climateprediction.net (CPDN) home page
Thread 'Replanca Error/Sigseg fault.'

Thread 'Replanca Error/Sigseg fault.'

Message boards : Number crunching : Replanca Error/Sigseg fault.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,837,810
RAC: 9,716
Message 56574 - Posted: 27 Jul 2017, 12:21:31 UTC - in response to Message 56573.  

Two more of the same 617 batch failed with SIGSEGV: segmentation violation at 13h 49 minutes. I have 4 failed in a row all at 1st attempt (_0)
https://www.cpdn.org/cpdnboinc/result.php?resultid=20576338
https://www.cpdn.org/cpdnboinc/result.php?resultid=20566129
ID: 56574 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,837,810
RAC: 9,716
Message 56575 - Posted: 27 Jul 2017, 17:45:42 UTC - in response to Message 56574.  
Last modified: 27 Jul 2017, 17:46:37 UTC

Two more from batch 617 on the second linux - after 12 minutes - Model crashed: In boinc_exit called with status 0
Calloing set_signal_exit_code with status 0
https://www.cpdn.org/cpdnboinc/result.php?resultid=20574397
https://www.cpdn.org/cpdnboinc/result.php?resultid=20574922
ID: 56575 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 56580 - Posted: 28 Jul 2017, 7:47:13 UTC - in response to Message 56575.  

https://www.cpdn.org/cpdnboinc/workunit.php?wuid=11133626
This one now in last chance saloon. on both mac and my linux laptop failed just before creation of first zip with sigseg fault. Now it is on a Windows box so I will check back in a while to see what happens and whether the Linux/Mac versions need to be pulled.
ID: 56580 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,837,810
RAC: 9,716
Message 56581 - Posted: 28 Jul 2017, 9:53:14 UTC - in response to Message 56580.  
Last modified: 28 Jul 2017, 9:55:23 UTC

So far 9 WUs failed on my Linux (and later on Darwin/Linux at 2nd attempt), 5 of them report trickles under Win, the other 4 haven't reported under WIN yet. I still have two more in queue.
ID: 56581 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 56582 - Posted: 28 Jul 2017, 13:17:22 UTC - in response to Message 56580.  

https://www.cpdn.org/cpdnboinc/workunit.php?wuid=11133626
This one now in last chance saloon. on both mac and my linux laptop failed just before creation of first zip with sigseg fault. Now it is on a Windows box so I will check back in a while to see what happens and whether the Linux/Mac versions need to be pulled.

I'm assuming this is once again one of those batches that fails on the first timestep of the regional model on January 1st on Mac and Linux?
ID: 56582 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,837,810
RAC: 9,716
Message 56583 - Posted: 28 Jul 2017, 14:45:25 UTC - in response to Message 56582.  

I'm assuming this is once again one of those batches that fails on the first timestep of the regional model on January 1st on Mac and Linux?


It looks like, I'm tailing my last one, but will have results in 12 h. One more failed, however stuck in progress, so I have 10 that failed in total on my Linux
ID: 56583 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 56584 - Posted: 28 Jul 2017, 15:32:07 UTC - in response to Message 56583.  

Looks very much like it. Am emailing project.
ID: 56584 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 56585 - Posted: 28 Jul 2017, 16:25:47 UTC - in response to Message 56584.  

Got the following back from Sarah,

Hi Dave,

Thanks yes sorry this is a resend of the natural batch. We have established that natural runs appear to fail at the start of a new model year under linux and mac and have traced this back to a particular SO2 file (that runs fine under windows for some reason). So far despite many attempts we are yet to find a solution for this. We are currently endeavouring to better understand the cause of the fault so that we can hopefully come up with a fix. We will keep you informed of our progress.

Best wishes,
Sarah


So any of this batch on Linux/Mac machines can be aborted.
ID: 56585 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,837,810
RAC: 9,716
Message 56588 - Posted: 28 Jul 2017, 17:00:39 UTC - in response to Message 56585.  
Last modified: 28 Jul 2017, 17:00:53 UTC

Thanks Dave,

So no need to tail the last 617er. Will abort.
ID: 56588 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 56598 - Posted: 29 Jul 2017, 20:32:15 UTC - in response to Message 56585.  

I got two batch 617 work units and each failed with a segmentation fault after about 12 hours on a Linux machine. Computer ID 1256552

wah2_sas50_l09y_198612_13_617_011131907_1
wah2_sas50_l2nz_199512_13_617_011135004_1
ID: 56598 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 56599 - Posted: 30 Jul 2017, 6:12:54 UTC - in response to Message 56598.  

They will do, on Linux and Mac please just abort this batch. Not sure what is happening with the ones that crash out after a few minutes though. I am assuming that is a different problem. Don't know if that one affects windows machines or not.
ID: 56599 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,837,810
RAC: 9,716
Message 56603 - Posted: 30 Jul 2017, 15:13:07 UTC - in response to Message 56599.  

The 3 WUs that failed after 8-12 minutes (mentioned above) on my Linix seem to work fine under WIN as zips and trickles pile up
ID: 56603 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 56604 - Posted: 31 Jul 2017, 3:15:47 UTC - in response to Message 56599.  

They will do, on Linux and Mac please just abort this batch. Not sure what is happening with the ones that crash out after a few minutes though. I am assuming that is a different problem. Don't know if that one affects windows machines or not.


I have a Windows version batch 617 WU’s that’s been running for 2 days now with no problems.
ID: 56604 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,837,810
RAC: 9,716
Message 56893 - Posted: 20 Sep 2017, 19:20:09 UTC

ID: 56893 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,837,810
RAC: 9,716
Message 56911 - Posted: 22 Sep 2017, 4:41:06 UTC

It might be irrelevant to the current topic, but two WUs from the bad batch 617 that was discussed below are still In Progress on the web no matter that the linux one crashed and the windows one finished successfully. I will detach/reattach to release them, but project people might have another look at that batch and find some answers why there are ghost WUs reported here.
ID: 56911 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : Replanca Error/Sigseg fault.

©2024 cpdn.org