climateprediction.net (CPDN) home page
Thread 'Replanca Error/Sigseg fault.'

Thread 'Replanca Error/Sigseg fault.'

Message boards : Number crunching : Replanca Error/Sigseg fault.
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 56250 - Posted: 17 May 2017, 19:25:11 UTC

I notice that a retread I was crunching failed at a time close enough on my box to wonder if it was not coincidence.The other machine had a noticeably shorter time but was faster. The Windows machine which had the first go had a replanca error whereas my Linux box failed with sigsegv fault. I did have a reboot shortly before the failure but it was still running three minutes after I resumed computation for the task.

Work unit is
https://www.cpdn.org/cpdnboinc/workunit.php?wuid=10824511
Task now in the last chance saloon on another windows machine.
ID: 56250 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,837,810
RAC: 9,716
Message 56251 - Posted: 17 May 2017, 19:51:33 UTC - in response to Message 56250.  
Last modified: 17 May 2017, 19:52:17 UTC

All 3 WUs of 567 batch I got on my Linux boxes failed with SIGSEGV: segmentation violation
i.e. this one
ID: 56251 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,837,810
RAC: 9,716
Message 56344 - Posted: 9 Jun 2017, 6:35:10 UTC

Two of the WUS25 under Linux batch 583 crashed with
SIGSEGV: segmentation violation
.....
.....
Exiting...
Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6999, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6859, iMonCtr=2
Model crash detected, will try to restart...
Leaving CPDN_ain::Monitor...
Calling boinc_finish...03:09:40 (6859): called boinc_finish(0)
In boinc_exit called with status 0
Calloing set_signal_exit_code with status 0

they seem to be unsent after the crash
ID: 56344 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 56345 - Posted: 9 Jun 2017, 10:41:06 UTC - in response to Message 56344.  

they seem to be unsent after the crash


Do the re-issues only appear under the work unit once they have been sent? Not sure whether they go out next or go to the back of the queue behind the 10,000 odd tasks still waiting to be sent?
ID: 56345 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,837,810
RAC: 9,716
Message 56356 - Posted: 10 Jun 2017, 8:18:13 UTC - in response to Message 56345.  
Last modified: 10 Jun 2017, 8:18:25 UTC

they seem to be unsent after the crash


Do the re-issues only appear under the work unit once they have been sent? Not sure whether they go out next or go to the back of the queue behind the 10,000 odd tasks still waiting to be sent?


These two in particular are now re-issued. One failed on Darwin and is in progress on a Windows machine, the other is in progress on Windows machine. I thought applications will run on a single platform only or I misunderstood the info?
ID: 56356 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 56358 - Posted: 10 Jun 2017, 17:50:11 UTC - in response to Message 56344.  

Two of the WUS25 under Linux batch 583 crashed with
SIGSEGV: segmentation violation
.....
.....
Exiting...
Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6999, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6859, iMonCtr=2
Model crash detected, will try to restart...
Leaving CPDN_ain::Monitor...
Calling boinc_finish...03:09:40 (6859): called boinc_finish(0)
In boinc_exit called with status 0
Calloing set_signal_exit_code with status 0

they seem to be unsent after the crash


I've had all three batch 583 tasks that made it to the third trickle crash with sigsegv on my Linux box. Appears to be a linux app problem on this particular batch. Batch 583 tasks under Windows have made it well past that point.

SIGSEGV: segmentation violation
Stack trace (12 frames):
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x839e357]
[0x2a9e3ca0]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x81443f4]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x814b133]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8141220]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x813ff46]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8077583]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x831cd74]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8330985]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x833318a]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8334c8d]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf7)[0x2a7ad637]

Exiting...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=11748, iMonCtr=2
Model crash detected, will try to restart...
Leaving CPDN_ain::Monitor...
Calling boinc_finish...08:10:00 (11748): called boinc_finish(0)
In boinc_exit called with status 0
Calloing set_signal_exit_code with status 0
ID: 56358 · Report as offensive     Reply Quote
pvh

Send message
Joined: 9 Apr 14
Posts: 14
Credit: 1,962,018
RAC: 0
Message 56360 - Posted: 11 Jun 2017, 8:12:15 UTC - in response to Message 56358.  

I see the same issue with the segfaults: 100% failure rate on WUs from the 583 batch, all after about the same amount of CPU time (so this doesn't look random at all). Looking at my wing men, I could not find a single one where the task finished OK (including a few Windows computers). These are statistics on 14 failed WUs. Several of those are already counted out with 3 failures.

With quite a few of my wing men I also saw this error

../../projects/climateprediction.net/wah2_8.25_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory
ID: 56360 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 56361 - Posted: 11 Jun 2017, 8:30:04 UTC - in response to Message 56360.  

With quite a few of my wing men I also saw this error

../../projects/climateprediction.net/wah2_8.25_i686-pc-linux-gnu: error while loading shared libraries: libstdc++


That is an unrelated problem, - Those of us who use Linux have to (in most cases) manually install some 32bit libraries in order to get the executables to run.

The sigseg fault problem has been reported to the project. I would guess the disk taking uploads will be the first problem to be addressed however.
ID: 56361 · Report as offensive     Reply Quote
Desti

Send message
Joined: 6 Aug 04
Posts: 124
Credit: 9,195,838
RAC: 0
Message 56363 - Posted: 11 Jun 2017, 13:32:40 UTC

ID: 56363 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 56379 - Posted: 14 Jun 2017, 8:42:22 UTC - in response to Message 56363.  
Last modified: 14 Jun 2017, 8:51:59 UTC

Will email project to ask if those of us with Linux boxes should abort tasks from batch 583.

Certainly thinking about doing this on my three boxes.
ID: 56379 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 56380 - Posted: 14 Jun 2017, 10:13:53 UTC - in response to Message 56379.  

Yes, if running Linux, please abort these tasks as they will crash before 4th Zip is created. Oxford are trying to track down the root of the problem.
ID: 56380 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 56383 - Posted: 15 Jun 2017, 5:45:58 UTC - in response to Message 56380.  

And it seems those running Darwin should also abort tasks from this batch.
ID: 56383 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,837,810
RAC: 9,716
Message 56387 - Posted: 15 Jun 2017, 19:04:11 UTC - in response to Message 56380.  

This one also failed on WIN and when my Ubuntu got the last reissue I simply aborted as advised.
ID: 56387 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 56388 - Posted: 15 Jun 2017, 19:33:12 UTC - in response to Message 56387.  

This one also failed on WIN


However that computer seems to be failing over 80% of tasks thrown at it so possibly not a good measure of the reliability of this batch of tasks.
ID: 56388 · Report as offensive     Reply Quote
Venkatesh Srinivas

Send message
Joined: 7 May 17
Posts: 16
Credit: 3,480,030
RAC: 2,845
Message 56390 - Posted: 16 Jun 2017, 1:22:01 UTC

How can you tell what batch a task is from?
ID: 56390 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 56391 - Posted: 16 Jun 2017, 2:51:21 UTC - in response to Message 56390.  
Last modified: 16 Jun 2017, 2:51:52 UTC

How can you tell what batch a task is from?

For an example, wah2_eu50r_mzhp_20174_3_569_011014676_1, the batch is the number that precedes that long list of numbers near the end. So, in the above example, the batch is 569. The number preceding the batch number is the number of model months that task has, so it's a 3 model month task.
ID: 56391 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 492
Credit: 31,523,112
RAC: 14,304
Message 56392 - Posted: 16 Jun 2017, 8:12:05 UTC - in response to Message 56391.  

Batch 583 are 25month models.
ID: 56392 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 56393 - Posted: 17 Jun 2017, 21:54:27 UTC - in response to Message 56392.  

Yeah. I was just using that task as an example of how to decode the batch and run length from the task name. But I can see how someone could think my post related to the crash of batch 583 tasks after 3 months/trickles.
ID: 56393 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,904,898
RAC: 2,026
Message 56394 - Posted: 19 Jun 2017, 9:39:50 UTC - in response to Message 56383.  

[Dave Jackson wrote:]And it seems those running Darwin should also abort tasks from this batch.

Confirmed on my Mac - SIGSEGV after three trickles.
ID: 56394 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 56395 - Posted: 19 Jun 2017, 16:50:30 UTC - in response to Message 56360.  
Last modified: 19 Jun 2017, 16:54:13 UTC

When you get this:
With quite a few of my wing men I also saw this error

../../projects/climateprediction.net/wah2_8.25_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory


it means you are running on a 64-bit machine with a 64-bit version of Linux. As soon as the executing program calls anything in those libraries, you get a null pointer instead of a pointer to the desired routine or function and off you go.

Load those libraries and you should be OK. On my Red Hat Enterprise Linux 6.9 system, they are in

$ rpm -qf libstdc++.so.6
libstdc++-4.4.7-18.el6.i686 <---<<<


$ locate libstdc++.so.6
/usr/lib/libstdc++.so.6
/usr/lib/libstdc++.so.6.0.13
/usr/lib64/libstdc++.so.6
/usr/lib64/libstdc++.so.6.0.13

ls -l libstdc++*
lrwxrwxrwx. 1 root root 19 Mar 21 08:37 libstdc++.so.6 -> libstdc++.so.6.0.13
-rwxr-xr-x. 1 root root 930192 Oct 18 2016 libstdc++.so.6.0.13
ID: 56395 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Replanca Error/Sigseg fault.

©2024 cpdn.org