climateprediction.net home page
Replanca Error/Sigseg fault.
Replanca Error/Sigseg fault.
log in

Advanced search

Message boards : Number crunching : Replanca Error/Sigseg fault.

1 · 2 · 3 · 4 · Next
Author Message
Profile Dave Jackson
Send message
Joined: 15 May 09
Posts: 1790
Credit: 2,671,578
RAC: 898
Message 56250 - Posted: 17 May 2017, 19:25:11 UTC

I notice that a retread I was crunching failed at a time close enough on my box to wonder if it was not coincidence.The other machine had a noticeably shorter time but was faster. The Windows machine which had the first go had a replanca error whereas my Linux box failed with sigsegv fault. I did have a reboot shortly before the failure but it was still running three minutes after I resumed computation for the task.

Work unit is

https://www.cpdn.org/cpdnboinc/workunit.php?wuid=10824511
Task now in the last chance saloon on another windows machine.

bernard_ivo
Send message
Joined: 18 Jul 13
Posts: 252
Credit: 5,903,045
RAC: 23
Message 56251 - Posted: 17 May 2017, 19:51:33 UTC - in response to Message 56250.
Last modified: 17 May 2017, 19:52:17 UTC

All 3 WUs of 567 batch I got on my Linux boxes failed with SIGSEGV: segmentation violation
i.e. this one

bernard_ivo
Send message
Joined: 18 Jul 13
Posts: 252
Credit: 5,903,045
RAC: 23
Message 56344 - Posted: 9 Jun 2017, 6:35:10 UTC

Two of the WUS25 under Linux batch 583 crashed with
SIGSEGV: segmentation violation
.....
.....
Exiting...
Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6999, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6859, iMonCtr=2
Model crash detected, will try to restart...
Leaving CPDN_ain::Monitor...
Calling boinc_finish...03:09:40 (6859): called boinc_finish(0)
In boinc_exit called with status 0
Calloing set_signal_exit_code with status 0

they seem to be unsent after the crash

Profile Dave Jackson
Send message
Joined: 15 May 09
Posts: 1790
Credit: 2,671,578
RAC: 898
Message 56345 - Posted: 9 Jun 2017, 10:41:06 UTC - in response to Message 56344.

they seem to be unsent after the crash


Do the re-issues only appear under the work unit once they have been sent? Not sure whether they go out next or go to the back of the queue behind the 10,000 odd tasks still waiting to be sent?

bernard_ivo
Send message
Joined: 18 Jul 13
Posts: 252
Credit: 5,903,045
RAC: 23
Message 56356 - Posted: 10 Jun 2017, 8:18:13 UTC - in response to Message 56345.
Last modified: 10 Jun 2017, 8:18:25 UTC

they seem to be unsent after the crash


Do the re-issues only appear under the work unit once they have been sent? Not sure whether they go out next or go to the back of the queue behind the 10,000 odd tasks still waiting to be sent?


These two in particular are now re-issued. One failed on Darwin and is in progress on a Windows machine, the other is in progress on Windows machine. I thought applications will run on a single platform only or I misunderstood the info?

Profile geophi
Volunteer moderator
Send message
Joined: 7 Aug 04
Posts: 1670
Credit: 32,083,245
RAC: 31,083
Message 56358 - Posted: 10 Jun 2017, 17:50:11 UTC - in response to Message 56344.

Two of the WUS25 under Linux batch 583 crashed with
SIGSEGV: segmentation violation
.....
.....
Exiting...
Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6999, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6859, iMonCtr=2
Model crash detected, will try to restart...
Leaving CPDN_ain::Monitor...
Calling boinc_finish...03:09:40 (6859): called boinc_finish(0)
In boinc_exit called with status 0
Calloing set_signal_exit_code with status 0

they seem to be unsent after the crash


I've had all three batch 583 tasks that made it to the third trickle crash with sigsegv on my Linux box. Appears to be a linux app problem on this particular batch. Batch 583 tasks under Windows have made it well past that point.

SIGSEGV: segmentation violation
Stack trace (12 frames):
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x839e357]
[0x2a9e3ca0]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x81443f4]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x814b133]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8141220]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x813ff46]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8077583]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x831cd74]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8330985]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x833318a]
/home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8334c8d]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf7)[0x2a7ad637]

Exiting...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=11748, iMonCtr=2
Model crash detected, will try to restart...
Leaving CPDN_ain::Monitor...
Calling boinc_finish...08:10:00 (11748): called boinc_finish(0)
In boinc_exit called with status 0
Calloing set_signal_exit_code with status 0

pvh
Send message
Joined: 9 Apr 14
Posts: 14
Credit: 1,951,379
RAC: 2,145
Message 56360 - Posted: 11 Jun 2017, 8:12:15 UTC - in response to Message 56358.

I see the same issue with the segfaults: 100% failure rate on WUs from the 583 batch, all after about the same amount of CPU time (so this doesn't look random at all). Looking at my wing men, I could not find a single one where the task finished OK (including a few Windows computers). These are statistics on 14 failed WUs. Several of those are already counted out with 3 failures.

With quite a few of my wing men I also saw this error

../../projects/climateprediction.net/wah2_8.25_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory

Profile Dave Jackson
Send message
Joined: 15 May 09
Posts: 1790
Credit: 2,671,578
RAC: 898
Message 56361 - Posted: 11 Jun 2017, 8:30:04 UTC - in response to Message 56360.

With quite a few of my wing men I also saw this error

../../projects/climateprediction.net/wah2_8.25_i686-pc-linux-gnu: error while loading shared libraries: libstdc++


That is an unrelated problem, - Those of us who use Linux have to (in most cases) manually install some 32bit libraries in order to get the executables to run.

The sigseg fault problem has been reported to the project. I would guess the disk taking uploads will be the first problem to be addressed however.

Desti
Send message
Joined: 6 Aug 04
Posts: 124
Credit: 9,195,838
RAC: 0
Message 56363 - Posted: 11 Jun 2017, 13:32:40 UTC

Can confirm 583 does three trickles and then segfaults.

https://www.cpdn.org/cpdnboinc/result.php?resultid=20467695
https://www.cpdn.org/cpdnboinc/result.php?resultid=20467661
https://www.cpdn.org/cpdnboinc/result.php?resultid=20467816
https://www.cpdn.org/cpdnboinc/result.php?resultid=20467712
https://www.cpdn.org/cpdnboinc/result.php?resultid=20467912
https://www.cpdn.org/cpdnboinc/result.php?resultid=20467799
https://www.cpdn.org/cpdnboinc/result.php?resultid=20467522


They all went to the exact same point of 1,539.60 credits.
____________
Linux Users Everywhere @ BOINC

Profile Dave Jackson
Send message
Joined: 15 May 09
Posts: 1790
Credit: 2,671,578
RAC: 898
Message 56379 - Posted: 14 Jun 2017, 8:42:22 UTC - in response to Message 56363.
Last modified: 14 Jun 2017, 8:51:59 UTC

Will email project to ask if those of us with Linux boxes should abort tasks from batch 583.

Certainly thinking about doing this on my three boxes.

Profile Dave Jackson
Send message
Joined: 15 May 09
Posts: 1790
Credit: 2,671,578
RAC: 898
Message 56380 - Posted: 14 Jun 2017, 10:13:53 UTC - in response to Message 56379.

Yes, if running Linux, please abort these tasks as they will crash before 4th Zip is created. Oxford are trying to track down the root of the problem.

Profile Dave Jackson
Send message
Joined: 15 May 09
Posts: 1790
Credit: 2,671,578
RAC: 898
Message 56383 - Posted: 15 Jun 2017, 5:45:58 UTC - in response to Message 56380.

And it seems those running Darwin should also abort tasks from this batch.

bernard_ivo
Send message
Joined: 18 Jul 13
Posts: 252
Credit: 5,903,045
RAC: 23
Message 56387 - Posted: 15 Jun 2017, 19:04:11 UTC - in response to Message 56380.

This one also failed on WIN and when my Ubuntu got the last reissue I simply aborted as advised.

Profile Dave Jackson
Send message
Joined: 15 May 09
Posts: 1790
Credit: 2,671,578
RAC: 898
Message 56388 - Posted: 15 Jun 2017, 19:33:12 UTC - in response to Message 56387.

This one also failed on WIN


However that computer seems to be failing over 80% of tasks thrown at it so possibly not a good measure of the reliability of this batch of tasks.

Venkatesh Srinivas
Send message
Joined: 7 May 17
Posts: 15
Credit: 450,081
RAC: 458
Message 56390 - Posted: 16 Jun 2017, 1:22:01 UTC

How can you tell what batch a task is from?

Profile geophi
Volunteer moderator
Send message
Joined: 7 Aug 04
Posts: 1670
Credit: 32,083,245
RAC: 31,083
Message 56391 - Posted: 16 Jun 2017, 2:51:21 UTC - in response to Message 56390.
Last modified: 16 Jun 2017, 2:51:52 UTC

How can you tell what batch a task is from?

For an example, wah2_eu50r_mzhp_20174_3_569_011014676_1, the batch is the number that precedes that long list of numbers near the end. So, in the above example, the batch is 569. The number preceding the batch number is the number of model months that task has, so it's a 3 model month task.

Profile Alan K
Send message
Joined: 22 Feb 06
Posts: 203
Credit: 10,727,650
RAC: 8,191
Message 56392 - Posted: 16 Jun 2017, 8:12:05 UTC - in response to Message 56391.

Batch 583 are 25month models.

Profile geophi
Volunteer moderator
Send message
Joined: 7 Aug 04
Posts: 1670
Credit: 32,083,245
RAC: 31,083
Message 56393 - Posted: 17 Jun 2017, 21:54:27 UTC - in response to Message 56392.

Yeah. I was just using that task as an example of how to decode the batch and run length from the task name. But I can see how someone could think my post related to the crash of batch 583 tasks after 3 months/trickles.

Profile Iain Inglis
Volunteer moderator
Send message
Joined: 16 Jan 10
Posts: 877
Credit: 100,083
RAC: 3,242
Message 56394 - Posted: 19 Jun 2017, 9:39:50 UTC - in response to Message 56383.

[Dave Jackson wrote:]And it seems those running Darwin should also abort tasks from this batch.

Confirmed on my Mac - SIGSEGV after three trickles.

Jean-David Beyer
Send message
Joined: 5 Aug 04
Posts: 145
Credit: 2,941,994
RAC: 3
Message 56395 - Posted: 19 Jun 2017, 16:50:30 UTC - in response to Message 56360.
Last modified: 19 Jun 2017, 16:54:13 UTC

When you get this:

With quite a few of my wing men I also saw this error

../../projects/climateprediction.net/wah2_8.25_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory


it means you are running on a 64-bit machine with a 64-bit version of Linux. As soon as the executing program calls anything in those libraries, you get a null pointer instead of a pointer to the desired routine or function and off you go.

Load those libraries and you should be OK. On my Red Hat Enterprise Linux 6.9 system, they are in

$ rpm -qf libstdc++.so.6
libstdc++-4.4.7-18.el6.i686 <---<<<


$ locate libstdc++.so.6
/usr/lib/libstdc++.so.6
/usr/lib/libstdc++.so.6.0.13
/usr/lib64/libstdc++.so.6
/usr/lib64/libstdc++.so.6.0.13

ls -l libstdc++*
lrwxrwxrwx. 1 root root 19 Mar 21 08:37 libstdc++.so.6 -> libstdc++.so.6.0.13
-rwxr-xr-x. 1 root root 930192 Oct 18 2016 libstdc++.so.6.0.13
____________

1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Replanca Error/Sigseg fault.


Main page · Your account · Message boards


Copyright © 2017 climateprediction.net