Thread 'Replanca Error/Sigseg fault.'

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944	Message 56250 - Posted: 17 May 2017, 19:25:11 UTC I notice that a retread I was crunching failed at a time close enough on my box to wonder if it was not coincidence.The other machine had a noticeably shorter time but was faster. The Windows machine which had the first go had a replanca error whereas my Linux box failed with sigsegv fault. I did have a reboot shortly before the failure but it was still running three minutes after I resumed computation for the task. Work unit is https://www.cpdn.org/cpdnboinc/workunit.php?wuid=10824511 Task now in the last chance saloon on another windows machine. ID: 56250 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,837,810 RAC: 9,716	Message 56251 - Posted: 17 May 2017, 19:51:33 UTC - in response to Message 56250. Last modified: 17 May 2017, 19:52:17 UTC All 3 WUs of 567 batch I got on my Linux boxes failed with SIGSEGV: segmentation violation i.e. this one ID: 56251 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,837,810 RAC: 9,716	Message 56344 - Posted: 9 Jun 2017, 6:35:10 UTC Two of the WUS25 under Linux batch 583 crashed with SIGSEGV: segmentation violation ..... ..... Exiting... Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6999, iMonCtr=2 Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6859, iMonCtr=2 Model crash detected, will try to restart... Leaving CPDN_ain::Monitor... Calling boinc_finish...03:09:40 (6859): called boinc_finish(0) In boinc_exit called with status 0 Calloing set_signal_exit_code with status 0 they seem to be unsent after the crash ID: 56344 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944	Message 56345 - Posted: 9 Jun 2017, 10:41:06 UTC - in response to Message 56344. they seem to be unsent after the crash Do the re-issues only appear under the work unit once they have been sent? Not sure whether they go out next or go to the back of the queue behind the 10,000 odd tasks still waiting to be sent? ID: 56345 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,837,810 RAC: 9,716	Message 56356 - Posted: 10 Jun 2017, 8:18:13 UTC - in response to Message 56345. Last modified: 10 Jun 2017, 8:18:25 UTC they seem to be unsent after the crash Do the re-issues only appear under the work unit once they have been sent? Not sure whether they go out next or go to the back of the queue behind the 10,000 odd tasks still waiting to be sent? These two in particular are now re-issued. One failed on Darwin and is in progress on a Windows machine, the other is in progress on Windows machine. I thought applications will run on a single platform only or I misunderstood the info? ID: 56356 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 56358 - Posted: 10 Jun 2017, 17:50:11 UTC - in response to Message 56344. Two of the WUS25 under Linux batch 583 crashed with SIGSEGV: segmentation violation ..... ..... Exiting... Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6999, iMonCtr=2 Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6859, iMonCtr=2 Model crash detected, will try to restart... Leaving CPDN_ain::Monitor... Calling boinc_finish...03:09:40 (6859): called boinc_finish(0) In boinc_exit called with status 0 Calloing set_signal_exit_code with status 0 they seem to be unsent after the crash I've had all three batch 583 tasks that made it to the third trickle crash with sigsegv on my Linux box. Appears to be a linux app problem on this particular batch. Batch 583 tasks under Windows have made it well past that point. SIGSEGV: segmentation violation Stack trace (12 frames): /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x839e357] [0x2a9e3ca0] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x81443f4] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x814b133] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8141220] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x813ff46] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8077583] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x831cd74] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8330985] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x833318a] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8334c8d] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf7)[0x2a7ad637] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=11748, iMonCtr=2 Model crash detected, will try to restart... Leaving CPDN_ain::Monitor... Calling boinc_finish...08:10:00 (11748): called boinc_finish(0) In boinc_exit called with status 0 Calloing set_signal_exit_code with status 0 ID: 56358 · Reply Quote

pvh Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0	Message 56360 - Posted: 11 Jun 2017, 8:12:15 UTC - in response to Message 56358. I see the same issue with the segfaults: 100% failure rate on WUs from the 583 batch, all after about the same amount of CPU time (so this doesn't look random at all). Looking at my wing men, I could not find a single one where the task finished OK (including a few Windows computers). These are statistics on 14 failed WUs. Several of those are already counted out with 3 failures. With quite a few of my wing men I also saw this error ../../projects/climateprediction.net/wah2_8.25_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory ID: 56360 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944	Message 56361 - Posted: 11 Jun 2017, 8:30:04 UTC - in response to Message 56360. With quite a few of my wing men I also saw this error ../../projects/climateprediction.net/wah2_8.25_i686-pc-linux-gnu: error while loading shared libraries: libstdc++ That is an unrelated problem, - Those of us who use Linux have to (in most cases) manually install some 32bit libraries in order to get the executables to run. The sigseg fault problem has been reported to the project. I would guess the disk taking uploads will be the first problem to be addressed however. ID: 56361 · Reply Quote

Desti Send message Joined: 6 Aug 04 Posts: 124 Credit: 9,195,838 RAC: 0	Message 56363 - Posted: 11 Jun 2017, 13:32:40 UTC Can confirm 583 does three trickles and then segfaults. https://www.cpdn.org/cpdnboinc/result.php?resultid=20467695 https://www.cpdn.org/cpdnboinc/result.php?resultid=20467661 https://www.cpdn.org/cpdnboinc/result.php?resultid=20467816 https://www.cpdn.org/cpdnboinc/result.php?resultid=20467712 https://www.cpdn.org/cpdnboinc/result.php?resultid=20467912 https://www.cpdn.org/cpdnboinc/result.php?resultid=20467799 https://www.cpdn.org/cpdnboinc/result.php?resultid=20467522 They all went to the exact same point of 1,539.60 credits. Linux Users Everywhere @ BOINC ID: 56363 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944	Message 56379 - Posted: 14 Jun 2017, 8:42:22 UTC - in response to Message 56363. Last modified: 14 Jun 2017, 8:51:59 UTC Will email project to ask if those of us with Linux boxes should abort tasks from batch 583. Certainly thinking about doing this on my three boxes. ID: 56379 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944	Message 56380 - Posted: 14 Jun 2017, 10:13:53 UTC - in response to Message 56379. Yes, if running Linux, please abort these tasks as they will crash before 4th Zip is created. Oxford are trying to track down the root of the problem. ID: 56380 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944	Message 56383 - Posted: 15 Jun 2017, 5:45:58 UTC - in response to Message 56380. And it seems those running Darwin should also abort tasks from this batch. ID: 56383 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,837,810 RAC: 9,716	Message 56387 - Posted: 15 Jun 2017, 19:04:11 UTC - in response to Message 56380. This one also failed on WIN and when my Ubuntu got the last reissue I simply aborted as advised. ID: 56387 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944	Message 56388 - Posted: 15 Jun 2017, 19:33:12 UTC - in response to Message 56387. This one also failed on WIN However that computer seems to be failing over 80% of tasks thrown at it so possibly not a good measure of the reliability of this batch of tasks. ID: 56388 · Reply Quote

Venkatesh Srinivas Send message Joined: 7 May 17 Posts: 16 Credit: 3,480,030 RAC: 2,845	Message 56390 - Posted: 16 Jun 2017, 1:22:01 UTC How can you tell what batch a task is from? ID: 56390 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 56391 - Posted: 16 Jun 2017, 2:51:21 UTC - in response to Message 56390. Last modified: 16 Jun 2017, 2:51:52 UTC How can you tell what batch a task is from? For an example, wah2_eu50r_mzhp_20174_3_569_011014676_1, the batch is the number that precedes that long list of numbers near the end. So, in the above example, the batch is 569. The number preceding the batch number is the number of model months that task has, so it's a 3 model month task. ID: 56391 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 492 Credit: 31,523,112 RAC: 14,304	Message 56392 - Posted: 16 Jun 2017, 8:12:05 UTC - in response to Message 56391. Batch 583 are 25month models. ID: 56392 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 56393 - Posted: 17 Jun 2017, 21:54:27 UTC - in response to Message 56392. Yeah. I was just using that task as an example of how to decode the batch and run length from the task name. But I can see how someone could think my post related to the crash of batch 583 tasks after 3 months/trickles. ID: 56393 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,904,898 RAC: 2,026	Message 56394 - Posted: 19 Jun 2017, 9:39:50 UTC - in response to Message 56383. [Dave Jackson wrote:]And it seems those running Darwin should also abort tasks from this batch. Confirmed on my Mac - SIGSEGV after three trickles. ID: 56394 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 56395 - Posted: 19 Jun 2017, 16:50:30 UTC - in response to Message 56360. Last modified: 19 Jun 2017, 16:54:13 UTC When you get this: With quite a few of my wing men I also saw this error ../../projects/climateprediction.net/wah2_8.25_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory it means you are running on a 64-bit machine with a 64-bit version of Linux. As soon as the executing program calls anything in those libraries, you get a null pointer instead of a pointer to the desired routine or function and off you go. Load those libraries and you should be OK. On my Red Hat Enterprise Linux 6.9 system, they are in $ rpm -qf libstdc++.so.6 libstdc++-4.4.7-18.el6.i686 <---<<< $ locate libstdc++.so.6 /usr/lib/libstdc++.so.6 /usr/lib/libstdc++.so.6.0.13 /usr/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6.0.13 ls -l libstdc++* lrwxrwxrwx. 1 root root 19 Mar 21 08:37 libstdc++.so.6 -> libstdc++.so.6.0.13 -rwxr-xr-x. 1 root root 930192 Oct 18 2016 libstdc++.so.6.0.13 ID: 56395 · Reply Quote