climateprediction.net home page
hadam3p eu WU segfault

hadam3p eu WU segfault

Message boards : Number crunching : hadam3p eu WU segfault
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
pvh

Send message
Joined: 9 Apr 14
Posts: 14
Credit: 1,962,018
RAC: 0
Message 48914 - Posted: 27 Apr 2014, 0:04:37 UTC

I just had a hadam3p eu WU fail on a segfault

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
SIGSEGV: segmentation violation
Stack trace (13 frames):
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu(boinc_catch_signal+0x6f)[0x8340f8f]
linux-gate.so.1(__kernel_sigreturn+0x0)[0x55578400]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x813cc30]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x81426bb]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x813a3ce]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8143cea]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x813924a]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8090a3e]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8055900]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8069392]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x806a310]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x82ccda2]
/lib/libc.so.6(__libc_start_main+0xf3)[0x555fc9d3]
... etc ...


The WU is http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=16599536

Is this a known problem? I could not find any mention of it on the forum...
ID: 48914 · Report as offensive     Reply Quote
Profile Greg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 48916 - Posted: 27 Apr 2014, 10:20:38 UTC - in response to Message 48914.  

Segfaults happen occasionally (rarely), with all model types.

Why? Who knows? A misbehaving driver for one of the computer's peripherals. Cosmic Rays flipping a bit in the computer's RAM. Electrical 'noise' on the power supply from (for example) the refrigerator turning off. Static electricity building up and then discharging. Oxide films building up on connectors in the computer. A failing component or solder joint. Most likely: a very obscure bug in the model code, that can never be reproduced by the developers, because it only shows up after a certain pattern of disk accesses, with model data in a specific place in RAM. Or something like that.

If segfaults happen frequently with one computer, then it might be worth investigating further, for example checking that all the connectors to the motherboard and disk drives are properly seated, and running Memtest86+ for 72 hours or so.

But a one-off segfault is nothing to worry about.
ID: 48916 · Report as offensive     Reply Quote
pvh

Send message
Joined: 9 Apr 14
Posts: 14
Credit: 1,962,018
RAC: 0
Message 48917 - Posted: 27 Apr 2014, 15:06:51 UTC

Thanks! All I know is that the segfault happened almost simultaneously with another WU on the same machine finishing. Could be a coincidence of course. But it would be consistent with the non-reproducibility...
ID: 48917 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 48918 - Posted: 27 Apr 2014, 17:34:35 UTC
Last modified: 27 Apr 2014, 17:34:53 UTC

Hmmm, I just lately had the impression, that two models influenced eachother in this thread. A windows machine with less detailed error output but still with a somehow similar effect.

I guess it might be useful to have a look at the activity of models running concurrent on the same machines when such a thing happens.
ID: 48918 · Report as offensive     Reply Quote
Profile Bonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,231,740
RAC: 3,115
Message 48925 - Posted: 27 Apr 2014, 20:17:31 UTC

Hi Mr. Greg van Paassen,

tht sounds right and cool.

Nice to read.
ID: 48925 · Report as offensive     Reply Quote
Profile Greg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 48926 - Posted: 27 Apr 2014, 23:36:55 UTC - in response to Message 48925.  

Thanks, Bonsai911! :-)
ID: 48926 · Report as offensive     Reply Quote
pvh

Send message
Joined: 9 Apr 14
Posts: 14
Credit: 1,962,018
RAC: 0
Message 48945 - Posted: 28 Apr 2014, 14:18:28 UTC

Just had a 2nd eu model crash on a segfault on the same machine

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=16599512

The other types of climateprediction WUs are still OK on this machine, but I guess it is still early days... I will run memtest to be on the safe side, but I have not seen flaky behavior with other projects.

I have had 3 eu WUs on this machine so far, and all 3 failed. One immediately crashed due to a different cause which was reproduced on other machines, so that must be inherent to the WU. But the other two both failed on segfaults.
ID: 48945 · Report as offensive     Reply Quote
pvh

Send message
Joined: 9 Apr 14
Posts: 14
Credit: 1,962,018
RAC: 0
Message 48952 - Posted: 28 Apr 2014, 22:22:28 UTC

Just checked the memory. It's fine. No clue what is going on here...
ID: 48952 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,905,706
RAC: 6,529
Message 48954 - Posted: 29 Apr 2014, 0:09:50 UTC

pvh: Note that at least one of your crashed models has "Model crashed: INITTIME: Atmosphere basis time mismatch" (as, indeed, has one recent one of mine). This is a project configuration error and nothing to do with your machine. So at least you don't have to worry about that one.
ID: 48954 · Report as offensive     Reply Quote
pvh

Send message
Joined: 9 Apr 14
Posts: 14
Credit: 1,962,018
RAC: 0
Message 48971 - Posted: 29 Apr 2014, 20:32:07 UTC

That one crash is clearly reprodicible. But in the mean time two more WUs crashed on a segfault. This time both hadam3p anz models. This is clearly not a rare event. I am seriously contemplating abandoning this project...
ID: 48971 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 48973 - Posted: 29 Apr 2014, 21:47:45 UTC - in response to Message 48971.  

Over the years there's been a number of reports about the SIGSEGV problem.
The vague ideas that I've formed about them, without knowing anything about the cause, are:
Always Linux.
They usually get posted about in the Linux thread.
They, like the "Visual Fortran run-rime error" and the "exited with zero status but no 'finished' file" problems, only happen to some computers. Also, they start suddenly, and stop happening just as suddenly. Or, at least, stop getting posted about.

I haven't had it happen to my computers in the few months that I've been running Linux, so, basically, it's your computer that has a problem, and not the models.



ID: 48973 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,379,331
RAC: 3,596
Message 48980 - Posted: 30 Apr 2014, 6:43:42 UTC

Interesting Les,
I have had these in the past on this box but not for a long time, certainly not since I replaced the PSU when it stopped booting up. The only other significant changes have been doubling the ram to 4GB and periodically changing the linux version. - I started off with Mandriva and then had problems with one of the incarnations of it and moved over to Kubuntu. The trouble in working out the cause is as you say, the erraticness of it. I would get a couple in a month and then none for six months or more which makes me wonder if
while it is as you say almost certainly a computer problem, are some models more prone to it than others?
ID: 48980 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 48981 - Posted: 30 Apr 2014, 7:21:41 UTC
Last modified: 30 Apr 2014, 7:23:02 UTC

I have a hadam3p_anz unit at 50% after 366 hours on my SuSE Linux box.Oddly enough, estimated completion time is given as 107 hours only. How come?
Tullio
ID: 48981 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 48983 - Posted: 30 Apr 2014, 8:23:28 UTC - in response to Message 48981.  

Probably in a loop. They don't last THAT long.

ID: 48983 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 48989 - Posted: 30 Apr 2014, 12:51:06 UTC - in response to Message 48983.  
Last modified: 30 Apr 2014, 12:51:19 UTC

Probably in a loop. They don't last THAT long.


Should I abort it?
ID: 48989 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,379,331
RAC: 3,596
Message 48990 - Posted: 30 Apr 2014, 18:40:01 UTC - in response to Message 48989.  

Should I abort it?


Yes, models stuck in a loop are like the Computer programmer found dead in the shower. Instructions on bottle.
Wet hair
Apply Shampoo
Rinse
Repeat
ID: 48990 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 48991 - Posted: 30 Apr 2014, 20:08:13 UTC - in response to Message 48990.  

If it actually only had 100 hours to go, a little over 450 hours total might not be a bad estimate for his computer, which is averaging 12 sec/TS for the ANZ model.

But at 50%, and 366 hours, there has to be a problem.
ID: 48991 · Report as offensive     Reply Quote
pvh

Send message
Joined: 9 Apr 14
Posts: 14
Credit: 1,962,018
RAC: 0
Message 48997 - Posted: 1 May 2014, 11:02:27 UTC

@Les Bayliss. Let me assure you that there are no hardware problems on my computer. I have checked. The machine is healthy.

You state that the segfaults only occur on Linux machines. This proves beyond a doubt that software problems must play a role in the problem as it can be ruled out that hardware problems would exclusively occur on Linux machines.

My guess is that it is the boinc library. I have seen that library produce spurious segfaults on occasion on all my machines, e.g. when there is a problem with the ethernet switch. Also a heavy load on the disk or OS can do this (i.e., when the kernel starts doing high-priority tasks). The problem (or problems, there could be multiple bugs in the boinc library...) clearly needs a combination of factors to pop up and is in part driven by external factors.

I have read an interesting posting on another message board that stated that this was due to naive design of the boinc library, pretending it had control over things it simply cannot control. Unfortunately I cannot find that posting back. But it mentioned arbitrary timeouts on system commands to "check" if the system is still healthy...

If so, there must be something in the way climateprediction uses the boinc library to make this problem more likely to pop us as the segfaults are clearly far more frequent here than on other boinc projects that I ran on the same box. Pointers could be that boinc resides on a RAID5 array on this machine, and that it is my regular desktop machine (so is more loaded with external tasks).

ID: 48997 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 48998 - Posted: 1 May 2014, 11:49:26 UTC - in response to Message 48997.  
Last modified: 1 May 2014, 11:56:13 UTC

The reason that sigsegv (signal 11) only happens on Linux and Mac machines is --
Wait for it --
Only on Linux and other *ix machines is this sigsegv defined.

Sigsegv signal 11 is defined on all *ix machines , on any such there is a definition -- basically, access outlaw memory - get sig 11 sigsegv.

Other OS's, other definitions.

How, say, MS windows reports this type of error, no se.

But -- on any unix-based machine - sigsegv means either a programming error or a hardware error - and you need real skills to figure out which.

refer to the hardware docs at intel or AMD.

Almost always when you get a sigsegv on any U*x machine, either it's incompetent C programming or,
more likely "hardware problem"

pvh states ==
You state that the segfaults only occur on Linux machines. This proves beyond a doubt that software problems must play a role in the problem as it can be ruled out that hardware problems would exclusively occur on Linux machines.


naah - hardware problems happen on all machines -- Linux reports segfaults - Windows reports -- ??

non sequitur.

"Let me assure you that there are no hardware problems on my computer. I have checked. The machine is healthy."
this is "famous last words" candidate.
Believe me, or don't -- how many times I thought "my machine has no problems - neither hardware or OS nor dll's"
Don't posture to be so sure about what none of us knows about.

Oh, and let me add that I've been running mostly CPDN for almost a decade, and the very very few sigsegv's were almost always hardware problems, the last decade or so.
ID: 48998 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 49010 - Posted: 1 May 2014, 14:08:06 UTC - in response to Message 48991.  

If it actually only had 100 hours to go, a little over 450 hours total might not be a bad estimate for his computer, which is averaging 12 sec/TS for the ANZ model.

But at 50%, and 366 hours, there has to be a problem.

Anyway I aborted it and am waiting for a new unit, while running SETI@home Astropulse and Test4Theory@home.
Tullio
ID: 49010 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : hadam3p eu WU segfault

©2024 climateprediction.net