Stalled regional (PNW) model, is this a known behavior?

Author	Message
[B^S] sTrey Send message Joined: 9 Jan 05 Posts: 30 Credit: 434,469 RAC: 0	Message 40813 - Posted: 4 Oct 2010, 5:44:34 UTC Just started running CPDN again, regional PNW models only. I run only 1 CPDN task at a time; the other cores run other projects primarily WCG. This combination has not had any problems in the past and nothing else has changed on this rig. (6.10.58 on XP Pro 4GB, VM usage is staying within tolerable range) The 2nd model accumulated over 4 hours of clock time before I noticed it had zero cpu time. After suspending then restarting the client, the model restarted at zero elapsed time, and so far seems to be running normally. I've seen ice worlds but not had a model stall in this way before. If it's an occasional normal then never mind, but was just curious about this behavior. Apologize in advance if this is old news; didn't see it in the READMEs I looked at. ID: 40813 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 40814 - Posted: 4 Oct 2010, 11:21:41 UTC This problem has never been reported before with any of the regional model types; in fact I can't remember anyone reporting this problem for any model type. Of course there are hundreds of people whose models have problems (or, more accurately in most cases, whose misconfigured computers have problems with models) but never report it, so there's no guarantee that this has never happened before. There are AFAIK no known bugs in any of the regional model types. They are very memory-intensive, so running a full load of them on a multicore means they will probably slow each other down, but that won't apply to your situation and we have warned members about this several times in the News thread. It will be interesting to see whether your PNW now progresses normally. Thanks for reporting this. Cpdn news ID: 40814 · Reply Quote

old_user588361 Send message Joined: 10 Sep 09 Posts: 9 Credit: 253,090 RAC: 0	Message 40884 - Posted: 19 Oct 2010, 17:24:14 UTC 10/19/2010 10:20:09 AM climateprediction.net task hadsm3dhet2_u6ac_006725419_1 aborted by user This WU was "in progress", stuck at 33.3333% for over a week. Rebooted PC, suspended, restarted, etc. It wouldn't progress past 33.3333%. After I aborted it and did an "update" it reported it as a completed task, so hopefully someone there will be able to figure out why it crashed. Pulling up the Graphics on this WU showed a completely blue-covered planet, so maybe there was a flood (didn't see an Ark though). ID: 40884 · Reply Quote

[B^S] sTrey Send message Joined: 9 Jan 05 Posts: 30 Credit: 434,469 RAC: 0	Message 40885 - Posted: 19 Oct 2010, 19:28:07 UTC - in response to Message 40884. Just FYI, this model completed fine. ID: 40885 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 40886 - Posted: 19 Oct 2010, 19:28:10 UTC - in response to Message 40884. Larry This thread is about hadam3p models. The one that you posted about is a hadsm3 (slab ocean) model. Totally different. These are prone to "ice world" behaviour, for which there's a thread just below this one, and an information sticky further up near the top of the list. Backups: Here ID: 40886 · Reply Quote

old_user588361 Send message Joined: 10 Sep 09 Posts: 9 Credit: 253,090 RAC: 0	Message 40887 - Posted: 19 Oct 2010, 21:02:37 UTC - in response to Message 40885. Thanks for checking on it. ID: 40887 · Reply Quote

old_user588361 Send message Joined: 10 Sep 09 Posts: 9 Credit: 253,090 RAC: 0	Message 40888 - Posted: 19 Oct 2010, 21:03:14 UTC - in response to Message 40886. and Les, sorry I posted in the wrong thread. Didn't notice the other one. ID: 40888 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 40889 - Posted: 19 Oct 2010, 21:10:05 UTC - in response to Message 40888. Posting in the wrong place isn't a problem. I was concerned that you'd missed all of the info about ice worlds, but I guess that you know about them now. :) The 'slab' models are the only ones that do this, by the way. Backups: Here ID: 40889 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 40890 - Posted: 20 Oct 2010, 17:52:39 UTC Last modified: 20 Oct 2010, 17:56:40 UTC Hi Larry I've looked at your HadSM's web page and noticed a few things. * It crashed at 33.3% which is the end of Phase 1 but there's no graph for Phase 1. HadSM produces 24 trickles for each phase plus a file at the end of each. Your model only sent 23 trickles. * This means it probably crashed while it was post-processing Phase 1 and generating its file. HadSMs hate to be disturbed during post-processing and this job takes them quite a while. If the computer is shut down or the owner exits from Boinc during this job or during the next whole countdown when progress resumes the model's likely to go wrong. It may crash or go back to the beginning of the phase or, as seems to be the case with yours, just stay stuck there constantly trying but failing. * So this isn't a case of the typical iceworld which is caused by an as yet undiagnosed flaw within some models. * Within a workunit one expects the model to behave the same way on all computers with the same CPU typed (AMD or Intel) and operating sysem (Win, Linux or Mac). Your computer has AMD + Windows. Here's the workunit. * Computers #7 and #10 in the list also have AMD + Windows. They completed the model. This means that the model itself was almost certainly OK. * Even if models crash they then say 'Completed' in the Boinc manager Status column. I think crashed tasks should say something like 'Finished prematurely' if Boinc can reliably detect this. I suggested this some months ago on a Boinc email list but the Boinc programmers mustn't have been very keen on my idea. Anyway, I wouldn't worry about it. You did the right thing to abort a model that wouldn't advance. Everybody crashes a model or two or more from time to time. But if you're running HadSMs it's worth looking to see what point they've reached before exiting from Boinc. Cpdn news ID: 40890 · Reply Quote