climateprediction.net home page
HADSM3 total CPU time goes up, but so does To Completion !

HADSM3 total CPU time goes up, but so does To Completion !

Questions and Answers : Windows : HADSM3 total CPU time goes up, but so does To Completion !
Message board moderation

To post messages, you must log in.

AuthorMessage
JimMcCarthy_StellarSolns
Avatar

Send message
Joined: 3 Sep 08
Posts: 23
Credit: 41,989,607
RAC: 2,734
Message 35501 - Posted: 14 Nov 2008, 21:41:19 UTC
Last modified: 14 Nov 2008, 21:52:48 UTC

Something seems to be awry with this task:

> Task ID 8083338
> Name hadsm3mh_kj5j_005999507_7
> Workunit 6234188
> Server state In Progress
> Computer ID 909163

It continues to crunch (at least CPU time reported in BOINC manager keeps rising -- currently up to 1106-hrs 50-min and climbing), but the CPU time \"To Completion\" reported by BOINC manager is *increasing* (slowly ... 62-hr 34-min yet to go ... 94.595%), so it appears I\'m not getting any closer to completion.

Calling up the model graphics, I see it\'s up to time step 203224 of 259248 (current model date is 04/09/2062 20:00) but the globe is almost devoid of clouds (only around coast of Antarctica and northern Greenland), no precipation anywhere, and very cold (off-scale at -24C or below) global temperatures.

Although the model is still crunching (time steps and model date/time do very slowly advance), it certainly does seem that something is wrong (both in current year 2062 global climate, and that the longer it runs, the farther from completion BOINC manager says its getting). No trickles have been uploaded for this task in the last 8 days (since 06-Nov), which is also suspicious?

I do have WinZIP backups of my BOINC data folders made on 05-Nov and 12-Nov, but since this machine is running 4 CPDN tasks, I would need special instructions for how to restore from ZIP just one of the four tasks, if advice is to go back to 05-Nov backup for this model and reprocess from that point forward. Or should I just be patient and let it slog through the remaining 56,024 time steps no matter how long it takes ?

Watching the BOINC manager graphics for this task, it seems to be slogging along at one time step every 5-min or so, which could mean another 5,000 hours (not 62-hrs) assuming the model neither crashes nor speeds up again.

-- Jim

P.S. I looked at the what other CPDN volunteers+computers have been assigned this same model task, I see someone else is running it with the exact same accumulated credit as I have so far (stuck in a similar state?), but meanwhile someone else _has_ completed the model, and their Phase 4 (P4) temperature and precipitation plots looked relatively \"normal\" -- certainly not the same cloud-free, dry, and cold year 2062 world I\'m looking at ... suggesting execution of the model at my end experienced a computational glitch at some point recently?
ID: 35501 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 35503 - Posted: 15 Nov 2008, 1:08:06 UTC

Hi Jim

You\'ve got an \'iceworld\'. Nothing has really turned to ice but computation gets stuck and the graphics colours go to their defaults, which for temp is blue so the whole globe appears frozen. A small but significant proportion of HADSM models including the mid-Holocenes are prone to this problem which is described here.

Sometimes within a WU every task is affected. Sometimes it depends on the OS. One person who\'s finished this task has Linux which doesn\'t surprise me but the other person who\'s completed it has Windows like you which does surprise me a little.

If no other tasks in a WU develop this problem, only yours, you need to think about whether the computer\'s completely stable. But that isn\'t the case here.

Abort it. Don\'t try to hang onto it or restore it. I\'ll send a private message to everyone still running a task from this WU to warn them.

If I were you I\'d subscribe to the forum News thread which is at the top of Number Crunching. That\'s where we announce known issues with models.


Cpdn news
ID: 35503 · Report as offensive     Reply Quote
JimMcCarthy_StellarSolns
Avatar

Send message
Joined: 3 Sep 08
Posts: 23
Credit: 41,989,607
RAC: 2,734
Message 35505 - Posted: 15 Nov 2008, 19:22:50 UTC
Last modified: 15 Nov 2008, 19:24:46 UTC

Hi mo.v --

Thanks for the explanation and pointer to previous threads about this issue. As it appears possible that some machine-dependent \"glitch\" may have triggered the model\'s transition into an \'iceworld\', I do now recall that a week ago I did experience a \"blue screen\" crash on that computer (an HP ML150 G2 server I named \"b\" since it\'s my second one at work) starting-up a software tool that was seriously incompatible with the SATA disk controller driver in that machine (my first ML150 G2 machine uses SCSI). Given, as you pointed out, someone has successfully completed this model under Windows, might it be worth a shot to restore it from my WinZIP backup of the BOINC data folder tree contents on 05-Nov (one day before the last trickle was uploaded for the model that\'s now an \'ice-world\', and when all still appeared to be normal) and let it run for another week or two just to see if it becomes an \'ice-world\' again ? Absent a repeat of whatever \"glitch\" triggered the \'ice-world\' the first time, it could finish completely in another 10-days or so....

Under the BOINC data folder, I see subfolders for \"projects\" and \"slots\", and under BOINC\\projects\\climateprediction.net\\ there is one subfolder per model. Seems straightforward to restore selectively the subfolder for the \'ice-world\' model only (leaving alone the subdirectories for the other 3 models in progress on this machine, which appear to be doing fine). I think I\'d also need to diff the *.xml files in the parent BOINC data directory, and whatever portions apply to the \'ice-world\' model will also need to be set back to what they were on 05-Nov before restarting BOINC. But is there anything I also need to restore selectively in the BOINC\\slots\\ directory tree, say for the slot running the model that\'s become an \'ice-world\' ?

Thanks again,

-- Jim
ID: 35505 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 35506 - Posted: 15 Nov 2008, 19:59:57 UTC

On our alternate forum, here, there\'s a section near the top called Readme posts, which contain links to hints, tips, and advice.
There are 5 individual sections in there, and the one labelled Backup and Restore has a link to a post that explains how to restore a single model on a multi core processor.


Backups: Here
ID: 35506 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 35517 - Posted: 16 Nov 2008, 1:57:06 UTC
Last modified: 16 Nov 2008, 2:22:31 UTC

PeteB\'s method has been tried and tested by several members. I\'d be tempted to try an easier solution - restore the whole folder to a single-core Windows machine and just run the one MH model there, keeping the 3 other models suspended because they\'d still be running on the quad. I think that would work. You\'d have to suspend the MH on the quad and get a replacement task for that core from another project. As you know how to look after your models you could consider CPDN Beta.

If we knew whether Daboo\'s model has in fact become an iceworld the decision would be easier. So far no reply to my PM.

If you do rerun the iceworld from backup, please let us know what happens to it.
Cpdn news
ID: 35517 · Report as offensive     Reply Quote
JimMcCarthy_StellarSolns
Avatar

Send message
Joined: 3 Sep 08
Posts: 23
Credit: 41,989,607
RAC: 2,734
Message 35544 - Posted: 18 Nov 2008, 20:12:15 UTC

Thanks Les for the link to the \"alternate\" forums. Using the selective backup instructions, over lunch yesterday I was able to restore the \'ice-world\' model task to its pre-ice-world state (from 05-Nov-2008), and after restarting it, confirmed it was running OK and it both looked and behaved normally. Checking it this morning, I discovered it had once again become an \'ice-world\' and that computation had reached the same percent progress (94.595%) where CPU remaining \"To Completion\" had stopped decreasing before.

So I\'ve aborted this model. Please let me know if there\'s any value in using my 05-Nov backup for this model \"on the verge\" of becoming an ice-world, for any diagnostic / investigation purposes at your end. I\'d also be curious to know when/if you hear from Daboo what\'s become of his attempt to complete this same model task under Windows.

Thanks again for your help and advice. I\'m now subscribed to the \"news\" thread (which is one of the \"alternate\" forums at the destination of the link that Les provided two postings up above), and will try to keep up with those alerts from this point forward.

-- Jim
ID: 35544 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 35545 - Posted: 18 Nov 2008, 21:41:29 UTC

There\'s also a CPDN News thread on this forum at the top of Number Crunching.

I expect you\'re relieved to have proved that the model\'s at fault and the computer isn\'t unstable. Just delete that backup, Jim. In Oxford they already know that a small proportion of these models fail but I don\'t think there are any current plans to tweak them. At least some of these iceworlds could be an inevitable consequence of the researchers using a very wide range of parameter values.

I\'d also love to hear from the many members I\'ve sent private messages to warning them about their actual or potential iceworlds. I\'ve never received a single reply. One problem is that the forum software default setting is for members not to receive email notification of PMs. That\'s to protect people\'s privacy. Another problem is that most members rarely visit the forums and they post about problems even less frequently. So all we can do is go through the web page for the complete workunit to see what happens to the other models. Sometimes I\'ve found that a member still apparently has a model stuck in an iceworld two months after my message.
Cpdn news
ID: 35545 · Report as offensive     Reply Quote

Questions and Answers : Windows : HADSM3 total CPU time goes up, but so does To Completion !

©2024 climateprediction.net