climateprediction.net home page
Full Resolution Ocean 6.07 Model failing with computation error

Full Resolution Ocean 6.07 Model failing with computation error

Questions and Answers : Macintosh : Full Resolution Ocean 6.07 Model failing with computation error
Message board moderation

To post messages, you must log in.

AuthorMessage
Roland

Send message
Joined: 22 Mar 06
Posts: 3
Credit: 29,860
RAC: 0
Message 50606 - Posted: 24 Oct 2014, 16:36:07 UTC

I'm a newbie to Climateprediction.net. Boinc has downloaded about 10 models for the UK Met Office Coupled Model Full Resolution Ocean 6.07 but every one of them has failed with "computation Error" status after a few hours and % (usually 2-3%). Even without looking I can tell one has failed because the fan on my late 2009 iMac Core i7 spins up. This seems to be down to the failed model sharing one core 50/50 with kernel_task and not responding to anything other than a quit boinc and restart. On the other hand I have 3 HadCM3 short 7.24 models running which look set to complete (>60% so far). I have keep in memory selected.

Any ideas? I've disabled that model for future downloads which is a sort of workaround, but not a cure.
ID: 50606 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50607 - Posted: 24 Oct 2014, 19:10:15 UTC - in response to Message 50606.  

Hello Roland

There's a thread about this here in the Number crunching forum.

There's also a repost of a message from the researcher for this experiment, at the top of the Science section.

ID: 50607 · Report as offensive     Reply Quote
Roland

Send message
Joined: 22 Mar 06
Posts: 3
Credit: 29,860
RAC: 0
Message 50744 - Posted: 6 Nov 2014, 16:45:34 UTC - in response to Message 50607.  

Thanks Les,

I couldn't live with the graceless way that the full resolution model failed - effectively taking a processor core with it each time, and leaving gigabytes of orphan data behind. I disabled all except the HadCM3 short model for future downloads. I've had more than 20 of these download and run-through to 100% (or nearly - I haven't actually managed to catch one in the act yet) and upload, however looking at the 'my tasks' list on the web site, every single one of them is showing status "Error while computing". I'm getting credit for the upload / trickles but I'm not sure if I'm doing much good if all the models are failing. At least they fail gracefully with no collateral damage but it is still disturbing...
Any ideas? I'm not sure the cause would be the design of the models with 100% failure rate?
ID: 50744 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50745 - Posted: 6 Nov 2014, 20:02:28 UTC - in response to Message 50744.  

The short models also have a thread in Number crunching, here.

There's another short thread about them here.

As you can see from these two, that model type, at least with the original batches, didn't like Windows systems. Not sure about the latest lot.
And I've not kept track of any Mac problems, so you'll have to read through the whole thread to see.

The Number crunching section is usually where all the activity is when there's a failure of some sort with models.

ID: 50745 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,904,049
RAC: 6,657
Message 50746 - Posted: 6 Nov 2014, 23:03:05 UTC

My experience, based on a single rather old Mac, is that no application with v7 in the version number will work correctly: each misbehaves in a different way. I have not, however, done a search for Macs that have completed without "error 9" - there may be some.

The uploads for HADCM3S from your Mac should be fine. That means that if the scientific post-processing does not take into account the model completion status then those uploads are as good as any other. Since the HADCM3S is catastrophically error-prone, your model may be the only one to produce anything, in which case it is again useful.

However, if you look at the work units for the models on your machine, you will see that the error return causes more models to be issued - the work is repeated. If you're concerned about waste then that might be a negative factor (though running duplicates makes CPDN in this instance more like other BOINC projects).

I don't run HADCM3S, so the question of whether it's worthwhile doesn't arise for me, but as a convenience I have deselected all v7 models on that Mac, which is now happily running a mix of v6 HADAM3P ANZ and HADCM3N models. You can see the application version numbers on the applications page.
ID: 50746 · Report as offensive     Reply Quote
Roland

Send message
Joined: 22 Mar 06
Posts: 3
Credit: 29,860
RAC: 0
Message 50750 - Posted: 7 Nov 2014, 18:36:17 UTC
Last modified: 7 Nov 2014, 18:36:55 UTC

What helpful people there are here!
I had a look through the last 30 work units I have been assigned (see computer 1342365), and the vast majority of them had been attempted four or five times previously by other computers with diverse operating systems and always failed with "error while computing". Only One of the 30 has subsequently run to completion on another computer. I've tried switching to the two models suggested by Iain Inglis and I'll see how that goes. I can't quite get my head around a simulation design that is known to fail on most computers most of the time - surely there is a better way?!
ID: 50750 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50751 - Posted: 7 Nov 2014, 19:24:21 UTC - in response to Message 50750.  

It depends on the reason for the failure. Some are computer problems, some are model problems.
As the program is heavily dependent on getting values from the floating point unit (FPU) of the processor, and each brand has their own copyrighted way of doing the maths, they produce slightly different results. Different results may push the model into an unstable situation and cause it to be deemed unviable, so the part of the program that checks for this stops the program.

And the researcher DID say that he was pushing the forcing values into areas that would probably cause a lot of failures, so that he could examine what happened "around the edges of the stable zone".

And just in case you're not aware: the aim of cpdn is NOT to predict future weather/climate by running all models from start to finish.
Rather, certain forcing elements are chosen, loaded with certain values, and then the model is run to see what happens to it. And the next one in the batch will have slightly different values in some of those elements, to see what happens to IT.

So there is a spread of starting values used, with the hope that some will fail, and some will succeed, thereby exposing the "edge" of whatever is being investigated.


ID: 50751 · Report as offensive     Reply Quote

Questions and Answers : Macintosh : Full Resolution Ocean 6.07 Model failing with computation error

©2024 climateprediction.net