climateprediction.net home page
Hardware problem or just bad luck?

Hardware problem or just bad luck?

Questions and Answers : Windows : Hardware problem or just bad luck?
Message board moderation

To post messages, you must log in.

AuthorMessage
wavydave

Send message
Joined: 24 Sep 04
Posts: 3
Credit: 687,503
RAC: 0
Message 34077 - Posted: 16 Jun 2008, 18:25:54 UTC

Hi,

For the best part of a year now, I\'ve been having plenty of success finishing CP models, but over the last month I\'ve hit a run of crashes - the latest is 6210836.

They all seem to give error 22 - the only info I can find suggests that this isn\'t the initial error, it\'s the result of a previous error. I\'m running Vista and have seen hints that write permissions problems can happen with Vista when Boinc is installed to the default directory, so on Saturday (14th), I uninstalled and reinstalled to C:\\Boinc to try and rule that out. Since then I\'ve had two more models crash, so I don\'t think it\'s that!

Can anyone shed any light as to what\'s suddenly going wrong? I\'ve set CP to get no more work for now until I\'ve got some sort of idea as to what\'s going on.

Would appreciate some help if possible - thanks in advance.

ID: 34077 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 34078 - Posted: 16 Jun 2008, 20:09:06 UTC

See Mike\'s advice; your recent errors are the same:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=6062&nowrap=true#33855
Do you exit boinc before shutting down? (Crucial with Vista.)

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 34078 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 34081 - Posted: 16 Jun 2008, 21:40:20 UTC
Last modified: 16 Jun 2008, 21:57:50 UTC

Hi Wavydave, welcome to the forum.

I think that error 22 is a CPDN or BOINC code (not sure which) but this generates an error message that would be true if it was a Windows error code - that the device (= hard drive) cannot recognize the command. But as far as I know it isn\'t a Windows error. So I think we have to ignore this part of the error messages which are useless for diagnosis. The error messages can be seen by clicking the + beside stderrout on each model\'s page.

A lot of the crashed models are HADAMs and some have fortunately crunched long enough to produce graphs. The graphs are most interesting. Have a look first at the 3 time series graphs of these HADAMs I crunched successfully; each time you need to click on \'Timeseries\' to open that graph up:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7130566
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7127338

Or the 3 graphs of a HADAM crunched by Astro; he\'s crunched others that produced similar graphs:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7886790

Dave, here are the graphs of your crashed HADAMs that progressed far enough to generate them:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7880026
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7884873
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7925389

There is something wrong with how your computer\'s processed the data for these crashed models. It\'s very rare for HADAM models to be faulty. Almost all of them should process and complete without inherent errors.

But here\'s one of your successful HADAMs completed earlier on the same computer. The 3 graphs were all normal:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7167724

So I think something went wrong on your computer about a month ago. It seems to have become unstable. What changed on this computer in mid-May? Did you overclock it, change the RAM settings, add any new parts, install any new software? Do you know whether it\'s overheating?
Cpdn news
ID: 34081 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 34083 - Posted: 16 Jun 2008, 23:24:16 UTC - in response to Message 34081.  

... What changed on this computer in mid-May? Did you overclock it, change the RAM settings, add any new parts, install any new software? ...

... that\'s about Vista SP1 time, but it doesn\'t seem to have caused problems elsewhere.
ID: 34083 · Report as offensive     Reply Quote
wavydave

Send message
Joined: 24 Sep 04
Posts: 3
Credit: 687,503
RAC: 0
Message 34087 - Posted: 17 Jun 2008, 8:22:07 UTC

Thanks for the replies and the welcome...

I can rule out anything related to Windows shutdown as my PC is only rebooted once every week or two. The models always seem to crash when I\'m not using the PC for anything else, so I\'d be surprised if it was overheating but I\'ll check that when I get home. I\'m running other projects as well (Seti, Cosmology and Einstein) and don\'t get any errors with those WUs, so it\'s purely affecting CP. Hadn\'t noticed the graphs until you pointed them out - there\'s something going badly wrong with the calculations!

I\'ve not changed any hardware recently and the only software change I can think of around a month ago was the release of the new SSE3 optimized Seti client (I installed Vista SP1 quite a bit earlier). I\'ve heard that this can be a bit of a resource hog - could it be that CP is being starved of resources if Seti is running at the same time (dual-core CPU) and falls over? I could revert to the stock Seti application for a while, or configure Boinc to only use one CPU if resource problems are suspected.

ID: 34087 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 34088 - Posted: 17 Jun 2008, 12:12:14 UTC
Last modified: 17 Jun 2008, 12:32:05 UTC

When you do reboot make sure you\'ve fully exited from BOINC and don\'t start the shutdown process until the BOINC icon\'s disappeared.

Seti could be the culprit. Have you got an optimised version of BOINC or standard BOINC-from-Berkeley? Or is it some Seti tasks that are specially optimised to run faster?

Seti tasks in my experience usually have a fairly generous time to completion. You could try suspending Seti in the Projects tab for a couple of days (for example) and letting CPDN run alongside your other projects. Then for the next couple of days suspend the CPDN project and let Seti run. See whether your climate model then runs stably.

But if you have an optimised version of BOINC this might not solve the problem. You\'ll have to try running Seti and CPDN alternately and tell us what happens.

Edit: I\'ve just looked at your computer details again. You have Vista. Remember that Vista doesn\'t like BOINC to be installed in the default location which is C\\Program files\\BOINC. People with Vista need to move the BOINC folder to C\\BOINC.

The computer has 2GB RAM. Vista is very memory-hungry and HADAM models need nearly 1GB memory at peak processing moments if I remember correctly. I\'ve run two HADAMs side-by-side on my Core2 Duo which has 2GB RAM, but I have XP. I think that if you want to run two climate models side-by-side at any time you should select HADSM for the other model. HADSM models need much less memory. We can choose the next type of model we download in our CPDN preferences in our accounts.

If you get a HADSM, don\'t interrupt the model while it\'s creating any of the 3 end of phase zip files after each 15 model years. Then wait until at least one trickle after the zip file upload before exiting from BOINC. The zip files are created during the first few days of December. This was probably why your last HADSM crashed after its 24th trickle which means at the end of a phase. It sent its trickle but didn\'t send a zip file to the server. That\'s why it hasn\'t got any Phase 1 graphs.
Cpdn news
ID: 34088 · Report as offensive     Reply Quote
[B^S] mavau

Send message
Joined: 30 Aug 04
Posts: 142
Credit: 9,936,132
RAC: 0
Message 34089 - Posted: 17 Jun 2008, 18:44:38 UTC

Following a second HadCM3 22 error with negative theta, I\'m now running 2 HADAM models for the first time. Inspiron 9400 laptop, Core 2 Duo 2GHz, Vista SP1, 2GB RAM.
I\'ll let you know if there\'s any problem.
Though so far the only issue I\'ve had with HADAM is the drop in credit.
btw, I\'ve been looking for recent info on the negative theta issue. My first one was in April. My understanding was it was a bad model.
Since then I\'ve completed a control model, but then got that second error.
On the 22 error: it is a BOINC error, but you can find more information at the end of the error messages. You can check my latest one here. (error 22 at the topand the negative thetas at the bottom).
Finally, regarding Vista. I\'ve also suffered the problems mentioned in the past.
Not waiting long enough for things to complete. I\'ve been extra careful since and I\'ve noticed prolonged disk activity when shutting down BOINC with a HADAM model.
The Program Files directory issue has to do with permissions. What happens is that it won\'t let BOINC run on startup. I just activate it manually on the rare occasions I shut the computer down.


Forum search Site search
ID: 34089 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 34090 - Posted: 17 Jun 2008, 19:49:19 UTC
Last modified: 17 Jun 2008, 20:01:26 UTC

mavau,

For what it\'s worth: HadAM3 seems harder on HDD than CM or SM Models.

A laptop* was recently added to my mix and the first two Models were HadAM3. The Core2Duo CPU temperatures are reasonable but the HDD temperatures are not. It\'s a concern. (The machine is elevated more than an additional CM to aid cooling.)

CPU 51v57C (depends on whether the fan is running; 20C ambient temperature.)
HDD 47+C with a pair of HadAM3 Models, \"only\" 43 or 44C with a pair of Spinups.
That compares with 31C on a 10,000 RPM HDD on a Quad Q9300 running 4 Spinups. (All temperatures reported by HWMonitor.)

I expected heat from a laptop CPU but the HDD temperature, especially with HadAM3, was a bit of a shock.

[* Acer Extensa 5620, 3GB, C2D T5500 (1.83 GHz), Vista Home SP1.]
Edited for typo.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 34090 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 34099 - Posted: 18 Jun 2008, 20:46:17 UTC
Last modified: 18 Jun 2008, 20:48:12 UTC

The higher HDD temperature with HadAM3 doesn\'t surprise me. Open task manager and add the page faults delta (i.e. number of times data had to be read from or written to the disk since the last update) column to the display. The other applications very rarely page fault, HadAM3 does lots of it all of the time.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 34099 · Report as offensive     Reply Quote
[B^S] mavau

Send message
Joined: 30 Aug 04
Posts: 142
Credit: 9,936,132
RAC: 0
Message 34216 - Posted: 4 Jul 2008, 13:02:00 UTC - in response to Message 34099.  

Re Inspiron 9400
My models completed successfully, but I noticed CPU throttling events in Event Viewer.
Using Christian Diefer\'s fan control gui I saw cpu temperature going up to 90°C, at which point the Intel utility throttled back.
Well, time for some cleaning up after a year and a half\'s crunching 24/7 mostly. You will find all the necessary information on this page.
It\'s unfortunate you have to take the whole laptop apart to get at the grids behind the fans, which is where the crud builds up.
It took me a careful hour\'s work.
The result: temperature down to under 60°C, fan running slow and noiselessly most of the time.
One final note: since you have to disconnect the BIOS battery, you have to enter setup to reset the clock on startup.

Forum search Site search
ID: 34216 · Report as offensive     Reply Quote

Questions and Answers : Windows : Hardware problem or just bad luck?

©2024 climateprediction.net