climateprediction.net home page
Client Error/Computation Error - HADSMs

Client Error/Computation Error - HADSMs

Message boards : Number crunching : Client Error/Computation Error - HADSMs
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user392646

Send message
Joined: 24 Apr 06
Posts: 4
Credit: 8,984,280
RAC: 0
Message 38003 - Posted: 17 Sep 2009, 21:31:25 UTC

For recent batch of HADSM models I have been getting the following messages:

17/09/2009 20:31:05 climateprediction.net Started upload of hadsm3fub_k4q9_006418923_1_1.zip
17/09/2009 20:31:06 climateprediction.net Computation for task hadsm3fub_k4q9_006418923_1 finished
17/09/2009 20:31:06 climateprediction.net Output file hadsm3fub_k4q9_006418923_1_2.zip for task hadsm3fub_k4q9_006418923_1 absent
17/09/2009 20:31:06 climateprediction.net Output file hadsm3fub_k4q9_006418923_1_3.zip for task hadsm3fub_k4q9_006418923_1 absent
17/09/2009 20:31:44 climateprediction.net Finished upload of hadsm3fub_k4q9_006418923_1_1.zip


When I look at my account TASK information it indicates Client Error/Computation Error.

Any ideas why, HADSM3Ps seem to be running fine.

Regards

Coz

ID: 38003 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 38004 - Posted: 17 Sep 2009, 21:58:56 UTC - in response to Message 38003.  

Sphagc,
Would you post a link to the workunit that you\'re talking about? And also, which computer is this in your list of computers?
ID: 38004 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 38005 - Posted: 17 Sep 2009, 21:58:56 UTC

My guess is that you\'re interrupting the 3 phase slab models at the end of a phase and before the next phase has started. They don\'t like this!
There\'s LOTS of post processing at the end of each phase, which involves extracting data, consolidating it, and then zipping them for upload. Interrupt this and the files are history.

If a model has reached the end of a phase, wait until after the first trickle in the next phase before interrupting.


Backups: Here
ID: 38005 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 38006 - Posted: 17 Sep 2009, 22:23:45 UTC

It looks like you\'ve had 7 errors right at the end of phase 1. As Les said, something appears to be happening to interrupt post-processing at that critical end-of-phase time. It seems unlikely that you would be manually interrupting each model at the time of failure since the failures occurred at 7 different times.

If I recall correctly, some executable other than the hadsm3 um process is called at post processing. Perhaps Vista, or an antivirus, or anti-malware application has locked this file that is only needed at that time? Ian/Thyme might have a better idea.
ID: 38006 · Report as offensive     Reply Quote
old_user392646

Send message
Joined: 24 Apr 06
Posts: 4
Credit: 8,984,280
RAC: 0
Message 38007 - Posted: 18 Sep 2009, 13:26:04 UTC - in response to Message 38004.  

Sphagc,
Would you post a link to the workunit that you\'re talking about? And also, which computer is this in your list of computers?



http://climateapps2.oucs.ox.ac.uk/cpdnboinc/hosts_user.php?userid=392646
Computer which is showing problem:
996941 [tasks] Cozzie-VistaX64 home 4,198.64 88,812 GenuineIntel
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz [Intel64 Family 6 Model 15 Stepping 11] Microsoft Windows Vista
Ultimate x64 Edition, Service Pack 2, (06.00.6002.00) 18 Sep 2009 12:41:42 UTC

Machine is left running 24/7 and I only reboot after Microsoft Updates (making sure I close down BOINC before shutdown).

Tasks with Errors.
9938307 6649750 9 Sep 2009 20:00:32 UTC 15 Sep 2009 17:51:07 UTC Over Client error Compute error 364,086.40 2,282.60 2,282.60
9894197 6645339 2 Sep 2009 19:15:06 UTC 7 Sep 2009 19:51:26 UTC Over Client error Compute error 417,807.50 2,282.60 2,282.60
9891457 6645065 11 Sep 2009 15:06:47 UTC 16 Sep 2009 10:13:09 UTC Over Client error Compute error 368,213.10 2,282.60 2,282.60
9826402 6638561 6 Sep 2009 16:39:28 UTC 11 Sep 2009 15:06:47 UTC Over Client error Compute error 400,210.60 2,282.60 2,282.60
9811750 6637096 12 Sep 2009 20:35:18 UTC 17 Sep 2009 19:32:18 UTC Over Client error Compute error 392,664.40 2,282.60 2,282.60
9752529 6631174 7 Sep 2009 19:53:02 UTC 12 Sep 2009 20:35:18 UTC Over Client error Compute error 393,902.40 2,282.60 2,282.60
9618960 6597657 9 Sep 2009 17:06:57 UTC 15 Sep 2009 19:57:11 UTC Over Client error Compute error 378,376.20 2,282.60 2,282.60

NB. Everything else seems to be working fine with shorter HADSM3Ps - I am doing nothing different with them, not had problem with the longer ones before.

Many thanks for your help

Coz.
ID: 38007 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 38008 - Posted: 18 Sep 2009, 15:13:55 UTC

@sphagc

Are there any differences in setup between that PC and your other Windows PCs that are successfully running hadsm3 type models? Different antivirus? Different antimalware program? Different firewalls?
ID: 38008 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 38012 - Posted: 20 Sep 2009, 4:57:50 UTC - in response to Message 38007.  

Seems like file permissions problems. Reset security on all files in your BOINC\'s data/projects directory. Could also be Vista security.... The climate applications need to be able to spawn themselves and their post-processing items. Without this execute permission, task will fail. I know there\'s a Windows Defender or Vista Security something-or-other or perhaps virus protection that might be preventing this.

Other than that, afraid I can\'t be much help with Vista....
ID: 38012 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 38018 - Posted: 22 Sep 2009, 14:29:26 UTC - in response to Message 38006.  
Last modified: 22 Sep 2009, 14:30:07 UTC

If I recall correctly, some executable other than the hadsm3 um process is called at post processing.

The se process is indeed the problem. All of the HadSM3 tasks are failing with the same error, namely

Could not launch smallexecs process. Last Error=5

(e.g. click the \'+\' by stderr out for task id 9938307).

Check that projects/climateprediction.net in your BOINC data directory contains the file hadsm3_se_6.07_windows_intelx86.zip (1,958,740 bytes) and that it has been unzipped to hadsm3_se_6.07_windows_intelx86.exe (2,212,352 bytes, modification time 12:11:16 on 21 August 2008).
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 38018 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 38019 - Posted: 23 Sep 2009, 7:50:43 UTC - in response to Message 38018.  

Could not launch smallexecs process. Last Error=5

A further thought about that message. Error number 5 is \"Access denied\" so the cause could be file permissions or locking.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 38019 · Report as offensive     Reply Quote
old_user392646

Send message
Joined: 24 Apr 06
Posts: 4
Credit: 8,984,280
RAC: 0
Message 38020 - Posted: 23 Sep 2009, 8:27:14 UTC - in response to Message 38019.  

Could not launch smallexecs process. Last Error=5

A further thought about that message. Error number 5 is \"Access denied\" so the cause could be file permissions or locking.



Thanks for all the replies, I have checked and both exe & zip file are present with all permissions set as far as I can see correctly.

The two quad-core systems both running Vista X64 Ultimate with Spyware Doctor for Malware detection, but problem systems has Kapersky Internet Security 2009 running whist, the other has Kapersky Anti-Virus 6 for Workstations. File permissions etc have been set identical, unless the Security Suite has something extra I have missed, although previous HADSM have cuased no problems.

Anyway everyone, thanks for messages I will keep an eye on the systems and report back if I spot any further problems.

Regards

Coz.
ID: 38020 · Report as offensive     Reply Quote
old_user568298

Send message
Joined: 20 May 09
Posts: 1
Credit: 36,702
RAC: 0
Message 38321 - Posted: 17 Nov 2009, 13:03:39 UTC

Well... Wish I could figure out why, but I\'ve had far too many compute errors running cpdn tasks and far too much frustration like this one: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6693008 where I\'ve burned hundreds of thousands of compute seconds only to have it punt and get but a fraction of credit. And judging from the above result, I\'m not the only one experiencing these type of failures. Perhaps my computer isn\'t up to the demand, but I don\'t believe that explains it. I\'ve run Aqua Multithread for hundreds of hours without error, I\'ve got Folding runnng on both GPUs daily with nary a problem. All while getting my normal work done. And other BOINC projects crunch along happily side by side with cpdn while it \"face-plants\" yet again. Ah well... I gave it a go. That should count for something I guess...
ID: 38321 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 38371 - Posted: 23 Nov 2009, 5:07:33 UTC

22-Nov-2009 13:59:05 [climateprediction.net] Computation for task hadsm3mh_kv40_006489252_4 finished
22-Nov-2009 13:59:05 [climateprediction.net] Output file hadsm3mh_kv40_006489252_4_2.zip for task hadsm3mh_kv40_006489252_4 absent
22-Nov-2009 13:59:05 [climateprediction.net] Output file hadsm3mh_kv40_006489252_4_3.zip for task hadsm3mh_kv40_006489252_4 absent
22-Nov-2009 13:59:05 [climateprediction.net] Output file hadsm3mh_kv40_006489252_4_4.zip for task hadsm3mh_kv40_006489252_4 absent

I have a few (but not all) HadSM_MH models that crash around timestep 260,000. Not sure why, as some of the MH models do finish properly, although not lately.
Good one:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=10543047

Bad ones:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=10531431
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=9374407
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=9362859
ID: 38371 · Report as offensive     Reply Quote
Profile old_user540633
Avatar

Send message
Joined: 7 Oct 08
Posts: 7
Credit: 165,698
RAC: 0
Message 38386 - Posted: 24 Nov 2009, 20:53:12 UTC

I know this WU failed because BOINC switched projects while it was trying to do post processing:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=9616085

However this WU failed without any reason I can find just yet:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=10415116

My Primegrid WU\'s around the time were unaffected which rules out processor problems and the NFS WU in memory survived which rules out a lack of available memory (since NFS is very sensitive to memory issues). None of the other 15 projects showed any issues whatsoever, just the CPDN WU. It had jumped to 100% sometime while I was gone, but was still \'Waiting to Run\'. I caught it before it restarted and changed the \'waiting\' to \'computer error\'. The graphics listed it as being at only 71% (despite the 100% given in the BOINC manager) and the temps had gone blue.
~It only takes one bottle cap moving at 23,000 mph to ruin your whole day~

ID: 38386 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 38387 - Posted: 24 Nov 2009, 21:07:56 UTC

If the temperatures were blue, then either the model hadn\'t run long enough to generate the data needed by the graphics package to show the correct colours, (blue is the default colour immediately on starting, and before sufficient data has been crunched), or it had turned into an \'iceworld\'.
Iceworld description here, discussion here, and appeal for data here. The later only applies to people who take regular backups, and are prepared to do some extra work.


Backups: Here
ID: 38387 · Report as offensive     Reply Quote

Message boards : Number crunching : Client Error/Computation Error - HADSMs

©2024 climateprediction.net