climateprediction.net (CPDN) home page
Thread 'Late November batch of Windows work'

Thread 'Late November batch of Windows work'

Message boards : Number crunching : Late November batch of Windows work
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 52980 - Posted: 29 Nov 2015, 22:48:00 UTC

OK, getting replies back, so several things:

The following is mostly intended for the mods, but I've included it all:
As the number of failures appeared to steadily be increasing for batch 238 (wah2_ri version 7.08) I have now removed the remaining workunits from this batch from the queue. However I am still very interested in how the workunits that have already been taken up progress. If you have any of these workunits running on your machines and are able to capture the working directories or .out files this would be very useful to us. Also if you have other information on the failures - when they are failing / how long they ran for this would also be helpful. Similarly if you have information on these workunits running successfully that would also be of benefit.


Part of a separate email:
This is the latest version of the newest incarnation of the W@H application, the region independent, start date independent, length independent with latest land cover model


This latest batch are submission for the MARiUS project. details of which are on the site.



ID: 52980 · Report as offensive     Reply Quote
Kevin

Send message
Joined: 5 Jul 09
Posts: 63
Credit: 6,091,274
RAC: 0
Message 52985 - Posted: 30 Nov 2015, 10:19:30 UTC - in response to Message 52980.  

OK, getting replies back, so several things:

The following is mostly intended for the mods, but I've included it all:
As the number of failures appeared to steadily be increasing for batch 238 (wah2_ri version 7.08) I have now removed the remaining workunits from this batch from the queue. However I am still very interested in how the workunits that have already been taken up progress. If you have any of these workunits running on your machines and are able to capture the working directories or .out files this would be very useful to us. Also if you have other information on the failures - when they are failing / how long they ran for this would also be helpful. Similarly if you have information on these workunits running successfully that would also be of benefit.




Just for those that have not done this before, (like me).

I have two of these, one of which has just started running, It has failed on a previous machine.

Where and when are the files present, and where would you like them sent, I am assuming this is only for files that fail.

Thank you.

Kevin

ID: 52985 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 52986 - Posted: 30 Nov 2015, 10:34:59 UTC - in response to Message 52985.  

Information about speed of progress and size of zip files when created may also be of interest.
ID: 52986 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 492
Credit: 31,516,485
RAC: 14,727
Message 52991 - Posted: 30 Nov 2015, 14:01:59 UTC - in response to Message 52986.  

Currently running four on my i5 machine. One repeat (computing error after 32secs!) and three new ones. Up to about 7% after 12hours.
ID: 52991 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 52992 - Posted: 30 Nov 2015, 15:36:28 UTC - in response to Message 52985.  

Kevin, the files are in ..../BOINC/projects/climateprediction.net and the main files of interest are those ending iwth _x.zip where x is an integer between 1 and 13. If things work normally these will get transferred to the relevant server automatically but I am currently at 58% through transferring about 1GB worth of files through to Andy for the beta site which is still down.
ID: 52992 · Report as offensive     Reply Quote
Kevin

Send message
Joined: 5 Jul 09
Posts: 63
Credit: 6,091,274
RAC: 0
Message 52993 - Posted: 30 Nov 2015, 15:59:17 UTC - in response to Message 52992.  

Dave, OK thanks, mine is running fine so far. If I need it I will shout for the relevent address.

Kevin
ID: 52993 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,517,114
RAC: 10,523
Message 52994 - Posted: 30 Nov 2015, 16:37:58 UTC - in response to Message 52986.  

I have 6 of these tasks running on my i7 Computer.
Estimated time for the tasks was 122 hours but 9% has taken 19 hours for the first task, so they will probably take just over 200 hours.
First trickle at about 9%,
Time Sent (UTC) Host ID Result ID Result Name Phase Timestep CPU Time (sec) Average (sec/TS)
30 Nov 2015 16:03:43 1305473 19108850 wah2_eu25_h9ig_197912_12_010206730_0 1 11,819 64,677 5.4723

I have 2 more tasks at 8.4%, so they should send a first trickle soon.

ID: 52994 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,517,114
RAC: 10,523
Message 52995 - Posted: 30 Nov 2015, 17:05:59 UTC - in response to Message 52994.  

The next 2 tasks sent their first trickle at 8.6%, 18 hours 45 mins.
Time Sent (UTC) Host ID Result ID Result Name Phase Timestep CPU Time (sec) Average (sec/TS)
30 Nov 2015 17:03:50 1305473 19110047 wah2_eu25_i7gm_198712_12_010207898_0 1 11,819 65,809 5.5681
30 Nov 2015 17:03:50 1305473 19107829 wah2_eu25_h3bh_197312_12_010205732_0 1 11,819 65,614 5.5516

Hope this info. is a help.
ID: 52995 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 52999 - Posted: 30 Nov 2015, 20:20:23 UTC

And they're back.
Server Status page

ID: 52999 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 53002 - Posted: 1 Dec 2015, 10:14:16 UTC

hello les

my machine just attempted to download 4 of the PNW workunits...all 4 failed on download...

frank
ID: 53002 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 492
Credit: 31,516,485
RAC: 14,727
Message 53003 - Posted: 1 Dec 2015, 15:50:55 UTC - in response to Message 52991.  

These units have sent two trickles each so far. Taking about 4.8s/ts (i5, 3.5GHz).
ID: 53003 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 53004 - Posted: 1 Dec 2015, 19:19:08 UTC
Last modified: 1 Dec 2015, 19:21:08 UTC

well, another download attempt was more successful...4 out of 5 workunits are now running...the 5th failed the download...all are PNW units

frank
ID: 53004 · Report as offensive     Reply Quote
Trotador

Send message
Joined: 21 Aug 11
Posts: 10
Credit: 26,553,404
RAC: 1,491
Message 53005 - Posted: 1 Dec 2015, 21:42:21 UTC

Data from downloaded and runing units

trickles:
Latest Trickles Received
Result ID 	Result Name 	Phase	Timestep 	 CPU Time (sec) 	Average (sec/TS)
19107519 	wah2_eu25_h1ao_197112_12_010205426_0 1 	23,339 	161,793 		6.9323
19107519 	wah2_eu25_h1ao_197112_12_010205426_0 1 	11,819 	82,173 		6.9526

Latest Trickles Received
Result ID 	Result Name 	Phase 	Timestep 	CPU Time (sec) 	Average (sec/TS)
19107486 	wah2_eu25_h0ik_197012_12_010205393_0 	1 	23,339 	161,913 	6.9374
19107486 	wah2_eu25_h0ik_197012_12_010205393_0 	1 	11,819 	82,149 	6.9506

Latest Trickles Received
Result ID 	Result Name 	Phase 	Timestep 	CPU Time (sec) 	Average (sec/TS)
19107361 	wah2_eu25_f2bd_195212_12_010202599_1 	1 	23,339 	161,386 	6.9149
19107361 	wah2_eu25_f2bd_195212_12_010202599_1 	1 	11,819 	82,047 	6.9420


zip files sizes:

16.407 wah2_eu25_j1ic_199112_12_010208514.zip
15.844 wah2_eu25_c0dm_192012_12_010197870.zip
15.631 wah2_eu25_f2bd_195212_12_010202599.zip
16.224 wah2_eu25_g2hd_196212_12_010204179.zip
15.735 wah2_eu25_g8il_196812_12_010205096.zip
15.629 wah2_eu25_h0ib_197012_12_010205384.zip
15.629 wah2_eu25_h0ik_197012_12_010205393.zip
15.631 wah2_eu25_h1ao_197112_12_010205426.zip
ID: 53005 · Report as offensive     Reply Quote
UXJnHL

Send message
Joined: 1 Nov 06
Posts: 11
Credit: 579,556
RAC: 1,322
Message 53007 - Posted: 2 Dec 2015, 5:35:04 UTC

How often do these tasks checkpoint? Looking at the task running now, it seems it's been over 50 minutes of CPU time since the last checkpoint.
ID: 53007 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 53010 - Posted: 2 Dec 2015, 22:15:54 UTC - in response to Message 53007.  

How often do these tasks checkpoint? Looking at the task running now, it seems it's been over 50 minutes of CPU time since the last checkpoint.

All CPDN models checkpoint at fixed points in the calculation. For these models it's at the end of each model day, with trickles and uploads being made every 30 model days.

My 15 has a checkpoint interval of just under 50 minutes and for the Q6600 it's around 70 minutes.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 53010 · Report as offensive     Reply Quote
UXJnHL

Send message
Joined: 1 Nov 06
Posts: 11
Credit: 579,556
RAC: 1,322
Message 53016 - Posted: 4 Dec 2015, 0:42:37 UTC - in response to Message 53010.  

How often do these tasks checkpoint? Looking at the task running now, it seems it's been over 50 minutes of CPU time since the last checkpoint.

All CPDN models checkpoint at fixed points in the calculation. For these models it's at the end of each model day, with trickles and uploads being made every 30 model days.

My 15 has a checkpoint interval of just under 50 minutes and for the Q6600 it's around 70 minutes.

Cheers, thanks for the reply!
ID: 53016 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,904,898
RAC: 2,026
Message 53037 - Posted: 6 Dec 2015, 14:07:44 UTC

Two of my WAH2 models from the 29 November batch have completed. Some others have failed early on: at least one of those has made some progress on another computer, which makes we wonder whether they don't like being run with too many in parallel (my habit is to run 25% CPUs, except when getting new work when I put CPUs back to 100% - the crashes all occurred during the 100% period).
ID: 53037 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,517,114
RAC: 10,523
Message 53038 - Posted: 6 Dec 2015, 16:53:33 UTC - in response to Message 53037.  

I'm running at 75% CPUs (6 tasks)during the day and 100% CPUs (8 Tasks)during the night on my Win 10 64bit i7.

One WAH2 failed having trickled once.
Another WAH2 failed after 8 trickles.
Both failed during the day, so not at 100% CPus.

The other WAH2s are at 70% completion.
ID: 53038 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53039 - Posted: 6 Dec 2015, 19:41:52 UTC
Last modified: 6 Dec 2015, 19:53:23 UTC

Oops. Wrong models.
I'll move this to the CM3n thread.
ID: 53039 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 53042 - Posted: 7 Dec 2015, 1:30:12 UTC - in response to Message 53037.  

Two of my WAH2 models from the 29 November batch have completed. Some others have failed early on: at least one of those has made some progress on another computer, which makes we wonder whether they don't like being run with too many in parallel (my habit is to run 25% CPUs, except when getting new work when I put CPUs back to 100% - the crashes all occurred during the 100% period).

The memory load for WAH2 seems to be much higher than was the case for previous applications. My wah2_eu25 tasks have a total working set size of around 460MB and I've changed the project resource shares on my Q6600 (which only has 2GB of RAM) to prevent it from running more than one of these tasks.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 53042 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Late November batch of Windows work

©2024 cpdn.org