climateprediction.net (CPDN) home page
Thread 'New small batches of long runs --> with problems'

Thread 'New small batches of long runs --> with problems'

Message boards : Number crunching : New small batches of long runs --> with problems
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 55894 - Posted: 14 Mar 2017, 2:26:39 UTC

From Sarah sparrow, Oxford CPDN scientist:

Hi all,

I just wanted to let you know that I have just sent out 12 batches of long simulations (10 years) to the main site in case you hear anything on the message boards. These are batches 539-550 and represent simulations that we need to ensure that the deep soil is spun up correctly for each region. We can only do this by way of a longer continuous run. These batches are small (only 10 workunits per region) so as not to tie up too much resource for a long period of time. I have three further batches like this to send out that I am awaiting some final details for.

Best wishes,
Sarah


Two appeared, one on each of two of my machines. Both failed early: after about two minutes on i7, ~5.5 minutes on a Q9550.

The i7 boinc installation usefulness was destroyed -- boinc couldn't reconnect. (All data was wiped from from all pages; boinc framework and labels remain.) The situation persisted after reboot, after "repair" installation, and after boinc 'uninstall'/'reinstall'. The Q9550 suffered page data-wipe but recovered.

Email sent to Sarah but it's middle of the night in England ...

Until we receive guidance from on high, my suggestion is to suspend, upon receipt, any tasks from the mentioned batches (539-550). It seems luck of the draw whether boinc suffers a knockdown or a knockout.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 55894 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 55895 - Posted: 14 Mar 2017, 18:03:32 UTC - in response to Message 55894.  

Any updates on how to recognize these killer WU’s other than that they are come from batches 539 – 550? Have they been pulled from the que?
ID: 55895 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55899 - Posted: 14 Mar 2017, 23:38:29 UTC - in response to Message 55895.  

Hi Jim

Most of these batches (up 552 now), seem to be running OK, so I don't think that there's a general problem.
ID: 55899 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 492
Credit: 31,510,687
RAC: 14,925
Message 55900 - Posted: 15 Mar 2017, 11:00:32 UTC - in response to Message 55899.  

I've got one from batch 545 waiting to start but it won't get to the front of the queue for a few days yet. Might suspend it as a task and set no new tasks until it's ready to go.
ID: 55900 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 55905 - Posted: 16 Mar 2017, 10:22:15 UTC - in response to Message 55894.  

"long simulations" - well there's an understatement. Currently typical completion time estimate of 3 days, but one of these is in my queue with an estimate of 72 days. Not sure I remember anything running that long before.

No more feedback on the reliability of these? I've suspended it waiting for some more feedback.
ID: 55905 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 55911 - Posted: 16 Mar 2017, 10:57:50 UTC

Not sure I remember anything running that long before.


There was a time when tasks would last six months or more!
ID: 55911 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,820,415
RAC: 9,194
Message 55912 - Posted: 16 Mar 2017, 12:12:49 UTC - in response to Message 55911.  

Not sure I remember anything running that long before.


There was a time when tasks would last six months or more!

Well, wah2_global_a04o..145_520. runs around 30 days on my i5-2520M @2.5 GHz so I can only imagine how long the "long runs" will be.
ID: 55912 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,522,141
RAC: 1,164
Message 55913 - Posted: 16 Mar 2017, 12:22:24 UTC
Last modified: 16 Mar 2017, 12:51:05 UTC

The estimate for mine is 52 days (AMD Phenom II X4 945, so not the latest and fastest CPU). Windows XP.

It has been running for almost 24 hours and is 1.526% complete.

No problems.
ID: 55913 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 492
Credit: 31,510,687
RAC: 14,925
Message 55939 - Posted: 20 Mar 2017, 11:19:58 UTC - in response to Message 55900.  

Now running this task. 1hour down, 121days to go!
ID: 55939 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 55942 - Posted: 20 Mar 2017, 20:47:07 UTC - in response to Message 55905.  

Well the 72 days turned out to be about 3 minutes before it crashed. Surprised me as the new PC build is proving to have a pretty low error rate.

Still, blessing in disguise as I see the supply of tasks is petering out. With tasks available the PC runs 24/7, but if none around it now gets turned off at night.
ID: 55942 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 492
Credit: 31,510,687
RAC: 14,925
Message 55947 - Posted: 22 Mar 2017, 18:10:44 UTC - in response to Message 55939.  

2.5% in 2 days so looks as if the 121 day estimate will be out by about 40days. We shall see. Also running a global with 145 month run which is generating trickles about every 2,800 timesteps.
ID: 55947 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 55948 - Posted: 22 Mar 2017, 20:30:13 UTC - in response to Message 55947.  

Those people running these very long tasks might want to think about going back to the practice of making frequent backups. Four months is a long time and a lot can happen. You don’t want to invest 2 or 3 months in a task and then lose it because of a power failure or unexpected reboot.
ID: 55948 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 55949 - Posted: 22 Mar 2017, 23:21:23 UTC - in response to Message 55947.  

That's about 80 days with an i5 4690, so a fast PC. Running this type of model on a slow PC would really take forever.
ID: 55949 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 55950 - Posted: 23 Mar 2017, 0:21:30 UTC - in response to Message 55949.  

I don’t know if you have been with the project long enough to remember the 160 year models that took about 8 months to run on 1.2 GHz processors with 256 KB’s of RAM. In those days, they used to post every time someone finished one.
ID: 55950 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 55951 - Posted: 23 Mar 2017, 2:37:27 UTC - in response to Message 55948.  

I no longer back up CPDN data as it is just too difficult to restore. OK, you can restore it but then you have overlapping trickles etc etc. Then there is the issue if a PC glitch knocks out your tasks (I run 12) then another series of tasks start running so how do you deal with that & the restore. I decided that the project probably realises that tasks fail and there are enough running to cover this. (If they don't they shouldn't be using distributed computing!) Besides, any failed task gets reallocated to another PC a couple of times. If all 3 fail it probably points to a model problem or is the result they were after anyway.
ID: 55951 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 55952 - Posted: 23 Mar 2017, 2:48:05 UTC - in response to Message 55949.  
Last modified: 23 Mar 2017, 2:48:41 UTC

Running this type of model on a slow PC would really take forever.


Interesting. Looking back at my first CPDN PC in 2006 (Pentium 4 CPU 3.06GHz) it had a floating point speed of 1420 million ops/sec. My latest creation (i7-6900K CPU @ 3.20GHz) is 4870 million ops/sec.

Now although that is a factor of 3.3 difference, it is not as large a difference as I was expecting given it is 11 years on. Although agreed, it would still be taking forever :-(
ID: 55952 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 55953 - Posted: 23 Mar 2017, 7:30:45 UTC - in response to Message 55950.  

Or a 650MHz Duron in my case. Cue Monty Python sketch https://www.youtube.com/watch?v=Xe1a1wHxTyo
ID: 55953 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55970 - Posted: 27 Mar 2017, 4:07:10 UTC

The fast running chickens are starting to come home to roost.
ID: 55970 · Report as offensive     Reply Quote
ProfilePete B

Send message
Joined: 26 Aug 04
Posts: 67
Credit: 10,385,915
RAC: 6,416
Message 55984 - Posted: 30 Mar 2017, 12:04:02 UTC - in response to Message 55952.  

Running this type of model on a slow PC would really take forever.


Interesting. Looking back at my first CPDN PC in 2006 (Pentium 4 CPU 3.06GHz) it had a floating point speed of 1420 million ops/sec. My latest creation (i7-6900K CPU @ 3.20GHz) is 4870 million ops/sec.

Now although that is a factor of 3.3 difference, it is not as large a difference as I was expecting given it is 11 years on. Although agreed, it would still be taking forever :-(


Reminiscing about old runs and machines, I remember running one of the first HADCM3 Spinups back in 2005/6 in advance of the BBC Expt on a Pentium 4 @ 3.2GHz and 1GB RAM. This ran at around 2sec/TS on that machine. I recently found I still had an old backup of the spinup files in an as downloaded but not started state.

I made all the adjustments to the files to get the spinup to run on my current i7 6700 with 16GB RAM in 2017. I ran it for a day and checking the zip files it was returning 0.45sec/TS, about 4.5 times faster than on the 12 year old Pentium 4 machine.

The old Pentium 4 machine was single core (2 if running hyperthreading). This is 4 core (8 if running hyperthreading). Taking both into account, some difference :) I'm only actually running 2 models together on it though just now.
ID: 55984 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 492
Credit: 31,510,687
RAC: 14,925
Message 56004 - Posted: 4 Apr 2017, 9:34:02 UTC - in response to Message 55949.  

Updated progress: task now about 24% complete after 12days and a few hours. This would infer about 53days total run time . So about another 41 days to go! BOINC still thinks about 90days left!!
ID: 56004 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : New small batches of long runs --> with problems

©2024 cpdn.org