Thread 'Suspending WUs safely'

Author	Message
bozz4science Send message Joined: 10 May 20 Posts: 50 Credit: 3,417,917 RAC: 2,363	Message 62857 - Posted: 5 Nov 2020, 16:55:18 UTC Last modified: 5 Nov 2020, 16:59:55 UTC As I very excitingly received my first WUs across all of my 3 hosts, I am trying to read up on most of the recent discussion. The Linux based host received UK Met Office HadAM4 at N216 resolution v8.52 WUs and the 2 windows hosts received Weather At Home 2 (wah2) v8.24 windows_intelx86 WUs. I am very excited to see these WUs being crunched once my remaining queue is drained in a couple of days. All other projects are set to no new tasks in the meantime. Now to my questions. As my hosts are considerably slower than the average host attached to this project, I am afraid of a WU suspension and following reboot due to some random windows update or power blackout (more likely with those above average runtimes), will likely result in the already started WUs to error out. As I wasn't quite successful with the search function to find a thread that matched my questions, I started this one. If however I just oversaw some thread, I would highly appreciate a pointer to the following questions: 1) So, can I just suspend some WUs and continue later if the "leave in memory" is activated in BOINC manager? 2) At what time or percentage interval are the WUs usually checkpointing? 3) Is it safe to suspend WUs and continue them after a reboot? Would I risk loosing incremental progress of that WU if it is currently crunching away in between 2 checkpoints? In addition I would be highly interested in how many WUs you would recommend running simultaneously on a 6 core processor (old) because I read about the strain they put on L3 cache apparently. So combining them with any Rosetta or WCG-MIP tasks on the remaining threads would also hurt performance tremendously right? Just want to be sure that computers actually result in valuable results as opposed to invalid results after days of crunching because of sth that I could have done to mitigate or avoid any kind of error altogether. Thx in advance! ID: 62857 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 62858 - Posted: 5 Nov 2020, 19:49:48 UTC - in response to Message 62857. 1) Yes. If already running, WAIT a few seconds after suspending each one, as there are a lot of files to close, which takes time on a slow HD. 2) Click on a task. Click on Properties to the left of the list of tasks. There are 2 items about check points near the top of the drop down list. 3) It's safe for most model types. But some in the past have been found to fail. (Which may be due to the suspend/shutdown process used.) It's the Linux N216 models that use a lot of L3 cache. We've found experimentally that each likes a good 4 Megs of L3. Look up the Intel specs for each of your computers to see how much L3 they each have. Run too many, and they slow WAAAAY down. ID: 62858 · Reply Quote

bozz4science Send message Joined: 10 May 20 Posts: 50 Credit: 3,417,917 RAC: 2,363	Message 62859 - Posted: 5 Nov 2020, 20:16:11 UTC - in response to Message 62858. Last modified: 5 Nov 2020, 20:17:29 UTC Thanks so much for getting back to me this quickly. Very informative indeed 1) Will wait in the future upon task suspension. I run most of computers with HD only so this is definitely good advice. 2) Haven't had the chance yet to see this in action and didn't know that they were implemented the same way as for other projects. Very helpful for some intermediate/scheduled downtime or if a reboot is required. 3) That's great to know. I will be cautious then 4) Thanks for that info. I will adjust the app config file on my Linux host then accordingly. For this host based on an old Xeon X5660 with 12 MB of L3 cache, that will be 3 work units and additionally I will limit down other tasks of L3-cache intensive apps for the time being. Very much appreciated Les Bayliss! Thanks again ID: 62859 · Reply Quote

bozz4science Send message Joined: 10 May 20 Posts: 50 Credit: 3,417,917 RAC: 2,363	Message 62937 - Posted: 13 Nov 2020, 14:00:29 UTC Didn't want to start a new thread and clutter the forum unnecessarily and didn't know where else to put. As my first 6 WUs started to trickle in a few times, I wonder what the "Phase", "Timestep" and "Average (sec/TS)" mean exactly. How many phases and/or trickles are there in a WU? Do they differ between the wah2 and the UK Met Office HadAM4 at N144 resolution WUs? Could you shortly elaborate on these? Thanks ID: 62937 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944	Message 62939 - Posted: 13 Nov 2020, 14:48:43 UTC - in response to Message 62937. Two of your tasks wah2_sam50_a0mf_201312_25_881_012034815_2 and hadam4h_e0c4_207111_5_887_012042274_0 can be used to illustrate what the task names mean which give some of this information. sam50 - South America region 50Km squares resolution 201312 Start month December 2013 _25_ Runs for twenty-five model months. 887 batch no, _2 at the end indicates it has failed on two computers previously and this is it's final chance. Hadam4h in the Linux task is the model type. Start month is November 2071 It runs for 5 model months. batch no 887 and _0 means it is first attempt at the task. Timestep is as the name suggests the steps through the model time. Pretty sure these vary between task types and possibly even within task types. sec/timestep is cpu time to complete that bit of processing. N144 are lower resolution than n216 HADM4H tasks and take up correspondingly less ram, cache and disk space. Trickles are sent at the same time as the monthly zip files are produced and what credit is based on. There are changes afoot to how credit is managed but not sure about time scale for that. (I am not holding my breath!) I or one of the other mods should probably tidy this up a bit and make it a sticky for people to read or to be referred to when these questions get asked to prevent re-inventing the wheel each time. ID: 62939 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 62943 - Posted: 13 Nov 2020, 19:35:05 UTC - in response to Message 62937. To complete the answers, "Phase" is a left over from the beginning of the project. The model used then was a "Slab Ocean" model - an atmosphere with lots of weather going on in it, and a big blob of water at the bottom, which had fixed characteristics. There were 3 parts to it - Past, Present, and Future, each lasting for 25 years. These were the phases. The intent was to create a huge pool of data that could be used by climate researchers to look for events that could improve the understanding of climate. Somewhere about 10 years ago, perhaps a bit longer, the Oxford researchers started looking outside of the university, to research groups world wide to provide the work. Then the models used were changed to more advanced Met Office models, and the purpose changed as well. The intent now was to study climate attribution. The term "Phase" is still in there somewhere, but it's not used now. As for trickle_up files, these too are are pretty much a left over from the past. The data is contained in zip files, which show up in the Transfers tab of BOINC. Once they're sent, they disappear into a server somewhere around the planet, never to be seen again by the person who created them. ID: 62943 · Reply Quote

bozz4science Send message Joined: 10 May 20 Posts: 50 Credit: 3,417,917 RAC: 2,363	Message 62947 - Posted: 13 Nov 2020, 21:06:07 UTC Thank you both very much! Very much appreciated. I agree on making this sticky as I couldn't find a fitting thread via the forum search. Maybe you could also edit the title accordingly by adding " + task naming explained" to make finding it easier for others. Fingers crossed though, that I will successfully compute the …_2 wah WUs. Thanks again ID: 62947 · Reply Quote