HADSM3 Model "stuck"

Author	Message
Aaron Doucett Send message Joined: 1 Oct 05 Posts: 12 Credit: 10,041,430 RAC: 0	Message 40972 - Posted: 5 Nov 2010, 0:36:59 UTC Hi, I've been using boinc for about a month now and have had a lot of success, but this is the first time I've run into this issue: On a machine with windows XP, I've been running a model "hadsm3dhet2_jkrd_006589899_4" For the past few hours or more, the model has been stuck at 95.324% done, and while the elapsed time keeps counting up, the to completion time does not follow. The time steps have changed, but at the rate of about one every 6 seconds... rather than my usual .4s per TS. The total elapsed time is now at 305 hours. There are 4 models running on total simultaneously on this machine (Quad Core intel) and this one doesn't seem to want to finish, despite still using a 25% chunk of the CPU. I've come this far... It would be a shame to have to abort! What should I do? Is there any hope? ID: 40972 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 40973 - Posted: 5 Nov 2010, 5:54:15 UTC - in response to Message 40972. Hi Aaron Welcome to the forum. First off, perhaps I should say that the climate models are not intended, or guaranteed, to run right to the end. Rather, each one is started with certain parameters, (the number of which can vary), and where each can have slightly different values from other similar models. Then the model is allowed to run to see what happens to it. Some will complete, others won't. But all of them tell the physicists something about the starting values. This is used to form a huge 'ensemble' of model values and results. The hadsm3 models ( also known as 'slab ocean', or just 'slab' models), have 2 failure modes: 1) They are prone to failure if they get interrupted near the point where they change phase. (There are 3 phases.) 2) They can become "iceworlds", so-called because the Temperature graph shows an overall blue colour. These are further sub-divided into fast processing, and slow processing, which may be what you have. Either way, some of the data from "iceworld" models is missing in the returned results, so they need to be aborted. There is a sticky post near the top of the Number crunching section about ice worlds, how to recognize them and what to do about them. Backups: Here ID: 40973 · Reply Quote

Aaron Doucett Send message Joined: 1 Oct 05 Posts: 12 Credit: 10,041,430 RAC: 0	Message 40974 - Posted: 5 Nov 2010, 14:50:26 UTC - in response to Message 40973. Thanks for the input! Now that you speak of it, I did see a completely blue earth under temperature mode! I assumed it just wasn't computing anything...but I guess instead I had encountered one of these "ice worlds" .... how horrifying! :D In either case, after a night of running I now am seeing seemingly completely different models running... very strange. ID: 40974 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 40977 - Posted: 5 Nov 2010, 20:37:55 UTC Last modified: 5 Nov 2010, 20:39:13 UTC Aaron, we need to look at the workunit that the problem 'iceworld' model comes from to see whether there are any other computers with the same OS and CPU type crunching it. If there are any, their owners need to be sent an email as their HadSM will probably develop the same problem. I see you have a lot of computers crunching a lot of models. Here they are. Could you please tell us the ID number of the computer that had the iceworld, or give us a link to the computer or the model? Blue is the default temperature colour. An all-blue world is a sign that data isn't being generated even though the model keeps trying desperately. It's a good idea to have a look at the graphics of each HadSM every two or three days to check that everything's still OK. Other model types don't have this bug. Cpdn news ID: 40977 · Reply Quote

Aaron Doucett Send message Joined: 1 Oct 05 Posts: 12 Credit: 10,041,430 RAC: 0	Message 40978 - Posted: 5 Nov 2010, 22:15:02 UTC I believe I have some information that might help you. If you look at my error while computing list, http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?userid=100701&offset=0&show_names=0&state=5 You will find that Computer 1104234 has had several "errors while computing" in the past couple days. This is a little odd, as I haven't seen this volume of errors in the past. I also discovered one of my "COSMOS" Machines (an array of PC's I have set up dedicated to run CPDN 24/7) had been running a hadsm for 700+ hours! So I aborted that task and noticed a significant loss of average credit on that machine (computer 1105294). The task ID for that "iceworld" was Task 11014309 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=11014309 This however (below), http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1104234 is the link to the machine that I also saw issues with. Nothing else in the system has changed in the past couple weeks so I'm not sure what would lead to the errors all of a sudden... Hope this helps you help me! ID: 40978 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 40980 - Posted: 6 Nov 2010, 9:26:17 UTC Last modified: 6 Nov 2010, 9:38:01 UTC Thanks for that info. I'm glad you were still watching the thread. One useful thing you may want to know before I get started and look at what's been happening to those two computers: If you want to post a live link to another internet page eg a computer's page do this: Type [url]then paste in the address of the page[/url] then type in the closing tag. Instead of typing the tags in you can click the URL button above the letter-writing box. You have to do this before and after the address. When you post the message the tags become hidden. If you use the Preview button before posting the message you see whether you got it right and it's going to work. Anyone who wants is welcome to start a Tag Practice Thread in the Cafe section to try out what all the tags do. If you click the Quote button below anyone's post you will be shown exactly what they typed in to produce the effects they got. Cpdn news ID: 40980 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 40982 - Posted: 6 Nov 2010, 10:35:13 UTC Last modified: 6 Nov 2010, 10:38:27 UTC Here are the tasks from computer 1104234: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=1104234 On the Task page from each model we need to click on stderr + to display the messages. There are a few issues. Task 11966718 is a regional PNW. It crashed after 2 trickles/file uploads. The messages include: Install Directory : C:\bonic\program files\BOINC\ Data Directory : C:\bonic\CommonAppData\BOINC Can there really be a folder on this computer on the C drive called 'bonic'? Is this a spelling error or is it your username, Aaron? (Let's deal with this before we look at other crashed models.) Cpdn news ID: 40982 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 40983 - Posted: 6 Nov 2010, 12:17:27 UTC Last modified: 6 Nov 2010, 12:20:16 UTC Same task as before 11966718. It crashed with exit code -1073741819 (0xffffffffc0000005). Jorden's BOINC FAQ for this error is here. Unfortunately there appear to be a number of possible causes. However, several models on the same computer have crashed with Exit code -2 and Last error=193. For example Task 11828983. Moderator Thyme Lawn gave instructions regarding this error, which probably indicates a permissions problem, on the independent forum here. The instructions are two years old and may not be exactly the same now for this computer's version of Windows. Further down the thread you can see that the member with the problem, owdjim (Chris), solved the problem by reinstalling BOINC instead of resetting the permissions. You don't actually need to uninstall the current BOINC before reinstalling, but you should completely exit from it first. Right-click on the BOINC icon in the tray and select Exit. It looks as if something has happened on this computer in the last few days to cause this problem as it was previously producing good results. Task 11959901 has Last error=1450 but I'd guess this is a different manifestation of a permissions problem. Cpdn news ID: 40983 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 40984 - Posted: 6 Nov 2010, 12:39:23 UTC The computer that had the iceworld is doing really well. The workunit that Aaron's iceworld belonged to is here. We can see from the sec/TS of Aaron's model 11014309 how it slowed down disastrously. Jason has been stuck in an iceworld since mid-October without noticing or realising. Anonymous also has Intel + Windows and may have just hit the problem. I'll ask for emails to be sent to these two members. Cpdn news ID: 40984 · Reply Quote

Aaron Doucett Send message Joined: 1 Oct 05 Posts: 12 Credit: 10,041,430 RAC: 0	Message 40991 - Posted: 8 Nov 2010, 15:59:48 UTC I've noticed some strange behavior from the computer http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1104234 which I think is related to stability issues. I've cleaned it up a bit and am HOPING that I get more consistent performance as I was getting before. It's a shame to have the Average credit for one of your fastest PC's halved because of system lockups! As far as the folder naming C:\ , That is the actual name of the folder on the machine. A typo I am thinking. Even though there were models in progress, that same machine seems to have already downloaded all new tasks. (4 FAMOUS models... interesting) Hope this clears a couple things up at least ID: 40991 · Reply Quote

Aaron Doucett Send message Joined: 1 Oct 05 Posts: 12 Credit: 10,041,430 RAC: 0	Message 40998 - Posted: 9 Nov 2010, 23:30:26 UTC - in response to Message 40991. there are definitely problems with that machine... Today the boinc detached from the client all on it's own... and again downloaded 4 new models. Something is not right! (This is after a complete reinstall as well) ID: 40998 · Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 21 Oct 10 Posts: 53 Credit: 2,101,753 RAC: 3,985	Message 41174 - Posted: 28 Nov 2010, 12:54:45 UTC Last modified: 28 Nov 2010, 12:56:21 UTC Hi all, I think maybe i should have read this forum and thread earlier since i'm maybe facing a similar issue. I'm new to CPDN (but not at all to boinc) and i decided to go for CPDN for the following reason : I have a computer at the office which cannot be connected to the net (and that's definitive), i know about the way of moving boinc files from one instance to another via USB key but I didn't want to do this often, so after discussing with boinc fellow from L'Alliance Francophone I thought going for CPDN was a good idea since WU are long. So I bring back the WU (the only first 2 WU that I ever had since I started) in the week-end and have them crunching into a windows 7 virtual machine (I have an imac at home) and then back during the week to the office on the vista computer (2 cores in each). The problem is that at the beginning I could see credit rising up and the 2 WU were sending some tricklets in the week when connected to the net, but then it started to "freeze", the number of hours is increasing and the % is moving forward and then backward (so kinda stuck below 70%), and the remaining hours always increasing... http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=10960172 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=10960153 I can't decide myself to cancel WU that are still active and not in "error" and where i have spent so many crunching hours... What should I do ? Thanks for your help. Jerome ID: 41174 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 41175 - Posted: 28 Nov 2010, 14:06:50 UTC - in response to Message 41174. Last modified: 28 Nov 2010, 14:08:44 UTC Hi Jerome Bad news: Those 2 models have gone "ice world". (Sticky info here.) Look at the s/TS for the last few trickles - it's jumped suddenly. Abort them now. It's just a waste of time to continue. Also, it's not a good idea to swap models between computers that aren't identical, as the floating point unit on the processor works slightly differently for each type. This produces slightly different results for the same model run on each OS/processor brand. So you're effectively falsifying the results. Even if you produce a model that runs all the way through, the validation on this project is by statistical analysis, and your models may get rejected by the researchers. Backups: Here ID: 41175 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2168 Credit: 64,533,542 RAC: 6,551	Message 41179 - Posted: 28 Nov 2010, 18:20:28 UTC The first task linked in Jerome's post is unknown if it is a "classic" Intel+Windows iceworld as no Intel+Windows PC in that work unit has gotten that far. It could be. On the other hand, the second task linked has had several completions on Intel+Windows, so it is not a "classic" iceworld, despite the increase in s/TS. Something else is going on. It could be a stability issue on that PC, or something else. I would probably abort both of them, if you haven't already, but watch out for additional stability issues on other models. hadsm3 type models are no longer available, and FAMOUS models have some expected crashes, so it is difficult to determine the stability of your computer from those. On the other hand, if you run the hadam3p type models, and have troubles with crashes there, you may need to do something to your PC to ensure it is running stably. ID: 41179 · Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 21 Oct 10 Posts: 53 Credit: 2,101,753 RAC: 3,985	Message 41181 - Posted: 29 Nov 2010, 9:07:15 UTC - in response to Message 41179. Last modified: 29 Nov 2010, 9:34:59 UTC Hi, too bad, I thought I had activated mail notification on that thread but obviously I didn't, so I only see these answers now that I am back to the office... Anyway I understand (and I should have understood before I guess) that it's useless to keep these two running. I also understand that it may not be a good idea to run CPDN across two different computers, and I feel from both your comments that it may be linked to the problem I'm having. However I'm willing to give another try, so I'll figure out how to configure CPDN in order only to have hadam3p only and will load them back at home tonight. Then I'll see how it behaves, if I run into such problem again I'll know for sure that CPDN is not an option here. Anyway I don't have much choice as I said, there is no way I can configure boinc to connect to internet here (proxy setting hidden + windows user password unknown - ID card login system) so I'm trying to find a way to have a weekly load of WU of some project, and I thought CPDN might be a good idea. By the way : what does s/TS mean ? Thanks for your quick responses anyway. Jerome edit : I had activated the mail notification, it seems not to work properly... ID: 41181 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,005,213 RAC: 4,277	Message 41182 - Posted: 29 Nov 2010, 9:31:25 UTC - in response to Message 41181. I also understand that it may not be a good idea to run CPDN across two different computers, and I feel from both your comments that it may be linked to the problem I'm having. The advice to avoid mixing machines is right. However, one way of working around your situation might be to run the models to completion on one machine and then move the installation to another machine for reporting to the server and getting more work. It's the running of the model that shouldn't be mixed; if you immediately suspend new work when it downloads (i.e. before the first save point) then it will start again from the beginning when moved to the hidden machine. By the way : what does s/TS mean ? It means 'seconds per timestep' and is therefore the main way to compare performance. ID: 41182 · Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 21 Oct 10 Posts: 53 Credit: 2,101,753 RAC: 3,985	Message 41183 - Posted: 29 Nov 2010, 11:06:59 UTC - in response to Message 41182. This is an interesting advice. Considering that I have 5 days of this computer activity per week, do you mean that I should no even care bringing back the data home on the week-end (since I shouldn't have it running on my windows virtual machine) until it is completed at the office, and only then do the update from home ? Or is it of some interest to have it communicate from home the same way (without running) so it can deliver the "trickels" during the week-end, and then let it continue at work until completed ? (I assume it would take more than 5 days to complete, or not necessarily ?) "and is therefore the main way to compare performance" : to compare performance between what ? two different computers ? ID: 41183 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,005,213 RAC: 4,277	Message 41185 - Posted: 29 Nov 2010, 12:47:06 UTC - in response to Message 41183. Last modified: 29 Nov 2010, 12:49:52 UTC Considering that I have 5 days of this computer activity per week, do you mean that I should no even care bringing back the data home on the week-end (since I shouldn't have it running on my windows virtual machine) until it is completed at the office, and only then do the update from home ? Yes, that's exactly right. Or is it of some interest to have it communicate from home the same way (without running) so it can deliver the "trickels" during the week-end, and then let it continue at work until completed ? You could do this but it is extra effort for no real gain. As far as I know the project gets no benefit from knowing early about a partially complete model that is going to complete. (If a model crashes then it may be useful to have some trickles uploaded, but if the model is likely to finish then you might as well wait until it is finished and upload all of the trickles together.) "and is therefore the main way to compare performance" : to compare performance between what ? two different computers ? People make a variety of comparisons: one model with another model of the same type, or with a different model type, or before and after an upgrade, or with one/two/three/four models running at the same time. It's often the case that a mix of different model types runs better on a computer than having all the same type - and s/TS is one way of finding out. Note that the s/TS reported on the each model's results page is cumulative - i.e. Average (sec/TS) is calculated by dividing the CPU Time (sec) by the Timestep (in English - perhaps your page shows something else). If you want to experiment with different configurations it's better to calculate the s/TS by using the difference in CPU time and the difference in timestep over one trickle interval. ID: 41185 · Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 21 Oct 10 Posts: 53 Credit: 2,101,753 RAC: 3,985	Message 41187 - Posted: 29 Nov 2010, 20:26:53 UTC - in response to Message 41185. Last modified: 29 Nov 2010, 20:34:49 UTC Things are not starting well... changed project options to restrict to hadam3p (4 applications have that name), I put boinc to suspended and network activity on again, then I cancelled those 2 WUs :'-( and then did project update... got 2 new hadam3p WUs... trying to download, did download a lot of files... "error downloading file", WU would stay with that error status... so I tried project reset, then nothing happens anymore (not doing any request to server) ... so forced project update again... and "reached daily quota of 4 tasks" !!! Bloody ***** !! I don't have anymore task, they were cancelled and then in error !! Actually i can see 2 pending units in my profile, but I swear there is nothing in my boinc http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=12331802 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=12331801 what can i do to tell it they don't exist anymore ?? ID: 41187 · Reply Quote

old_user92639 Send message Joined: 13 Aug 05 Posts: 54 Credit: 117,227 RAC: 0	Message 41188 - Posted: 29 Nov 2010, 22:28:58 UTC CPDN Monitor - Abort request from BOINC... CPDN Moniteur - demande d'arrÃªt de BOINC ... -197 error --------------------- what can i do to tell it they don't exist anymore ?? no, you're alone :) ID: 41188 · Reply Quote