Smaller Work Units

Author	Message
old_user692374 Send message Joined: 5 Jan 13 Posts: 2 Credit: 32,547 RAC: 0	Message 45478 - Posted: 18 Jan 2013, 23:31:59 UTC I wish the work units was smaller. I run 8 different projects that I like to distribute work between evenly, but with the work unit I got from climateprediction estimated to take 600 hours, it seems that my Boinc wont be accepting any work from my other projects for a couple months, which I dont find acceptable. Instead of having my work unit calculating 40 years, why not set each work unit doing like 5 year intervals? ID: 45478 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45481 - Posted: 19 Jan 2013, 5:42:53 UTC - in response to Message 45478. The workunit that you're describing is the hadcm3 Coupled Ocean model. It's already been cut back from what it was years ago, and is unlikely to be made any smaller. There is already a smaller model, the hadam3 regional model, aka Weather At Home model, which runs for about 70-80 hours. i.e. 1 year models. However, as has been mentioned more than once, the work here is based on the work being done at climate research centres around the world, and it's up to them when to provide the data for more models. And then it's only in batches of a few thousand at a time. As for Boinc wont be accepting any work from my other projects for a couple months it will. But it'll take a few weeks the first time it encounters these long models to "learn" about them. Then it'll go back to 'round robin' running, because BOINC version 7 works in a completely different way to version 6. Backups: Here ID: 45481 · Reply Quote

old_user692374 Send message Joined: 5 Jan 13 Posts: 2 Credit: 32,547 RAC: 0	Message 45482 - Posted: 19 Jan 2013, 6:12:13 UTC - in response to Message 45481. Glad to hear most are smaller and that I will get the other projects in a couple weeks. I understand you get a lot of data for this project, but I would imagine you could get results in smaller work units that instead of one person taking 3 months, for an example, break the work unit into 12 smaller work units, and if each work unit went to a separate computer, you could get the results in about a week instead of 3 months. ID: 45482 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45483 - Posted: 19 Jan 2013, 8:02:51 UTC - in response to Message 45482. Last modified: 19 Jan 2013, 8:05:01 UTC And the science would be worthless. Been there, discussed that. This is climate science. And it won't suit everyone. PS My computers complete these long units in 3 weeks, not 3 months. They're currently running quite happily along side 3 hour WUs from a different project. Backups: Here ID: 45483 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,002,728 RAC: 4,238	Message 45487 - Posted: 20 Jan 2013, 14:43:25 UTC - in response to Message 45482. Just to add to what Les has already said: The models are a time sequence, so they start at one time with a set of initial data and then 'propagate' that initial state to some final time (one year, two years, forty years later). The final model state is saved and returned to the servers, for use as the initial state of another model (or set of models). So it doesn't actually make sense to split a 40-year run into parts to be run in parallel because the later parts would need an initial set of data from an earlier run but, if run in parallel, the earlier runs have by definition not finished! ID: 45487 · Reply Quote

old_user682835 Send message Joined: 28 Jul 12 Posts: 1 Credit: 8,615 RAC: 0	Message 45511 - Posted: 28 Jan 2013, 2:11:03 UTC I must disagree with both Les and Iain and say that you�re missing the point. We�re not only your computing service providers, but we are also your customers. You�re justifications are OK for the computing side, but they are a poke in the eye to me, your customer. You�re saying that gargantuan work units are the only ones that make sense. I disagree. First, have you analyzed how much wasted effort is going on with units failed or cancelled, etc? The amount of lost computing time can be substantial. Do you even look at the average up time of the recipients PC? My home computer is likely not on long enough to complete 500 hours of processing in a reasonable time. Second, much of the thrill and satisfaction of this shared computing is seeing work completed and getting some vague sense of accomplishment. I�m not getting any of that from you. Just an annoyed feeling. I�m subscribed to four projects on my home computer, two running on my AMD graphics card and two running on my CPU. I was occasionally viewing project status and jobs completed. I was satisfied. Then (Oct, Nov, Dec? it�s been so long) two of your super-big jobs appeared on my system, each requiring over 500 hours of CPU time. Now, my second CPU project is essentially �un-enrolled � by you for the 2-4 months it will take to process these jobs. Right now, one of your jobs is suspended in the hope the other would finish by the deadline. Not a chance, even for that one running alone. What should I do? Should I delete one or both of them now, after a combined 162 hours of run time? I�m not happy. This project is no longer giving me the physic satisfaction I guess I need. I guess these unattainable projects have me in the don�t care mode. Compare ClimatePrediction to my Einstein@Home GPU project. You have prevented me from running other CPU projects. You will not finish by the Feb & Mar deadlines. You have taken away the fun of participating. But, Einstein is not blocking other GPU projects. I�m completing projects* about once per hour. And is satisfying my desire to make progress. You�ve awarded me 8,000 point. Einstein has awarded me 220,000 points. On a scale of 1-10, that�s an 11. * So, it doesn�t matter to me at all that I�ve no insight to the amount of work completed. I�ve not looked. And, I don�t really care about the points, but some of your customers will. ID: 45511 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45512 - Posted: 28 Jan 2013, 2:31:59 UTC - in response to Message 45511. I'm sorry that you feel that way, but this is climate science, and this project just isn't suitable for everyone. The work being done is run on programs created by the UK's Met Office, for professional modelling on their supercomputers. The results of the models being worked on here are for the use of professional climatologists in various climate institutions. THEY decide what is a statistically valid minimum model length. If the length of time taken for them doesn't suit you, (or anyone else), then your only recourse is to Disconnect (Remove in later versions of BOINC), your computer(s) from the project and concentrate on projects with smaller WUs. As for the statistics of failures here, yes this is looked at from time to time. And, yet again, THERE IS NO DEADLINE FOR RETURN OF THE DATA. Backups: Here ID: 45512 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,523,697 RAC: 5,963	Message 45514 - Posted: 28 Jan 2013, 7:56:16 UTC You�re justifications are OK for the computing side, but they are a poke in the eye to me, your customer. The customers for this project are those who pay Oxford University to get the work done. We the crunchers are not the customers. With regards to the length of time taken to run the current models, (about 800 hours on one of my computers and over 3,000 on the other), the length of these models was announced when the first batch was released and those not wanting to run them advised to exclude them in their model preferences in their accounts. Those who don't look at the boards regularly or new to the project would obviously not be aware of this. I haven't looked at the information on the main website recently to see how clear this is and would suggest that if it isn't clear then it should be. I only run other projects when there is no work available from CPDN so the effect on other projects doesn't bother me though as described in other threads the resource share does average things out fairly if you look over a long enough period. For me the question it comes down to is, "Do you crunch because you believe in the value of the projects you choose or do you crunch in order to see your computer going up in league tables or just for the satisfaction of seeing units complete and being replaced by new ones?" I would guess that the majority of long term cpdn crunchers do it because they believe in the value of the project and are therefore prepared to put up with things that annoy even us from time to time. ID: 45514 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,077,407 RAC: 1,835	Message 45515 - Posted: 28 Jan 2013, 8:10:54 UTC It would seem that your computers or computing habits are not well suited to this project. The answer is to run some other project like Malaria Control or Rosetta@home were the Wu�s can be completed in a few hours each. They have lots of worthwhile projects over at World Community Grid. I like you enjoy finishing WU�s. That is why I like to run a mixture of hadam3p and hadcm3n WU�s when possible. This lets me run the long hadcm3n models and still finish an Hadam3p WU every 5 days or so. It would be nice if they could keep a steadier stream of these models coming (this is the wish list). That said, there is a sense of accomplishment when I complete one of the hadcm3p WU�s. Nursing these temperamental giants through 7 weeks of processing takes some doing, and I feel good when I succeed. ID: 45515 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 45516 - Posted: 28 Jan 2013, 22:21:05 UTC bk, By way of background, these 40-year HadCM3n tasks are not trivial but are mere "babies" compared to what we used to run. They are a mere quarter of the HadCM3 tasks we used to run (on slower machines!) -- the originals were 160-year versions and we interrupted them frequently to make backups. (Those were days before reliable UPS units and even a minor power glitch on the mains could make a mess of things.) In addition, some of us ran 200-year 'spinups' to create starting conditions so that each task wouldn't have to run its own 'spinup.' (Imagine running 360-year tasks!) Breaking the 160-year tasks into pieces, as a response to whinging on the boards, required investigation of scientific consequences of running the four parts on different machines -- different CPU's floating-point instruction sets, etc. Though not optimal, it was a compromise the scientists could deal with. (I don't know what normalizations they might run to account for the differences.) One SERIOUS consequence is the huge increase in the servers' loads -- to handle up/download of the additional restart dumps. That overload contributed to recurring space problems, not helped by severely limited budgets for both servers and staff to manage them. Hope that helps. (By the way, in case you haven't guessed, I prefer the longer tasks; all 160 years on a single box is best for the scientific results.) "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 45516 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,077,407 RAC: 1,835	Message 45518 - Posted: 29 Jan 2013, 19:25:24 UTC I also remember running those 160 year WU�s on my old 1.2 GHz laptop with only 256 MB�s of RAM. That�s right, � GB of RAM. Win XP will run on a surprisingly same amount. This of course was a single core machine so that one WU was all I could run at one time. It progressed at slightly less than 0.5% per day running 24/7. They took 8 to 9 months to complete. I got an awful lot of message about increasing the size of the paging file. ID: 45518 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 45526 - Posted: 31 Jan 2013, 20:31:19 UTC Could I suggest that members who want shorter tasks should go to their account (find it in the blue menu to the left) and then in the climateprediction preferences edit the model types you want. Deselect Hadcm which is the long 40-year model and select the three types of Hadam for the three regions: EU Europe SAF Southern Africa PNW Pacific North-West However, there's less work available now than a few years ago so if you reduce the number of model types you want your computer may spend periods without a model to crunch. Cpdn news ID: 45526 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,523,697 RAC: 5,963	Message 45527 - Posted: 1 Feb 2013, 13:16:35 UTC - in response to Message 45526. But there are currently a couple of thousand eu regional models going, just snagged three for when a cm3n finishes in about 30 hours time. ID: 45527 · Reply Quote

Gene Send message Joined: 24 Apr 08 Posts: 1 Credit: 2,550,034 RAC: 0	Message 46203 - Posted: 13 May 2013, 17:37:32 UTC It is viable to have smaller work units without comprising the integrity of the data. Since the model has to run for a continuous period of time, it cannot be parallelized among many computers simultaneously, but it can be swapped between one computer and another at arbitrarily short prescribed checkpoints as long as all the associated data is passed along. So, for example, with a 40 year run (2020-2060), Ivan can crunch year 2037, after the server gets the data from Bob, who crunched 2036, etc. After all the data is received from those 40 chronological work units it is stitched together and is entered as part of the model ensemble. This has the added benefit of being able to set up short deadlines for the work units, and sending out the work unit again if it isn't crunched in time. This might also save a lot of time if a model spins out of control; rather than running for another 400 hours, the server performs a quick reality check of the parameters before sending out further sequential work units, and if they exceed certain thresholds informs the modelers. Setting up the modeling in this way would increase the server traffic by the factor proportional to the decrease in work unit time (maybe a factor of 100 would be ideal), which might put a strain on the server hardware. This reduction in model length could be very easily implemented. It seems it would go a long way to solving user frustrations with running a model for 800 hours only to have it end in a computation error, or upload error, etc. (I have had a few of those since I have been contributing my computing time for the last 5 years.) I'm a climate scientist at the Lamont-Doherty Earth Observatory, so I don't work with these models directly, but I also understand very well how breaking a model into subcomponents can be done easily. If anyone wants to discuss with me how this is impossible, I'd be glad to chat about how to do it (pm or email me at Lhenry@ldeo.columbia.edu) . ~Cheers ID: 46203 · Reply Quote

Belfry Send message Joined: 19 Apr 08 Posts: 179 Credit: 4,306,992 RAC: 0	Message 46204 - Posted: 13 May 2013, 18:38:50 UTC - in response to Message 46203. It is viable to have smaller work units without comprising the integrity of the data.... Setting up the modeling in this way would increase the server traffic by the factor proportional to the decrease in work unit time (maybe a factor of 100 would be ideal), which might put a strain on the server hardware. Hi Gene, welcome to the forum. I think model integrity and network traffic are less of a concern than the increased time it would take to complete since many pieces would end up with unstable and frequently turned-off machines. Anyway, since the time we both joined hadcm3's have been halved and hadam3's divided by three. And with newer processors turning hadcm3's around in one to two weeks at stock clocks this has become less of an issue for many users. ID: 46204 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 46207 - Posted: 13 May 2013, 19:21:43 UTC Last modified: 13 May 2013, 19:29:47 UTC Welcome to the boards, Gene. The argument is less about the possibility of fragmenting the models but of managing the consequences of fragmentation. Up- down-loads of restart dumps are not trivial. (You are experienced running the models, so you can do the arithmetic.) 160 iterations of what is now done with four? It boggles the imagination. (Decades ago, when I was one of many mainframe programmers at US Air Force Global Weather Central [as it was then called], we had a saying: If we can talk about it, we can program it, but that doesn't necessarily mean it would be a good idea.) Consideration must be given to the way individuals manage their boinc work. Stacking tasks in a machine's queue would not enhance a model's 140-year completion. This small reply doesn't exhaust the negative issues associated with model fragmentation. This topic won't die. If models were chopped into 160 pieces rather than four, we'd surely get recurring complaints and schemes for paring runs down to where they'd fit into WCG limitations. I hope you stay with the project. [Edited for typo.] "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 46207 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 46208 - Posted: 13 May 2013, 19:22:19 UTC Yes, all the models we're currently running are already time-sliced. Hadcm has actually been reduced in size by more than half for certain experiments. Each model we get is 40 years but they can be sewn together for various periods depending on what the researchers need for particular experiments. The Hadam3P can also be stitched together for long periods. I'm not sure whether the programmers would want the models broken down into shorter periods. The Lamont-Doherty EO looks an interesting place to work, Gene. With a fascinating website and some meetings and lectures I'd like to go to if I wasn't in the UK. Cpdn news ID: 46208 · Reply Quote

Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0	Message 46210 - Posted: 13 May 2013, 21:52:32 UTC The other dimension to this is the faster computers which we all now use. The first climate model I ran (under the umbrella of the BBC project) was a 160 year model and on the pathetic system I had then it took almost a year and a half. Now, several upgrades later, the 40 year models take me about 24 days on my slower system or 14 days on my faster system. (Many crunchers do better than me.) So in the time it would take our scarce technical resource to put in place the processes needed to split the tasks further, Moore's Law will probably have delivered a further performance fillip and the project will have been sped up again. Therefore, we may as well just wait for technological improvements to speed the models up for us. ID: 46210 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 46213 - Posted: 14 May 2013, 0:03:48 UTC Hi Gene. I'll pass a note to the project people about your offer, but you need to understand that they don't do any of the actual climate work. They're just software engineers, tasked with getting batches of models to run stably. The climate physicists who supply the data, (and are also paying for project people's time), are based in various climate centres around the planet. For the long RAPIT/RAPID models, there are several UK universities involved. I found the following saved on my computer: Consortium members: National Oceanography Centre, British Antarctic Survey, Universty of Reading, Univerity of Oxford, Durham University, Met-Office, LSE, Imperial College. Details were in the Experiments section at the front section of the project's web site, which was lost when the server involved had to be taken down when a different part of it running our php board was hacked. We're still attempting to get this back up, which will be after a long overdue update/error correction exercise. For the short, so called regional models: The SAF people are based in South Africa, (Cape Town?), but haven't proved data to run for a few years now. The PNW people are at the University of Oregon, in Oregon USA. The EU people are at a university in either France or Germany, I forget where. So you really need to talk to people in these places about their work. And, as you're a climate physicist, I don't need to tell you the basics, about chaotic systems, etc. But these supercomputer models don't run too well on ALL desktop computers. There are several things that introduce disturbances into models, such as over clocking, (which allows less time for the processor to retrieve accurate results from the FPU), unstable power supplies that are OK for most uses, but Not for the work in this project), And the one of most relevance to your argument: differences in the maths of the FPUs of different brands of processor, and possible even between versions of a given brand. There was some research done a few years ago about this, and it was found that running a model with the same starting data on an Intel processor and on an AMD processor produced results sufficiently different that they were effectively different models. The paper that resulted from this was in the Research papers section at the front of the project web site. So, for the foreseeable future, it's take-it-as-it-is. Backups: Here ID: 46213 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 46214 - Posted: 14 May 2013, 1:54:09 UTC PS Gene The reason why your computer with lots of memory is crashing everything could be answered in this sticky post at the top of the Macintosh section. The other may have crashed it's FAMOUS models because of the problem with the Mac compiler, as was posted somewhere back at the time they were being issued. Backups: Here ID: 46214 · Reply Quote