climateprediction.net home page
Smaller Work Units
Smaller Work Units
log in

Advanced search

Questions and Answers : Wish list : Smaller Work Units

1 · 2 · Next
Author Message
old_user692374
Send message
Joined: 5 Jan 13
Posts: 2
Credit: 32,547
RAC: 0
Message 45478 - Posted: 18 Jan 2013, 23:31:59 UTC

I wish the work units was smaller. I run 8 different projects that I like to distribute work between evenly, but with the work unit I got from climateprediction estimated to take 600 hours, it seems that my Boinc wont be accepting any work from my other projects for a couple months, which I dont find acceptable. Instead of having my work unit calculating 40 years, why not set each work unit doing like 5 year intervals?

Les Bayliss
Volunteer moderator
Send message
Joined: 5 Sep 04
Posts: 6408
Credit: 16,839,542
RAC: 21,887
Message 45481 - Posted: 19 Jan 2013, 5:42:53 UTC - in response to Message 45478.

The workunit that you're describing is the hadcm3 Coupled Ocean model.
It's already been cut back from what it was years ago, and is unlikely to be made any smaller.

There is already a smaller model, the hadam3 regional model, aka Weather At Home model, which runs for about 70-80 hours. i.e. 1 year models.

However, as has been mentioned more than once, the work here is based on the work being done at climate research centres around the world, and it's up to them when to provide the data for more models. And then it's only in batches of a few thousand at a time.

As for

Boinc wont be accepting any work from my other projects for a couple months
it will. But it'll take a few weeks the first time it encounters these long models to "learn" about them. Then it'll go back to 'round robin' running, because BOINC version 7 works in a completely different way to version 6.


____________
Backups: Here

old_user692374
Send message
Joined: 5 Jan 13
Posts: 2
Credit: 32,547
RAC: 0
Message 45482 - Posted: 19 Jan 2013, 6:12:13 UTC - in response to Message 45481.

Glad to hear most are smaller and that I will get the other projects in a couple weeks.

I understand you get a lot of data for this project, but I would imagine you could get results in smaller work units that instead of one person taking 3 months, for an example, break the work unit into 12 smaller work units, and if each work unit went to a separate computer, you could get the results in about a week instead of 3 months.

Les Bayliss
Volunteer moderator
Send message
Joined: 5 Sep 04
Posts: 6408
Credit: 16,839,542
RAC: 21,887
Message 45483 - Posted: 19 Jan 2013, 8:02:51 UTC - in response to Message 45482.
Last modified: 19 Jan 2013, 8:05:01 UTC

And the science would be worthless.
Been there, discussed that.

This is climate science. And it won't suit everyone.

PS
My computers complete these long units in 3 weeks, not 3 months.
They're currently running quite happily along side 3 hour WUs from a different project.
____________
Backups: Here

Profile Iain Inglis
Volunteer moderator
Send message
Joined: 16 Jan 10
Posts: 877
Credit: 100,083
RAC: 3,242
Message 45487 - Posted: 20 Jan 2013, 14:43:25 UTC - in response to Message 45482.

Just to add to what Les has already said:

The models are a time sequence, so they start at one time with a set of initial data and then 'propagate' that initial state to some final time (one year, two years, forty years later). The final model state is saved and returned to the servers, for use as the initial state of another model (or set of models).

So it doesn't actually make sense to split a 40-year run into parts to be run in parallel because the later parts would need an initial set of data from an earlier run but, if run in parallel, the earlier runs have by definition not finished!

old_user682835
Send message
Joined: 28 Jul 12
Posts: 1
Credit: 8,615
RAC: 0
Message 45511 - Posted: 28 Jan 2013, 2:11:03 UTC

Les Bayliss
Volunteer moderator
Send message
Joined: 5 Sep 04
Posts: 6408
Credit: 16,839,542
RAC: 21,887
Message 45512 - Posted: 28 Jan 2013, 2:31:59 UTC - in response to Message 45511.

I'm sorry that you feel that way, but this is climate science, and this project just isn't suitable for everyone.

The work being done is run on programs created by the UK's Met Office, for professional modelling on their supercomputers.
The results of the models being worked on here are for the use of professional climatologists in various climate institutions. THEY decide what is a statistically valid minimum model length.

If the length of time taken for them doesn't suit you, (or anyone else), then your only recourse is to Disconnect (Remove in later versions of BOINC), your computer(s) from the project and concentrate on projects with smaller WUs.

As for the statistics of failures here, yes this is looked at from time to time.

And, yet again, THERE IS NO DEADLINE FOR RETURN OF THE DATA.


____________
Backups: Here

Profile Dave Jackson
Send message
Joined: 15 May 09
Posts: 1783
Credit: 2,671,578
RAC: 898
Message 45514 - Posted: 28 Jan 2013, 7:56:16 UTC

Profile JIM
Send message
Joined: 31 Dec 07
Posts: 982
Credit: 14,320,108
RAC: 19,627
Message 45515 - Posted: 28 Jan 2013, 8:10:54 UTC

Profile astroWX
Volunteer moderator
Send message
Joined: 5 Aug 04
Posts: 1459
Credit: 76,183,576
RAC: 71,704
Message 45516 - Posted: 28 Jan 2013, 22:21:05 UTC

bk,

By way of background, these 40-year HadCM3n tasks are not trivial but are mere "babies" compared to what we used to run. They are a mere quarter of the HadCM3 tasks we used to run (on slower machines!) -- the originals were 160-year versions and we interrupted them frequently to make backups. (Those were days before reliable UPS units and even a minor power glitch on the mains could make a mess of things.) In addition, some of us ran 200-year 'spinups' to create starting conditions so that each task wouldn't have to run its own 'spinup.' (Imagine running 360-year tasks!)

Breaking the 160-year tasks into pieces, as a response to whinging on the boards, required investigation of scientific consequences of running the four parts on different machines -- different CPU's floating-point instruction sets, etc. Though not optimal, it was a compromise the scientists could deal with. (I don't know what normalizations they might run to account for the differences.)

One SERIOUS consequence is the huge increase in the servers' loads -- to handle up/download of the additional restart dumps. That overload contributed to recurring space problems, not helped by severely limited budgets for both servers and staff to manage them.

Hope that helps. (By the way, in case you haven't guessed, I prefer the longer tasks; all 160 years on a single box is best for the scientific results.)

____________
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.

Profile JIM
Send message
Joined: 31 Dec 07
Posts: 982
Credit: 14,320,108
RAC: 19,627
Message 45518 - Posted: 29 Jan 2013, 19:25:24 UTC

Profile mo.v
Volunteer moderator
Avatar
Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 10,773,446
RAC: 2,347
Message 45526 - Posted: 31 Jan 2013, 20:31:19 UTC

Could I suggest that members who want shorter tasks should go to their account (find it in the blue menu to the left) and then in the climateprediction preferences edit the model types you want. Deselect Hadcm which is the long 40-year model and select the three types of Hadam for the three regions:

EU Europe
SAF Southern Africa
PNW Pacific North-West

However, there's less work available now than a few years ago so if you reduce the number of model types you want your computer may spend periods without a model to crunch.
____________
Cpdn news

Profile Dave Jackson
Send message
Joined: 15 May 09
Posts: 1783
Credit: 2,671,578
RAC: 898
Message 45527 - Posted: 1 Feb 2013, 13:16:35 UTC - in response to Message 45526.

But there are currently a couple of thousand eu regional models going, just snagged three for when a cm3n finishes in about 30 hours time.

Gene
Send message
Joined: 24 Apr 08
Posts: 1
Credit: 2,529,811
RAC: 0
Message 46203 - Posted: 13 May 2013, 17:37:32 UTC

It is viable to have smaller work units without comprising the integrity of the data.

Since the model has to run for a continuous period of time, it cannot be parallelized among many computers simultaneously, but it can be swapped between one computer and another at arbitrarily short prescribed checkpoints as long as all the associated data is passed along.

So, for example, with a 40 year run (2020-2060), Ivan can crunch year 2037, after the server gets the data from Bob, who crunched 2036, etc. After all the data is received from those 40 chronological work units it is stitched together and is entered as part of the model ensemble. This has the added benefit of being able to set up short deadlines for the work units, and sending out the work unit again if it isn't crunched in time.

This might also save a lot of time if a model spins out of control; rather than running for another 400 hours, the server performs a quick reality check of the parameters before sending out further sequential work units, and if they exceed certain thresholds informs the modelers.

Setting up the modeling in this way would increase the server traffic by the factor proportional to the decrease in work unit time (maybe a factor of 100 would be ideal), which might put a strain on the server hardware.

This reduction in model length could be very easily implemented. It seems it would go a long way to solving user frustrations with running a model for 800 hours only to have it end in a computation error, or upload error, etc. (I have had a few of those since I have been contributing my computing time for the last 5 years.)

I'm a climate scientist at the Lamont-Doherty Earth Observatory, so I don't work with these models directly, but I also understand very well how breaking a model into subcomponents can be done easily. If anyone wants to discuss with me how this is impossible, I'd be glad to chat about how to do it (pm or email me at Lhenry@ldeo.columbia.edu) .


~Cheers



Belfry
Send message
Joined: 19 Apr 08
Posts: 179
Credit: 4,306,992
RAC: 0
Message 46204 - Posted: 13 May 2013, 18:38:50 UTC - in response to Message 46203.

It is viable to have smaller work units without comprising the integrity of the data.... Setting up the modeling in this way would increase the server traffic by the factor proportional to the decrease in work unit time (maybe a factor of 100 would be ideal), which might put a strain on the server hardware.



Hi Gene, welcome to the forum. I think model integrity and network traffic are less of a concern than the increased time it would take to complete since many pieces would end up with unstable and frequently turned-off machines. Anyway, since the time we both joined hadcm3's have been halved and hadam3's divided by three. And with newer processors turning hadcm3's around in one to two weeks at stock clocks this has become less of an issue for many users.

Profile astroWX
Volunteer moderator
Send message
Joined: 5 Aug 04
Posts: 1459
Credit: 76,183,576
RAC: 71,704
Message 46207 - Posted: 13 May 2013, 19:21:43 UTC
Last modified: 13 May 2013, 19:29:47 UTC

Welcome to the boards, Gene.

The argument is less about the possibility of fragmenting the models but of managing the consequences of fragmentation. Up- down-loads of restart dumps are not trivial. (You are experienced running the models, so you can do the arithmetic.) 160 iterations of what is now done with four? It boggles the imagination. (Decades ago, when I was one of many mainframe programmers at US Air Force Global Weather Central [as it was then called], we had a saying: If we can talk about it, we can program it, but that doesn't necessarily mean it would be a good idea.)

Consideration must be given to the way individuals manage their boinc work. Stacking tasks in a machine's queue would not enhance a model's 140-year completion.

This small reply doesn't exhaust the negative issues associated with model fragmentation.

This topic won't die. If models were chopped into 160 pieces rather than four, we'd surely get recurring complaints and schemes for paring runs down to where they'd fit into WCG limitations.

I hope you stay with the project.


[Edited for typo.]
____________
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.

Profile mo.v
Volunteer moderator
Avatar
Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 10,773,446
RAC: 2,347
Message 46208 - Posted: 13 May 2013, 19:22:19 UTC

Yes, all the models we're currently running are already time-sliced. Hadcm has actually been reduced in size by more than half for certain experiments. Each model we get is 40 years but they can be sewn together for various periods depending on what the researchers need for particular experiments. The Hadam3P can also be stitched together for long periods. I'm not sure whether the programmers would want the models broken down into shorter periods.

The Lamont-Doherty EO looks an interesting place to work, Gene. With a fascinating website and some meetings and lectures I'd like to go to if I wasn't in the UK.
____________
Cpdn news

Lockleys
Send message
Joined: 13 Jan 07
Posts: 164
Credit: 7,624,218
RAC: 7,993
Message 46210 - Posted: 13 May 2013, 21:52:32 UTC

The other dimension to this is the faster computers which we all now use. The first climate model I ran (under the umbrella of the BBC project) was a 160 year model and on the pathetic system I had then it took almost a year and a half. Now, several upgrades later, the 40 year models take me about 24 days on my slower system or 14 days on my faster system. (Many crunchers do better than me.) So in the time it would take our scarce technical resource to put in place the processes needed to split the tasks further, Moore's Law will probably have delivered a further performance fillip and the project will have been sped up again. Therefore, we may as well just wait for technological improvements to speed the models up for us.

Les Bayliss
Volunteer moderator
Send message
Joined: 5 Sep 04
Posts: 6408
Credit: 16,839,542
RAC: 21,887
Message 46213 - Posted: 14 May 2013, 0:03:48 UTC

Hi Gene.

I'll pass a note to the project people about your offer, but you need to understand that they don't do any of the actual climate work. They're just software engineers, tasked with getting batches of models to run stably.
The climate physicists who supply the data, (and are also paying for project people's time), are based in various climate centres around the planet.

For the long RAPIT/RAPID models, there are several UK universities involved.
I found the following saved on my computer:

Consortium members: National Oceanography Centre, British Antarctic Survey, Universty of Reading, Univerity of Oxford, Durham University, Met-Office, LSE, Imperial College.

Details were in the Experiments section at the front section of the project's web site, which was lost when the server involved had to be taken down when a different part of it running our php board was hacked. We're still attempting to get this back up, which will be after a long overdue update/error correction exercise.

For the short, so called regional models:
The SAF people are based in South Africa, (Cape Town?), but haven't proved data to run for a few years now.
The PNW people are at the University of Oregon, in Oregon USA.
The EU people are at a university in either France or Germany, I forget where.

So you really need to talk to people in these places about their work.

And, as you're a climate physicist, I don't need to tell you the basics, about chaotic systems, etc.
But these supercomputer models don't run too well on ALL desktop computers. There are several things that introduce disturbances into models, such as over clocking, (which allows less time for the processor to retrieve accurate results from the FPU), unstable power supplies that are OK for most uses, but Not for the work in this project), And the one of most relevance to your argument: differences in the maths of the FPUs of different brands of processor, and possible even between versions of a given brand.

There was some research done a few years ago about this, and it was found that running a model with the same starting data on an Intel processor and on an AMD processor produced results sufficiently different that they were effectively different models.
The paper that resulted from this was in the Research papers section at the front of the project web site.

So, for the foreseeable future, it's take-it-as-it-is.



____________
Backups: Here

Les Bayliss
Volunteer moderator
Send message
Joined: 5 Sep 04
Posts: 6408
Credit: 16,839,542
RAC: 21,887
Message 46214 - Posted: 14 May 2013, 1:54:09 UTC

PS Gene

The reason why your computer with lots of memory is crashing everything could be answered in this sticky post at the top of the Macintosh section.
The other may have crashed it's FAMOUS models because of the problem with the Mac compiler, as was posted somewhere back at the time they were being issued.


____________
Backups: Here

1 · 2 · Next

Questions and Answers : Wish list : Smaller Work Units


Main page · Your account · Message boards


Copyright © 2017 climateprediction.net