Message boards :
Number crunching :
Last 11 FAMOUS "Error while Computing" - BORING
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Feb 06 Posts: 158 Credit: 1,251,176 RAC: 0 |
It is soul destroying that the only tasks available on CPDN are nowadays 100% "Error while Computing". Unless something changes, I shall seriously consider giving up CPDN as a waste of time, I regret to say. I hope there is a glimmer of light at the end of the tunnel. This should be a really worthwhile cause, but seems to be going nowhere at present, and I do not mean that this is because of the recent change of staff, as it has been like this since last August but continues to get worse, having now reached the 100% failure rate after originally starting with about 33% successful completions. Keith |
Send message Joined: 31 Aug 04 Posts: 42 Credit: 547,031 RAC: 0 |
I can only speak for myself, but I have at least a 80% success rate. Prehaps you've just been unlucky or maybe you need to adjust your settings, but you certainly won't improve the situation by giving up... |
Send message Joined: 7 Aug 04 Posts: 2173 Credit: 64,758,769 RAC: 3,155 |
Since mid December, I've completed about 90% of these tasks in Linux. Unfortunately for Keith, the improvement in model stability for FAMOUS (the only tasks that Macs can currently run) has not bled over onto the Apple platform. With the Intel compiler used for the Mac, they can't force certain optimizations to OFF and FAMOUS doesn't like that. It is running too fast and certain routines within the model don't like the full optimization. We are testing hadcm3 for all platforms, and hadam3p regional models for Linux and Mac on the Beta site. But without dedicated programmers, and with the holiday at Oxford, I am unsure of the status of these. They are not ready to go to the main site yet. |
Send message Joined: 15 May 09 Posts: 4472 Credit: 18,448,326 RAC: 22,385 |
Don't give up Kieth, I have gone throughn phases like that but I understand from viewing other posts on this topic that it is through seeing which work units end up with errors and which don't that the model gets refined so eventually a greater percentage should complete. I am assuming that you are getting the errors such as negative pressure value and not a problem with your computer which may be due to overcocking etc. I am also told that the date these units do produce prior to crashing is of use. Dave |
Send message Joined: 20 Feb 06 Posts: 158 Credit: 1,251,176 RAC: 0 |
Dave & Programmers/CPDN staff I feel sure something is happening to bin the zip files. Here is a listing of Messages during the last day of my Mac suffering yet another (12th) calculating error. Always the zip files are absent, so something must be deleting them. ------------------------------------------------------------------- Fri 7 Jan 23:00:00 2011 Resuming network activity Fri 7 Jan 23:00:00 2011 climateprediction.net Sending scheduler request: To send trickle-up message. Fri 7 Jan 23:00:00 2011 climateprediction.net Not reporting or requesting tasks Fri 7 Jan 23:00:01 2011 climateprediction.net Started upload of famous_xrzv_1699_200_007098114_1_15.zip Fri 7 Jan 23:00:01 2011 climateprediction.net Started upload of famous_xd0z_899_200_007078714_1_2.zip Fri 7 Jan 23:00:06 2011 climateprediction.net Scheduler request completed Fri 7 Jan 23:01:39 2011 climateprediction.net Finished upload of famous_xd0z_899_200_007078714_1_2.zip Fri 7 Jan 23:01:45 2011 climateprediction.net Finished upload of famous_xrzv_1699_200_007098114_1_15.zip Fri 7 Jan 23:18:47 2011 climateprediction.net Started upload of famous_xd0z_899_200_007078714_1_3.zip Fri 7 Jan 23:18:49 2011 climateprediction.net Sending scheduler request: To send trickle-up message. Fri 7 Jan 23:18:49 2011 climateprediction.net Not reporting or requesting tasks Fri 7 Jan 23:19:33 2011 climateprediction.net Scheduler request failed: HTTP gateway timeout Fri 7 Jan 23:19:43 2011 climateprediction.net Finished upload of famous_xd0z_899_200_007078714_1_3.zip Fri 7 Jan 23:20:33 2011 climateprediction.net Sending scheduler request: To send trickle-up message. Fri 7 Jan 23:20:33 2011 climateprediction.net Not reporting or requesting tasks Fri 7 Jan 23:20:37 2011 climateprediction.net Scheduler request completed Fri 7 Jan 23:39:33 2011 climateprediction.net Sending scheduler request: To send trickle-up message. Fri 7 Jan 23:39:33 2011 climateprediction.net Not reporting or requesting tasks Fri 7 Jan 23:39:36 2011 climateprediction.net Scheduler request completed Sat 8 Jan 00:17:04 2011 climateprediction.net Sending scheduler request: To send trickle-up message. Sat 8 Jan 00:17:04 2011 climateprediction.net Not reporting or requesting tasks Sat 8 Jan 00:17:07 2011 climateprediction.net Scheduler request completed Sat 8 Jan 00:39:40 2011 climateprediction.net Started upload of famous_xrzv_1699_200_007098114_1_16.zip Sat 8 Jan 00:39:40 2011 climateprediction.net Sending scheduler request: To send trickle-up message. Sat 8 Jan 00:39:40 2011 climateprediction.net Not reporting or requesting tasks Sat 8 Jan 00:39:45 2011 climateprediction.net Scheduler request completed Sat 8 Jan 00:40:35 2011 climateprediction.net Finished upload of famous_xrzv_1699_200_007098114_1_16.zip Sat 8 Jan 01:00:00 2011 Suspending network activity - time of day Sat 8 Jan 22:11:09 2011 climateprediction.net Computation for task famous_xrzv_1699_200_007098114_1 finished Sat 8 Jan 22:11:09 2011 climateprediction.net Output file famous_xrzv_1699_200_007098114_1_19.zip for task famous_xrzv_1699_200_007098114_1 absent Sat 8 Jan 22:11:09 2011 climateprediction.net Output file famous_xrzv_1699_200_007098114_1_20.zip for task famous_xrzv_1699_200_007098114_1 absent Sat 8 Jan 22:11:09 2011 climateprediction.net Starting famous_xazz_2099_200_007076086_1 Sat 8 Jan 22:11:09 2011 climateprediction.net Starting task famous_xazz_2099_200_007076086_1 using famous version 611 Sat 8 Jan 23:00:00 2011 Resuming network activity Sat 8 Jan 23:00:01 2011 climateprediction.net Sending scheduler request: To send trickle-up message. Sat 8 Jan 23:00:01 2011 climateprediction.net Reporting 1 completed tasks, not requesting new tasks Sat 8 Jan 23:00:01 2011 climateprediction.net Started upload of famous_xrzv_1699_200_007098114_1_17.zip Sat 8 Jan 23:00:01 2011 climateprediction.net Started upload of famous_xrzv_1699_200_007098114_1_18.zip Sat 8 Jan 23:00:08 2011 climateprediction.net Scheduler request completed Sat 8 Jan 23:00:09 2011 climateprediction.net Started upload of famous_xd0z_899_200_007078714_1_4.zip Sat 8 Jan 23:00:09 2011 climateprediction.net Started upload of famous_xd0z_899_200_007078714_1_5.zip Sat 8 Jan 23:01:39 2011 climateprediction.net Finished upload of famous_xd0z_899_200_007078714_1_4.zip Sat 8 Jan 23:01:54 2011 climateprediction.net Finished upload of famous_xd0z_899_200_007078714_1_5.zip Sat 8 Jan 23:09:31 2011 climateprediction.net Sending scheduler request: To send trickle-up message. Sat 8 Jan 23:09:31 2011 climateprediction.net Not reporting or requesting tasks Sat 8 Jan 23:09:34 2011 climateprediction.net Scheduler request completed --------------------------------------------------------------------------- You will see that uploads of 15.zip and 16.zip were successful. Then 19.zip and 20.zip were reported absent. And finally Start of upload of 17.zip and 18.zip was reported. But no report of their absence, nor of their Upload being finished. There is something very illogical here and may be an indication where the programming has gone astray???? Keith P.S. I may not have seen any others, but I think all these failures end with THETA messages:- ---------------------------------------------------- Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Sorry, too many model crashes! :-( (15896): called boinc_finish -------------------------------------------------------------- Keith |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Nothing wrong. Sat 8 Jan 22:11:09 2011 climateprediction.net Computation for task famous_xrzv_1699_200_007098114_1 finished "finished" means either completed the full run, or crashed before that. The wording is that chosen by the BOINC programmers to reflect the usage across the greatest number of projects. Sat 8 Jan 22:11:09 2011 climateprediction.net Output file famous_xrzv_1699_200_007098114_1_19.zip for task famous_xrzv_1699_200_007098114_1 absent This means that the model crashed after creating zip file 18, and before creating zip file 19. Therefore there IS no zip file 19, so it CAN'T be sent. *************** You will see that uploads of 15.zip and 16.zip were successful. When a model fails, BOINC tries to upload the data as specified in it's client_state.xml file, (aka it's "to-do" list), to the specified upload servers. However, there usually isn't time to finish sending the biggish zips before BOINC finishes running though it's "to-do" list. The last item on the list says, in effect, "That's it for this work unit. Erase all references to that WU, and get on with the next item on the schedule." So it's quite possible to see a zip file start to upload, and then for it (perhaps several zips), to just disappear. Again, a "BOINC" thing. Backups: Here |
Send message Joined: 20 Feb 06 Posts: 158 Credit: 1,251,176 RAC: 0 |
Thanks for explanation, Les. But I am no happier with this demoralizing situation. Another "Computation error" today at 5:30 pm on zip 8. Have had no completed FAMOUS for over a month now - 100% failures. Last 2 successes were on 4th and 5th December. Obviously it was a mistake sending to my MAC a HADAM3P European task, because that completed successfully on November 28th!!! It seems that our checking of the FAMOUS tasks to find errors and improve them is failing. It seems we are searching for an elusive task that works among many useless ones. Surely some must be getting "on target" even if they do not get the "bullseye"!! It seems we are trying to get a working application rather like the example quoted of putting chimpanzees typing randomly on typewriters for an infinite time, so that eventually one will type the complete works of William Shakespeare!!!! Keith |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,643,694 RAC: 4,803 |
It seems we are trying to get a working application rather like the example quoted of putting chimpanzees typing randomly on typewriters for an infinite time, so that eventually one will type the complete works of William Shakespeare!!!! For ordinary people like us, there is an unavoidable difficulty of perception with the FAMOUS/Millennium models: how many climate models would you expect to run for 1200 years and produce climates that are physically viable? And how would you go about working out that the completion rate should be 1% or 10% or 100%? I haven't a clue and, to be frank, the surprising thing to me is that any get through at all. It seems that the current batch aren't very successful and the lack of model success is known only because large numbers have been run by CPDN participants and not succeeded. That's a success for the participants even though it's not for the models. If the failure rate were to be caused by some kind of programming error then, for sure, participants have every right to be upset: the CPDN main site is not a test facility for finding bugs - there's a CPDN beta site for that kind of debugging and people there expect to have some 'avoidable' failures. But that isn't what's happening here on the main CPDN site, as far as I can see: each FAMOUS batch represents a different category and range of model settings - and some will work and others not. Though CPDN has 'prediction' in the title and people might naturally expect the prediction process to be pretty reliable at least over 'short' times, the project's broader aim is to help improve the modelling of climate - and there's no reason, in my opinion, to suppose that all or indeed any new type of model will finish. |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
Seti@home has not found any alien signal so far and Einstein@home has not found any gravitational wave, although it found one pulsar (and perhaps another). Yet people are crunching happily and, in the case of SETI, even donating money to buy two new servers, Oscar and Jocelyn, and a third one, Synergy, is arriving. So people are happy to search, even if they don't find anything. Tullio |
Send message Joined: 12 Feb 08 Posts: 66 Credit: 4,877,652 RAC: 0 |
Is there going to be more FAMOUS tasks? Server status page shows only 211 WU's left, and it's the only model type available for Linux and Mac. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Yes, more on the way. |
Send message Joined: 20 Feb 06 Posts: 158 Credit: 1,251,176 RAC: 0 |
Les Any chance they will be of a different variety that might possibly occasionally complete like they used too? I may have been sent 2 of the new batch as I see the "stock" of FAMOUS task has been replenished. I have had 2 more "Calculation errors" today bringing my 100% successive FAMOUS failure rate since beginning of December to 15 tasks. All that I have inspected have ended with "Invalid Theta messages":- ---------------------------------------------- ......... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Sorry, too many model crashes! :-( (3678): called boinc_finish ------------------------------------------------- Keith |
Send message Joined: 20 Feb 06 Posts: 158 Credit: 1,251,176 RAC: 0 |
All of the last batch of FAMOUS tasks have gone already. Is everyone (as well as Mac & Linux) having early Computation Errors? Will there be more tasks? Keith P.S. I have noticed the servers have been down frequently in the mornings (GMT). Is work being done there? It would be interesting to know what progress is being made, Keith |
Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0 |
Keith, on your P.S., it seems that the reason for the server being down is the daily backup of the database. Apparently the database has grown quite a lot in the last year. Famous produces a greater volume of trickles than older model types; that might be the cause of the growth. It's going to be a few months before the situation gets better since the project has lost its sysadmin and programmers, and the replacements haven't started yet. On the errors, I ran a couple of X-series Famouses successfully on a windows virtual machine running on a linux box, and another linux box has completed an X- and a W-series, with two more in progress. No failed work units so far; maybe I'm just lucky at the moment. On the lack of Famous work, I think what has happened is this. In the past, each model was sent out to five or six computers, but the X and W series are only being sent out to one computer (and only to another if the earlier computer fails). So, suddenly we're getting through the work queue six times as fast. I don't think the researchers allowed for this. With both programmers having left the project the researchers can't create new work units quickly. There are plenty of regional models left for Windows computers, but unfortunately Mac and Linux boxes are having a mandatory vacation, probably for a few months :(. I hope it's less. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Keith The new batch of FAMOUS may be slightly different to the last, but the program for Mac computers is still the same as before: Can't turn off SSE2 usage. The sponsorship for the Oxford part of the Millennium project, (the FAMOUS climate models), expired at the end of December, and the project person, Hiro, has moved back to his previous work at another university in Northern England. The FAMOUS project, i.e. the models and data collection, will now only be part time for him. He has said privately that even he has a low success rate on his Mac, and also that it will be several months before it's possible to attempt another build of the program. To be quite blunt about it, you have two choices: 1) Move to anther project for a few months, until the other model types are out of beta testing. 2) Buy a Windows computer. |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
Of the last 5 tasks downloaded (across 2 computers), only 2 failed and pretty early in the computation. The daily backup seems to take several hours, so I just set my network available times around this late-night window (in the compute preferences). IIRC, the backup goes from 23:00 to 03:00 GMT. Better if one of the admins can confirm this. |
Send message Joined: 20 Feb 06 Posts: 158 Credit: 1,251,176 RAC: 0 |
Les Thanks for telling me the Mac is no good with FAMOUS and change to Windows!!! but to be serious, will the SSE2 problem still be a problem when using a virtual Windows XP by using Parallels DeskTop? If so, I shall quit using OSX 10.5.8, and use Windows XP on Parallels DeskTop for Mac. At the very least I will be able to compare the relative performance. As a matter of interest, why should the SSE2 cause a problem. I have no idea what it is, I am afraid to say. All I have found, which seems vaguely understandable is:- The key benefits of SSE2 are that MMX instructions can work on 128-bit data blocks, and that SSE instructions now support 64-bit floating-point values. ... Keith |
Send message Joined: 20 Feb 06 Posts: 158 Credit: 1,251,176 RAC: 0 |
Now 17 successive Computation errors. And is it the end for my crunching? ........... ================================================ Thu 13 Jan 23:04:07 2011 climateprediction.net Message from server: No work is available for HadCM3 Coupled Model Experiment Optimised File I/O Thu 13 Jan 23:04:07 2011 climateprediction.net Message from server: No work is available for UK Met Office FAMOUS Thu 13 Jan 23:04:07 2011 climateprediction.net Message from server: UK Met Office HADAM3P European Region is not available for your type of computer. Thu 13 Jan 23:04:07 2011 climateprediction.net Message from server: UK Met Office HADAM3P Southern Africa is not available for your type of computer. Thu 13 Jan 23:04:07 2011 climateprediction.net Message from server: UK Met Office HADAM3P Pacific North West is not available for your type of computer. Thu 13 Jan 23:04:07 2011 climateprediction.net Message from server: No work available for the applications you have selected. Please check your settings on the web site. Thu 13 Jan 23:04:31 2011 climateprediction.net update requested by user Thu 13 Jan 23:04:32 2011 climateprediction.net Sending scheduler request: Requested by user. Thu 13 Jan 23:04:32 2011 climateprediction.net Requesting new tasks Thu 13 Jan 23:04:33 2011 climateprediction.net Scheduler request completed: got 0 new tasks Thu 13 Jan 23:04:33 2011 climateprediction.net Message from server: No work sent Thu 13 Jan 23:04:33 2011 climateprediction.net Message from server: No work is available for UK Met Office HadSM3 Slab Model Thu 13 Jan 23:04:33 2011 climateprediction.net Message from server: No work is available for HadCM3 Coupled Model Experiment Optimised File I/O Thu 13 Jan 23:04:33 2011 climateprediction.net Message from server: No work is available for UK Met Office FAMOUS Thu 13 Jan 23:04:33 2011 climateprediction.net Message from server: UK Met Office HADAM3P European Region is not available for your type of computer. Thu 13 Jan 23:04:33 2011 climateprediction.net Message from server: UK Met Office HADAM3P Pacific North West is not available for your type of computer. Thu 13 Jan 23:04:33 2011 climateprediction.net Message from server: UK Met Office HADAM3P Southern Africa is not available for your type of computer. Thu 13 Jan 23:04:33 2011 climateprediction.net Message from server: No work available for the applications you have selected. Please check your settings on the web site. Thu 13 Jan 23:05:38 2011 climateprediction.net Sending scheduler request: To fetch work. Thu 13 Jan 23:05:38 2011 climateprediction.net Requesting new tasks Thu 13 Jan 23:05:39 2011 climateprediction.net Scheduler request completed: got 0 new tasks Thu 13 Jan 23:05:39 2011 climateprediction.net Message from server: No work sent Thu 13 Jan 23:05:39 2011 climateprediction.net Message from server: No work is available for UK Met Office HadSM3 Slab Model Thu 13 Jan 23:05:39 2011 climateprediction.net Message from server: No work is available for HadCM3 Coupled Model Experiment Optimised File I/O Thu 13 Jan 23:05:39 2011 climateprediction.net Message from server: No work is available for UK Met Office FAMOUS Thu 13 Jan 23:05:39 2011 climateprediction.net Message from server: UK Met Office HADAM3P European Region is not available for your type of computer. Thu 13 Jan 23:05:39 2011 climateprediction.net Message from server: UK Met Office HADAM3P Southern Africa is not available for your type of computer. Thu 13 Jan 23:05:39 2011 climateprediction.net Message from server: UK Met Office HADAM3P Pacific North West is not available for your type of computer. Thu 13 Jan 23:05:39 2011 climateprediction.net Message from server: No work available for the applications you have selected. Please check your settings on the web site. Thu 13 Jan 23:06:45 2011 climateprediction.net Sending scheduler request: To fetch work. Thu 13 Jan 23:06:45 2011 climateprediction.net Requesting new tasks Thu 13 Jan 23:06:47 2011 climateprediction.net Scheduler request completed: got 0 new tasks Thu 13 Jan 23:06:47 2011 climateprediction.net Message from server: No work sent Thu 13 Jan 23:06:47 2011 climateprediction.net Message from server: No work is available for UK Met Office HadSM3 Slab Model Thu 13 Jan 23:06:47 2011 climateprediction.net Message from server: No work is available for HadCM3 Coupled Model Experiment Optimised File I/O Thu 13 Jan 23:06:47 2011 climateprediction.net Message from server: No work is available for UK Met Office FAMOUS Thu 13 Jan 23:06:47 2011 climateprediction.net Message from server: UK Met Office HADAM3P European Region is not available for your type of computer. ======================================================== All in beautiful red type!!!! If my last 4 tasks go now, I will give up. Keith P.S. Maybe come back in 3 or 4 months to see what is happening. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,643,694 RAC: 4,803 |
There is, of course, another possibility - which is that your hardware has developed a fault. Try the suggestions in the hardware section of the 'read me' files, here. The only time I've had unexpected errors was when a memory stick went wrong. It took a long time to realise that was the problem because the machine's diagnostic test passed every time it was run for a couple of months - and then started failing every time. New stick: crashes stopped. CPDN really thrashes the memory in particular. |
Send message Joined: 20 Feb 06 Posts: 158 Credit: 1,251,176 RAC: 0 |
One more Calculation error today Only 3 remaining FAMOUS tasks Shall stop crunching CPDN when these 3 fail Keith P.S. See no fault with hardware. Always THETA problem failures per BOINC. |
©2024 cpdn.org