climateprediction.net home page

HADAM3P - Maximum elapsed time exceeded


Advanced search

Message boards : Number crunching : HADAM3P - Maximum elapsed time exceeded

AuthorMessage
Eirik Redd
Send message
Joined: Aug 31 04
Posts: 249
Credit: 26,987,757
RAC: 20,070
Message 42711 - Posted 29 Jul 2011 20:08:16 UTC

    Last modified: 29 Jul 2011 20:25:46 UTC

    This hadam3p_pnw_32s5_1985_1_007369346_0 with just over 231,000 seconds of run time.
    The estimated time to completion had been way low ever since the task started -- about ten hours, slowly increasing as the task ran. 100 or 120 hours would seem more likely on this machine or the other host where there are still a couple of these running.

    Is this a case where changing <rsc_fpops_bound> in client_state.xml might help?

    Thanks

    Eric

    <<edit>>

    Just got another hadap3p_pnw on the same machine, and time-to-completion looks normal at 111 hrs.
    ____________

    Profile astroWX
    Forum moderator
    Send message
    Joined: Aug 5 04
    Posts: 1295
    Credit: 37,520,270
    RAC: 17,812
    Message 42712 - Posted 30 Jul 2011 6:16:00 UTC

      Last modified: 30 Jul 2011 6:28:27 UTC

      Eric,

      The first batch of these tasks had run estimates about 1/10 of reality. Your second PNW task is closer to the mark.

      Whether you tweak client_state is up to you, depending on whether the erroneous value interferes with other boinc projects. If none, it will sort itself out as the task moves along.

      Edit:
      I forgot, my bad. Please do make the tweak, else the task is likely to turn belly up about 80% completion.

      Jim
      ____________
      "We have met the enemy and he is us." -- Pogo
      Greetings from coastal Washington state, the scenic US Pacific Northwest.

      Profile JIM
      Send message
      Joined: Dec 31 07
      Posts: 676
      Credit: 3,957,635
      RAC: 2,780
      Message 42716 - Posted 30 Jul 2011 14:18:13 UTC

        Last modified: 30 Jul 2011 14:20:44 UTC

        Dear astroWX

        [Quote:] I forgot, my bad. Please do make the tweak, else the task is likely to turn belly up about 80% completion.

        Please explain the tweak mentioned above. I also have a hadma3p_pnw that is ?waiting to start? with a ?to completion? time or 18 hours. You say that it is likely crash at about 80% unless it is modified.

        I think I have found the place in the client_state.xml file that needs to be modified.

        <rsc_fpops_bound> in client_state.xml

        <name>hadam3p_pnw_314p_1995_1_007369937</name>
        <app_name>hadam3p_pnw</app_name>
        <version_num>609</version_num>
        <rsc_fpops_est>79683833333333.000000</rsc_fpops_est>
        <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>
        <rsc_memory_bound>364000000.000000</rsc_memory_bound>
        <rsc_disk_bound>2000000000.000000</rsc_disk_bound>

        Is this right? Also please explain which value I have to change and to what.

        I don?t want to waste 3 and a half days of crunching on defective WU only to have it crash that is fixable.
        ____________

        Ingleside
        Send message
        Joined: Aug 5 04
        Posts: 92
        Credit: 7,982,232
        RAC: 8,702
        Message 42717 - Posted 30 Jul 2011 16:20:37 UTC - in response to Message 42716.

          Last modified: 30 Jul 2011 16:22:36 UTC

          <rsc_fpops_bound> in client_state.xml

          <name>hadam3p_pnw_314p_1995_1_007369937</name>
          <app_name>hadam3p_pnw</app_name>
          <version_num>609</version_num>
          <rsc_fpops_est>79683833333333.000000</rsc_fpops_est>
          <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>
          <rsc_memory_bound>364000000.000000</rsc_memory_bound>
          <rsc_disk_bound>2000000000.000000</rsc_disk_bound>

          Is this right? Also please explain which value I have to change and to what.

          I don?t want to waste 3 and a half days of crunching on defective WU only to have it crash that is fixable.

          You'll need to change <rsc_fpops_bound>.

          Just make sure BOINC is stopped, open-up client_state.xml in Notepad (or something similar under Linux), and add another number to <rsc_fpops_bound> (before the decimal-point), save the new client_state.xml and re-start BOINC.

          It doesn't matter if you also changes <rsc_fpops_bound> for other tasks, so it's possible to search & replace all occurrences of <rsc_fpops_bound> with <rsc_fpops_bound>9 or something (adding an extra 9 to all).

          To not get very high duration correction factor, it's also an idea to change <rsc_fpops_est>, by adding an 8 or 9 at the start. This should only be done to the wrongly-estimated task(s).

          Profile JIM
          Send message
          Joined: Dec 31 07
          Posts: 676
          Credit: 3,957,635
          RAC: 2,780
          Message 42718 - Posted 30 Jul 2011 17:07:41 UTC - in response to Message 42716.

            Thanks for the info. Will try it tonight after I get home from work. Wish me luck (will make backup first).

            Just to be sure that I understand, I should modify the entry from this:

            <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>

            To look like this?

            <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>9

            Is this correct? Please respond.

            ____________

            Ingleside
            Send message
            Joined: Aug 5 04
            Posts: 92
            Credit: 7,982,232
            RAC: 8,702
            Message 42719 - Posted 30 Jul 2011 18:45:17 UTC - in response to Message 42718.

              Thanks for the info. Will try it tonight after I get home from work. Wish me luck (will make backup first).

              Just to be sure that I understand, I should modify the entry from this:

              <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>

              To look like this?

              <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>9

              Is this correct? Please respond.

              No, it should be from:
              <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>
              to
              <rsc_fpops_bound>9796838333333330.000000</rsc_fpops_bound>


              Profile JIM
              Send message
              Joined: Dec 31 07
              Posts: 676
              Credit: 3,957,635
              RAC: 2,780
              Message 42720 - Posted 30 Jul 2011 18:58:20 UTC - in response to Message 42719.

                thanks for the quick reply. I would of had it backwards.
                ____________

                Eirik Redd
                Send message
                Joined: Aug 31 04
                Posts: 249
                Credit: 26,987,757
                RAC: 20,070
                Message 42733 - Posted 2 Aug 2011 6:45:39 UTC - in response to Message 42720.

                  This tweak worked ok for me.
                  The newer wu don't need it.

                  Keep crunching

                  Eric
                  ____________

                  Les Bayliss
                  Forum moderator
                  Send message
                  Joined: Sep 5 04
                  Posts: 5344
                  Credit: 8,876,229
                  RAC: 549
                  Message 42735 - Posted 2 Aug 2011 11:26:33 UTC

                    News post about a new problem.


                    ____________
                    Backups: Here

                    Darmok
                    Avatar
                    Send message
                    Joined: Dec 29 09
                    Posts: 27
                    Credit: 2,577,258
                    RAC: 546
                    Message 42738 - Posted 2 Aug 2011 11:28:58 UTC - in response to Message 42733.

                      Last modified: 2 Aug 2011 11:45:03 UTC

                      Just a thought: Wouldn't be easier to send a new client_state file? Based also on the post from Les in News & Annoucements,I am experiencing all these problems with a slew of pnw's set at a too low completion time, and my first completions that crashed at 89:38:18 exactly. I do not want to abort as it took several hours of download to receive them and it would be a waste.

                      As Les said, not everybody read the boards, not everybody is computer savvy and some have multiple hosts.
                      ____________

                      DJStarfox
                      Send message
                      Joined: Jan 27 07
                      Posts: 260
                      Credit: 1,159,804
                      RAC: 204
                      Message 42740 - Posted 2 Aug 2011 13:26:21 UTC - in response to Message 42738.

                        BOINC infrastructure doesn't allow such changes. Client state file also can change every 5 sec.

                        Darmok
                        Avatar
                        Send message
                        Joined: Dec 29 09
                        Posts: 27
                        Credit: 2,577,258
                        RAC: 546
                        Message 42746 - Posted 2 Aug 2011 23:41:01 UTC - in response to Message 42740.

                          You are right. The file reverts back after a restart.

                          Profile Overtonesinger
                          Send message
                          Joined: Dec 30 05
                          Posts: 4
                          Credit: 940,357
                          RAC: 0
                          Message 42758 - Posted 5 Aug 2011 21:24:48 UTC

                            Huge disappointment

                            The BIG minus of having 8 logical CPU-s is: NOT 1 or 2 ... but 8 WUs are DESTROYED at 68 percent complete (about 108 hours each) before I find out whats happening. Imagine how bad I feel. So much CPU-time wasted.

                            OK, just tell me here, please, when all new WUs are fixed - so their config is right. As a computer programmer I hate to change it manually to fix som-1-else s mistake... I hate to fear and having-to-CARE about every work-units LIFE. ;) lives of my newborn twins (kids) are just enough... Thanx.




                            *P.S.* Please, try to create 1 MultiThreaded (8-threaded preferably) WorkUnits, because when it fails - it fails only one WU, so I find it out much much sooner than 8x 108 hours of CPU time... , I would find it about eight times sooner. ;)

                            I loved the huge AQUA workUnits, MT 8 and they lasted about 25 hours. Sadly: the project has only too little 1-threaded TEST units lately (with LIMITED NUMBER of max. two at the same time per 1 computer) after some major crash. So I switched to CPDN hoping to put some 1 of 8 gigabytes of my RAM to some good use... how a PITY it crashed all 8 WUs. :(
                            please, fix it. I love CPDN and i have been running it on several computers for 2 years now... Please, fix it, as I have *too much* RAM and I cannot use it ALL. :)
                            ____________

                            Les Bayliss
                            Forum moderator
                            Send message
                            Joined: Sep 5 04
                            Posts: 5344
                            Credit: 8,876,229
                            RAC: 549
                            Message 42759 - Posted 5 Aug 2011 21:39:10 UTC - in response to Message 42758.

                              Last modified: 5 Aug 2011 21:41:34 UTC

                              I found out about the problem within hours, and notified the project people, who cancelled the remaining faulty WUs.
                              All WU's currently running/being created are, as far as is known, fault free.

                              cpdn isn't a set and forget project; keeping an eye on the message boards is needed if people don't want to get caught out with problems.


                              **************************

                              Multicore models were tested a couple of years back, but they were too unstable to even release to the beta testers. It's unlikely that the two project people will ever have the time to try again.
                              ____________
                              Backups: Here

                              Profile JIM
                              Send message
                              Joined: Dec 31 07
                              Posts: 676
                              Credit: 3,957,635
                              RAC: 2,780
                              Message 42767 - Posted 11 Aug 2011 4:15:47 UTC

                                Several days ago, I downloaded three Hadam3p WU?s. The to completion time was listed as 1548 hours! I know that these to completion times are often significantly inflated, often 2 or 3 times the actual running time, but, this is about 10 times what it will take to run the WU?s.

                                Are other people are seeing this wild inflation or does it have something to do with the tweak that I made to the client_state file to fix a problem with an earlier WU? that WU had a very low to completion time of only 18 hours. That tweak is described above in this thread. Do I need to undo the tweak?


                                ____________

                                Profile Iain Inglis
                                Forum moderator
                                Send message
                                Joined: Jan 16 10
                                Posts: 495
                                Credit: 9,532
                                RAC: 0
                                Message 42768 - Posted 11 Aug 2011 9:14:59 UTC

                                  The HADCM3N time estimates were excessive, but the HADAM3P models haven't been reported as having systematic problems (other than 'fpops_bound'). I've never been quite clear how BOINC handles the transition between model types: it may be that if you've switched from mostly HADCM3N to HADAM3P then there would be some transient effect as BOINC adjusts - in which case the best thing to do is just wait.

                                  If others have similarly inflated HADAM3P estimates for the new models then perhaps the HADCM3N values have been copied across ...

                                  Profile Dave Jackson
                                  Send message
                                  Joined: May 15 09
                                  Posts: 807
                                  Credit: 632,379
                                  RAC: 338
                                  Message 42769 - Posted 11 Aug 2011 10:12:24 UTC - in response to Message 42768.

                                    Just gone back to a HADAM3P and time estimate seemed about right at the start but then I haven't done any manual editing of the client_state.xml file so can't comment on that.

                                    Dave

                                    Eirik Redd
                                    Send message
                                    Joined: Aug 31 04
                                    Posts: 249
                                    Credit: 26,987,757
                                    RAC: 20,070
                                    Message 42771 - Posted 12 Aug 2011 0:11:52 UTC

                                      Here -- one of my hosts got a new HADAM3P est at 1100+ hours, and the last few HADAM3P on this same host at startup estimated at 1500. Actual time is about 120 hrs. It all settles down after a while.
                                      The strange bit is, that my other hosts never did this gross overestimate.
                                      Maybe it's something in client_state.xml but I'm not going to worry it.
                                      Too many BOINC options for me to sweat it.
                                      If it aint broke --

                                      Eric
                                      ____________

                                      Profile JIM
                                      Send message
                                      Joined: Dec 31 07
                                      Posts: 676
                                      Credit: 3,957,635
                                      RAC: 2,780
                                      Message 42772 - Posted 12 Aug 2011 5:56:06 UTC

                                        Dear Eric:

                                        I know what you mean. My other host just got an hadam3P_pnw WU and the initial to completion time is only 265 hours. Hadam3p WU?s on that machine take about 175 hours to complete.

                                        The only problem with these wildly inflated to completions times is that it is hard to get new work. The Boinc manager does not even ask for work when it thinks that you have a month and a half of crunching ahead of you, but, in reality the WU will finish in only a few days. This was fine when there were 20,000 WU?s waiting to be crunched, but, these days with the queue often empty (and the built-in back off times) the machine may have to beg for days just to get something.

                                        ____________

                                        Ingleside
                                        Send message
                                        Joined: Aug 5 04
                                        Posts: 92
                                        Credit: 7,982,232
                                        RAC: 8,702
                                        Message 42775 - Posted 12 Aug 2011 11:40:46 UTC - in response to Message 42767.

                                          Last modified: 12 Aug 2011 11:51:19 UTC

                                          Several days ago, I downloaded three Hadam3p WU?s. The to completion time was listed as 1548 hours! I know that these to completion times are often significantly inflated, often 2 or 3 times the actual running time, but, this is about 10 times what it will take to run the WU?s.

                                          Are other people are seeing this wild inflation or does it have something to do with the tweak that I made to the client_state file to fix a problem with an earlier WU? that WU had a very low to completion time of only 18 hours. That tweak is described above in this thread. Do I need to undo the tweak?

                                          If you finished one of the broken Hadam3p-tasks by editing fpops_bound but didn't also increase fpops_est, your Duration Correction Factor (DCF for short) will increase accordingly. So, if example the initial estimate for the broken model was 10 hours, but the model actually took 100 hours to run, your DCF was increased 10x than before.

                                          This new DCF will influence all future estimates, so with a new Hadam3p with "correct" fpops_est, instead of 100 hours it will show 1000 hours.

                                          The DCF will slowly decrease again as you finish tasks, if not mis-remembers it decreases max 10% for each task, except if client thinks it's too large difference between current DCF and the lower one so only decreases with 1% per task... But in any case, it should slowly decrease again.


                                          The DCF and therefore the estimates will never be very good here at CPDN, since for one thing HADCM3N is too high estimate, so after a string of these the DCF will become 0.5 or something, but a single Hadam3p will increase DCF back to 1 again. Also, some of the models the speed is significantly dependent on other things computer runs, if you runs multiple of the same model they can slow-down eachother, so even this will give some variations between runs.


                                          Edit:

                                          You can see your current DCF in BOINC Manager, as long as you're running v6.6.xx or later, by selecting the Project-tab, select a project, and hit "Properties". DCF is the last one listed.

                                          The DCF is also displayed on the web-page, if you look on one of your own computers details you'll see the DCF.

                                          SFCC
                                          Send message
                                          Joined: Sep 3 09
                                          Posts: 5
                                          Credit: 509,410
                                          RAC: 0
                                          Message 42818 - Posted 26 Aug 2011 22:10:23 UTC

                                            All hadam3p WU's that I have received during the past week or so have shown an estimated run time betweem 500 hrs and 1200 hrs with a 'due date' approximately 3 mo hence. The WU's run for several days with little or no decrease in remaining estimated run time, resulting in the WU running at high priority and blicking any other project from running on single core machines. Some have eventually settled down after a few days and start showing a systematic decrease in remaining run time, but others continued to run at high priority for many days until I finally aborted them. I have therefore suspended the project on my single core machines and only allow it to run on my three multicore machines (under two different user IDs) which can still run other projects while the climate model is running at high priority). Just thought someone might like to know...

                                            Profile Greg van Paassen
                                            Send message
                                            Joined: Nov 17 07
                                            Posts: 142
                                            Credit: 4,271,370
                                            RAC: 79
                                            Message 42819 - Posted 27 Aug 2011 0:39:13 UTC - in response to Message 42818.

                                              Last modified: 27 Aug 2011 0:46:31 UTC

                                              As I understand it, the Boinc time estimation algorithm works better the more models the PC has run to completion. So the more models you let run, the better will be the initial estimate.

                                              During the course of a model run, Boinc seems to stick with its original estimate much longer than we humans do. In my experience it doesn't get really accurate till about 90% completed, or even later. There's nothing we can do about this.

                                              So it's a case of just letting Boinc do its thing, if you still want to contribute. In the long run, things will work out. (That is, after finishing the CPDN model, the PC will spend its time with other projects, to work off its "time deficit".)

                                              Oh and yes, the CPDN models do take weeks (or months!) to run. That's normal.

                                              Cheers.

                                              w1hue
                                              Send message
                                              Joined: Aug 31 05
                                              Posts: 13
                                              Credit: 574,406
                                              RAC: 375
                                              Message 42820 - Posted 27 Aug 2011 4:36:35 UTC - in response to Message 42819.

                                                Last modified: 27 Aug 2011 4:49:10 UTC

                                                Oh and yes, the CPDN models do take weeks (or months!) to run. That's normal.

                                                And while it's running at high priority on a single core machine, nothing else gets done until one of the other WU's gets near its 'due date'. The problem would be resolved if the due date (of the HADAM3P WU's) was made much longer. Previous ClimatePrediction WU's typically had a due date about one year in the future -- in which case the run time estimate was always << less than the time until the due date.

                                                EDIT: I just noticed that the offending WU's are HADCM3N's rather than HADAM3P's -- guess this is the wrong thread... :-[
                                                ____________

                                                Les Bayliss
                                                Forum moderator
                                                Send message
                                                Joined: Sep 5 04
                                                Posts: 5344
                                                Credit: 8,876,229
                                                RAC: 549
                                                Message 42821 - Posted 27 Aug 2011 4:48:16 UTC - in response to Message 42820.

                                                  The problem with long "due dates", is that people treated this as "The project's not in a hurry to get any results".

                                                  Except that each of the research groups do want results fairly quickly, as the next lot of work is dependent on what's currently running.

                                                  And with the tight deadline for completion of the RAPIT project, (it may even be past it, due to problems getting out the bugs at the start of the year), there's just no time to "be nice" to single core/multi project crunchers.

                                                  Welcome to the New Look cpdn.


                                                  ____________
                                                  Backups: Here

                                                  w1hue
                                                  Send message
                                                  Joined: Aug 31 05
                                                  Posts: 13
                                                  Credit: 574,406
                                                  RAC: 375
                                                  Message 42822 - Posted 27 Aug 2011 4:52:22 UTC - in response to Message 42821.

                                                    And with the tight deadline for completion of the RAPIT project, (it may even be past it, due to problems getting out the bugs at the start of the year), there's just no time to "be nice" to single core/multi project crunchers.


                                                    If that is the project's attitude, then I'll just donate my puny little computers' time elsewhere... :-(

                                                    ____________

                                                    Profile Iain Inglis
                                                    Forum moderator
                                                    Send message
                                                    Joined: Jan 16 10
                                                    Posts: 495
                                                    Credit: 9,532
                                                    RAC: 0
                                                    Message 42823 - Posted 27 Aug 2011 13:01:36 UTC - in response to Message 42822.

                                                      And with the tight deadline for completion of the RAPIT project, (it may even be past it, due to problems getting out the bugs at the start of the year), there's just no time to "be nice" to single core/multi project crunchers.


                                                      If that is the project's attitude, then I'll just donate my puny little computers' time elsewhere... :-(

                                                      It isn't the project's attitude. CPDN models are usually a pretty relaxed affair, with results continuing to be accepted beyond any reasonable deadline. However, as Les says, the RAPIT sub-project is different - they do have a more constrained timeline. If that's a problem then it is always possible to deselect the HADCM3N models from your project preferences and select another model type.

                                                      I take the use of CPDN by other research groups as a credit to the project team and also to the prodigious efforts of the volunteers. It shouldn't be a surprise if different teams have different objectives - and we volunteers may need to adjust our contributions accordingly.

                                                      w1hue
                                                      Send message
                                                      Joined: Aug 31 05
                                                      Posts: 13
                                                      Credit: 574,406
                                                      RAC: 375
                                                      Message 42825 - Posted 27 Aug 2011 17:03:43 UTC - in response to Message 42823.

                                                        If that's a problem then it is always possible to deselect the HADCM3N models from your project preferences and select another model type.

                                                        Good point -- I'll deselect HADCM3N. But not much else seems to be available lately...

                                                        ____________

                                                        Profile JIM
                                                        Send message
                                                        Joined: Dec 31 07
                                                        Posts: 676
                                                        Credit: 3,957,635
                                                        RAC: 2,780
                                                        Message 42826 - Posted 28 Aug 2011 0:15:30 UTC - in response to Message 42825.

                                                          Good point -- I'll deselect HADCM3N. But not much else seems to be available lately.

                                                          Hadam3p are still being produced in limited batches. They seem to be released at irregular intervals. The problem is that they go fast. Its is not like in the past when slab models were available in seemingly endless numbers. If you keep connected 24/7 you should get 1 or 2 in a few days. Just make sure that your work buffer is set for 10 days.
                                                          ____________

                                                          Post to thread

                                                          Message boards : Number crunching : HADAM3P - Maximum elapsed time exceeded




                                                          Copyright © 2002-2014 climateprediction.net