climateprediction.net home page
Benchmak stopped model being crunched

Benchmak stopped model being crunched

Questions and Answers : Windows : Benchmak stopped model being crunched
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user272

Send message
Joined: 6 Aug 04
Posts: 58
Credit: 1,286,603
RAC: 0
Message 14640 - Posted: 25 Jul 2005, 12:25:53 UTC

Just noticed the following in the log of one of my machines running hadsm3 4.13

25/07/2005 07:26:37 128 Suspending computation and network activity - running CPU benchmarks
25/07/2005 07:26:37 129 Pausing result 13pm_100072003_1 (removed from memory)
25/07/2005 07:26:39 130 Running CPU benchmarks
25/07/2005 07:26:47 131 Aborting CPU benchmarks, one or more active tasks are still running.

I remember this happening once before on an earlier model and, I _think_, a different machine. The problem is that BOINCVIEW, and I suppose BOINC Manager?, show the model as running although it was never actually restarted after the benchmark abort. If you\'re not paying attention it\'s quite easy to miss the situation and end up with a machine sitting idle.

Is this a known problem?

Ian


<img>
ID: 14640 · Report as offensive     Reply Quote
Profile Andrew Hingston
Volunteer moderator

Send message
Joined: 17 Aug 04
Posts: 753
Credit: 9,804,700
RAC: 0
Message 14641 - Posted: 25 Jul 2005, 13:09:28 UTC

See <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2893">this thread</a>.

It is, as you say, easily overlooked. Only a few people have reported it, but it may be common.
ID: 14641 · Report as offensive     Reply Quote
old_user2500

Send message
Joined: 28 Aug 04
Posts: 65
Credit: 9,636,280
RAC: 0
Message 14644 - Posted: 25 Jul 2005, 16:17:41 UTC - in response to Message 14641.  

It is occurring on a number of my machines.Some of these are "headless" (no monitor / keyboard or mouse) and I don't notice it until I check the CPDN stats and see that a machine has not reported in for some time).

I am running CDPN and Seti under Boinc 4.45. I have both apps configured to remain in memory when they are suspended. When the event occurrs - both applications still show in the Boinc manager but CPDN does not appear to be running. Not sure if the Seti app still runs but I will watch for it.
ID: 14644 · Report as offensive     Reply Quote
old_user2500

Send message
Joined: 28 Aug 04
Posts: 65
Credit: 9,636,280
RAC: 0
Message 14953 - Posted: 7 Aug 2005, 5:07:09 UTC - in response to Message 14644.  

Ok .. it just ocurred on one of my machines:

- CPU benchmark starts to run and CPDN is paused.
- an error occurs
- the message "Aborting CPU benchmarks, one or more active tasks are still running." is displayed
- both of the hadsm processes appear to be killed but BOINC does not seem to be aware of it. I am running Seti (10%) and CPDN (90%) and have set both to remain in memory when suspended.
- on the "work" tab BOINC still shows CPDN and SETI.
- BOINC still allocates time slices to CPDN even though the hadsm processes are not running. BOINC shows CPDN as running even though the hadsm processes are not actually there.
- When BOINC allocates time slices to SETI and SETI runs as the SETI process is still in memory.


ID: 14953 · Report as offensive     Reply Quote
Profile Keck_Komputers
Avatar

Send message
Joined: 5 Aug 04
Posts: 426
Credit: 2,426,069
RAC: 0
Message 14956 - Posted: 7 Aug 2005, 8:51:57 UTC

This is a known issue that is hopefully fixed in the 4.71 version of the BOINC client.

The timeout for stopping applications for the benchmark run has been increased, hopefully enough to allow CPDN to stop in time. What is apparently happening is the wait times out, then CPDN stops, at this point the app is waiting for the benchmarks to finish and the benchmarks have allready given up. So neither does anything.

Restarting the client will get things going again. Running the benchmarks manually every 4.5 days is a preventative.


John Keck -- BOINCing since 2002/12/08 -- <a href="http://www.boinc.dk/index.php?page=user_statistics&amp;project=cpdn&amp;userid=191"><img border="0" height="80" src="http://191.cpdn.sig.boinc.dk?188"></a>
ID: 14956 · Report as offensive     Reply Quote
old_user9685

Send message
Joined: 2 Sep 04
Posts: 44
Credit: 372,682
RAC: 0
Message 15163 - Posted: 17 Aug 2005, 12:43:39 UTC - in response to Message 14956.  

&gt; The timeout for stopping applications for the benchmark run has been
&gt; increased, hopefully enough to allow CPDN to stop in time. What is apparently

In the latest dev branch (4.72), the timeout is still at 10 seconds. Unless someone is still planning to increase it, it's not likely to be in the next release.
ID: 15163 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 15170 - Posted: 17 Aug 2005, 18:39:10 UTC

Chris Sutton has compiled a version 4.45a with a 30 sec delay rather than 10 sec. Arnaud has offered to host it so hopefully something soon.
_______________________________
Visit <a href="http://boinc-doc.net/boinc-wiki/index.php?title=Climateprediction_FAQ">BOINC WIKI</a> for help

And join <a href="http://www.boincsynergy.com/">BOINC Synergy</a> for all the news in one place.
ID: 15170 · Report as offensive     Reply Quote
old_user159

Send message
Joined: 5 Aug 04
Posts: 2
Credit: 142,931
RAC: 0
Message 15172 - Posted: 17 Aug 2005, 20:59:43 UTC - in response to Message 14641.  

&gt;
&gt; It is, as you say, easily overlooked. Only a few people have reported it, but
&gt; it may be common.
&gt;

FYI, I also observed same problem (CC 4.45 / Windows). I first assumed this was coming from my config, but I now discover it is a frequently met problem.

By the way, if it's just a matter of changing the value of a delay, does anyone knows why Berkeley don't just do the modification for the next releases ?


&gt; Unless someone is still planning to increase it, it's not likely
&gt; to be in the next release.

Shall we do a petition ? Where do we sign up ? ;-)


ID: 15172 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 15206 - Posted: 18 Aug 2005, 19:17:50 UTC

To get 4.45a with a 30 sec delay instead of 10,

Arnaud is hosting a version created by Chris Sutton:

<a href="http://arnaudboinc.free.fr/">4.45a site</a>
_______________________________
Visit <a href="http://boinc-doc.net/boinc-wiki/index.php?title=Climateprediction_FAQ">BOINC WIKI</a> for help

And join <a href="http://www.boincsynergy.com/">BOINC Synergy</a> for all the news in one place.
ID: 15206 · Report as offensive     Reply Quote
old_user9685

Send message
Joined: 2 Sep 04
Posts: 44
Credit: 372,682
RAC: 0
Message 15219 - Posted: 19 Aug 2005, 7:38:40 UTC - in response to Message 15206.  
Last modified: 19 Aug 2005, 7:41:14 UTC

&gt; To get 4.45a with a 30 sec delay instead of 10,

There is a problem with the benchmarks run by this version. They are very low. I haven't yet figured out the problem (hoping just missing optimizations), so please don't download this version.

If you already have, I would suggest reverting back to the UCB 4.45 until I have had a chance to figure out the problem.

Sorry all.
Chris :(
ID: 15219 · Report as offensive     Reply Quote
old_user9685

Send message
Joined: 2 Sep 04
Posts: 44
Credit: 372,682
RAC: 0
Message 15273 - Posted: 21 Aug 2005, 11:02:26 UTC - in response to Message 15219.  

&gt; There is a problem with the benchmarks run by this version. They are very low.

There's a new version (4.45b) on its way to Arnaud.
Benchmark issue appears sorted. Holler if you find otherwise.
ID: 15273 · Report as offensive     Reply Quote
Arnaud

Send message
Joined: 3 Sep 04
Posts: 268
Credit: 256,045
RAC: 0
Message 15277 - Posted: 21 Aug 2005, 14:39:15 UTC
Last modified: 21 Aug 2005, 14:39:43 UTC

<a href="http://arnaudboinc.free.fr">BOINC 4.45b</a> available :o)
-----------------------------------------------
<a href="http://boinc-doc.net/boinc-wiki/index.php?title=Main_Page">Boinc Wiki</a>
<a href="http://forum.boinc.fr/">L'Alliance Francophone</a>
ID: 15277 · Report as offensive     Reply Quote
Profile old_user15351

Send message
Joined: 8 Sep 04
Posts: 23
Credit: 121,446
RAC: 0
Message 15390 - Posted: 26 Aug 2005, 0:18:05 UTC

I have a concern regarding the timeout, what if it's not that the CPDN model is taking more than 10s to exit but that your running a "non CPU intensive" project app (berkeley computer science "crash collection" project for example - http://winerror.cs.berkeley.edu/crashcollection/) as i'm using a 3GHz Xeon generation workstation, so i don't think it's CPDN that's causing the problem, possibly something is wrong with the "restart apps" code in the core client ???
i'm not a coder, so this is out of my league, just a suggestion though

Lee
ID: 15390 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2169
Credit: 64,550,109
RAC: 6,649
Message 15392 - Posted: 26 Aug 2005, 2:06:22 UTC - in response to Message 15390.  

&gt; I have a concern regarding the timeout, what if it's not that the CPDN model
&gt; is taking more than 10s to exit but that your running a "non CPU intensive"
&gt; project app (berkeley computer science "crash collection" project for example
&gt; - http://winerror.cs.berkeley.edu/crashcollection/) as i'm using a 3GHz Xeon
&gt; generation workstation, so i don't think it's CPDN that's causing the problem,
&gt; possibly something is wrong with the "restart apps" code in the core client

Could be, but every unattended benchmark with 4.45 resulted in a stoppage of work that wouldn't restart. Once I copied in Ralic's 4.45b files, all automatic benchmarks have worked as they should.
ID: 15392 · Report as offensive     Reply Quote
Profile old_user15351

Send message
Joined: 8 Sep 04
Posts: 23
Credit: 121,446
RAC: 0
Message 15429 - Posted: 26 Aug 2005, 16:28:58 UTC - in response to Message 15392.  

&gt; &gt; I have a concern regarding the timeout, what if it's not that the CPDN
&gt; model
&gt; &gt; is taking more than 10s to exit but that your running a "non CPU
&gt; intensive"
&gt; &gt; project app (berkeley computer science "crash collection" project for
&gt; example
&gt; &gt; - http://winerror.cs.berkeley.edu/crashcollection/) as i'm using a 3GHz
&gt; Xeon
&gt; &gt; generation workstation, so i don't think it's CPDN that's causing the
&gt; problem,
&gt; &gt; possibly something is wrong with the "restart apps" code in the core
&gt; client
&gt;
&gt; Could be, but every unattended benchmark with 4.45 resulted in a stoppage of
&gt; work that wouldn't restart. Once I copied in Ralic's 4.45b files, all
&gt; automatic benchmarks have worked as they should.
&gt;

true, but i gather from all the post on various threads that it's most probably because CPDN isn't allowed enough time to exit properly, hence the current problem, but a "non CPU intensive" task doesn't try to exit (because it won't affect the bemchmarks) and so the problem may well persist even with the increased timeout in 4.45b

I'm just trying to get the the bottom of a potentially bigger problem, so that when boinc becomes more popular etc. there are fewer issues

Lee
ID: 15429 · Report as offensive     Reply Quote
old_user6115

Send message
Joined: 31 Aug 04
Posts: 14
Credit: 404,382
RAC: 0
Message 16538 - Posted: 10 Oct 2005, 20:48:24 UTC - in response to Message 15172.  

&gt;
&gt; It is, as you say, easily overlooked. Only a few people have reported it, but
&gt; it may be common.
&gt;

FYI, I also observed same problem (CC 4.45 / Windows). I first assumed this was coming from my config, but I now discover it is a frequently met problem.

By the way, if it\'s just a matter of changing the value of a delay, does anyone knows why Berkeley don\'t just do the modification for the next releases ?


&gt; Unless someone is still planning to increase it, it\'s not likely
&gt; to be in the next release.

Shall we do a petition ? Where do we sign up ? ;-)

I wonder if it was fixed by V5.1.8. I recently noticed the problem exhibited on one of my boxes (Win2003 Server). It was very difficult to notice since the other 3 projects the box is running were just fine. I only noticed it when I noticed that the box hadn\'t trickled in several days. Then I noticed that CPDN was \"running\", but not using any CPU time.

I noticed the same symptoms on a WinXP box that is a \"work\" computer. I just happened to notice the \"running\" without using CPU time. That one gets rebooted reasonably often in normal use, so I probably would never notice a delayed trickle. I haven\'t noticed it on another \"work\" computer, but it\'s pretty hard to notice when using a number of machines.

Since the four projects I am running now all support V5 clients, I went ahead and upgraded. Just wonder how vigilant I need to be.



ID: 16538 · Report as offensive     Reply Quote

Questions and Answers : Windows : Benchmak stopped model being crunched

©2024 climateprediction.net