climateprediction.net home page
Task ... exited with zero status but no \'finished\' file.

Task ... exited with zero status but no \'finished\' file.

Questions and Answers : Windows : Task ... exited with zero status but no \'finished\' file.
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
old_user470806

Send message
Joined: 6 Sep 07
Posts: 47
Credit: 23,188
RAC: 0
Message 30572 - Posted: 20 Sep 2007, 5:14:18 UTC

Although I searched for words and phrases from the subject event, nothing was reported to be on the board:

9/19/2007 2:05:44 PM| |Starting BOINC client version 5.10.13 for windows_intelx86
9/19/2007 2:05:44 PM| |log flags: task, file_xfer, sched_ops
9/19/2007 2:05:44 PM| |Libraries: libcurl/7.16.1 OpenSSL/0.9.8e zlib/1.2.3
9/19/2007 2:05:44 PM| |Data directory: C:\\Program Files\\BOINC
9/19/2007 2:05:44 PM| |Processor: 1 AuthenticAMD AMD Athlon(tm) XP 2400+ [x86 Family 6 Model 8 Stepping 1]
9/19/2007 2:05:44 PM| |Processor features: fpu tsc sse 3dnow mmx
9/19/2007 2:05:44 PM| |Memory: 1.94 GB physical, 2.29 GB virtual
9/19/2007 2:05:44 PM| |Disk: 57.26 GB total, 19.35 GB free
9/19/2007 2:05:44 PM| climateprediction.net |URL: http://climateprediction.net/; Computer ID: 753323; location: (none); project prefs: default
9/19/2007 2:05:44 PM| |General prefs: from climateprediction.net (last modified 2007-09-18 17:32:13)
9/19/2007 2:05:44 PM| |Host location: none
9/19/2007 2:05:44 PM| |General prefs: using your defaults
9/19/2007 2:05:44 PM| |Reading preferences override file
9/19/2007 2:05:44 PM| |Preferences limit memory usage when active to 768.55MB
9/19/2007 2:05:44 PM| |Preferences limit memory usage when idle to 1031.35MB
9/19/2007 2:05:44 PM| |Preferences limit disk usage to 9.31GB
9/19/2007 2:05:44 PM| climateprediction.net |Restarting task hadcm3iozn_cpm4_2000_80_45898965_7 using hadcm3i version 544

9/19/2007 7:00:51 PM| climateprediction.net |Task hadcm3iozn_cpm4_2000_80_45898965_7 exited with zero status but no \'finished\' file
9/19/2007 7:00:51 PM| climateprediction.net |If this happens repeatedly you may need to reset the project.
9/19/2007 7:00:51 PM| climateprediction.net |Restarting task hadcm3iozn_cpm4_2000_80_45898965_7 using hadcm3i version 544

9/19/2007 11:59:43 PM| climateprediction.net |Task hadcm3iozn_cpm4_2000_80_45898965_7 exited with zero status but no \'finished\' file
9/19/2007 11:59:43 PM| climateprediction.net |If this happens repeatedly you may need to reset the project.
9/19/2007 11:59:43 PM| climateprediction.net |Restarting task hadcm3iozn_cpm4_2000_80_45898965_7 using hadcm3i version 544
==========

As you may note, the subject event occurs at intervals of approximately 5-hours -- on almost every day during which I have let the climate model execute for at least 6 or 7 consecutive hours. Although the BOINC stderr.txt file does not include a field that has the date and time of each entry, the apparent cause of the subject event is \"No heartbeat from core client for 31 sec - exiting / CPDN Monitor - No \'heartbeat\' from BOINC...\"

It has never been clear in all of the documents that I\'ve read on the matter (including the Wiki, among others) whether this behavior affects the validity of the output obtained by executing the climate model. The general consensus seems to be that it doesn\'t, but it seems that no one who might be an authority has offered an opinion.

Aside from that, I have not been able to correlate the subject event with any other activity or event(s) that might be simultaneously occurring on my computer. In fact, most of the time that the subject event occurs, the climate model is the only application that is actively running, aside from several resident programs such as Norton AV, Spybot - Search & Destroy, or the Sunbelt Personal Firewall. And all they are doing is monitoring any network traffic that happens to occur.

Windows has apparently attempted to synchronize the time on my computer with the federal Naval Observatory one and only one time each WEEK -- whether it succeeds or fails, it waits another week to try again, and I doubt that it has ever succeeded.

There seems to be a possibility that the model is \"looping\", although I believe that I downloaded and installed, respectively, the most recent version of BOINC and of the climate model software. If the climate model is \"looping\", then the program itself is supposed to detect this and report it as as specific error condition. It hasn\'t done that yet.

As described above, the subject event appears to occur because BOINC loses contact with the CPDN monitor.

I suppose that what I need (if there isn\'t any known solution to stopping the subject event) is some assurance that I\'m not just wasting my computer\'s resources by running climate modeling software that might have a bug in it which will invalidate its output.

|
| --- Stardance
|
| nil carborundum illegitimi
ID: 30572 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 30574 - Posted: 20 Sep 2007, 7:23:03 UTC
Last modified: 20 Sep 2007, 7:23:36 UTC

There are 5 README files near the top of our alternative forum, here, where we put hints, tips, and help info.

The error message in question is in: Crashes and other problems, 3rd from the top.

It\'s usually to do with synchronising the internal clock, but I think I\'ve seen another explanation somewhere, perhaps on the BOINC/dev boards.

Supposedly, the message itself is a left over from a very early version of BOINC, and, (as it says in the Hitch Hikers Guide to the Galaxy, (2nd edition), in the description of Earth), it\'s: Mostly harmless.

However, a few people have reported crashes associated with it.
If the model doesn\'t crash, then the data is OK.


Backups: Here
ID: 30574 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 30576 - Posted: 20 Sep 2007, 9:50:17 UTC
Last modified: 20 Sep 2007, 10:31:13 UTC

[Stardance wrote] There seems to be a possibility that the model is \"looping\", although I believe that I downloaded and installed, respectively, the most recent version of BOINC and of the climate model software. If the climate model is \"looping\", then the program itself is supposed to detect this and report it as as specific error condition. It hasn\'t done that yet.

There haven\'t been any reports of coupled models looping with current science app. versions. That doesn\'t mean, of course, that yours can\'t be the first. It\'s easy to check: just note the date from the graphics and check occasionally - if it continually rewinds (i.e. beyond the built-in day, month, year cycle) then it\'s a looper.

One of my models had that message a few days ago and carried on unaffected: it\'s repetition that\'s more of a concern. There were still 60 checkpoints between the bracketing trickles, so it looks like the checkpoint rewind worked on that occasion.

18/09/2007 14:05:47|climateprediction.net|[checkpoint_debug] result hadcm3inct_clip_1920_160_35868093_4 checkpointed
18/09/2007 14:27:22|climateprediction.net|[checkpoint_debug] result hadcm3inct_clip_1920_160_35868093_4 checkpointed
18/09/2007 14:27:30|climateprediction.net|Task hadcm3inct_clip_1920_160_35868093_4 exited with zero status but no \'finished\' file
18/09/2007 14:27:30|climateprediction.net|If this happens repeatedly you may need to reset the project.
18/09/2007 14:27:30|climateprediction.net|Restarting task hadcm3inct_clip_1920_160_35868093_4 using hadcm3i version 540
18/09/2007 14:47:16|climateprediction.net|[checkpoint_debug] result hadcm3inct_clip_1920_160_35868093_4 checkpointed
18/09/2007 15:07:15|climateprediction.net|[checkpoint_debug] result hadcm3inct_clip_1920_160_35868093_4 checkpointed
ID: 30576 · Report as offensive     Reply Quote
old_user470806

Send message
Joined: 6 Sep 07
Posts: 47
Credit: 23,188
RAC: 0
Message 30586 - Posted: 20 Sep 2007, 22:11:43 UTC - in response to Message 30574.  

There are 5 README files near the top of our alternative forum, here, where we put hints, tips, and help info.

The error message in question is in: Crashes and other problems, 3rd from the top.

.....

However, a few people have reported crashes associated with it.
If the model doesn\'t crash, then the data is OK.


FWIW, I\'ve already read -- multiple times -- at least three of the README files to which you refer (running the climate model, crashes and other problems, backups), and plenty of others such as the unofficial BOINC wiki and the official BOINC wiki. The consensus seems to be that the subject event occurs because BOINC and/or the CPDN monitor lose contact -- i.e., the \"heartbeat\" signal hasn\'t been received by one or the other or by both, so they shut down.

As to why the heartbeat signal disappears, no one really seems to know. On my computer, at least, it isn\'t from Windows synchronizing the clock, because an attempt to do that occurs only once per week.

On the face of it, the subject event doesn\'t seem to do any harm. It is an orderly response to an error condition which, we hope, doesn\'t arise from anything that the climate model is doing computationally.


|
| --- Stardance
|
| nil carborundum illegitimi
ID: 30586 · Report as offensive     Reply Quote
old_user470806

Send message
Joined: 6 Sep 07
Posts: 47
Credit: 23,188
RAC: 0
Message 30587 - Posted: 20 Sep 2007, 22:22:47 UTC - in response to Message 30576.  

[quote] .... There haven\'t been any reports of coupled models looping with current science app. versions. That doesn\'t mean, of course, that yours can\'t be the first. It\'s easy to check: just note the date from the graphics and check occasionally - if it continually rewinds (i.e. beyond the built-in day, month, year cycle) then it\'s a looper.[quote]

Alas, the current version of nVidia Forceware does NOT fully support OpenGL 2.0 (contrary to nVidia\'s apparent claim), so there is no graphics output available to examine. I have no way of determining what the climate model is doing.

[quote] One of my models had that message a few days ago and carried on unaffected: it\'s repetition that\'s more of a concern. There were still 60 checkpoints between the bracketing trickles, so it looks like the checkpoint rewind worked on that occasion.[quote]

If repetition is a concern, then I suppose that I have something to be concerned about. :-( I haven\'t seen any message yet that mentions that BOINC or the model has \"checkpointed\". There has been only one \"trickle\" that I\'ve seen reported a few days ago. The model has been running, at present, a total of 103 hours and whatever minutes, which is reported as 4.8% of the work. The estimate of time that will be required has increased from the initial 1009 hours to 1017 hours, and if 103 hours is 4.8% of the total run, it will take 2,145 hours to finish (assuming a linear relationship between the time spent and progress to completion).


|
| --- Stardance
|
| nil carborundum illegitimi
ID: 30587 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 30588 - Posted: 20 Sep 2007, 22:43:11 UTC


You only get a \'checkpoint\' message if you have some diagnostics file/options running. I wasn\'t interested in this, so didn\'t take much notice when I read about it.

If no other projects are running, and only one climate model, then the time of the last checkpoint will be the date stamp on the client_state.xml file.

In the coupled model, trickles are on 3rd/4th December each model year, but this requires the graphics. :(
These as you know get mentioned in the Messages tab.

I thought that the graphics only required OpenGL 1.2, but perhaps that\'s changed.


Backups: Here
ID: 30588 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 30589 - Posted: 21 Sep 2007, 0:12:58 UTC

Sorry about the mention of checkpointing without a proper explanation. As Les says, BOINC Manager has some diagnostics that can be turned on to report various things, including the points at which the model is saved to disk.

To do this, create a file called cc_config.xml and drop it into the BOINC folder, then load the file using \'Advanced | Read config file\' (in 5.10.13).

Here\'s mine:


<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>
<checkpoint_debug>1</checkpoint_debug>
</log_flags>
<options>
<save_stats_days>90</save_stats_days>
</options>
</cc_config>


Set the content of the checkpoint_debug element to 1 for reporting and 0 for no reporting. There are lots of other parameters.

In the coupled model, there are 360 days per year and the model is saved every six model days (i.e. 432 steps, which is 6 days x 72 steps per day) - so there are 60 saves (i.e. checkpoints) per year.

Reporting the checkpoints can be useful for timing backups on machines that are very slow or have defective graphics (I have one each of those). In the case of crashes, it should also show how the model rewinding was applied (month, year etc. - though I\'ve not recorded any).

In the older science apps that looped, \'repetition\' meant continuous repetition of the crash, not just a couple. The isolated occurences seem, as you say, to be benign.
ID: 30589 · Report as offensive     Reply Quote
old_user470806

Send message
Joined: 6 Sep 07
Posts: 47
Credit: 23,188
RAC: 0
Message 30624 - Posted: 21 Sep 2007, 22:13:45 UTC - in response to Message 30588.  


You only get a \'checkpoint\' message if you have some diagnostics file/options running. I wasn\'t interested in this, so didn\'t take much notice when I read about it. If no other projects are running, and only one climate model, then the time of the last checkpoint will be the date stamp on the client_state.xml file.


Thank-you for the information. The following may be of interest:

client_state.xml 28KB ... 9/21/2007 5:42 PM
client_state_prev.xml 28KB ... 9/21/2007 5:15 PM


....
I thought that the graphics only required OpenGL 1.2, but perhaps that\'s changed.


I thought that OpenGL 2.0 was required, but the \"Getting Started|Technical Requirements\" page (currently) says, \"The application graphics requires a graphics card that supports OpenGL, with the latest drivers.\" nVidia claims that its most recent Forceware driver supports OpenGL 2.0, but it actually fully supports only prior versions up to and including 1.9.

If BOINC/Climate model graphics require only OpenGL 1.2, then my computer\'s video subsystem should be able to display them. After I click on the \"graphics available\" tag on the simple view of the BOINC manager, all I see are (1) a vertical rectangular area on the left which is dark blue with a couple of narrow rectangular horizontal white fields (sometimes what appears to be a temperature scale with some numbers and red/white coloring is also in the lower left part of that area), and (2) another rectangular area on the right that is totally black, containing a small globe of the earth with the continents outlined by a white line -- this picture doesn\'t change.

I started running the climate model at 12:04 PM, and after almost five hours, BOINC output the following messages:

9/21/2007 4:45:10 PM| climateprediction.net |Task hadcm3iozn_cpm4_2000_80_45898965_7 exited with zero status but no \'finished\' file
9/21/2007 4:45:10 PM| climateprediction.net |If this happens repeatedly you may need to reset the project.
9/21/2007 4:45:10 PM| climateprediction.net |Restarting task hadcm3iozn_cpm4_2000_80_45898965_7 using hadcm3i version 544

-----
It also put out a message that a new version of BOINC is available for download.
It doesn\'t say anything about how to install the update. :-(


|
| --- Stardance
|
| nil carborundum illegitimi
ID: 30624 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 30625 - Posted: 21 Sep 2007, 22:29:01 UTC


I\'m not currently running the latest version of BOINC, and I can\'t remember what it looks like, but I believe that it displays a globe image for the climate projects.
This is just a static image, and isn\'t what your model is doing.

To see YOUR model, you need to go into the Tasks tab of the manager, and click on Show graphics. (Unless \'they\' have changed the wording of tabs/buttons again.)

Tolu simplied the graphics at about the start of the year, to try and prevent some of the display difficulties some people were getting. A lot of the help info \'builtin\' to this site, (as accessed from the blue menu to the left), is out of date, but can\'t be corrected by the moderators. And the 2 project people are somewhat busy.

************

AS for the \'exited\' problem, try changing the time interval that Windows uses to sync the clock. Make it 3 hours, or 7, or something obviously different to now.
Then see how long it takes for the message to appear.

ID: 30625 · Report as offensive     Reply Quote
old_user470806

Send message
Joined: 6 Sep 07
Posts: 47
Credit: 23,188
RAC: 0
Message 30626 - Posted: 21 Sep 2007, 22:29:13 UTC - in response to Message 30589.  

Sorry about the mention of checkpointing without a proper explanation. As Les says, BOINC Manager has some diagnostics that can be turned on to report various things, including the points at which the model is saved to disk. ....


Thank-you for the instructions for setting-up checkpointing. I may do that after I install the upgrade of BOINC to version 5.10.13.


|
| --- Stardance
|
| nil carborundum illegitimi
ID: 30626 · Report as offensive     Reply Quote
old_user470806

Send message
Joined: 6 Sep 07
Posts: 47
Credit: 23,188
RAC: 0
Message 30629 - Posted: 22 Sep 2007, 1:55:56 UTC - in response to Message 30625.  


....
AS for the \'exited\' problem, try changing the time interval that Windows uses to sync the clock. Make it 3 hours, or 7, or something obviously different to now.
Then see how long it takes for the message to appear.


Unfortunately, I cannot find a way to change the time interval (which now appears to be one week) in the Windows XP Control Panel GUI. I could probably change it by editing the registry, but I don\'t know enough about the Date & Time feature to be confident of finding the appropriate key(s) and making an appropriate change(s).

Please be aware, the subject event (Task ... exits with zero status and no \'finished\' file) happens quite predictably at approximately 5-hour intervals, and they are not \"isolated\" events. So far, I have simply not been able to find anything going on with my computer system that correlates with that event EXCEPT that BOINC and/or the CPDN Monitor fail to send their respective \'heartbeat\' signals.

Apparently, no one knows why they would fail to send the \'heartbeat\' signal, other than the fact that BOINC sometimes takes too long to make an internet connection to the CPDN server when it endeavors to do that. I don\'t have any evidence of that happening. Maybe the problem will be resolved in the upgrade to BOINC version 5.10.20 (which I\'ve downloaded and not yet installed).
|
| --- Stardance
|
| nil carborundum illegitimi
ID: 30629 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 30633 - Posted: 22 Sep 2007, 4:24:12 UTC


The reason for the loss of heartbeat IS known.
It\'s when the BOINC gui doesn\'t get a response from the worker daemon for a set time.
And if this loss of contact lasts for around 15 minutes, the manager will consider the work unit lost, and begin the process of telling the project server that the model has failed, and then request a new data set to work on.

Resource intense programs running at the same time are one reason for loss of heartbeat, which is why the advice is to Suspend BOINC before running them, and Resume BOINC afterwards.

ID: 30633 · Report as offensive     Reply Quote
old_user470806

Send message
Joined: 6 Sep 07
Posts: 47
Credit: 23,188
RAC: 0
Message 30653 - Posted: 23 Sep 2007, 1:19:19 UTC - in response to Message 30633.  

NOTE: the \"subject event (message)\" is \"Task ... exited with zero status but no \'finished\' file.\"


The reason for the loss of heartbeat IS known. It\'s when the BOINC gui doesn\'t get a response from the worker daemon for a set time.


That defines the situation but doesn\'t say why \"the BOINC GUI\" does not get a response from \"the worker daemon\". In one of the FAQs that I\'ve read, the author stated that the science application itself monitors for a \'heartbeat\' from BOINC, and exits if it doesn\'t receive the signal when it expects. That is what the subject event message means: CPDN Monitor did not receive a BOINC \'heartbeat\' during some span of time, so it \"exited with zero status\" (no internal error occurred during its own execution) \"but no \'finished\' file\" (i.e., the work unit was not completed).

However, the first of the two corresponding messages that are recorded in stderr.txt appears to be from BOINC and the second from CPDN Monitor:

....
No heartbeat from core client for 31 sec - exiting
CPDN Monitor - No \'heartbeat\' from BOINC...
....

So it appears that BOINC doesn\'t receive a \'heartbeat\' from CPDN for 31 seconds and it \"exits\", then CPDN exits because it doesn\'t receive a \'heartbeat\' from BOINC. I don\'t see how BOINC actually \"exits\" first, though, since it seems to me that BOINC passes along the subject event message that it displays onscreen. After all, how would BOINC \'know\' that CPDN is not in an error state and that there will not be a \"finished file\", if CPDN is still running???

The only explanation that I can think is that when BOINC does not receive a \'heartbeat\' from CPDN, it stops sending its own \'heartbeat\' (and outputs the first message to stderr.txt). Cessation of the BOINC \'heartbeat\' causes CPDN to issue the subject event message, output its message to stderr.txt, and actually exit (cease execution). BOINC outputs the subject event message from CPDN, then either (1) declares an unrecoverable error or (2) restarts CPDN.


And if this loss of contact lasts for around 15 minutes, the manager will consider the work unit lost, and begin the process of telling the project server that the model has failed, and then request a new data set to work on.


So far, BOINC has done nothing but restart the climate model. Sunbelt Personal Firewall has yet to report any outgoing BOINC traffic in the context of the subject event. It has reported traffic when BOINC seeks to schedule a \"trickle\".


Resource intense programs running at the same time are one reason for loss of heartbeat, which is why the advice is to Suspend BOINC before running them, and Resume BOINC afterwards.


The subject event occurs despite the fact that there are no \"resource intensive\" processes in progress at the time the event happens. When I want to scan files with Ad Aware, Spybot Search & Destroy, or Norton AntiVirus, I exit BOINC and the climate model beforehand. I do that prior to running any \"resource intensive\" application, such as a video game. The only application competing for system resources as I write is the Firefox 2.0.0.6 browser displaying the webpage and the input field for this message. How \"resource intensive\" is that?

Regardless, the subject event occurs at approximately five-hour intervals despite the fact that BOINC and the climate model are using 100% of the CPU instruction cycles about 95% of the time UP TO THE POINT AT WHICH THE EVENT OCCURS. I\'ve watched it with Sysinternals Process Explorer, which I run instead of the native Windows XP task manager. But I haven\'t seen anything yet that gives a hint as to why CPDN fails to send a \'heartbeat\' to BOINC at five-hour intervals.

Of course, I will continue to investigate this matter, but it increasingly appears to me that CPDN probably has a \"bug\" that delays the issue of its \'heartbeat\' signal once in every five hours of execution.
|
| --- Stardance
|
| nil carborundum illegitimi
ID: 30653 · Report as offensive     Reply Quote
old_user470806

Send message
Joined: 6 Sep 07
Posts: 47
Credit: 23,188
RAC: 0
Message 30654 - Posted: 23 Sep 2007, 2:04:46 UTC - in response to Message 30589.  

Thanks, Ian.

To do this {institute checkpointing}, create a file called cc_config.xml and drop it into the BOINC folder, then load the file using \'Advanced | Read config file\' (in 5.10.13).

Here\'s mine:


<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>
<checkpoint_debug>1</checkpoint_debug>
</log_flags>
<options>
<save_stats_days>90</save_stats_days>
</options>
</cc_config>


Set the content of the checkpoint_debug element to 1 for reporting and 0 for no reporting. There are lots of other parameters.


FWIW, I\'ve composed a larger cc_config.xml from a BOINC FAQ Service page. It doesn\'t seem necessary to go into the gory details, but here is some of the output:

9/22/2007 2:49:31 PM| climateprediction.net |[checkpoint_debug] result hadcm3iozn_cpm4_2000_80_45898965_7 checkpointed
9/22/2007 3:16:48 PM| climateprediction.net |[checkpoint_debug] result hadcm3iozn_cpm4_2000_80_45898965_7 checkpointed
9/22/2007 3:44:02 PM| climateprediction.net |[checkpoint_debug] result hadcm3iozn_cpm4_2000_80_45898965_7 checkpointed

9/22/2007 4:01:32 PM| climateprediction.net |Task hadcm3iozn_cpm4_2000_80_45898965_7 exited with zero status but no \'finished\' file
9/22/2007 4:01:32 PM| climateprediction.net |If this happens repeatedly you may need to reset the project.
9/22/2007 4:01:32 PM| climateprediction.net |Restarting task hadcm3iozn_cpm4_2000_80_45898965_7 using hadcm3i version 544

9/22/2007 4:28:47 PM| climateprediction.net |[checkpoint_debug] result hadcm3iozn_cpm4_2000_80_45898965_7 checkpointed
9/22/2007 4:55:57 PM| climateprediction.net |[checkpoint_debug] result hadcm3iozn_cpm4_2000_80_45898965_7 checkpointed
9/22/2007 5:23:04 PM| climateprediction.net |[checkpoint_debug] result hadcm3iozn_cpm4_2000_80_45898965_7 checkpointed

The \"pattern\" that I see in the timestamps:

2:49 - 3:16 = 27 minutes
3:16 - 3:44 = 28 min.
3:44 -(4:11)= 27 min. but the model stopped & was restarted at 4:01
4:01 - 4:28 = 27 min.
4:28 - 4:55 = 27 min.
4:55 - 5:23 = 28 min.

So the model is \"checkpointed\" every 27 or 28 minutes unless something goes wrong in the interim. When the subject event occurs and the model is then re-started, the model is again \"checkpointed\" every 27 0r 28 minutes.

Is there something else that I\'m not seeing? What does the \"checkpointed\" result mean??
|
| --- Stardance
|
| nil carborundum illegitimi
ID: 30654 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 30659 - Posted: 23 Sep 2007, 11:19:23 UTC


\'Checkpointed\' means that the model\'s data was written out to disk. If you get a \'zero status\' message, it goes back to the previous checkpoint.

So your model is making progress, overall...

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 30659 · Report as offensive     Reply Quote
old_user470806

Send message
Joined: 6 Sep 07
Posts: 47
Credit: 23,188
RAC: 0
Message 30783 - Posted: 2 Oct 2007, 19:26:00 UTC - in response to Message 30659.  


\'Checkpointed\' means that the model\'s data was written out to disk. If you get a \'zero status\' message, it goes back to the previous checkpoint.

So your model is making progress, overall...


Thanks for letting me know that. Following are some lines from the file stderr.txt that is in the BOINC\\slots\\0\\ subdirectory:

scan: init_data.xml
scan: NAT_VOLC.gz
scan: ocean_05yi_0_2000.gz
scan: ozone_hadcm3_2075.gz
scan: solar_v01.gz
scan: spec3a_lw_3_asol2c_hadcm3.gz
scan: spec3a_sw_3_asol2b_hadcm3.gz
scan: stderr.txt
scan: SULPC_OXIDANTS_19_A2_1990.gz
scan: SULPC_OXIDANTS_19_A2_1990.mod.gz
scan: volc_v01.gz
Not a JPEG file: starts with 0x01 0xda
Not a JPEG file: starts with 0x01 0xda
Not a JPEG file: starts with 0x01 0xda
No heartbeat from core client for 31 sec - exiting
CPDN Monitor - No \'heartbeat\' from BOINC...
scan: 1002_flux_corr.anc.gz
scan: atmos_bmet_0_2000.gz
scan: boinc_lockfile

Does each line which starts with the word \"scan:\" signify that there is an error condition with respect to the file named on that line?? (Also, I have no idea which file is \"not a JPEG file\" or why BOINC is accessing it.)

|
| --- Stardance
|
| nil carborundum illegitimi
ID: 30783 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 30784 - Posted: 2 Oct 2007, 20:25:32 UTC


I think \'scan\' appears whenever the model starts up, as it searches through what files are available. The \'not a JPEG\' message appears once each time the graphics are shown.

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 30784 · Report as offensive     Reply Quote
old_user470806

Send message
Joined: 6 Sep 07
Posts: 47
Credit: 23,188
RAC: 0
Message 31380 - Posted: 15 Nov 2007, 7:21:02 UTC

UPDATE: I have noticed previously that, on two occasions, there was a four-hour interval between hadcm3iozn exits, instead of an interval ranging from 4.75 - 5.5 hours. Thinking about this apparent \"discrepancy\", I eventually realized that, since I ordinarily ran BOINC soon after booting the computer, there was no clear way to distinguish between an hadcm3iozn exit caused by the climate-model software itself and an exit caused by the operating system or other software. So, I delayed running BOINC after booting the system as follows:

November 13: computer booted at 4:05 PM
- BOINC ran at 6:05 PM
- hadcm3iozn exited at 9:36 PM (restarted by BOINC)
- hadcm3iozn exited at 2:35 AM (restarted by BOINC)

November 14: computer booted at 3:30 PM
- BOINC ran at 5:58 PM
- hadcm3iozn exited at 8:27 PM (restarted by BOINC)
- hadcm3iozn exited at 1:25 AM (restarted by BOINC)

Clearly, the interval of 5 to 5.5 hours is with respect to the time that the computer boots, not the time that the climate model is initially run by BOINC.

What Windows XP (fully current) or other software \"running resident\" -- or perhaps the AMD Athlon XP 2400+ CPU -- does every five hours or so that causes HADcm3iozn to exit is a complete mystery to me. The only thing that I am reasonably sure of is that it is not caused by updating or attempting to update the computer\'s time clock. That occurs at intervals of 7 days, and, as far as I can determine, an actual update has probably never been effected; regardless, the interval is not changeable without editing the registry.

Any ideas??

|
| --- Stardance
|
| nil carborundum illegitimi
ID: 31380 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 31387 - Posted: 15 Nov 2007, 8:42:59 UTC
Last modified: 15 Nov 2007, 8:49:58 UTC

If you right-click on the system clock, adjust date/time, and then untick \'automatically synchronise with an internet time server\', does it continue doing this?


There was an issue with one series of nVidia chipsets (nForce2), so older motherboards based on that sometimes do it. This has only been reported two or three times so it\'s very rare. See the following post:
http://www.climateprediction.net/board/viewtopic.php?t=4870&postdays=0&postorder=asc&start=0#43701

This isn\'t mentioned in Les\'s sticky. Les, is it worth adding \'If you have an nForce2 motherboard, see this\'?
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 31387 · Report as offensive     Reply Quote
old_user470806

Send message
Joined: 6 Sep 07
Posts: 47
Credit: 23,188
RAC: 0
Message 31395 - Posted: 16 Nov 2007, 1:18:32 UTC - in response to Message 31387.  

If you right-click on the system clock, adjust date/time, and then untick \'automatically synchronise with an internet time server\', does it continue doing this?


I have unticked the option now. It will be several hours before I can determine whether having the option disabled eliminates the exits by the climate-model, so I\'ll let you know then.

There was an issue with one series of nVidia chipsets (nForce2), so older motherboards based on that sometimes do it. This has only been reported two or three times so it\'s very rare. See the following post:
http://www.climateprediction.net/board/viewtopic.php?t=4870&postdays=0&postorder=asc&start=0#43701

....


Funny you should mention it. The mainboard in my computer is an Asus A7N8X-VM which apparently has an NVIDIA nForce2 chipset since I have the Rev. 44.08 drivers for it on a CD-ROM that came with the machine. (I bought it on September 11, 2003.) I\'ll check out the hyperlink that you\'ve posted. Thanks!!

|
| --- Stardance
|
| nil carborundum illegitimi
ID: 31395 · Report as offensive     Reply Quote
1 · 2 · Next

Questions and Answers : Windows : Task ... exited with zero status but no \'finished\' file.

©2024 climateprediction.net