climateprediction.net home page
HadAM3P-HadRM3P restart loop on Windows 7
HadAM3P-HadRM3P restart loop on Windows 7
log in

Advanced search

Questions and Answers : Windows : HadAM3P-HadRM3P restart loop on Windows 7

Author Message
MaynardVizzutti
Send message
Joined: 29 Mar 15
Posts: 3
Credit: 368,808
RAC: 920
Message 51723 - Posted: 30 Mar 2015, 0:06:31 UTC

The program will start, run for around 10 seconds and fail, restarting immediately and failing again in a never-ending loop. I didn't find any instructions for the preferred information gathering, but will be happy to collect information if it's desired.

Windows 7 SP1, 8-core Intel i7, 8GB, BOINC 7.4.42 x64. HadAM3P 7.22. It failed on the first try and has never worked on this machine.

The Coupled Model program seems to be proceeding normally, so I'll run that in the meantime. Thanks.

Profile astroWX
Volunteer moderator
Send message
Joined: 5 Aug 04
Posts: 1459
Credit: 76,183,576
RAC: 71,704
Message 51724 - Posted: 30 Mar 2015, 4:22:56 UTC - in response to Message 51723.

Hi, Maynard,

Welcome to the project and to the boards.

Checked your machine, found one task running and four aborted by user. Guaranteed: User aborts will kill tasks every time.

How many times did each task crash/restart on its own? What does your 'Messages' tab show?
____________
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.

MaynardVizzutti
Send message
Joined: 29 Mar 15
Posts: 3
Credit: 368,808
RAC: 920
Message 51727 - Posted: 30 Mar 2015, 21:51:31 UTC - in response to Message 51724.

Thanks for your quick reply.

At first, I was assigned one task, which I allowed to restart for approximately 5-7 minutes before aborting. I estimate something like 30-50 restarts for that task. The system's memory-in-use display oscillated up and down with the same period, which is what made me notice in the first place.

I hoped it was an isolated incident, but on the next batch, I received three more such jobs, which I saw behaving the same way and terminated much sooner, probably within one minute.

The fourth job I received was hadcm3n_um_6.07_windows_intelx86 *32. It is running normally, but on a side note, the deadline calls for 400 hours of CPU over 92 days, which I'm not sure I can deliver. The shorter tasks had deadlines a year away and would easily make it.

Nothing appears in the BOINC event log (with only the default logging enabled), nor did I find any log files in the project/task directories. If you have instructions for enabling better logging, I'll be happy to do it.

Les Bayliss
Volunteer moderator
Send message
Joined: 5 Sep 04
Posts: 6408
Credit: 16,839,542
RAC: 21,887
Message 51728 - Posted: 30 Mar 2015, 22:02:25 UTC - in response to Message 51727.
Last modified: 31 Mar 2015, 2:19:49 UTC

1) As posted VERY regularly, there is NO "deadline" for returning the data. It's just an unbypassable BOINC requirement that there be one.

As for the error messages, they appear under Stderr on each model's page. Click the plus sign to expand the list.

MaynardVizzutti
Send message
Joined: 29 Mar 15
Posts: 3
Credit: 368,808
RAC: 920
Message 51730 - Posted: 31 Mar 2015, 0:53:45 UTC - in response to Message 51728.

Thanks for pointing me to the error messages. Here is a sample:

18:41:05 (5248): BOINC client no longer exists - exiting
18:41:05 (5248): timer handler: client dead, exiting
CPDN Monitor - No 'heartbeat' from BOINC...
18:41:49 (10872): BOINC client no longer exists - exiting
18:41:49 (10872): timer handler: client dead, exiting
CPDN Monitor - No 'heartbeat' from BOINC...
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=9760, selfPID=9760, iMonCtr=2
18:42:00 (8224): BOINC client no longer exists - exiting
18:42:00 (8224): timer handler: client dead, exiting
CPDN Monitor - No 'heartbeat' from BOINC...
18:42:11 (6156): BOINC client no longer exists - exiting
18:42:11 (6156): timer handler: client dead, exiting
CPDN Monitor - No 'heartbeat' from BOINC...
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=8332, selfPID=8332, iMonCtr=2

And so on.

I saw a similar sequence in another thread, but the program identified in that case as the culprit is not installed on my machine. I'll assume the virus/firewall protection is a good place to start looking and will try some things when my current task nears completion. Thanks to both of you for your help.

Les Bayliss
Volunteer moderator
Send message
Joined: 5 Sep 04
Posts: 6408
Credit: 16,839,542
RAC: 21,887
Message 51734 - Posted: 1 Apr 2015, 1:51:38 UTC

Your problem has been identified, and, co-incidentally, also posted about by another cruncher.
I have answered him here.



Questions and Answers : Windows : HadAM3P-HadRM3P restart loop on Windows 7


Main page · Your account · Message boards


Copyright © 2017 climateprediction.net