stuck task (1006)

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 70629 - Posted: 8 Mar 2024, 10:31:04 UTC I have a task that is stuck on 98.058%. Task properties shows --- against time since last checkpoint. Restarting client and manager makes no difference. 1. Is there anything I can do to bump start the task? 2. Is there anything I can look at to try and work out what has gone wrong? Nothing in event log or event log backup from before re-starting. I shall wait till after the second task has completed before looking at whether I can start the wah2_8.29_windows_intelx86.exe manually. (Or I could just abort.) ID: 70629 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 809 Credit: 13,604,352 RAC: 5,068	Message 70630 - Posted: 8 Mar 2024, 11:30:10 UTC - in response to Message 70629. Last modified: 8 Mar 2024, 11:33:39 UTC Hi Dave, Are you sure the task processes are actually running? That's the first thing to check. There are 3 processes per task. Count the number of wah2_8.29_windows_intelx96.exe, wah2am* & wah2rm* processes you have running in Task Manager (or Resource Manager). If the numbers don't match with however many tasks boincmgr says let me know. If they match then have a look in the model log files to see what's going on. First, find the folder for the task in question. Using one of mine as an example, if boincmgr shows the name of the task as 'wah2_eas25_a3pf_200912_24_1007_012269659_0', go to your boinc data folder (might be hidden), then projects\climateprediction.net, and you should see a folder of the same name but without the trailing _0 (task try number). In the task folder you should see a text file: stdout_mon.txt. This contains a print of the timesteps completed. Check the 'Date modified' column, was the file updated recently? It's normally updated every few minutes. Open the the file up. It'll contain lines like this: wah2_eas25_a3pf_200912_24_1007_012269659 - PH 1 TS 0131015 A - 12/11/2010 17:45 - H:M:S=0061:51:37 AVG= 1.70 DLT= 0.30 wah2_eas25_a3pf_200912_24_1007_012269659 - PH 1 TS 0131016 A - 12/11/2010 18:00 - H:M:S=0061:51:37 AVG= 1.70 DLT= 0.28 wah2_eas25_a3pf_200912_24_1007_012269659 - PH 1 TS 0131017 A - 12/11/2010 18:15 - H:M:S=0061:51:38 AVG= 1.70 DLT= 0.47 Your names & numbers will be different, but that's a timestep log of how far the model has got. You can reload this file every few minutes to see if the lines have changed. What you're looking for is changes to the middle of the line: .. A - 12/11/2010 18:15 .... That's the current model date & time. If that shows the model is not progressing, despite the process running, that's unusual. Never seen that before. Zip up the files: stdout_mon.txt, stdout_rm.txt, stdout_um.txt in the task folder, together with stderr.txt in the task's slot folder and email the zip to me. I'll take a look and see what's going on. p.s. had you made any hand edits to the client_status.xml file at all? pp.s. it's not possible to 'hand-start' the task. It has to be done under boinc. --- CPDN Visiting Scientist ID: 70630 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 70631 - Posted: 8 Mar 2024, 14:18:23 UTC - in response to Message 70630. Just two tasks in the vm running windows (tiny10) One wah2_8.29_windows_intelx96.exe is using only Both say they are using 0.5MB RAM but only one has any disk usage. Two wah2am processes are running. one using just 0.8MB RAM, the other 152.1MB Two wah2rm processes show as running one averaging about 26% cpu usage. (VM has 4 cpus allocated. That one shows about 257MB of RAM. The other shows 0% cpu usage and just 1MB of RAM. Pretty obvious that the task crashed I think. Three or four instances of this in stderr. Model crashed: READHIST: End of file in READ from history file for namelist NLIHISTO tmp/xadae.pipe_dummy I couldn't find anything informative in the other files you mentioned. There was a power outage for five minutes which is probably relevant. If you still want the files to check I can send them but I am not hopeful of finding much. I am just glad none of my Linux testing branch tasks suffered! ID: 70631 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 809 Credit: 13,604,352 RAC: 5,068	Message 70632 - Posted: 8 Mar 2024, 19:41:00 UTC - in response to Message 70631. Could you please send the text files? Id like to have a look. The monitor process has obviously disappeared but the models should have stopped as well as they are each checking the other is still running. Not sure what's happening there. Thanks. --- CPDN Visiting Scientist ID: 70632 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 70633 - Posted: 8 Mar 2024, 21:32:31 UTC - in response to Message 70632. Last modified: 9 Mar 2024, 17:34:29 UTC Will send in morning. No hand edits to client_state.xml or other BOINC files. Edit: Done. It will be interesting if there is anything significant. I was a bit surprised not to lose anything else from an unplanned powerdown. I did notice that some of the aborted tasks still had their folders sitting there to delete. Not a biggie as I check these things periodically anyway and as you said, that has been fixed or will be before further batches go out. ID: 70633 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 809 Credit: 13,604,352 RAC: 5,068	Message 70634 - Posted: 9 Mar 2024, 22:33:22 UTC - in response to Message 70633. Last modified: 9 Mar 2024, 22:33:51 UTC Having looked at the output, I think what's happened is the power-cut killed the task as it was writing out the history files (or checkpoint files if you want to call them that). That left them incomplete so when the model restarted it couldn't read the input it needed. The puzzle for me is why the two model processes didn't get killed as well. I should be able to create a test with a corrupted history file and see if I get the same behaviour. --- CPDN Visiting Scientist ID: 70634 · Reply Quote