Message boards : Number crunching : Batch 1015 Discussion/problems
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Batch 1015 is being released now. This is the next batch in the East Asia 25km configuration (eas25). I just got one of these on my little machine (Computer 1512658). It has a little over 1/2 hour on it now, So it did not crash on start-up. It predicts about 16 days to go. |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
Six downloaded here. (Placeholder for this batch) I assume this means you have tracked down the problem Glenn? |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I now have two on my pipsqueak Windows10 machine. ID: 1512658 They both seem to be running OK. Predicted to take about 16 days but by eyeball it looks like they will be a little faster than that. |
Send message Joined: 12 Apr 21 Posts: 318 Credit: 15,030,773 RAC: 4,296 |
It seems like that task directory & files that should go into the slots directory still goes into the projects/climateprediction.net directory. When I ran out of work a couple of days ago I cleaned out all of the older ones but when I got new work today, new ones appeared. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
It seems like that task directory & files that should go into the slots directory still goes into the projects/climateprediction.net directory. When I ran out of work a couple of days ago I cleaned out all of the older ones but when I got new work today, new ones appeared.This is changed in the next release. In order to keep consistent results for running projects we keep the version the same for all batches per project. --- CPDN Visiting Scientist |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
One of the two 1015 tasks on my pipsqueak machine has accomplished its first trickle, Another potential obstacle overcome. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,816,918 RAC: 4,574 |
Slightly off topic, but related to current issues. I've been sent a batch 1007 resend: wah2_eas25_a1cu_199312_24_1007_012266614_1 The previous user had an Intel i9, but only managed three trickles in a fortnight. My i5 will probably run through it in 8 - 9 days, but is it worth it? |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
Yes, definitely. Batch 1007 is a valid batch. Don't abort it! 1006 & 1007 might be hitting the deadline for volunteers who have not yet started tasks. That might be why resends are coming. --- CPDN Visiting Scientist |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
One of the two 1015 tasks on my pipsqueak machine has accomplished its first trickle, Another potential obstacle overcome. And now each of those two tasks has delivered four trickles. |
Send message Joined: 2 Jul 15 Posts: 21 Credit: 4,256,934 RAC: 1,418 |
I had a Batch 1015 task abort overnight. Only other CPDN tasks would have been running at the time. Everything else was quiesced. Name wah2_eas25_a3id_201912_24_1015_012281245_0 Workunit 12281245 Created 15 Apr 2024, 10:43:22 UTC Sent 15 Apr 2024, 14:43:43 UTC Report deadline 24 Jul 2024, 14:43:43 UTC Received 17 Apr 2024, 8:27:45 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x00000000) Computer ID 1367467 Run time 1 days 5 hours 52 min 58 sec CPU time 1 days 5 hours 52 min 58 sec Validate state Invalid Credit 1,678.16 Device peak FLOPS 3.48 GFLOPS Application version Weather At Home 2 (wah2) (region independent) v8.29 windows_intelx86 Another Batch 1015 task is still running ... as is a Batch 1005 task that started mysteriously last week running version 8.24. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
I had a look, there's no 'stderr' output on the task webpage, which is the task log, so I can't see why the model failed. Though the fact there's no stderr output on the task page itself is a clue. I notice the PC only has 8Gb RAM. How much RAM do you allocate for BOINC? And how many CPDN tasks do you have running at a time? I suspect a problem with available memory. Also, memory can get fragmented on Windows (similar to disk fragmentation). It's not impossible the task died because it couldn't allocate a memory segment big enough. The best way to clear memory fragmentation is to reboot the machine. That's all I can help with on this. --- CPDN Visiting Scientist |
Send message Joined: 2 Jul 15 Posts: 21 Credit: 4,256,934 RAC: 1,418 |
Thank you, Glenn. I've tended to allow the computer to run steadily while CPDN projects are active. I will change my paradigm to begin rebooting at the end of every day, quiescing the CPDN tasks before doing so and then activating them afterward. Following is my profile; I welcome suggestions to improve it. Thank you. When computer is in use 'In use' means mouse/keyboard input in last 5 minutes Suspend all computing No Suspend GPU computing No Use at most 75 % of the CPUs Use at most 50 % of CPU time Suspend when non-BOINC CPU usage is above 50 % Use at most 50 % of memory When computer is not in use Use at most Requires BOINC 7.20.3+ 75 % of the CPUs Use at most Requires BOINC 7.20.3+ 75 % of CPU time Suspend when non-BOINC CPU usage is above Requires BOINC 7.20.3+ 75 % Use at most 75 % of memory Suspend when no mouse/keyboard input in last --- minutes General Suspend when computer is on battery N/A Switch between tasks every 60 minutes Request tasks to checkpoint at most every 60 seconds Leave non-GPU tasks in memory while suspended Yes Store at least --- days of work Store up to an additional 0.25 days of work Compute only between --- Disk Use no more than 250 GB Leave at least 0.001 GB free Use no more than 50 % of total Page/swap file: use at most 75 % Network Limit download rate to --- KB/second Limit upload rate to --- KB/second Limit usage to --- MB every --- days Transfer files only between --- Skip data verification for image files Confirm before connecting to Internet Disconnect when done |
Send message Joined: 2 Oct 19 Posts: 21 Credit: 47,674,094 RAC: 24,265 |
I picked up 32 of these batch 1015 tasks on a 5950X with 64 GB RAM. One failed after the 2nd trickle. The others seem to be doing fine so far. Here's a link to the one that failed: link to task Stderr output: <core_client_version>7.22.2</core_client_version> <![CDATA[ <message> The system cannot find the drive specified. (0xf) - exit code 15 (0xf)</message> <stderr_txt> modelGetExecutables: check control files, strTemp0 & 1 : C:\ProgramData\BOINC/projects/climateprediction.net/wah2_eas25_a1j8_201312_24_1015_012278684/jobs/xadae.namelists C:\ProgramData\BOINC/projects/climateprediction.net/wah2_eas25_a1j8_201312_24_1015_012278684/jobs/xacxf.namelists modelGetExecutables: unzipping control files : strInput & strTmp wah2_eas25_a1j8_201312_24_1015_012278684.zip wah2_eas25_a1j8_201312_24_1015_012278684/jobs gstrDump[0] = generic_phase1_spinup_eas25_global_aabaka gstrDump[1] = generic_phase1_spinup_eas25_regional_aabaka global model: command string: "C:\ProgramData\BOINC/projects/climateprediction.net/wah2am3m2_um_8.29_windows_intelx86.exe" wah2_eas25_a1j8_201312_24_1015_012278684 generic_phase1_spinup_eas25_global_aabaka ic19610812_12_N96 AERclim_ancil_168months_CMIP6-MIROC6_SST_2009-01-01_2022-12-30_v2404 AERclim_ancil_168months_CMIP6-MIROC6_SIC_2009-01-01_2022-12-30_v2404 SO2DMS_N96_cmip6hist-ssp245_2009-2020 oxi.addfa ozone_cmip6hist-ssp245_N96_1979_2031 regional model: command string: "C:\ProgramData\BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.29_windows_intelx86.exe" wah2_eas25_a1j8_201312_24_1015_012278684 cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. executeModelProcess: MonID=4856, GCM_PID=19780, RCM_PID=1608 Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 19780, selfPID = 19780, iMonCtr = 1 </stderr_txt> |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
And now each of those two tasks has delivered four trickles. And now eight trickles. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
The error message The system cannot find the drive specified.is a Windows issue. It has a number of possibilities, disk timeout, failing drive. Might be worth doing a SMART check on the drive concerned if it keeps happening. This error accounts for about 10% of CPDN WAH task fails. --- CPDN Visiting Scientist |
Send message Joined: 2 Oct 19 Posts: 21 Credit: 47,674,094 RAC: 24,265 |
The error message Thanks for looking into this Glenn. I'll run a SMART check on the drive when all the tasks are completed. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
The error message One would think that. But Windows being Windows, there are also apparently some ways it can show up that aren't "drive failure" related, as I've just learned... I've had two on a machine that, as far as I can tell, has a perfectly good drive (NVMe drive with zero reported problems, virtio block devices through to the Windows 10 VM doing the compute these days). https://www.cpdn.org/result.php?resultid=22418452 https://www.cpdn.org/result.php?resultid=22418464 I don't know what the codebase looks like, but according to: https://superuser.com/questions/1807763/inexplicable-the-system-cannot-find-the-drive-specified-how-to-solve-it and https://stackoverflow.com/questions/19843849/unexpected-the-system-cannot-find-the-drive-specified-in-batch-file that error message can occur when something in a batch file is mis-interpreted as a drive path, not what it's supposed to be. I'm sure some are the result of a failing drive, but when perfectly good hosts on modern drives are throwing it (without any subsequent errors in other tasks), it seems worth pulling the "weird corner case in a batch file" thread a bit. I'd have assumed it was purely a failing drive error message too, but... apparently not. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
The batch file examples are down to misunderstanding how wildcarding works. We don't use batch files in the Windows apps. Also, if we had an error like that it would fail probably every time. I guess since most people use the default install of boinc so it ends up on the C: drive. If that gets busy due to other activities a boinc process drive access might time out. Particularly because the boinc processes run at a lower priority. I'm just guessing, but maybe running in a VM might make this error more likely? (assuming it's not a hardware issue of course). The other Windows related error we see (about 15-20% of task fails), is "Invalid control block address". When I looked this up it seemed to be related to Windows Update doing something. I didn't read too far once I knew it wasn't a problem I needed to fix :D. But it's not obvious to me why Windows Update should cause an issue to a running task? Maybe someone who knows Windows better than me might have an idea. I'd be interested to know if it's potentially recoverable. --- CPDN Visiting Scientist |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I guess since most people use the default install of boinc so it ends up on the C: drive. If that gets busy due to other activities a boinc process drive access might time out. Particularly because the boinc processes run at a lower priority. I am not having problems with the two 1015 tasks on my Windows-10 machine. They have now uploaded 10 trickles each. You will see that It has a small amount of RAM. It has only one drive and it is solid state (machine was too small to put a spinning hard drive in it). The machine is set up to run up to 7 Boinc tasks at a time. And it is currently doing that. I only bought the machine to run TaxAct once a year, and I finished that about March 15. Four times a year I run Garmin Express to update the maps in the GPS for my car. So rather that waste the machine the rest of the year, I run Boinc on it. According to my UPS, it costs me $0.93/day for the electricity to run it. It is another story for my big Linux machine, but I have not gotten any CPDN work for it since last June. Computer 1512658 Computer information CPU type GenuineIntel 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1] Number of processors 8 Operating System Microsoft Windows 10 Core x64 Edition, (10.00.19045.00) BOINC version 7.24.1 Memory 15.64 GB Cache 256 KB |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
The batch file examples are down to misunderstanding how wildcarding works. We don't use batch files in the Windows apps. Also, if we had an error like that it would fail probably every time. Unless it's dependent on some particular sequence in the name. Sorry, I don't know enough Windows to reason about it deeply. But my point is mostly that this particular error message can be caused by things that are not a hardware failure on the disk, and it may be worth trying to see if something in the proximity of the failure is doing something dumber-than-desired with disk access strings.
Does Windows do that? It seems a particularly harsh error for a low priority access, especially as it kills the process. "I'll get to you when I get to you" is a lot more standard for low priority tasks under heavy disk IO, they just block on the disk IO until there's a gap to fill them. On Linux, at least, you'll see a very high "iowait" time for a process, but it won't kill it if it can't service the disk requests. I'm just guessing, but maybe running in a VM might make this error more likely? (assuming it's not a hardware issue of course). *shrug* I just used 'winsat disk -drive c' to test the performance of my Win10 VM, which is the only VM, on a dedicated compute rig doing nothing else, and it reported 253MB/s for 16kb random read, 2542 MB/s for 64kb sequential read, and 1377 MB/s for 64kb sequential write. I doubt disk IO is timing out on that box. I don't know what the failure rates are for Windows in general - it may be a low enough overall failure rate that it's not worth running down. |
©2024 cpdn.org