Message boards : Number crunching : #1020,1,2,3...
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
The first of these has gone out. I have about 10 minutes left on the project time out before I can get some. This is the one with all forcings. 1020 EASHA 5,000 tasks WAH2 East Asia 25km 1986-2010 |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
In the last 4 hours, my pipsqueak Windows11 machine (computer 1512658) got four 1020 tasks and they all seem to be running OK. I got one each hour. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Just two running on my Ryzen9. I have upped the number of cores the VM can use so probably will get some more before too much longer. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
4 more running. Server status page says down to 457 of the first batch left. They will all be gone by the time I get up to check on things tomorrow. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
#1021 6048 tasks ALL WAH2 East Asia 25km has been released. I have four running alongside 6 from 10020 #1022 5040 tasks NAT WAH2 East Asia 25km has gone out too. Time to get those cores crunching but, please don't download lots for the cache as the researcher does want results back as quickly as possible. In winter my 16 CPU Ryzen9 will heat my small office. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
#1023 5040 tasks GHG WAH2 East Asia 25km And another one! |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,575,067 RAC: 15,735 |
#1023 5040 tasks GHG WAH2 East Asia 25kmDave, I posted a list of the forthcoming batches already. See: https://www.cpdn.org/forum_thread.php?id=9232&postid=71086 --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
That's all of them out there. So far failure rate looks not too high. Another two days till the first of mine is due to finish. With a tad over 5,000 tasks with recent credit they should last a little while. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
My first one has completed. CPU time 5 days 6 hours 35 min 19 sec another should finish later this morning. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
FYI: I got a whole bunch of failures today. Here is a typical one. Several of them failed at the same time, but not all of them. These were all on my Windows 11 machine. I normally leave the Boinc manager running, but it was not running when I turned the monitor on. So it is possible that Windows did an update and reboot without telling me. Normally I run SupportAssist for updates, but it has not run lately. (It keeps a log of recent activity). Task 22470326 Name wah2_eas25_n0d1_201012_24_1022_012312644_0 Workunit 12312644 <---<<< Created 24 Jul 2024, 13:26:05 UTC Sent 9 Aug 2024, 22:47:14 UTC Report deadline 17 Nov 2024, 22:47:14 UTC Received 14 Aug 2024, 3:16:50 UTC Server state Over Outcome Computation error Client state Compute error Exit status 9 (0x00000009) Unknown error code Computer ID 1512658 <---<<< Run time 4 days 3 hours 10 min 33 sec CPU time 3 days 14 hours 59 min 52 sec Validate state Invalid Credit 5,819.81 Device peak FLOPS 3.68 GFLOPS Application version Weather At Home 2 (wah2) (region independent) v8.32 windows_intelx86 Peak working set size 341.16 MB Peak swap size 308.56 MB Peak disk usage 94.80 MB Stderr <core_client_version>8.0.2</core_client_version> <![CDATA[ <message> The storage control block address is invalid. <----<<< (0x9) - exit code 9 (0x9)</message> <stderr_txt> modelGetExecutables: check control files, strTemp0 & 1 : C:\ProgramData\BOINC/projects/climateprediction.net/wah2_eas25_n0d1_201012_24_1022_012312644/jobs/xadae.namelists C:\ProgramData\BOINC/projects/climateprediction.net/wah2_eas25_n0d1_201012_24_1022_012312644/jobs/xacxf.namelists modelGetExecutables: unzipping control files : strInput & strTmp wah2_eas25_n0d1_201012_24_1022_012312644.zip wah2_eas25_n0d1_201012_24_1022_012312644/jobs gstrDump[0] = generic_phase1_spinup_eas25_global_aabaka_f gstrDump[1] = generic_phase1_spinup_eas25_regional_aabaka_f global model: command string: "C:\ProgramData\BOINC/projects/climateprediction.net/wah2am3m2_um_8.32_windows_intelx86.exe" wah2_eas25_n0d1_201012_24_1022_012312644 generic_phase1_spinup_eas25_global_aabaka_f ic19611128_10_N96 NATclim_ancil_168months_CMIP6-ACCESS-CM2_SST_2009-01-01_2022-12-30_v2404b NATclim_ancil_168months_CMIP6-ACCESS-CM2_SIC_2009-01-01_2022-12-30_v2404b so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5 regional model: command string: "C:\ProgramData\BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.32_windows_intelx86.exe" wah2_eas25_n0d1_201012_24_1022_012312644 cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. executeModelProcess: MonID=2964, GCM_PID=16812, RCM_PID=2028 Queuing intermediate upload for CPDN/BOINC: cpdnout1.zip Queuing intermediate upload for CPDN/BOINC: cpdnout2.zip Queuing intermediate upload for CPDN/BOINC: cpdnout3.zip Queuing intermediate upload for CPDN/BOINC: cpdnout4.zip Queuing intermediate upload for CPDN/BOINC: cpdnout5.zip Queuing intermediate upload for CPDN/BOINC: cpdnout6.zip Queuing intermediate upload for CPDN/BOINC: cpdnout7.zip Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 16812, selfPID = 16812, iMonCtr = 1 No Process Handle Regional Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 16812, selfPID = 2028, iMonCtr = 1 </stderr_txt> ]]> |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Normally I run SupportAssist for updates, but it has not run lately. (It keeps a log of recent activity).Surely there is still some record of when updates have run? - Found this. Open Start. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,924,640 RAC: 13,177 |
I normally leave the Boinc manager running, but it was not running when I turned the monitor on. So it is possible that Windows did an update and reboot without telling me. I'd guess it unlikely that a reboot would lead to tasks crashing with this new app version. I'd want to know if you rebooted the PC between any of the crashes. If not, I'd suggest a reboot, it's probably one of the first troubleshooting steps to try for Windows related errors, which this one appears to be. There certainly seems to be a common problem to your tasks crashing. Checking the Update History in Settings, you'll be able to see the dates but not times of both successful and failed updates. Also Reliability History as well as Event Viewer - to see if anything happened around crash times. Checking stdoutdae.txt in BOINC directory and the different std....txt files in the task directories of the failed tasks might provide some clues. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,742,261 RAC: 6,307 |
It's likely that Windows update has run recently - yesterday was Microsoft's "Patch Tuesday" (the day that they release major updates each month). Usually, Windows 11 restarts automatically after applying the patches - sometime after the end of your defined 'working day'. The other route to finding information about, and controlling, updates is Start --> Settings --> Windows Update. There's an 'Update history' link on that page, and also controls for delaying automatic updates for up to five weeks. Using that, you can arrange to apply the updates yourself between tasks, and then allow the next task to run uninterrupted. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,575,067 RAC: 15,735 |
The error message: <message>seen in your log has been mentioned & discussed before on the forums, by yourself in fact Jean: https://www.cpdn.org/forum_thread.php?id=9233&postid=70386 and https://www.cpdn.org/forum_thread.php?id=9277&postid=70852. I looked it up on the web and plenty of reports it's associated with Windows Update in some way. This particular error message occurs in 10% of the failures we see in a batch. So it's quite common. If I get a moment, I'll look through the database and check what day of the week we see these fails. That would back up Richard's suggestion. We also see a high number of disk (or storage) related errors such as : 'system cannot find drive specified', 'drive cannot find specific area or track', 'code 193 error, e.g. boinc_finish(193)', and 'extended attributes are inconsistent'. The last one may also be associated with Windows Update. I think the rest are likely to be hardware related. Those errors combined account for ~25-30% of failed tasks in a batch. --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,742,261 RAC: 6,307 |
'Patch Tuesday' is always the second Tuesday of the month (US pacific time), and it usually reaches the UK on 'the Wednesday after the second Tuesday' of the month - not necessarily 'the second Wednesday'. Time zones, and all that. I have a strong feeling that Windows 11 continues to install bits of the update for a significant period after the reboot. If BOINC is installed as a service, it will be auto-launched while these residual processes are still happening - they may be responsible for these otherwise surprising errors apparently originating deep in the hardware. My Windows 11 laptop has two running tasks currently at 90% complete - but I've blocked updates until they finish. Depending on the time of day they're predicted to finish, I may download replacements in advance - but I'll suspend them from running until well after I've dealt with the updates. Rinse and repeat. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Normally I run SupportAssist for updates, but it has not run lately. (It keeps a log of recent activity). C:\Windows\System32>wmic path Win32_OperatingSystem get LastBootUpTime LastBootUpTime 20240813231633.500000-240 One group failed August 14 03:16:50; (The other group failed August 9) I am 4 time zones behind GMT. I notice there were two groups of failures. For each group of failures all members failures happened at the same time. Each group had four tasks. My app_config.xml allows up to four of these to run at a time. I prefer Linux that does not do updates until I tell it to. It does tell me when there are some. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
The other route to finding information about, and controlling, updates is Start --> Settings --> Windows Update. There's an 'Update history' link on that page, and also controls for delaying automatic updates for up to five weeks. I have it set to 1 week. I tried to set it to 3 weeks, but it will not allow me set it to anything but 1 week. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,924,640 RAC: 13,177 |
One group failed August 14 03:16:50; (The other group failed August 9) I am 4 time zones behind GMT. Since the first group failed before patch Tuesday, these failures may not be related to it. Check the Update History to see if any happened on the days of failures. I'd say a reboot is in order, with the new app version your current tasks are almost certain to be fine. But I'd be concerned that there may be a non-trivial chance that in 3-4 days same thing will happen again. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I'd say a reboot is in order, with the new app version your current tasks are almost certain to be fine. But I'd be concerned that there may be a non-trivial chance that in 3-4 days same thing will happen again. Well, there are two tasks running on that machine that have a little over 15 days to go. I assume if I suspend them and then reboot, they will not come back. So should I abort them? |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073 |
I suspend running tasks before shutting my PC down at night, the current load have resumed OK in the morning with no problems (provided I remember to resume them). They've survived about a dozen nights by doing this so far, and I assume they will survive the last few and complete soon. |
©2024 cpdn.org