climateprediction.net (CPDN) home page
Thread 'Receiving new tasks with impossible to meet deadlines'

Thread 'Receiving new tasks with impossible to meet deadlines'

Message boards : Number crunching : Receiving new tasks with impossible to meet deadlines
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,817,746
RAC: 4,590
Message 71303 - Posted: 18 Aug 2024, 14:13:36 UTC - in response to Message 71302.  

It might make a difference depending if you do a soft shutdown from the screen, and a hard shutdown from the power button - or worst of all, a forced shutdown by holding the power button for four seconds or flicking the mains power switch.

The software route is designed to give you time to respond to messages - 'do you want to save changes to that file you've forgotten you're editing?'. If any running task needs time to shut down from whatever it's doing, Windows should wait for it. It should do the same for a hard shutdown, but I personally don't feel confident doing that if the screen and keyboard are working.

A forced shutdown is helpful if the machine has locked, and is no longer responding to mouse or keyboard - even ctrl-alt-del. But you can't be sure exactly what caused it.
ID: 71303 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,817,746
RAC: 4,590
Message 71304 - Posted: 18 Aug 2024, 14:20:46 UTC - in response to Message 71301.  

It's not brutal ...
I was having visions of driving down a freeway with BOINC at the controls. Full throttle for one second, coast for the next second. BOINC doesn't have a cruise control - it would sound like a ramjet.
ID: 71304 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71305 - Posted: 18 Aug 2024, 17:02:34 UTC - in response to Message 71304.  
Last modified: 18 Aug 2024, 17:03:43 UTC

I was having visions of driving down a freeway with BOINC at the controls. Full throttle for one second, coast for the next second. BOINC doesn't have a cruise control - it would sound like a ramjet.


Do you not mean a pulse jet?

https://en.wikipedia.org/wiki/Pulsejet
ID: 71305 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,817,746
RAC: 4,590
Message 71306 - Posted: 18 Aug 2024, 18:11:28 UTC - in response to Message 71305.  

I was thinking those were the same thing, but I see they're not. The example I had in mind was as in that wiki article.
ID: 71306 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,030,773
RAC: 4,296
Message 71308 - Posted: 19 Aug 2024, 0:03:12 UTC - in response to Message 71301.  

As for this idea of shutting boinc client down before shutting down the PC, that's a myth. Whether the client gets the quit signal from the user on boinc manager or the PC shutdown, the software goes through the same route.

If I remember correctly, Linux BOINC doesn't have the Exit BOINC option in the menu. Perhaps that's why? I have a habit of always closing programs before restarting but perhaps now it'll be less annoying if a restart happens before exiting BOINC.

On a separate but perhaps related note, I kind of think people sometimes baby/worry about their systems too much. Two things come to mind are SSD usage and temperatures.

The more objective info seems to be that even with heavy usage (writing to disk), relatively modern SSDs will last quite a long time. Also, relatively modern components, i.e. CPUs, GPUs, seem to be designed to run hot for a long time too. By hot I mean at or close to their maximums. It's very likely all of these will still be working just fine by the time the user decides to upgrade. It seems a little odd to me when users spend money on (especially) good quality, more expensive components and then try to baby them instead of letting them earn their pay, so to say.
ID: 71308 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 71309 - Posted: 19 Aug 2024, 6:25:48 UTC - in response to Message 71308.  
Last modified: 19 Aug 2024, 6:30:05 UTC

If I remember correctly, Linux BOINC doesn't have the Exit BOINC option in the menu. Perhaps that's why? I have a habit of always closing programs before restarting but perhaps now it'll be less annoying if a restart happens before exiting BOINC.
It is relatively recently that option was taken away. It was certainly there when the myth or not myth started which was back in the days of tasks that took months even on a then fast machine. There was also the fun of taking and restoring backups. Something I last did with the now defunct slab model.

The option of stopping the running client is still there if you compile your own client. What I used to do before exiting BOINC was stop all running tasks by suspending them before shutting down BOINC.
ID: 71309 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71311 - Posted: 19 Aug 2024, 9:40:22 UTC - in response to Message 71309.  
Last modified: 19 Aug 2024, 9:41:36 UTC

The 'must shutdown client before PC' advice was frequent on the forums when Weather@Home tasks were failing some time ago, and it's great many folk contributed to finding patterns of behaviour. Certainly helps me debug the code. But even then it was a myth because we now know it was nothing to do with the way the tasks were shutdown. It was related to the way the processes were failing to communicate when they started back up. Since we're compiling the codes with Microsoft libraries, if it was that risky you'd need to do the same for all the programs running on the PC.

As Richard rightly points out, a hard power-off risks losing the contents of the I/O buffers before they are written to storage. If the tasks are in the middle of a checkpoint that might cause a failure on reboot.

I spend a lot of time analyzing the task failures and none are because a client was shutdown (either manually or PC power off). I just want to get across there's no need for this any more, people shouldn't waste time shutting the client down for CPDN when the PC is shutdown. Life is too short.
ID: 71311 · Report as offensive     Reply Quote
ProfileForrest

Send message
Joined: 19 May 06
Posts: 9
Credit: 4,294,690
RAC: 11,165
Message 71334 - Posted: 20 Aug 2024, 19:20:47 UTC - in response to Message 71311.  
Last modified: 20 Aug 2024, 19:27:39 UTC

The 'must shutdown client before PC' advice was frequent on the forums when Weather@Home tasks were failing some time ago, and it's great many folk contributed to finding patterns of behaviour. Certainly helps me debug the code. But even then it was a myth because we now know it was nothing to do with the way the tasks were shutdown. It was related to the way the processes were failing to communicate when they started back up. Since we're compiling the codes with Microsoft libraries, if it was that risky you'd need to do the same for all the programs running on the PC.

As Richard rightly points out, a hard power-off risks losing the contents of the I/O buffers before they are written to storage. If the tasks are in the middle of a checkpoint that might cause a failure on reboot.

I spend a lot of time analyzing the task failures and none are because a client was shutdown (either manually or PC power off). I just want to get across there's no need for this any more, people shouldn't waste time shutting the client down for CPDN when the PC is shutdown. Life is too short.


Apologies, I was not able to respond you yesterday.
After I commented with the "shut down before pc" post, I did have to shut down the pc. At the time there were 16 tasks running & 7 waiting to run. After restarting, all 16 tasks that were running had computation errors. The 7 tasks that had been idle were not affected. I didn't think to screenshot it at the time, but have marked up one I took earlier in the day, after setting the % time to 100. The green higlighted tasks are those that did not error out.

I’ve never had all of the tasks error out like they did this time. When it’s happened before, it was:
1. Limited to just a few tasks
2. Always the tasks with the most progress

If shutting down the pc is not the reason I’m encountering these computation errors, what else could be the cause? Is it possible that all the tasks were in the middle of a checkpoint at the same time?

task list: https://www.dropbox.com/scl/fi/is4846zhl31f25xymd1pq/tasks-remaining.png?rlkey=dxpy7uw1a8nejbueqy42flak7&dl=0
log: https://www.dropbox.com/scl/fi/mez0w2ngqlo2ytm2wu7xe/2024-08-20-log-file.txt?rlkey=w1wl12peowheuvfgp50ko0knl&dl=0
current settings: https://www.dropbox.com/scl/fi/myd7hjvi5txtmseddokb5/current-settings.png?rlkey=3vmd5jc5l0eul2vinzycqffw0&dl=0

edit: by "I did have to shut down the pc" I mean perform a hard shutdown. Keyboard/mouse were unresponsive. I've had no issues with responsiveness since.
ID: 71334 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,817,746
RAC: 4,590
Message 71336 - Posted: 21 Aug 2024, 13:06:30 UTC - in response to Message 71334.  

edit: by "I did have to shut down the pc" I mean perform a hard shutdown. Keyboard/mouse were unresponsive. I've had no issues with responsiveness since.
I've had a look through most, if not all, of the tasks reported at 18 Aug 2024, 5:26:07 UTC.

All have returned a good number of trickles, so they've been running OK: but not all the same number - so they're not running synchronously, and the 'all tried to checkpoint at the same moment' is an unlikely explanation.

There are a number of different error messages (the commonest being "Exit status 25 (0x00000019) Unknown error code"), and they all seem to reference some problems with accessing the data disk. But there's no consistency about the actual problem reported.

All seem to have completed their close-down processes and called boinc finish - except the final one (at the top of the list), which has an exit code of zero and a completely blank stderr.txt

Where that leaves us, I down know.
ID: 71336 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71337 - Posted: 21 Aug 2024, 13:27:01 UTC - in response to Message 71334.  
Last modified: 21 Aug 2024, 13:33:07 UTC

.... At the time there were 16 tasks running & 7 waiting to run. After restarting, all 16 tasks that were running had computation errors. ...
I had a look at the failed tasks. Exit code 25 reason is given the stderr output:
The drive cannot locate a specific area or track on the disk.
 (0x19) - exit code 25 (0x19)</message>

This is a seek error and it may point to a corrupt file(s) or directory on the storage device, or some bad sectors. Try running a filesystem check on the storage device which has the BoincData directory. The command is 'chkdsk' on Windows. There's other causes and solutions online, I'm not a Windows expert, others here will know more than me about this. As it happened when the PC started I wonder if it might also be caused by a timeout - is the BoincData directory on a spinning HDD perhaps? I'm not sure if a timeout would generate the same error. First step though, check the disk for errors. Unfortunately, there's not enough information in the logs to know what file(s) the tasks were trying to access. That's something I'm working on.

p.s. I found this webpage about the problem which was useful: https://windowsreport.com/cannot-locate-specific-area-track-on-disk/
---
CPDN Visiting Scientist
ID: 71337 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,030,773
RAC: 4,296
Message 71340 - Posted: 21 Aug 2024, 20:25:36 UTC

Forrest,

I wonder if your system may still have been overloaded, as some of your failed tasks have blank Stderr and the system was unresponsive and you had to reboot.

Also, you're allowing BOINC only half of your RAM which likely means you should run less than 16 CPDN tasks.

Right now it seems like you have 7 tasks left and have set No New Tasks for CPDN and had no errors in about 3 days and trickles have been coming in steadily. Keep it this way and see if your tasks complete but be careful not to load BOINC with other projects. You can try troubleshooting your disk as suggested above now or wait to see if you get any more errors.

Lastly, if you have the desire or ability, double your RAM. That way you'll be able to run more of CPDN and other BOINC projects and do other things without overloading your system. You almost certainly already have an SSD as your system seems pretty new.
ID: 71340 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71342 - Posted: 21 Aug 2024, 22:08:29 UTC - in response to Message 71340.  
Last modified: 21 Aug 2024, 22:08:58 UTC

There's enough RAM at 50% usage for 16 tasks. BOINC gets 16Gb of the 32Gb on the machine. Each Weather@Home task will take a total of 450Mb RAM (that's the sum of the 3 processes per task). So 16Gb for BOINC running CPDN is fine, with room to spare.

Personally, I'd do the filesystem check ASAP. If there is anything wrong, it needs fixing before anything more goes wrong. I suspect as AndreyOR mentions, it might be due to some overload on the machine but that's also how filesystem issues can arise.
---
CPDN Visiting Scientist
ID: 71342 · Report as offensive     Reply Quote
ProfileForrest

Send message
Joined: 19 May 06
Posts: 9
Credit: 4,294,690
RAC: 11,165
Message 71390 - Posted: 29 Aug 2024, 22:38:20 UTC - in response to Message 71342.  
Last modified: 29 Aug 2024, 22:40:24 UTC

I tested & retested the drive, a Samsung EVO 870 w/1.6 TB free space. At the start of testing, it had been 9 days since last trim.

Win 10 utils:
Drive properties/Error checking - no errors on drive
chkdsk - no errors found

Samsung Diagnostics:
Short scan - no errors
Full scan - no errors
Short SMART self-test - no errors
Extended SMART self-test - no errors

I also ran these on my OS drive. All tests were good with the exception that chkdsk found & corrected some errors it encounter there.

On one occasion while testing the 870, the PC repeatedly failed boot up. I noticed that the lighting on one of the two DIMMs was not behaving the same as the other. When I removed it, I noticed that it wasn't fully locked into the slot. I pulled & reseated both and have had no further computation errors since. The remaining 5 tasks completed in less that a week. After the last of those completed, I allowed new tasks again & am porcessing a new batch of 16.

I would be curious to know if a DIMM not making full contact could account for the same errors or if just running fewer tasks is the more likely reason for the successful completion.
ID: 71390 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71391 - Posted: 30 Aug 2024, 9:17:11 UTC - in response to Message 71390.  
Last modified: 30 Aug 2024, 9:17:38 UTC

I would be curious to know if a DIMM not making full contact could account for the same errors or if just running fewer tasks is the more likely reason for the successful completion.
As modern operating systems use spare RAM as a cache for files it sounds plausible it's related to the memory. Good to hear you found it.

The error you had 'unable to find track' is rare looking at the batch statistics. It only occurred in about 2% of failures which amounts to about 16-18 tasks.
---
CPDN Visiting Scientist
ID: 71391 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Receiving new tasks with impossible to meet deadlines

©2024 cpdn.org