climateprediction.net (CPDN) home page
Thread 'WaH batches 996 & 1001 have been closed'

Thread 'WaH batches 996 & 1001 have been closed'

Message boards : Number crunching : WaH batches 996 & 1001 have been closed
Message board moderation

To post messages, you must log in.

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 70598 - Posted: 5 Mar 2024, 10:08:25 UTC
Last modified: 5 Mar 2024, 10:09:42 UTC

The project scientists have enough results from batches 996 & 1001 to compare against results from the new version of the app running the identical batches 1006 & 1007.

Comparison shows that the newer v8.29, recently recompiled, produced slightly warmer temperatures in the winter months, compared to the old version 8.24. The differences are not statistically significant (and not unexpected).

WaH v8.29 is much more stable with very few hard fails and correctly restarts on a host power cycle. There will be no more new batches using WaH v8.24.
---
CPDN Visiting Scientist
ID: 70598 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 70599 - Posted: 5 Mar 2024, 10:36:09 UTC - in response to Message 70598.  

WaH v8.29 is much more stable with very few hard fails and correctly restarts on a host power cycle. There will be no more new batches using WaH v8.24.


Good news indeed! I will abort my 1001 resends. Well done for getting this sorted.
ID: 70599 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70600 - Posted: 5 Mar 2024, 16:11:13 UTC - in response to Message 70598.  

Comparison shows that the newer v8.29, recently recompiled, produced slightly warmer temperatures in the winter months, compared to the old version 8.24. The differences are not statistically significant (and not unexpected).


Good news! Is there a reason model result changes were "not unexpected"? Fixing correctness issues shouldn't alter results... I'd think... but the WaH stuff seems a bit special case as far as code goes.


WaH v8.29 is much more stable with very few hard fails and correctly restarts on a host power cycle. There will be no more new batches using WaH v8.24.


Even better news! I won't miss Windows Update or a power outage trashing a CPU-month or two of work.

Thank you so much for your work on improving this code!
ID: 70600 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 70601 - Posted: 5 Mar 2024, 16:29:52 UTC - in response to Message 70600.  

Comparison shows that the newer v8.29, recently recompiled, produced slightly warmer temperatures in the winter months, compared to the old version 8.24. The differences are not statistically significant (and not unexpected).
Good news! Is there a reason model result changes were "not unexpected"? Fixing correctness issues shouldn't alter results... I'd think... but the WaH stuff seems a bit special case as far as code goes.
Atmospheric models are non-linear in nature. New compilers can cause code optimization differences, new library versions, etc. Differences in model results can come from the cloud/convection code as one example. It tests to see if water vapour saturation in cloud free air is above a particular value, so a single bit difference can decide whether a cloud forms or not. Since a cloud represents a change of state, water vapour condenses out, that's a non-linear process and changes the properties of the air parcel and its environment.
---
CPDN Visiting Scientist
ID: 70601 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 70603 - Posted: 5 Mar 2024, 16:36:56 UTC
Last modified: 5 Mar 2024, 16:48:49 UTC

*rant* I just don't understand how a "Windows Update" can be allowed to stop and restart the system anytime, idiotic..
ID: 70603 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 127
Credit: 24,659,954
RAC: 6,714
Message 70604 - Posted: 5 Mar 2024, 16:50:15 UTC - in response to Message 70598.  

WaH batches 996 & 1001 have been closed
What does this mean in practice?
Does it mean continue crunching will just be a waste of electricity, since anything returned is just dumped?
Or is it still useful to continue crunching these until they either finish or crap-out on next re-boot?
ID: 70604 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70606 - Posted: 5 Mar 2024, 17:27:56 UTC - in response to Message 70601.  

Yeah, I understand they're exceedingly sensitive to disturbances and such, with non-linear follow on effects. I just work in a space where if the code generated different results based on the compiler, we'd be running down the bugs. But I also like to think x86 floating point, even vector, is well enough defined that you shouldn't get differences between chips, and I'm aware that's a falsehood - I just don't work in floating point spaces. Just interesting. I suppose if it's reordering some of the rounding operations and such you can get subtly different output from a series of operations. I like my computers deterministic, darn it! :p

*rant* I just don't understand how a "Windows Update" can be allowed to stop and restart the system anytime, idiotic..


There's probably some way to disable it. I don't really "do Windows" anymore, so I'm not sure how to do it. I have Linux compute rigs, and when there's non-Linux work (Windows, 32-bit Intel Mac, etc), I spin up VMs for the duration of the work, and then destroy them when done, because I don't have enough disk space to store all of them on the compute rigs, and "copying VMs around between hosts" causes some very interesting failures when two systems are identical enough that they get the same computer ID and start smashing each other's work allocation.

What's extra double special is that unless you change some other notification settings, it's likely to install updates, reboot, and then sit at the "But would you pretty please make an Online Microsoft Account????" nag screen (which doesn't allow any compute to start). No, you blasted OS, I created an offline account, through your increasingly troublesome process (now you have to actually not have a network connection at all to even see the option), because I wanted an offline account!
ID: 70606 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,746,817
RAC: 869
Message 70607 - Posted: 5 Mar 2024, 18:05:35 UTC - in response to Message 70604.  

You might as well abandon them as any results they return will be discarded. This will save you a bit of electricity, and reduce the workload on the project servers a little bit.
ID: 70607 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,617,787
RAC: 9,624
Message 70608 - Posted: 5 Mar 2024, 20:57:09 UTC - in response to Message 70607.  
Last modified: 5 Mar 2024, 20:57:33 UTC

I've received a Batch 995 WAH 8.24 retread from July 2023. wah2_nz25_20aa_209105_25_995_012220768.
https://www.cpdn.org/workunit.php?wuid=12220768
Should this WU be crunched or abandoned, please?
Thank you.
ID: 70608 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70609 - Posted: 5 Mar 2024, 21:51:34 UTC

It's not 996 or 1001, so crunch it, far as I know.
ID: 70609 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 70610 - Posted: 5 Mar 2024, 21:55:52 UTC - in response to Message 70608.  
Last modified: 5 Mar 2024, 21:57:16 UTC

A closed batch doesn't send out retries.

I've received a Batch 995 WAH 8.24 retread from July 2023. wah2_nz25_20aa_209105_25_995_012220768.
https://www.cpdn.org/workunit.php?wuid=12220768
Should this WU be crunched or abandoned, please?
Thank you.

---
CPDN Visiting Scientist
ID: 70610 · Report as offensive     Reply Quote

Message boards : Number crunching : WaH batches 996 & 1001 have been closed

©2024 cpdn.org