climateprediction.net (CPDN) home page
Thread 'Batch 996 Weather@Home2 East Asia25'

Thread 'Batch 996 Weather@Home2 East Asia25'

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 12 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 69663 - Posted: 5 Oct 2023, 18:00:52 UTC
Last modified: 6 Oct 2023, 10:46:18 UTC

Firstly, I am thinking that having threads by batch numbers might help keeping relevant posts together in one place.

With five currently running on my Ryzen7 using WINE. I am estimating about 7 days computing time for these.
ID: 69663 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 69664 - Posted: 5 Oct 2023, 20:13:49 UTC - in response to Message 69663.  

Firstly, I am thinking that having threads by batch numbers might help keeping relevant posts together in one place.

With five currently running on my Ryzen7 using WINE. I am estimating about 7 days computing time for these.

While there have been no "hard fails" in this batch so far (where all 3 tasks in a work unit fail), and there is no way to view the number of individual task failures, it looks like Signal 11 failures are dominating at this point. The task on my Ryzen running Windows natively failed at the usual point with a signal 11 (segmentation fault) during the first model day. Tasks running under Wine appear to be progressing nicely.
ID: 69664 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 69665 - Posted: 5 Oct 2023, 20:32:52 UTC - in response to Message 69664.  

The task on my Ryzen running Windows natively failed at the usual point with a signal 11 (segmentation fault) during the first model day. Tasks running under Wine appear to be progressing nicely.


I have three of those tasks running on my Windows10 macine. They started at about one-hour intervals and have about 1.7, 2.7, and 3.7 hours completed.
About 9.5 days for them to complete. 8-core machine running on 7 of the cores. Machine not doing anything else (except 4 other Boinc tasks).
ID: 69665 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,746,817
RAC: 869
Message 69666 - Posted: 5 Oct 2023, 21:05:37 UTC

I just picked up 8 from this batch, and was just about to celebrate all 8 behaving properly.....
However wah2_eas25_a2qn_200212_24_996_012227099 failed with a computation error after 2:39
All the others have passed this time and are plodding on, hopefully for the next few days and to completion.
ID: 69666 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 492
Credit: 31,500,747
RAC: 15,338
Message 69667 - Posted: 5 Oct 2023, 22:22:02 UTC

Picked up 4.Fingers crossed.
ID: 69667 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 69671 - Posted: 6 Oct 2023, 9:16:53 UTC - in response to Message 69664.  

Firstly, I am thinking that having threads by batch numbers might help keeping relevant posts together in one place.

With five currently running on my Ryzen7 using WINE. I am estimating about 7 days computing time for these.

While there have been no "hard fails" in this batch so far (where all 3 tasks in a work unit fail), and there is no way to view the number of individual task failures, it looks like Signal 11 failures are dominating at this point. The task on my Ryzen running Windows natively failed at the usual point with a signal 11 (segmentation fault) during the first model day. Tasks running under Wine appear to be progressing nicely.
I don't trust WINE for running the model correctly. We discovered during testing that WINE implementations do not fail the model when it suffers a memory fault unlike on bare metal Windows. I think there is some memory protection in place for WINE. That implies the results from incorrect memory addresses (e.g. maybe zero) are being used by the model, potentially corrupting the results.
---
CPDN Visiting Scientist
ID: 69671 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 69672 - Posted: 6 Oct 2023, 9:19:18 UTC - in response to Message 69663.  

Firstly, I am thinking that having threads by batch numbers might help keeping relevant posts together in one place.
Good idea, maybe add more description to the title though? e.g. 'Batch 996: Weather@Home2 East Asia25' or similar?
---
CPDN Visiting Scientist
ID: 69672 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,746,817
RAC: 869
Message 69673 - Posted: 6 Oct 2023, 11:09:02 UTC - in response to Message 69672.  

Good idea - and I see someone has already done it, thanks to who ever that was :-)
ID: 69673 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 69674 - Posted: 6 Oct 2023, 11:17:00 UTC - in response to Message 69672.  

I don't trust WINE for running the model correctly.


Is there a way of testing for this, e.g. looking at results from models that crash using windows when run on WINE and seeing where the results fit in the statistical model and whether they produce results from those tasks that look reasonable compared with those that have very similar initial conditions. Maybe collecting a few of the hard fails and running them from the testing site on WINE hosts. Is it just the ones that fail or should fail that are problematic or should the ones that don't give the memory error produce valid results?

Does this mean that those of us who run tasks under WINE have been producing dodgy or at least unsafe results for years?

I have no idea how many hosts use WINE to run CPDN tasks. I suspect a higher percentage among those who read the fora than in the set and forget brigade who don't.

(I am ready to find out if this is an area where my understanding of how this works is wrong.)
ID: 69674 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 69675 - Posted: 6 Oct 2023, 11:29:45 UTC - in response to Message 69671.  
Last modified: 6 Oct 2023, 11:51:20 UTC

I don't trust WINE for running the model correctly. We discovered during testing that WINE implementations do not fail the model when it suffers a memory fault unlike on bare metal Windows. I think there is some memory protection in place for WINE. That implies the results from incorrect memory addresses (e.g. maybe zero) are being used by the model, potentially corrupting the results.


I think memory faults, WINE or not, are an indication of an incorrect program or a hardware fault. If WINE has some memory protection in it in addition to the hardware, perhaps this is just more proof of my theory. It seems to hide the memory faults.

When I first used Windows (Windows 95) it had so many faults that it crashed several times a day even if it was not doing anything. I did not run BOINC then (I do not remember if it existed at that time). Since then Windows has improved some. IIRC Windows 7 was pretty good and I am now running Windows 10 on my other machine.

The three current tasks on my Windows machine now have about 18 hours on them with about 9 days to go.
ID: 69675 · Report as offensive     Reply Quote
Yeti

Send message
Joined: 5 Aug 04
Posts: 178
Credit: 19,600,119
RAC: 26,117
Message 69676 - Posted: 6 Oct 2023, 16:34:24 UTC

Just hopped on these Tasks.

Can you tell me, how much Discspace and RAM is needed per Task ?

On a different machine I have an "8.52 HadAM4 at N216", can you tell me same facts for these ?

Thanks in advance

Yeti
Supporting BOINC, a great concept !
ID: 69676 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 69677 - Posted: 6 Oct 2023, 16:49:20 UTC - in response to Message 69676.  
Last modified: 6 Oct 2023, 17:14:39 UTC

Haven't checked disk space but maximum ram use on my box is 1.1% of 32GB. With 8 tasks running my box is using 6.43GB for CPDN according to BOINC. Virtual memory size reported on task properties is 273MB per task.

Edit: you can see the figures BOINC reports for your tasks by clicking on a task and then Properties down the left hand side of the manager.

On a different machine I have an "8.52 HadAM4 at N216", can you tell me same facts for these ?

The recommended minimum Ram for the N216 is 2GB/task so running one should present no problems for current machines. Peak working set size 1,412.09 MB for the last N216 task I ran.
ID: 69677 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,901,585
RAC: 2,106
Message 69678 - Posted: 6 Oct 2023, 18:05:27 UTC

Are uploads clearing? I've got four Zips backed off, but it may of course be my problem rather than a general one.

Trickles recording fine.
ID: 69678 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 69679 - Posted: 6 Oct 2023, 18:16:17 UTC - in response to Message 69678.  

Are uploads clearing?


Seven of my tasks have produced their first two trickles and the zips have all uploaded, however, one did prove stubborn and was hanging around for several hours requiring a lot of retries. So my guess is things are not totally OK at the server but things should eventually clear.
ID: 69679 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,901,585
RAC: 2,106
Message 69680 - Posted: 6 Oct 2023, 18:31:21 UTC - in response to Message 69679.  

Are uploads clearing?


Seven of my tasks have produced their first two trickles and the zips have all uploaded, however, one did prove stubborn and was hanging around for several hours requiring a lot of retries. So my guess is things are not totally OK at the server but things should eventually clear.


Thanks, Dave — I’ll have to have a bit of patience …
ID: 69680 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 69681 - Posted: 6 Oct 2023, 20:00:07 UTC

Thanks, Dave — I’ll have to have a bit of patience …

got two 3.zip files currently uploading slowly and another four in the queue but the slow rate is my bored band. If I notice any more sticking I will check the error messages.
ID: 69681 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 492
Credit: 31,500,747
RAC: 15,338
Message 69682 - Posted: 6 Oct 2023, 22:14:25 UTC - in response to Message 69667.  

These going OK. Another 8 picked up!
ID: 69682 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 492
Credit: 31,500,747
RAC: 15,338
Message 69684 - Posted: 7 Oct 2023, 7:56:22 UTC - in response to Message 69682.  

Trickle files uploading OK but zips are now getting stuck.

07/10/2023 08:17:46 | climateprediction.net | [fxd] starting upload, upload_offset -1
07/10/2023 08:17:46 | climateprediction.net | Started upload of wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_3.zip
07/10/2023 08:17:46 | climateprediction.net | [file_xfer] URL: http://upload7.cpdn.org/cgi-bin/file_upload_handler
07/10/2023 08:17:48 | climateprediction.net | [file_xfer] http op done; retval 0 (Success)
07/10/2023 08:17:48 | climateprediction.net | [error] Error reported by file upload server: [wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_3.zip] locked by file_upload_handler PID=3567801
07/10/2023 08:17:48 | climateprediction.net | [file_xfer] parsing upload response: <data_server_reply> <status>1</status> <message>[wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_3.zip] locked by file_upload_handler PID=3567801</message></data_server_reply>
07/10/2023 08:17:48 | climateprediction.net | [file_xfer] parsing status: -127
07/10/2023 08:17:48 | climateprediction.net | [file_xfer] file transfer status -127 (transient upload error)
07/10/2023 08:17:48 | climateprediction.net | Temporarily failed upload of wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_3.zip: transient upload error
07/10/2023 08:17:48 | climateprediction.net | [file_xfer] project-wide upload delay for 1913.964660 sec
07/10/2023 08:17:48 | climateprediction.net | Backing off 00:22:55 on upload of wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_3.zip
ID: 69684 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,746,817
RAC: 869
Message 69685 - Posted: 7 Oct 2023, 8:31:31 UTC

After some rejoicing the other day at getting 8 from this batch I'm now somewhat less happy - all but one have failed :-(
Can someone who understands please have a look at the failed tasks (should b easy as I only have one active cruncher) - I suspect there's something amiss with the way it is set up.
Thanks in advance.
ID: 69685 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 69686 - Posted: 7 Oct 2023, 9:30:25 UTC - in response to Message 69685.  

Hi Rob. Had hoped that the signal11 failures would be a lot lower with this batch but it seems this might not be the case. This is to do with the batch and not your computer. Just hoping there are enough good tasks between this and the last lot for the researcher to get what she needs.
ID: 69686 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 12 · Next

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25

©2024 cpdn.org