Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
Firstly, I am thinking that having threads by batch numbers might help keeping relevant posts together in one place. With five currently running on my Ryzen7 using WINE. I am estimating about 7 days computing time for these. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Firstly, I am thinking that having threads by batch numbers might help keeping relevant posts together in one place. While there have been no "hard fails" in this batch so far (where all 3 tasks in a work unit fail), and there is no way to view the number of individual task failures, it looks like Signal 11 failures are dominating at this point. The task on my Ryzen running Windows natively failed at the usual point with a signal 11 (segmentation fault) during the first model day. Tasks running under Wine appear to be progressing nicely. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
The task on my Ryzen running Windows natively failed at the usual point with a signal 11 (segmentation fault) during the first model day. Tasks running under Wine appear to be progressing nicely. I have three of those tasks running on my Windows10 macine. They started at about one-hour intervals and have about 1.7, 2.7, and 3.7 hours completed. About 9.5 days for them to complete. 8-core machine running on 7 of the cores. Machine not doing anything else (except 4 other Boinc tasks). |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,746,817 RAC: 869 |
I just picked up 8 from this batch, and was just about to celebrate all 8 behaving properly..... However wah2_eas25_a2qn_200212_24_996_012227099 failed with a computation error after 2:39 All the others have passed this time and are plodding on, hopefully for the next few days and to completion. |
Send message Joined: 22 Feb 06 Posts: 492 Credit: 31,500,747 RAC: 15,338 |
Picked up 4.Fingers crossed. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
I don't trust WINE for running the model correctly. We discovered during testing that WINE implementations do not fail the model when it suffers a memory fault unlike on bare metal Windows. I think there is some memory protection in place for WINE. That implies the results from incorrect memory addresses (e.g. maybe zero) are being used by the model, potentially corrupting the results.Firstly, I am thinking that having threads by batch numbers might help keeping relevant posts together in one place. --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
Firstly, I am thinking that having threads by batch numbers might help keeping relevant posts together in one place.Good idea, maybe add more description to the title though? e.g. 'Batch 996: Weather@Home2 East Asia25' or similar? --- CPDN Visiting Scientist |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,746,817 RAC: 869 |
Good idea - and I see someone has already done it, thanks to who ever that was :-) |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
I don't trust WINE for running the model correctly. Is there a way of testing for this, e.g. looking at results from models that crash using windows when run on WINE and seeing where the results fit in the statistical model and whether they produce results from those tasks that look reasonable compared with those that have very similar initial conditions. Maybe collecting a few of the hard fails and running them from the testing site on WINE hosts. Is it just the ones that fail or should fail that are problematic or should the ones that don't give the memory error produce valid results? Does this mean that those of us who run tasks under WINE have been producing dodgy or at least unsafe results for years? I have no idea how many hosts use WINE to run CPDN tasks. I suspect a higher percentage among those who read the fora than in the set and forget brigade who don't. (I am ready to find out if this is an area where my understanding of how this works is wrong.) |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I don't trust WINE for running the model correctly. We discovered during testing that WINE implementations do not fail the model when it suffers a memory fault unlike on bare metal Windows. I think there is some memory protection in place for WINE. That implies the results from incorrect memory addresses (e.g. maybe zero) are being used by the model, potentially corrupting the results. I think memory faults, WINE or not, are an indication of an incorrect program or a hardware fault. If WINE has some memory protection in it in addition to the hardware, perhaps this is just more proof of my theory. It seems to hide the memory faults. When I first used Windows (Windows 95) it had so many faults that it crashed several times a day even if it was not doing anything. I did not run BOINC then (I do not remember if it existed at that time). Since then Windows has improved some. IIRC Windows 7 was pretty good and I am now running Windows 10 on my other machine. The three current tasks on my Windows machine now have about 18 hours on them with about 9 days to go. |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 19,600,119 RAC: 26,117 |
Just hopped on these Tasks. Can you tell me, how much Discspace and RAM is needed per Task ? On a different machine I have an "8.52 HadAM4 at N216", can you tell me same facts for these ? Thanks in advance Yeti Supporting BOINC, a great concept ! |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
Haven't checked disk space but maximum ram use on my box is 1.1% of 32GB. With 8 tasks running my box is using 6.43GB for CPDN according to BOINC. Virtual memory size reported on task properties is 273MB per task. Edit: you can see the figures BOINC reports for your tasks by clicking on a task and then Properties down the left hand side of the manager. On a different machine I have an "8.52 HadAM4 at N216", can you tell me same facts for these ? The recommended minimum Ram for the N216 is 2GB/task so running one should present no problems for current machines. Peak working set size 1,412.09 MB for the last N216 task I ran. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,901,585 RAC: 2,106 |
Are uploads clearing? I've got four Zips backed off, but it may of course be my problem rather than a general one. Trickles recording fine. |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
Are uploads clearing? Seven of my tasks have produced their first two trickles and the zips have all uploaded, however, one did prove stubborn and was hanging around for several hours requiring a lot of retries. So my guess is things are not totally OK at the server but things should eventually clear. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,901,585 RAC: 2,106 |
Are uploads clearing? Thanks, Dave — I’ll have to have a bit of patience … |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
Thanks, Dave — I’ll have to have a bit of patience … got two 3.zip files currently uploading slowly and another four in the queue but the slow rate is my bored band. If I notice any more sticking I will check the error messages. |
Send message Joined: 22 Feb 06 Posts: 492 Credit: 31,500,747 RAC: 15,338 |
These going OK. Another 8 picked up! |
Send message Joined: 22 Feb 06 Posts: 492 Credit: 31,500,747 RAC: 15,338 |
Trickle files uploading OK but zips are now getting stuck. 07/10/2023 08:17:46 | climateprediction.net | [fxd] starting upload, upload_offset -1 07/10/2023 08:17:46 | climateprediction.net | Started upload of wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_3.zip 07/10/2023 08:17:46 | climateprediction.net | [file_xfer] URL: http://upload7.cpdn.org/cgi-bin/file_upload_handler 07/10/2023 08:17:48 | climateprediction.net | [file_xfer] http op done; retval 0 (Success) 07/10/2023 08:17:48 | climateprediction.net | [error] Error reported by file upload server: [wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_3.zip] locked by file_upload_handler PID=3567801 07/10/2023 08:17:48 | climateprediction.net | [file_xfer] parsing upload response: <data_server_reply> <status>1</status> <message>[wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_3.zip] locked by file_upload_handler PID=3567801</message></data_server_reply> 07/10/2023 08:17:48 | climateprediction.net | [file_xfer] parsing status: -127 07/10/2023 08:17:48 | climateprediction.net | [file_xfer] file transfer status -127 (transient upload error) 07/10/2023 08:17:48 | climateprediction.net | Temporarily failed upload of wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_3.zip: transient upload error 07/10/2023 08:17:48 | climateprediction.net | [file_xfer] project-wide upload delay for 1913.964660 sec 07/10/2023 08:17:48 | climateprediction.net | Backing off 00:22:55 on upload of wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_3.zip |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,746,817 RAC: 869 |
After some rejoicing the other day at getting 8 from this batch I'm now somewhat less happy - all but one have failed :-( Can someone who understands please have a look at the failed tasks (should b easy as I only have one active cruncher) - I suspect there's something amiss with the way it is set up. Thanks in advance. |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
Hi Rob. Had hoped that the signal11 failures would be a lot lower with this batch but it seems this might not be the case. This is to do with the batch and not your computer. Just hoping there are enough good tasks between this and the last lot for the researcher to get what she needs. |
©2024 cpdn.org