Message boards : Number crunching : Batch 1017 Errors
Message board moderation
Author | Message |
---|---|
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
Great to get some work after a very long time. However two completed work units show an error after the 2nd trickle has been uploaded. I think this is after the 14th zip file </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_15.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_16.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_17.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_18.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_19.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> Ran for almost 6 and half hours before failing. Conan |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
Next 2 failed the same way My hosts are visible so you can see the error messages I am running Linux Fedora 37 on a Ryzen 8 7900x and a 5900x. the 5900 has not returned a result yet Conan |
Send message Joined: 21 Aug 19 Posts: 1 Credit: 5,075,286 RAC: 896 |
Same here with these task. Running Linux Ubuntu 22.04.4 on Intel i5-13500. Stopped with error after successfully uploading the 14. cycle acc. to std_err.txt |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,575,067 RAC: 15,735 |
The model completes successfully. The controlling code has messed up the count of upload files. All the results are being returned, so just let the tasks run as the output will be useable. --- CPDN Visiting Scientist |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
The resent tasks are now running correctly and I completed one successfully with a few more running. Thanks Conan |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
Sorry the last 7 work units failed, but not due to faulty work units. I ran out of memory when another programme started up using 1 GB per work unit and launched 22 of them, normally not a problem but with 2 Climate Prediction WUs running using 3 to 5 GB each I had nothing left. It took a while to get control of the computer back and then I aborted the other project and set to No New Work which should stop it from happening again. Conan |
Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318 |
So far 3 WUs completed and 5 failed. I've got memory under control so that's not the problem. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
So far 3 WUs completed and 5 failed. I've got memory under control so that's not the problem. @Aurum These were some of the errors on the 5 that crashed on this PC The errors suggest numerical instability, possibly caused by cpu or memory errors, i.e. hardware issues. But Glenn might know more about that. 1. forrtl: error (73): floating divide by zero 2. SMILAG TRAJECTORY OUT OF ATM 1 TIMES. POINT= 17 MAX ETADOT VERTICAL VEL.= 1.002539831634478E-003 LON = 158.625000000000 degrees LAT = 0.560744942544222 degrees MAX V WIND= 946.703584439113 MAX U WIND= 276.212476087232 V WIND = 946.703584439113 IS TOO STRONG, EXPLOSION. LEVEL= 45 POINT= 17 LON = 158.625000000000 degrees LAT = -0.560744942544222 degrees ABORT! 1 !V WIND TOO STRONG, EXPLOSION!!! 3. forrtl: error (72): floating overflow 4. SMILAG TRAJECTORY OUT OF ATM 13 TIMES. POINT= 16 MAX ETADOT VERTICAL VEL.= 2.468154707297061E-005 LON = 166.500000000000 degrees LAT = -8.41117384374318 degrees POINT= 17 MAX ETADOT VERTICAL VEL.= 3.891867336306428E-005 LON = 165.375000000000 degrees LAT = -8.41117384374318 degrees POINT= 18 MAX ETADOT VERTICAL VEL.= 2.441204446343112E-005 LON = 164.250000000000 degrees LAT = -8.41117384374318 degrees MAX V WIND= 341.887839959058 SMILAG TRAJECTORY OUT OF ATM 1 TIMES. POINT= 17 MAX ETADOT VERTICAL VEL.= 1.710159647687799E-005 LON = 165.375000000000 degrees LAT = -9.53266359317569 degrees 5. forrtl: error (72): floating overflow |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,575,067 RAC: 15,735 |
As George says the model has gone unstable. The log says V wind too strong. There will be some simulations that do this, it's not a CPU or memory issue. The batch is about trying to force storm systems to grow in an idealised atmosphere. Some will grow very fast and give an unstable solution. --- CPDN Visiting Scientist |
Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318 |
I'd run more WUs but I get this mysterious missive and WUs stop coming: "This computer has finished a daily quota of 1 tasks" |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 42,504,403 RAC: 75,848 |
I'd run more WUs but I get this mysterious missive and WUs stop coming: "This computer has finished a daily quota of 1 tasks" AFAIK, this is server side work issuing logic trying to protect against faulty hosts that always error out. If a host returned error results, the quota will be reduced until it becomes 1. Once a task successfully finishes, your quota will be lifted and can get more WUs. This happened to me when this fixed batch initially started, because a few days ago every result was an error. All my hosts that took part in that round had to finish 1 WU first before getting more tasks as usual. Meanwhile, I happen to have one host not getting any WU last time and it was able to fetch more work off the bat. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
AFAIK, this is server side work issuing logic trying to protect against faulty hosts that always error out. If a host returned error results, the quota will be reduced until it becomes 1. Once a task successfully finishes, your quota will be lifted and can get more WUs.That is right.I have only not suffered because I downloaded a bunch of tasks but my slow bored band can't even keep up with one task at a time on top of the testing WAH2 tasks I am running. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I never saw this before. I noticed them today even though they are a week old. I got a bunch of these. computer 1511241 CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.10 (Ootpa) [4.18.0-553.5.1.el8_10.x86_64|libc 2.28] BOINC version 7.20.2 Workunit 12281949 name oifs_43r3_bl_a03e_2016092300_20_1017_12281949 application OpenIFS 43r3 Baroclinic Lifecycle created 7 Jun 2024, 11:01:27 UTC minimum quorum 1 initial replication 1 max # of error/total/success tasks 3, 1, 1 errors Too many total results |
©2024 cpdn.org