climateprediction.net (CPDN) home page
Thread 'Batch 1017 Errors'

Thread 'Batch 1017 Errors'

Message boards : Number crunching : Batch 1017 Errors
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 70949 - Posted: 8 Jun 2024, 0:12:28 UTC

Great to get some work after a very long time.

However two completed work units show an error after the 2nd trickle has been uploaded.

I think this is after the 14th zip file

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_15.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_16.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_17.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_18.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_19.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>

Ran for almost 6 and half hours before failing.

Conan
ID: 70949 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 70951 - Posted: 8 Jun 2024, 4:21:22 UTC

Next 2 failed the same way

My hosts are visible so you can see the error messages
I am running Linux Fedora 37 on a Ryzen 8 7900x and a 5900x. the 5900 has not returned a result yet

Conan
ID: 70951 · Report as offensive     Reply Quote
Sven

Send message
Joined: 21 Aug 19
Posts: 1
Credit: 5,075,286
RAC: 896
Message 70953 - Posted: 8 Jun 2024, 6:39:00 UTC - in response to Message 70949.  

Same here with these task.
Running Linux Ubuntu 22.04.4 on Intel i5-13500.
Stopped with error after successfully uploading the 14. cycle acc. to std_err.txt
ID: 70953 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,575,067
RAC: 15,735
Message 70964 - Posted: 8 Jun 2024, 14:24:54 UTC - in response to Message 70953.  

The model completes successfully. The controlling code has messed up the count of upload files. All the results are being returned, so just let the tasks run as the output will be useable.
---
CPDN Visiting Scientist
ID: 70964 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 70976 - Posted: 13 Jun 2024, 5:29:39 UTC

The resent tasks are now running correctly and I completed one successfully with a few more running.

Thanks
Conan
ID: 70976 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 70977 - Posted: 13 Jun 2024, 10:32:24 UTC

Sorry the last 7 work units failed, but not due to faulty work units.

I ran out of memory when another programme started up using 1 GB per work unit and launched 22 of them, normally not a problem but with 2 Climate Prediction WUs running using 3 to 5 GB each I had nothing left.

It took a while to get control of the computer back and then I aborted the other project and set to No New Work which should stop it from happening again.

Conan
ID: 70977 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 70978 - Posted: 13 Jun 2024, 13:10:36 UTC

So far 3 WUs completed and 5 failed. I've got memory under control so that's not the problem.
ID: 70978 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 70979 - Posted: 13 Jun 2024, 16:58:55 UTC - in response to Message 70978.  

So far 3 WUs completed and 5 failed. I've got memory under control so that's not the problem.


@Aurum

These were some of the errors on the 5 that crashed on this PC The errors suggest numerical instability, possibly caused by cpu or memory errors, i.e. hardware issues. But Glenn might know more about that.

1. forrtl: error (73): floating divide by zero

2. SMILAG TRAJECTORY OUT OF ATM 1 TIMES.
POINT= 17 MAX ETADOT VERTICAL VEL.= 1.002539831634478E-003
LON = 158.625000000000 degrees
LAT = 0.560744942544222 degrees
MAX V WIND= 946.703584439113
MAX U WIND= 276.212476087232
V WIND = 946.703584439113 IS TOO STRONG, EXPLOSION.
LEVEL= 45 POINT= 17
LON = 158.625000000000 degrees
LAT = -0.560744942544222 degrees
ABORT! 1 !V WIND TOO STRONG, EXPLOSION!!!

3. forrtl: error (72): floating overflow

4. SMILAG TRAJECTORY OUT OF ATM 13 TIMES.
POINT= 16 MAX ETADOT VERTICAL VEL.= 2.468154707297061E-005
LON = 166.500000000000 degrees
LAT = -8.41117384374318 degrees
POINT= 17 MAX ETADOT VERTICAL VEL.= 3.891867336306428E-005
LON = 165.375000000000 degrees
LAT = -8.41117384374318 degrees
POINT= 18 MAX ETADOT VERTICAL VEL.= 2.441204446343112E-005
LON = 164.250000000000 degrees
LAT = -8.41117384374318 degrees
MAX V WIND= 341.887839959058
SMILAG TRAJECTORY OUT OF ATM 1 TIMES.
POINT= 17 MAX ETADOT VERTICAL VEL.= 1.710159647687799E-005
LON = 165.375000000000 degrees
LAT = -9.53266359317569 degrees

5. forrtl: error (72): floating overflow
ID: 70979 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,575,067
RAC: 15,735
Message 70981 - Posted: 13 Jun 2024, 19:13:21 UTC - in response to Message 70979.  

As George says the model has gone unstable. The log says V wind too strong.
There will be some simulations that do this, it's not a CPU or memory issue. The batch is about trying to force storm systems to grow in an idealised atmosphere. Some will grow very fast and give an unstable solution.
---
CPDN Visiting Scientist
ID: 70981 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 70984 - Posted: 14 Jun 2024, 5:27:41 UTC

I'd run more WUs but I get this mysterious missive and WUs stop coming: "This computer has finished a daily quota of 1 tasks"
ID: 70984 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 42,504,403
RAC: 75,848
Message 70985 - Posted: 14 Jun 2024, 5:41:41 UTC - in response to Message 70984.  

I'd run more WUs but I get this mysterious missive and WUs stop coming: "This computer has finished a daily quota of 1 tasks"

AFAIK, this is server side work issuing logic trying to protect against faulty hosts that always error out. If a host returned error results, the quota will be reduced until it becomes 1. Once a task successfully finishes, your quota will be lifted and can get more WUs.

This happened to me when this fixed batch initially started, because a few days ago every result was an error. All my hosts that took part in that round had to finish 1 WU first before getting more tasks as usual. Meanwhile, I happen to have one host not getting any WU last time and it was able to fetch more work off the bat.
ID: 70985 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70986 - Posted: 14 Jun 2024, 6:10:20 UTC

AFAIK, this is server side work issuing logic trying to protect against faulty hosts that always error out. If a host returned error results, the quota will be reduced until it becomes 1. Once a task successfully finishes, your quota will be lifted and can get more WUs.
That is right.I have only not suffered because I downloaded a bunch of tasks but my slow bored band can't even keep up with one task at a time on top of the testing WAH2 tasks I am running.
ID: 70986 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70996 - Posted: 15 Jun 2024, 14:34:31 UTC

I never saw this before. I noticed them today even though they are a week old.
I got a bunch of these.

computer 1511241
CPU type GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 16
Operating System Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.10 (Ootpa) [4.18.0-553.5.1.el8_10.x86_64|libc 2.28]
BOINC version 7.20.2

Workunit 12281949
name oifs_43r3_bl_a03e_2016092300_20_1017_12281949
application OpenIFS 43r3 Baroclinic Lifecycle
created 7 Jun 2024, 11:01:27 UTC
minimum quorum 1
initial replication 1
max # of error/total/success tasks 3, 1, 1
errors Too many total results
ID: 70996 · Report as offensive     Reply Quote

Message boards : Number crunching : Batch 1017 Errors

©2024 cpdn.org