climateprediction.net (CPDN) home page
Thread 'One of my oifs_43r3_bl_1018 taskss errored out.'

Thread 'One of my oifs_43r3_bl_1018 taskss errored out.'

Message boards : Number crunching : One of my oifs_43r3_bl_1018 taskss errored out.
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70988 - Posted: 14 Jun 2024, 13:25:39 UTC
Last modified: 14 Jun 2024, 13:32:14 UTC

One of my oifs_43r3_bl_1018 tasks errored out.

Task 22443572
Name 	oifs_43r3_bl_a1a0_2016092300_20_1018_12289983_0
Workunit 	12289983

Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	1 (0x00000001) Unknown error code
Computer ID 	1511241
Run time 	7 hours 24 min 44 sec
CPU time 	7 hours 20 min 23 sec
Validate state 	Invalid
Credit 	1,318.46
Device peak FLOPS 	5.93 GFLOPS
Application version 	OpenIFS 43r3 Baroclinic Lifecycle v1.13
x86_64-pc-linux-gnu
Peak working set size 	5,548.55 MB
Peak swap size 	5,981.07 MB
Peak disk usage 	1,286.50 MB


I wont bore you with the entire stderr file, but the important part is

[EC_DRHOOK:localhost.localdomain:1:1:1644391:1644391] [20240614:084327:1718369007:27437.112] [signal_drhook@/home/abowery/Working_folder/OpenIFS/oifs_43r3_bl/gc_oifs43r3_2/src/ifsaux/support/drhook.c:1734] DrHook backtrace done for signal#8, nsigs = 1
[EC_DRHOOK:localhost.localdomain:1:1:1644391:1644391] [20240614:084327:1718369007:27437.112] [signal_drhook@/home/abowery/Working_folder/OpenIFS/oifs_43r3_bl/gc_oifs43r3_2/src/ifsaux/support/drhook.c:1785] Calling previous signal handler at 0x1cf8da0 for signal#8, nsigs = 1
forrtl: error (72): floating overflow


I infer that the software and hardware are working correctly, but the mathematics of the model disagreed with reality. I have been processing _1018_ tasks successfully. This is the first failure.
ID: 70988 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 70989 - Posted: 14 Jun 2024, 14:47:19 UTC

I have had one error out too. I think it is just the physics of the model.
ID: 70989 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70990 - Posted: 14 Jun 2024, 16:12:30 UTC - in response to Message 70989.  

forrtl: error (72): floating overflow

Does this mean the program is written in Fortran? It would not have occurred to me Oifs tasks are in that 1950's language. Or do they just call a library that is so-written?

My failing task has been assigned to another user, so we will see how (s)he does with it.

Task 22448825
Name 	oifs_43r3_bl_a1a0_2016092300_20_1018_12289983_1
Workunit 	12289983

ID: 70990 · Report as offensive     Reply Quote
pututu

Send message
Joined: 18 Jun 17
Posts: 18
Credit: 10,293,533
RAC: 33,275
Message 70991 - Posted: 14 Jun 2024, 18:52:45 UTC
Last modified: 14 Jun 2024, 18:59:56 UTC

I've one WU where all three different computers are erroring out. https://www.cpdn.org/workunit.php?wuid=12291686
Same error
forrtl: error (72): floating overflow

This WU has two computers erroring out.
https://www.cpdn.org/cpdnboinc/workunit.php?wuid=12289928
This time slightly different error number:
forrtl: error (65): floating invalid
Let us see if the third computer can validate the model.
ID: 70991 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 70992 - Posted: 14 Jun 2024, 22:28:44 UTC - in response to Message 70991.  

If one or both of the previous tasks in a workunit have failed with a floating point exception, the 3rd definitely will not work.

It's expected some of the tasks will fail in this batch in this way. There's no need to report it.
---
CPDN Visiting Scientist
ID: 70992 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70993 - Posted: 14 Jun 2024, 23:31:51 UTC - in response to Message 70992.  
Last modified: 14 Jun 2024, 23:42:09 UTC

It's expected some of the tasks will fail in this batch in this way. There's no need to report it.


I could expect the floating point overflow errors, but the forrtl: error (65): floating invalid is another thing entirely.

If something got out of range, you can get a floating point overflow, but an invalid floating point number is something else entirely. This would not be something out of range, but something that is a bunch of bits that are not a floating point number at all.

https://lucid.co/techblog/2022/03/04/if-its-not-a-number-what-is-it-demystifying-nan-for-the-working-programmer
ID: 70993 · Report as offensive     Reply Quote
pututu

Send message
Joined: 18 Jun 17
Posts: 18
Credit: 10,293,533
RAC: 33,275
Message 70994 - Posted: 15 Jun 2024, 3:32:08 UTC
Last modified: 15 Jun 2024, 3:45:15 UTC

This WU is interesting. First returned unit has "Error while computing" but full credit is granted. Didn't notice anything abnormal in the stderr output. The task then got sent to another PC but with error (65): floating invalid.
https://www.cpdn.org/workunit.php?wuid=12292871.

Is there a way for the standard boinc server setup to allow exception where if the error is 72 floating overflow, the task won't be recycled since it is going to fail anyway in the next two hosts? My guess is probably not easy to do but just asking. Alternatively, each volunteer could manually check the recycled tasks and decide on their own to abort or let it crunch.
ID: 70994 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 37
Credit: 9,581,380
RAC: 3,853
Message 70995 - Posted: 15 Jun 2024, 4:45:28 UTC

I've had one of those "apparently valid but flagged as error" tasks as well. Workunit https://www.cpdn.org/workunit.php?wuid=12293995

Looking at the stderr.txt it appears that a SIGKILL has been issued after boinc_finish() has been called -- hence the Error status :-(

Also, looking at some of the wingmen for failing tasks, I notice there are one or two cases where there's no stderr.txt visible on the result page -- looks like some systems are sometimes having problems completing the wrap-up of tasks (successfully completed or otherwise!)

Cheers - Al.
ID: 70995 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 43,914,791
RAC: 53,378
Message 70998 - Posted: 16 Jun 2024, 18:00:07 UTC - in response to Message 70994.  
Last modified: 16 Jun 2024, 18:04:17 UTC

I have a couple of interesting ones that I have to abort. Upon reaching 99.98% or something, they just never finished, with time left continue to count hours into the negative territory. For one of them, I checked `ps` and the oifs process has exited already actually. I originally thought it's one of my specific host, until another host got a similar result. However, the resends were successful. It's unclear to me what went wrong for them. Perhaps in the wrapper that's handling the final results?

https://www.cpdn.org/result.php?resultid=22441906
https://www.cpdn.org/result.php?resultid=22443068
https://www.cpdn.org/result.php?resultid=22443808
https://www.cpdn.org/result.php?resultid=22446273

It's pretty rare though, affecting ~1% of my WUs so far. Just a bit annoying to babysit because I need to abort them manually...
ID: 70998 · Report as offensive     Reply Quote
ChelseaOilman

Send message
Joined: 24 Dec 19
Posts: 32
Credit: 41,611,797
RAC: 16,781
Message 70999 - Posted: 16 Jun 2024, 18:24:49 UTC
Last modified: 16 Jun 2024, 18:25:07 UTC

I also had 2 tasks get to 99.9% and just sit there doing nothing. One each on 2 different computers. I aborted them.
ID: 70999 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71001 - Posted: 16 Jun 2024, 21:38:50 UTC - in response to Message 70999.  

Ok, thanks for reporting. I will bring this up with Andy tomorrow as he manages the code controlling the OpenIFS model. I don't work on it anymore.
---
CPDN Visiting Scientist
ID: 71001 · Report as offensive     Reply Quote
pututu

Send message
Joined: 18 Jun 17
Posts: 18
Credit: 10,293,533
RAC: 33,275
Message 71020 - Posted: 22 Jun 2024, 3:19:34 UTC

I got one task that stayed at 99.99% much longer than my average run time. I suspended the task and keep crunching the rest of the tasks. At some point, I restarted the boinc client and resumed that suspended task and it got validated. Maybe I'm just restarted the missing application or stalled application. Anyway this was my first encounter.
ID: 71020 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71021 - Posted: 22 Jun 2024, 8:16:24 UTC - in response to Message 71020.  

Which task was it? I can look at the logs and see what happened.
---
CPDN Visiting Scientist
ID: 71021 · Report as offensive     Reply Quote
pututu

Send message
Joined: 18 Jun 17
Posts: 18
Credit: 10,293,533
RAC: 33,275
Message 71022 - Posted: 22 Jun 2024, 13:30:05 UTC

Sorry, I didn't write down that one particular task number. After restarting the boinc client, the run time got recalculated and back to normal average run time (not the cpu time) and continue where it thinks it has left off. I believe this happen to every task when you restart the boinc client. It will recalculate the run time again. At least this is what I observed on my system.
ID: 71022 · Report as offensive     Reply Quote
pututu

Send message
Joined: 18 Jun 17
Posts: 18
Credit: 10,293,533
RAC: 33,275
Message 71023 - Posted: 22 Jun 2024, 15:43:30 UTC - in response to Message 71022.  

Sorry, I didn't write down that one particular task number. After restarting the boinc client, the run time got recalculated and back to normal average run time (not the cpu time) and continue where it thinks it has left off. I believe this happen to every task when you restart the boinc client. It will recalculate the run time again. At least this is what I observed on my system.


I meant to say using data from the checkpoint to recalculate the run time.
ID: 71023 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71024 - Posted: 23 Jun 2024, 14:21:24 UTC - in response to Message 71023.  

To calculate the percentage done, the controlling program (not boinc) reads one of the model log files to see what model step it's on. I think that sometimes goes wrong. When boinc and therefore the task are restarted, the file is read correctly.
---
CPDN Visiting Scientist
ID: 71024 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71222 - Posted: 12 Aug 2024, 16:13:01 UTC

I jus got six 1018 work units. Four are running and two are waiting to run. They are all _1 reruns and all the _0 ones failed.

Here is one of mine:

Task 22505782
Name oifs_43r3_bl_a0k2_2016092300_20_1018_12289049_1
Workunit 12289049

All the _0 ones are from the same machine and user. Here is one of those work units.

Workunit 12289049
22505782 	1511241 	12 Aug 2024, 7:18:49 UTC 	11 Oct 2024, 7:18:49 UTC 	In progress 	--- 	--- 	1,318.46 	OpenIFS 43r3 Baroclinic Lifecycle v1.13
x86_64-pc-linux-gnu

22442624 	1443502 	13 Jun 2024, 7:16:37 UTC 	12 Aug 2024, 7:16:37 UTC 	Timed out - no response 	0.00 	0.00 	--- 	OpenIFS 43r3 Baroclinic Lifecycle v1.13
x86_64-pc-linux-gnu

ID: 71222 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 71223 - Posted: 12 Aug 2024, 16:28:55 UTC - in response to Message 71222.  

That the originals errored out with 0 run time and your running ones have produced a first zip suggests a problem with the original computer rather than the tasks.
ID: 71223 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 71224 - Posted: 12 Aug 2024, 16:53:03 UTC - in response to Message 71223.  
Last modified: 12 Aug 2024, 17:01:51 UTC

That the originals errored out with 0 run time and your running ones have produced a first zip suggests a problem with the original computer rather than the tasks.

No, read more carefully please---"timed out - no respons"
Today is the deadline for that batch, 2 months
ID: 71224 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 71225 - Posted: 12 Aug 2024, 18:17:57 UTC - in response to Message 71224.  

My bad. Presumably tasks downloaded and never started.
ID: 71225 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : One of my oifs_43r3_bl_1018 taskss errored out.

©2024 cpdn.org