Message boards : Number crunching : Model crashed: INITTIME: Atmosphere basis time mismatch
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
New batch of eu's this morning but they all fail in a few seconds with the INITTIME error. The major nuisance here is that with limited bandwidth it takes minutes to download WU's that then crash in a few seconds of run-time. Likely the whole batch of several thousand will fail with this error? >>edit -- now got at least one that has run for several minutes, so fear that whole batch bad not justified. |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
I suspended one of my tasks to see if the 1996 unit I had downloaded today would crash or not and it didn't. I have also downloaded one 2013 model and one 2002 model. Seems a strange mix if they are all part of the batch released - I suspect they are not. Edit: Two of them actually downloaded yesterday evening at some point. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,902,413 RAC: 2,084 |
There is a mix of EU models at the moment. Two I got this morning are reissued timeouts from 5 April 2013. Nothing to do with the flooding attribution at all. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
After more (slow) downloads -- All the models I've got today are eu models, all 2013, all but the first 3 started OK. The 3 with the INITTIME error on startup were named a4my, a4mz, arn1 . The INITTIME error is from some kind of inconsistent parameters in the WU, yes? |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,811,302 RAC: 3,160 |
Same error over 19 workunits: <core_client_version>7.2.39</core_client_version> <![CDATA[ <message> WU download error: couldn't get input files: <file_xfer_error> <file_name>o32_A2_1984_2020_N96_f.gz</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> <file_xfer_error> <file_name>so2dms_N96_2013_12_2015_02f.gz</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> <file_xfer_error> <file_name>ancil_OSTIA_seaice_2014.gz</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> <file_xfer_error> <file_name>ancil_OSTIA_SST_2014.gz</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> </message> ]]> |
Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0 |
Hadam3p_eu_qgp9_2004 (d/l last night), a5qm_2013 and a5qc_2013 (d/l this morning) running normally, so far. WU a5qk_2013 ready to start when a core becomes available. Looks like a patchy glitch. HTH |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I've been gradually allowing new work over the last few hours, and all models are for 2013, a5 and a6 series. All except one are original. The resend ran for 6+ hours on the original computer. Oops! The one that just finished downloading on this computer has just failed after 19 seconds. I'll log it, then upload and see what happened. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Bonsai That list says that you have BOINC 7.2.39 There's a post on BOINC/dev somewhere about that version having a bad bug. I think that it was something to do with file transfers. :( |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,811,302 RAC: 3,160 |
changed to 7.2.42 |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The failure was due to the INITTIME problem. The replacement is a4u9, so this batch is, so far: a4.., a5.., and a6.. And they're going fast. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I got three this morning that are eu_a8 (2013) series. Each one has over 7 hours on it, so if they fail, they are not failing fast. |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
All of the recent tasks that crashed for me, did so at under two minutes. On the graphics they never got to the stage of showing a running model. I have therefore suspended my running models to check the downloaded ones and I now have two running and three which I believe to be good waiting to run. If I had waited for my current models to finish before testing, I would have missed the boat on those I do have. |
Send message Joined: 28 May 17 Posts: 49 Credit: 17,420,787 RAC: 2,223 |
Reviving an old thread as it was the 1st result on Google. I am getting the thread title errors on HadSM4 at N144 tasks. No CPU usage then abort after around a minute. PC completes N216 tasks. https://www.cpdn.org/result.php?resultid=22071424 https://www.cpdn.org/result.php?resultid=22071421 I've got several more paused as BOINC downloaded too many. I'd like to fix if possible before resuming. |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
Reviving an old thread as it was the 1st result on Google. This has been reported to the project, looking through tasks from this batch I have so far found one other with this type of crash and will let project know. As of about 0100Hrs UTC there were only 46 of this batch running so it is difficult to know how widespread the problem is yet but having found a third one out of those 46, I suspect a problem with the ancillary files for the tasks. Edit:As of 13 minutes ago, the batch has been paused while they do some checking. Also subsequent batch which was about to go out paused as part of same experiment. |
Send message Joined: 28 May 17 Posts: 49 Credit: 17,420,787 RAC: 2,223 |
Reviving an old thread as it was the 1st result on Google. Ok thanks for checking. I resumed the other 4 tasks on that PC since it seemed like batch issues vs missing libs or something on the PC. Same result. |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
They think the issue has been identified and will I presume either cancel the tasks and resend with the correctly dated files or fix them in situ on the server. (I don't know if they can do that.) |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
They think the issue has been identified and will I presume either cancel the tasks and resend with the correctly dated files or fix them in situ on the server. (I don't know if they can do that.) Looks like withdrawn as there are only 12 waiting to go out on the server. Sarah thinks it is a start date issue on one or more of the files. They will re-appear with this corrected later today or tomorrow at a guess. Any sitting on machines now that have gotten past the one or two minute stage may well be OK but if I had any in my queue I would be aborting till the fixed ones come out. |
Send message Joined: 28 May 17 Posts: 49 Credit: 17,420,787 RAC: 2,223 |
Resends are still being sent out. Its a lot of downloading to just abort in a minute. And since there are no app selections here I've got a ton of N216 work. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Reviving an old thread as it was the 1st result on Google. Me, too. I got only one N144, and it bombed very fast. I got it yesterday (I think), but my client server did not try to run it until today. ask 22073399 Name hadsm4_a1in_201310_6_907_012084122_0 Workunit 12084122 Created 25 May 2021, 11:15:17 UTC Sent 26 May 2021, 8:08:45 UTC Report deadline 8 May 2022, 13:28:45 UTC Received 26 May 2021, 14:57:57 UTC Server state Over Outcome Computation error Client state Compute error Exit status 22 (0x00000016) Unknown error code Computer ID 1511241 Run time 25 sec CPU time Validate state Invalid Credit 0.00 Device peak FLOPS 6.51 GFLOPS Application version UK Met Office HadSM4 at N144 resolution v8.02 i686-pc-linux-gnu Peak working set size 10.14 MB Peak swap size 16.77 MB Peak disk usage 0.02 MB Stderr <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xnnuj.pipe_dummy Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xnnuj.pipe_dummy Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xnnuj.pipe_dummy Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xnnuj.pipe_dummy Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xnnuj.pipe_dummy Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xnnuj.pipe_dummy Sorry, too many model crashes! :-( 09:56:54 (266358): called boinc_finish(22) </stderr_txt> ]]> |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
Am told the script is fixed now. The batches involved will probably go out some time after, "9-5" staff arrive in Oxford. |
©2024 cpdn.org