Message boards : Number crunching : OpenIFS Discussion
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
I got four from testing all failed in just over one and a half minutes with ABORT! 1 RRTM_KGB16:ERROR READING FILE RADRRTMAt least that is the only line that leaps out at me. batch D523. Don't know if it worth setting up trello cards for these or not? |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
Sorry - being an idiot. I was looking in projects/climateprediction.net which has an app_config.xml instead of dev.cpdn which doesn't and is where the apps were running..... oh I wish the forums would let me delete posts :D |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
I got four from testing all failed in just over one and a half minutes withNo, don't bother. I've already emailed the scientist. The file is there, I've checked but the model configuration is wrong. I wish they would stop sending out tasks though.ABORT! 1 RRTM_KGB16:ERROR READING FILE RADRRTMAt least that is the only line that leaps out at me. batch D523. p.s. Dave - top marks for finding the 1-line error message in the very long traceback! If you are interested. The file the model is looking for is in slots/?/ifsdata/RADRRTRM. My guess is the model has been told the wrong directory name. |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
two running from #945 Edit: Running two at once, no problem with my bored band keeping up. Peak memory usage at the moment seems to be about 13% of 32GB |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
two running from #945 I, too, have two running at once. How would I tell if I were choking my broadband connection? I get 75 Megabits/ second up and down if the server at the other end can keep up. Memory usage at the moment is: $ free -hw total used free shared buffers cache available Mem: 62Gi 12Gi 1.6Gi 102Mi 332Mi 48Gi 49Gi Swap: 15Gi 82Mi 15Gi so I am really using 12 Gigabytes out of 62 Gigabytes total. This includes 10 other Boinc tasks that are not CPDN. Not only are there 1.6 Gigabytes free, but thre are also 49 Gigabytes available by grabbing some of the input disk cache if needed (without swapping). |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,817,746 RAC: 4,590 |
They released them while I was out at the pub! Never mind - got my first couple, and have preserved the detail for the morning. 123 output files! Waiting till I see the sizes, but I hope Oxford know what they've unleashed on their creaking infrastructure. Initial runtime estimate 60 hours 46 minutes. Again, I'll do the maths in the morning. |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
123 output files! Waiting till I see the sizes, but I hope Oxford know what they've unleashed on their creaking infrastructure.Averaging about 14.2MB looking at mine. ( Haven't actually done the arithmetic to calculate the mean. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,817,746 RAC: 4,590 |
One file every 11 minutes: 28/11/2022 22:35:43 | climateprediction.net | [cpu_sched] Starting task oifs_43r3_ps_0923_2021050100_123_945_12164012_0 using oifs_43r3_ps version 101 in slot 2 |
Send message Joined: 22 Feb 06 Posts: 492 Credit: 31,494,949 RAC: 15,461 |
Got 3. Posting zips about every 10 mins. Estimated completion about 17hours. (3.5GHz i5, 24Gb ram). |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
One file every 11 minutes: Mine is a little different; I have two of these running. So one every 7 minutes for each of them. If I knew how big they were, I could tell how much bandwidth I need to send them. They seem to take my machine about 5 seconds to upload each one. Mon 28 Nov 2022 08:06:14 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0002_2021050100_123_945_12163091_0_r1172930429_45.zip Mon 28 Nov 2022 08:06:19 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0002_2021050100_123_945_12163091_0_r1172930429_45.zip Mon 28 Nov 2022 08:07:31 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0030_2021050100_123_945_12163119_0_r632876908_45.zip Mon 28 Nov 2022 08:07:36 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0030_2021050100_123_945_12163119_0_r632876908_45.zip Mon 28 Nov 2022 08:13:25 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0002_2021050100_123_945_12163091_0_r1172930429_46.zip Mon 28 Nov 2022 08:13:30 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0002_2021050100_123_945_12163091_0_r1172930429_46.zip Mon 28 Nov 2022 08:14:42 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0030_2021050100_123_945_12163119_0_r632876908_46.zip Mon 28 Nov 2022 08:14:47 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0030_2021050100_123_945_12163119_0_r632876908_46.zip |
Send message Joined: 22 Feb 06 Posts: 492 Credit: 31,494,949 RAC: 15,461 |
Got 3. Posting zips about every 10 mins. Estimated completion about 17hours. (3.5GHz i5, 24Gb ram). Got another 4 and set CPU to 100% (i.e. 4 cores). Getting message that one task is running or waiting for memory as expected. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,817,746 RAC: 4,590 |
Running the numbers from last night. I see the server has started me off with an expected speed that exactly matches my whetstone benchmark. The actual running speed seems to be much faster than that, which is no bad thing: better that we don't underestimate it, and risk missing deadlines. My machines normally run primarily as GPU platforms, so my CPU efficiency is low - CPU time is currently barely two-thirds of wall-clock time. I'm running down my GPU cache and other work, so I'll get some 'normal' times from the next batch. Uploads are being generated as I saw last night, and all are going through cleanly. Trickle reports are being batched up and sent once per hour, as per server delay request. Trickle data is minimal, but that's probably all it needs to be. <msg_from_host> <result_name>oifs_43r3_ps_0799_2021050100_123_945_12163888_0</result_name> <time>1669718045</time> <variety>orig</variety> <wu>oifs_43r3_ps_0799_2021050100_123_945_12163888</wu> <result>oifs_43r3_ps_0799_2021050100_123_945_12163888_0_r987065464</result> <ph></ph> <ts>6307200</ts> <cp>24355</cp> <vr></vr> </msg_from_host> |
Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174 |
I've got a few of these new units. So far two completed ok and two with errors. The first error log ends with: Uploading the intermediate file: upload_file_21.zip 00:22:21 STEP 529 H= 529:00 +CPU= 12.302 Uploading trickle at timestep: 1900800 00:22:36 STEP 530 H= 530:00 +CPU= 15.541 double free or corruption (out) The other: Uploading the intermediate file: upload_file_19.zip 18:58:27 STEP 481 H= 481:00 +CPU= 9.772 18:58:37 STEP 482 H= 482:00 +CPU= 10.168 free(): invalid next size (fast) |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
Got 3. Posting zips about every 10 mins. Estimated completion about 17hours. (3.5GHz i5, 24Gb ram).We can adjust the trickle frequency if it causes a problem. Please, just ignore what the boinc client reports as estimated time to completion. It's not going to get it right at all because it's a new app. Work it out from the '%age done' and the elapsed time. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
I've seen at least one of those 'double free or corruption' but only on an old i7-7700 with non-ecc memory. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
I've got a few of these new units. So far two completed ok and two with errors.Ah! Excellent. I've been trying to understand why some tasks are apparently stopping with nothing in the stderr.txt returned to the server to explain why it stopped. @DarkAngel - can you tell me which resultids those were so I can look them up? Also, what machine & OS are you using these on? This kind of error message indicates a memory problem, often caused by a bug in the code but I've also seen it caused by certain versions of compilers/system libraries. I've never seen it with the model itself but then I've never run the model on such a wide range of systems like this. Could also be the wrapper code we use. Quick question. When the tasks are running, if you do 'ps -ef' you should see the same number of 'master.exe' processes as 'oifs_43r3_ps_1.01_x86_64-pc-linux-gnu'. The latter is the 'controller' for the model itself (master.exe). Do you have the same number of each? I ask because we know of one issue that can kill the 'oifs_43r3....' process running but still leave the model 'master.exe' running. Thanks for your help. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
I've seen at least one of those 'double free or corruption' but only on an old i7-7700 with non-ecc memory.It's not the hardware. I have an even older i7-3770 which I've never seen this issue on. It's a software/OS issue, which unfortunately won't be easy to track down. If anyone who gets these can report the URL of the result id (e.g. https://www.cpdn.org/cpdnboinc/result.php?resultid=22248137) that would help (please send Private Message so as not to flood this thread, thx.) |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Sadness. My ADSL that's been adequate isn't adequate anymore for the many uploads per model - that's about a GiB and a half per work-unit. Throttling downloads until my very Asymmetric ISP upload bottleneck gets replaced with Gbit (likely soon). Models run in about 11 hours on my slowest and fastest multicore machines, but as was disclosed way in advance, they need at least 5GB per running model, they get less, they slow waaay down. I've ordered an AMD 5800X3D to see if the bigger L3 cache helps with this kind of work. Thanks to all for supporting this work with your time and compute capacity. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
We can adjust the trickle frequency if it causes a problem. I do not see any problem. I have completed three work units without error (no credit assigned yet, but that is to be expected. As far as a problem is concerned, would that be too many trickles? ps_1016 took 5 seconds to upload. Then 8 minutes until the next one. ps_1785 took 6 seconds to upload. Then 8 minutes until the next one. ps_0961 took 4 seconds to upload. Then 8 minutes until the next one. Tue 29 Nov 2022 04:00:11 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_1016_2021050100_123_946_12164105_0_r1596679317_76.zip Tue 29 Nov 2022 04:00:16 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1016_2021050100_123_946_12164105_0_r1596679317_76.zip Tue 29 Nov 2022 04:04:41 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_1785_2021050100_123_946_12164874_0_r29850111_9.zip Tue 29 Nov 2022 04:04:47 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1785_2021050100_123_946_12164874_0_r29850111_9.zip Tue 29 Nov 2022 04:05:10 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0961_2021050100_123_945_12164050_0_r1728848570_77.zip Tue 29 Nov 2022 04:05:14 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0961_2021050100_123_945_12164050_0_r1728848570_77.zip Tue 29 Nov 2022 04:08:13 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_1016_2021050100_123_946_12164105_0_r1596679317_77.zip Tue 29 Nov 2022 04:08:18 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1016_2021050100_123_946_12164105_0_r1596679317_77.zip Tue 29 Nov 2022 04:12:33 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_1785_2021050100_123_946_12164874_0_r29850111_10.zip Tue 29 Nov 2022 04:12:39 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1785_2021050100_123_946_12164874_0_r29850111_10.zip Tue 29 Nov 2022 04:13:03 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0961_2021050100_123_945_12164050_0_r1728848570_78.zip Tue 29 Nov 2022 04:13:07 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0961_2021050100_123_945_12164050_0_r1728848570_78.zip Running three at a time seems to be no problem at all. I do have a fast Internet link. According to my CPDN computer (Computer 1511241) page, I get Average upload rate 3170 KB/sec Average download rate 15674.33 KB/sec And accoring to Speakeasy speed test site, Timestamp Download Upload Latency Jitter Quality Score Test Server 11/29/2022 16:30:21 78.70 Mbps 89.08 Mbps 6 ms 1 ms Excellent nyc.speedtest.clouvider.net.prod.hosts.ooklaserver.net |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
My ADSL that's been adequate isn't adequate anymore for the many uploads per modelSo I understand properly. Is it the amount of trickles that's an issue? Or the total amount of data? The model output for the complete forecast is split into the smaller trickle files (to ease the data upload burden). We could do fewer trickles but the total data size would be the same (each trickle would be larger). I'm assuming it's the total size of the upload (sum of all trickle sizes) that's a problem? We can ask the scientist to reduce the model output if necessary. |
©2024 cpdn.org