climateprediction.net (CPDN) home page
Thread 'Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested'

Thread 'Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested'

Message boards : Number crunching : Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71595 - Posted: 14 Oct 2024, 13:50:55 UTC

People might recall last year I was talking about running the OpenIFS model at higher resolution configurations (60km grids) instead of the normal size we run (125km) grids. We now have a project that wants to make use of this configuration. Before we decide whether to go ahead, CPDN would like some feedback from users as this is a substantial increase on the normal resources used.

There are two key issues: memory required and the size of the checkpoint files.

OpenIFS@60km would have a peak memory requirement of roughly 25Gb. The checkpoint (or restart) files which are normally written periodically would be approx 4Gb. This compares to 6Gb RAM & 1Gb checkpoint filesize for the resolution configurations we have run to date.

The question is how to volunteers feel about this?

The model would run multi-core for improved completion. Also there would be credit multipliers used to take into account the extra memory & disk required.

We also think this should be an 'opt-in' application via the project preferences and the scheduler would only allow 1 task per host.

There are some technical steps I can look into to make reductions in memory & disk usage but for now I'm interested if anyone would be prepared to run this or whether it's a non-starter?

Thanks.
---
CPDN Visiting Scientist
ID: 71595 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 16,423,616
RAC: 46,988
Message 71596 - Posted: 14 Oct 2024, 14:40:19 UTC - in response to Message 71595.  

How long would you guess they might take to complete, say at 4 core or 32 core ?
Will they utilise all cores for the majority of the time ?
Would probably give it a serious go if cores are kept busy but duration would be a factor too.
ID: 71596 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71597 - Posted: 14 Oct 2024, 15:07:30 UTC - in response to Message 71595.  

There are some technical steps I can look into to make reductions in memory & disk usage but for now I'm interested if anyone would be prepared to run this or whether it's a non-starter?


I think it should definately be a starter.

My main machine (Computer 1511241) runs Red Hat Enterprise Linux release 8.10 (Ootpa) with kernel 4.18.0-553.22.1.el8_10.x86_64.

While it has 16 cores (8 real, 8 hyperthreaded), I usually allow 12 for Boinc except when the summer it gets too hot and I reduce it to 8 or 10.
As for checkpoint files, I run Boinc in a partition all its own. Right now, running 12 Boinc tasks, it uses hardly any of this space.

Memory 	                125.08 GB
Cache 	                16896 KB
Swap space 	        15.62 GB
Total disk space 	488.04 GB
Free Disk Space 	479.37 GB

ID: 71597 · Report as offensive     Reply Quote
LCB001

Send message
Joined: 5 May 13
Posts: 1
Credit: 5,893,091
RAC: 2,625
Message 71598 - Posted: 14 Oct 2024, 15:20:38 UTC

Would be interested in running these especially if the mem requirements could be dropped a bit.
ID: 71598 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,746,817
RAC: 869
Message 71599 - Posted: 14 Oct 2024, 15:47:43 UTC

Sounds interesting - if only I had enough memory...
ID: 71599 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71600 - Posted: 14 Oct 2024, 16:11:59 UTC - in response to Message 71596.  

How long would you guess they might take to complete, say at 4 core or 32 core ?
Will they utilise all cores for the majority of the time ?
Would probably give it a serious go if cores are kept busy but duration would be a factor too.
The model is very efficient, the cores will be used >95% of the time except when it's doing I/O which is on a single core. You'll also see lower efficiency if the machine is busy with other jobs.

As for how long, this OpenIFS@60 config runs 3x slower than the configurations used so far. But we would run with 2 cores (to begin with), so if you had a OpenIFS task before, multiply time taken by 1.5 roughly. YMMV.

I'm more concerned with how people feel about the large 4Gb checkpoint files being written to their drives periodically, particularly if anyone's running servers. I can adjust the frequency of the writes but at risk of repeating a lot of computation. The model results are highly compressed (Mb not Gb) and pose less of an issue but need to be added to the total data load. I'm investigating new compression algorithms to see if I can get the checkpoint filesize down.
---
CPDN Visiting Scientist
ID: 71600 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,819,403
RAC: 4,657
Message 71601 - Posted: 14 Oct 2024, 16:51:09 UTC

You can include me in the group of users willing to give them a try. My older machines are still in test mode with 16 GB RAM, but I can restore them back up to 64 GB / 32 GB when I get a convenient moment. All the machines I use for CPDN have a dedicated 2 TB SSD for BOINC data, and I have a decent upload speed for returning the results.
ID: 71601 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 16,423,616
RAC: 46,988
Message 71602 - Posted: 14 Oct 2024, 17:14:13 UTC - in response to Message 71600.  

I'd prefer the checkpointing to be user defined, in which case I'd probably turn it off !
If 2 cores works and you can quickly scale it up to more (user defined) so much the better.
ID: 71602 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71603 - Posted: 14 Oct 2024, 18:36:48 UTC - in response to Message 71602.  
Last modified: 14 Oct 2024, 18:38:54 UTC

I'd prefer the checkpointing to be user defined, in which case I'd probably turn it off !
If 2 cores works and you can quickly scale it up to more (user defined) so much the better.

I wondered about people being able to turn if off. But if the machine is rebooted the task would have to start all over again. If you're prepared to accept that, I can look into it at the user level. But I'm not going to make the checkpoint frequency user definable.

I would not go above 4 cores for this configuration.
---
CPDN Visiting Scientist
ID: 71603 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 16,423,616
RAC: 46,988
Message 71604 - Posted: 14 Oct 2024, 18:59:11 UTC - in response to Message 71603.  

I'm prepared to accept that, I have very rare unplanned shutdowns caused by external factors.
Failures due to the model computations rather than the user environment would be of higher concern.
4 cores is probably fine, they shouldn't be running that long from what you've said.

How many workunits would you expect there to be for this configuration ?
ID: 71604 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71605 - Posted: 14 Oct 2024, 19:22:33 UTC - in response to Message 71604.  
Last modified: 14 Oct 2024, 19:22:40 UTC

I'm prepared to accept that, I have very rare unplanned shutdowns caused by external factors.
Ok, noted. That's useful feedback.

How many workunits would you expect there to be for this configuration ?
The project in question is likely to release 4-5 batches of ~2000 workunits each. Ideally we'd like the first batch to go out end of this year or early next but there's still some development work to be done.
---
CPDN Visiting Scientist
ID: 71605 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,031,602
RAC: 4,207
Message 71606 - Posted: 14 Oct 2024, 19:35:45 UTC

I'd definitely be up for running it. What would you say be the peak disk space requirement per task? The data files, plus multiples of the 4GB checkpoints. I'm wondering if I'd have to increase disk space allowed for BOINC, which is not a problem. I'll think about other questions/suggestions and post later.
ID: 71606 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71607 - Posted: 14 Oct 2024, 21:22:43 UTC - in response to Message 71606.  

I'd definitely be up for running it. What would you say be the peak disk space requirement per task? The data files, plus multiples of the 4GB checkpoints. I'm wondering if I'd have to increase disk space allowed for BOINC, which is not a problem. I'll think about other questions/suggestions and post later.
There's only ever 1 checkpoint file (4Gb for the new config). Plus a day's worth of model output becomes 90Mb instead of 22Mb for the batches so far. So it's the checkpoint file that dominates.

Peak space requirements perhaps when the uploads are not working, then the number of model output starts increasing but there's still only ever 1 checkpoint. Back of envelope, let's say ~20 uploads of model output waiting to transfer, roughly 1.75Gb extra on top of the checkpoint 4Gb? It's not huge and if I add compression I should be able to get the checkpoint file down to less than 3Gb. Wear & tear on the storage might be a concern? (personally it's not but I want to raise it).

You will definitely notice the impact of this configuration running on your machine if you are using it.
---
CPDN Visiting Scientist
ID: 71607 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 16,423,616
RAC: 46,988
Message 71608 - Posted: 14 Oct 2024, 21:36:00 UTC - in response to Message 71607.  

Don't you create a new checkpoint file before deleting the old checkpoint file ?
ID: 71608 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 492
Credit: 31,496,606
RAC: 15,431
Message 71609 - Posted: 14 Oct 2024, 22:27:34 UTC - in response to Message 71595.  

4 core CPU, 32Gb RAM and 2Tb hdd on a machine that is only used for CPDN - I see no real problem for me.
ID: 71609 · Report as offensive     Reply Quote
rfbrooks

Send message
Joined: 31 Aug 04
Posts: 10
Credit: 7,283,021
RAC: 14,019
Message 71610 - Posted: 14 Oct 2024, 22:51:19 UTC

I would be open to it. Upgraded to 32 GB RAM specifically to handle this type of task.
ID: 71610 · Report as offensive     Reply Quote
cetus

Send message
Joined: 7 Aug 04
Posts: 10
Credit: 148,100,750
RAC: 29,951
Message 71611 - Posted: 15 Oct 2024, 2:01:44 UTC

I'll definitely run these. I too have been getting lots of ram for the last few years in order to run large models - I'll be happy to see them show up. The large checkpoint files shouldn't be an issue for me either. The only problem I had with OIFS models in the past was the internet bandwidth for uploading. Since you can't run very many of these at a time, hopefully that won't be as big of an issue for these models.
ID: 71611 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71612 - Posted: 15 Oct 2024, 4:20:27 UTC - in response to Message 71611.  
Last modified: 15 Oct 2024, 4:22:43 UTC

The only problem I had with OIFS models in the past was the internet bandwidth for uploading.


I do not expect that to be a problem for me. I have Verizon FiOS fiber-optic to the house Internet connection that is about 1 GigaBit per second up and down. So if does not get choked up at my end. Up to the servers. I notice it is a little slow tonight. I am listening to video at the same time I ran this speed test.

Timestamp 	   Download    Upload 	   Latency Jitter Quality Score Test Server
10/15/2024 0:1:23  878.87 Mbps 830.28 Mbps 6 ms    1 ms   Excellent     nyc.mega.host.speedtest.net.prod.hosts.ooklaserver.net
10/3/2024 23:41:31 860.61 Mbps 860.97 Mbps 4 ms    2 ms   Excellent     nyc.mega.host.speedtest.net.prod.hosts.ooklaserver.net
6/25/2024 13:57:52 819.96 Mbps 925.68 Mbps 5 ms    1 ms   Excellent     nyc.mega.host.speedtest.net.prod.hosts.ooklaserver.net

ID: 71612 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 71613 - Posted: 15 Oct 2024, 7:22:50 UTC
Last modified: 15 Oct 2024, 7:31:56 UTC

You won't be surprised that I am up for running these. Having 64GB of RAM now and 32 cores. Looking forward to them on the testing site.
Running just one at a time will prevent a large backlog of data uploads when 2 cores are being used. It might not do so with 4. Though I am guessing I just wouldn't get a new task of this type before the first one uploaded.
ID: 71613 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,031,602
RAC: 4,207
Message 71614 - Posted: 15 Oct 2024, 7:34:26 UTC - in response to Message 71607.  

Ok, a few Gb per task peak disk requirement is not a problem for me. Disk usage is also not a concern, relatively modern SSDs last a long time even with heavy usage.

I'd also like the ability to use more cores. In addition, perhaps allow users to choose to run 2-3 tasks at a time, if they have the RAM.
ID: 71614 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Number crunching : Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested

©2024 cpdn.org