climateprediction.net (CPDN) home page
Thread 'Erroneous disk space notices'

Thread 'Erroneous disk space notices'

Message boards : Number crunching : Erroneous disk space notices
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64511 - Posted: 28 Sep 2021, 16:37:40 UTC - in response to Message 64506.  

1. The computer may be crashing lots of tasks which are not being cleared out, thus slowly taking up disk space.
My computer is not crashing these WUs they are crashing because their parameters are faulty. But either way you're saying this project is incapable of cleaning up behind itself.
2. If the computer is being allowed to continue running tasks while the re-cabling of Oxford is being done and we're "off the air", then it will also fill up with files waiting to be sent back.
And everything has been sent back and the problem persists. Besides I've never come close to filling my SSDs with anywhere from 74 GB to 700 GB available and this problem still persists.
ID: 64511 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64512 - Posted: 28 Sep 2021, 16:40:10 UTC - in response to Message 64505.  

You may be right but even then, given that as far as I know this exact problem hasn't appeared before, that implies there is something about your setup that triggers the problem. I like Les, have been crunching since the early days of the project and not encountered it either personally or on the message boards until now.

I just checked the system requirements page and the page has clearly needed an update for a long time! I would suggest a minimum of 2GB/core and that will likely go up further when the mythical Openifs tasks appear. (I have 32GB for a 16 core (8 real ones) machine and am regretting not getting double that.
Do I misunderstand something, or is something else wrong? The original post shows complaints that the O.P. does not have enough DISK SPACE. And the responses seem to be about the amount of RAM needed.
Exactly. The message seems to be about storage memory of which I have no shortage. Maybe RAM is the problem. I think this program reserves memory somehow and hogs it for itself. Problem is I don't know how to get Linux to give me a report showing what it's reserved for itself, which is obviously too much.
ID: 64512 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64513 - Posted: 28 Sep 2021, 17:06:02 UTC - in response to Message 64512.  

Exactly. The message seems to be about storage memory of which I have no shortage. Maybe RAM is the problem. I think this program reserves memory somehow and hogs it for itself. Problem is I don't know how to get Linux to give me a report showing what it's reserved for itself, which is obviously too much.


If RAM is the problem, why is the complaint:
climateprediction.net: Notice from server
UK Met Office HadAM4 at N216 resolution needs 133.09MB more disk space. You currently have 1774.26 MB available and it needs 1907.35 MB.
9/19/2021 3:57:48 AM  Rig-45    
--------------------------------------------------------------------------------
climateprediction.net: Notice from server
UK Met Office HadAM4 at N216 resolution needs 1907.35MB more disk space. You currently have 0.00 MB available and it needs 1907.35 MB.
9/18/2021 3:45:56 PM  Rig-17, Rig-36 


Since the disk space you report differs from the disk space whatever process is reporting the shortages (above), why are they different? Just what is the process that makes these error messages?

Rig-45 has 338 GB available on its SSD with 22.5 GiB used of 31.1 GiB RAM and a barely used 16 GiB swap file.
Rig-17 has 89 GB available on its SSD with 23.9 GiB used of 31.1 GiB RAM and a barely used 16 GiB swap file.


I would think your computer would complain about RAM shortage. But I have never seen that. OTOH, if you did not have enough RAM, would not the Boinc Client just refrain from running the program until enough RAM were available?

The Linux command free -h will tell you a lot about RAM and swap space usage On my machine with 64 GBytes RAM and running
Red Hat Enterprise Linux release 8.4 (Ootpa), I get

.
$ free -h
              total        used        free      shared  buff/cache   available
Mem:           62Gi        10Gi       2.5Gi       120Mi        49Gi        50Gi
Swap:          15Gi        12Mi        15Gi

ID: 64513 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64514 - Posted: 28 Sep 2021, 17:26:01 UTC

I just doubled the RAM in Rig-47 from 32 GB to 64 GB. Rig-47: https://www.cpdn.org/show_host_detail.php?hostid=1521364
Note that this computer has Free Disk Space = 69.52 GB and yet WCG won't run because it says it needs 500 MB more.
134	World Community Grid	9/28/2021 10:15:20 AM	Message from server: OpenPandemics - COVID 19 needs 200.00MB more disk space.  You currently have 0.00 MB available and it needs 200.00 MB.	
135	World Community Grid	9/28/2021 10:15:20 AM	Message from server: Mapping Cancer Markers needs 500.00MB more disk space.  You currently have 0.00 MB available and it needs 500.00 MB.
aurum@Rig-47:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           62Gi        15Gi        25Gi       126Mi        21Gi        46Gi
Swap:          15Gi          0B        15Gi
Rig-47 has 10 hadam4h WUs and one hadam4 WUs running. Hyperthreading is disabled so this i9-10980XE CPU has 18 CPU cores available.
I think the problem lies entirely with the code for hadam.
ID: 64514 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64515 - Posted: 28 Sep 2021, 17:41:26 UTC
Last modified: 28 Sep 2021, 17:44:09 UTC

Rig-34 had this problem when it had 10 hadam4h WUs running but after completing one it's behaving normally and allowing WCG WUs to run. So it can handle 9 but not 10 WUs.
Rig-08 runs as expected with 4 hadam4h WUs plus 9 hadam4 WUs. This is what makes me think the problem is coded in hadam4h.
ID: 64515 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64516 - Posted: 28 Sep 2021, 19:07:19 UTC - in response to Message 64514.  

I just doubled the RAM in Rig-47 from 32 GB to 64 GB. Rig-47: https://www.cpdn.org/show_host_detail.php?hostid=1521364
Note that this computer has Free Disk Space = 69.52 GB and yet WCG won't run because it says it needs 500 MB more.

134	World Community Grid	9/28/2021 10:15:20 AM	Message from server: OpenPandemics - COVID 19 needs 200.00MB more disk space.  You currently have 0.00 MB available and it needs 200.00 MB.	
135	World Community Grid	9/28/2021 10:15:20 AM	Message from server: Mapping Cancer Markers needs 500.00MB more disk space.  You currently have 0.00 MB available and it needs 500.00 MB.

aurum@Rig-47:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           62Gi        15Gi        25Gi       126Mi        21Gi        46Gi
Swap:          15Gi          0B        15Gi

It seems to me that the generators of these two sets of readings of disk space consumption are reporting about two different machines.
ID: 64516 · Report as offensive     Reply Quote
[SG]Felix

Send message
Joined: 4 Oct 15
Posts: 34
Credit: 9,075,151
RAC: 374
Message 64517 - Posted: 28 Sep 2021, 19:46:08 UTC - in response to Message 64514.  

I just doubled the RAM in Rig-47 from 32 GB to 64 GB. Rig-47: https://www.cpdn.org/show_host_detail.php?hostid=1521364
Note that this computer has Free Disk Space = 69.52 GB and yet WCG won't run because it says it needs 500 MB more.
134	World Community Grid	9/28/2021 10:15:20 AM	Message from server: OpenPandemics - COVID 19 needs 200.00MB more disk space.  You currently have 0.00 MB available and it needs 200.00 MB.	
135	World Community Grid	9/28/2021 10:15:20 AM	Message from server: Mapping Cancer Markers needs 500.00MB more disk space.  You currently have 0.00 MB available and it needs 500.00 MB.
aurum@Rig-47:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           62Gi        15Gi        25Gi       126Mi        21Gi        46Gi
Swap:          15Gi          0B        15Gi
Rig-47 has 10 hadam4h WUs and one hadam4 WUs running. Hyperthreading is disabled so this i9-10980XE CPU has 18 CPU cores available.
I think the problem lies entirely with the code for hadam.


This message exactly shows you, where the problem is. If you are right, and you have this much free space on your rigs, then you have configured boinc wrong, in therms of how much disk space it is allowed to use.

please double check the following options, and test, how changes affect your message logs:

 (tickbox) Use no more than XX GB;
(tickbox) Leave at least XX GB free;
(tickbox not ticked) Use no more than XX% of total


The above works fine for me.

Greets
Felix
ID: 64517 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 82
Credit: 70,017,155
RAC: 3,100
Message 64520 - Posted: 28 Sep 2021, 23:56:27 UTC - in response to Message 64516.  

134	World Community Grid	9/28/2021 10:15:20 AM	Message from server: OpenPandemics - COVID 19 needs 200.00MB more disk space.  You currently have 0.00 MB available and it needs 200.00 MB.	
135	World Community Grid	9/28/2021 10:15:20 AM	Message from server: Mapping Cancer Markers needs 500.00MB more disk space.  You currently have 0.00 MB available and it needs 500.00 MB.

You do not have enough disk space available. You might reconfigure your BOINC options as indicated by the post above.
And you are right, if a climateprediction.net WU crashes the zip files will not be cleaned up afterwards. So, you will have a lot of worthless information eating up your disk space. Someone mentioned it before, there are two solutions:
restart project
or clean-up all the crashed WUS by hand in the corresponding project folder.
Yes, it is not easy and hassle free to run climateprediction.net, but therefore it is fun and will help further generations!
ID: 64520 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 64526 - Posted: 29 Sep 2021, 13:35:33 UTC - in response to Message 64514.  

If only running ten, then memory wouldn't be a problem. If using hyperthreading, and all 36 cores then it would, though a bigger problem then would be the massive slow down due to lack of space in Cache memory and swapping from there to RAM slowing things down. I am at a loss to explain the disk space being misreported though as I have for experiment only run 16 of the N216 tasks at once with broadly similar disk space available with no issues.

With respect to crashed tasks not cleaning up after themselves, this seems to me much less of a problem than it used to be and it only rarely seems to happen to me now whereas it used to happen frequently. That may be because outside of testing branch, I only get the very occasional crashed task these days.
ID: 64526 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64530 - Posted: 30 Sep 2021, 11:38:23 UTC - in response to Message 64516.  

I just doubled the RAM in Rig-47 from 32 GB to 64 GB. Rig-47: https://www.cpdn.org/show_host_detail.php?hostid=1521364
Note that this computer has Free Disk Space = 69.52 GB and yet WCG won't run because it says it needs 500 MB more.
134	World Community Grid	9/28/2021 10:15:20 AM	Message from server: OpenPandemics - COVID 19 needs 200.00MB more disk space.  You currently have 0.00 MB available and it needs 200.00 MB.	
135	World Community Grid	9/28/2021 10:15:20 AM	Message from server: Mapping Cancer Markers needs 500.00MB more disk space.  You currently have 0.00 MB available and it needs 500.00 MB.
aurum@Rig-47:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           62Gi        15Gi        25Gi       126Mi        21Gi        46Gi
Swap:          15Gi          0B        15Gi
It seems to me that the generators of these two sets of readings of disk space consumption are reporting about two different machines.
Well then it seems you're wrong.
ID: 64530 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64531 - Posted: 30 Sep 2021, 11:40:42 UTC - in response to Message 64517.  

I just doubled the RAM in Rig-47 from 32 GB to 64 GB. Rig-47: https://www.cpdn.org/show_host_detail.php?hostid=1521364
Note that this computer has Free Disk Space = 69.52 GB and yet WCG won't run because it says it needs 500 MB more.
134	World Community Grid	9/28/2021 10:15:20 AM	Message from server: OpenPandemics - COVID 19 needs 200.00MB more disk space.  You currently have 0.00 MB available and it needs 200.00 MB.	
135	World Community Grid	9/28/2021 10:15:20 AM	Message from server: Mapping Cancer Markers needs 500.00MB more disk space.  You currently have 0.00 MB available and it needs 500.00 MB.
aurum@Rig-47:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           62Gi        15Gi        25Gi       126Mi        21Gi        46Gi
Swap:          15Gi          0B        15Gi
Rig-47 has 10 hadam4h WUs and one hadam4 WUs running. Hyperthreading is disabled so this i9-10980XE CPU has 18 CPU cores available.
I think the problem lies entirely with the code for hadam.


This message exactly shows you, where the problem is. If you are right, and you have this much free space on your rigs, then you have configured boinc wrong, in therms of how much disk space it is allowed to use.

please double check the following options, and test, how changes affect your message logs:

 (tickbox) Use no more than XX GB;
(tickbox) Leave at least XX GB free;
(tickbox not ticked) Use no more than XX% of total


The above works fine for me.

Greets
Felix
I already said they're set to 95%. I presented that in different ways. Exactly what should these settings be???
ID: 64531 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64532 - Posted: 30 Sep 2021, 11:55:17 UTC - in response to Message 64520.  
Last modified: 30 Sep 2021, 12:14:22 UTC

134	World Community Grid	9/28/2021 10:15:20 AM	Message from server: OpenPandemics - COVID 19 needs 200.00MB more disk space.  You currently have 0.00 MB available and it needs 200.00 MB.	
135	World Community Grid	9/28/2021 10:15:20 AM	Message from server: Mapping Cancer Markers needs 500.00MB more disk space.  You currently have 0.00 MB available and it needs 500.00 MB.
You do not have enough disk space available. You might reconfigure your BOINC options as indicated by the post above.
And you are right, if a climateprediction.net WU crashes the zip files will not be cleaned up afterwards. So, you will have a lot of worthless information eating up your disk space. Someone mentioned it before, there are two solutions:
restart project
or clean-up all the crashed WUS by hand in the corresponding project folder.
Yes, it is not easy and hassle free to run climateprediction.net, but therefore it is fun and will help further generations!
One computer submitted its last WU and I detached it. My available disk space increased by 60 GB!!!. Yes, this is a buggy program that cannot clean up behind itself.
I've tried this command several times and it has no effect whatsoever:
/etc/init.d/boinc-client restart
I assume what you mean is the Reset in BOINCmgr:
Reset project: Stop the project's current work, if any, and start from scratch. Use this if BOINC has become stuck for some reason. Any unreported results and tasks in progress will be discarded.
I don't want to wipe out all current work having committed weeks to it. Better to let them finish and Detach CP.
As for doing it by hand how can I know for sure what is garbage and what is still in use???
I agree climate studies should be first or second priority.
EDIT: I looked in the CP project folder and it's obvious there's many outdated folders. I deleted them and it started playing nice with others again.
ID: 64532 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64538 - Posted: 30 Sep 2021, 14:32:16 UTC - in response to Message 64530.  

It seems to me that the generators of these two sets of readings of disk space consumption are reporting about two different machines.

Well then it seems you're wrong.


I see I am making the same mistake that some others may be making: confusing RAM and DISK SPACE. The UNIX and Linux free command deals with RAM and the df command deals with disk space. And with the large amount of RAM some of us have, that exceeds the amount of disk space that used to be common (20 years ago) it is an easy to make mistake. I remember how thrilled I was when someone came out with a 2 GByte hard drive.

SO if the free command says you have 46 GBytes of RAM available, and whatever source you get your error messages from says it needs 200 MBytes or 500 MBytes of disk space, then what I said previously, while not wrong, is nonsense: a clear case of comparing apples to oranges.

So we should be comparing apples to apples and the Linux command for that is df
On my machine, for example, I run the Boinc client in a dedicated partition. So my total disk space looks enormous to me, but the amount dedicated to Boinc is much more modest. The boinc partition is 118 GBytes, and I am using only 23% of it. There are four N216 CPDN tasks up in there and some WCG, Rosetta, and Universe ones as well.
$ df -h
Filesystem             Size  Used Avail Use% Mounted on

/dev/mapper/rhel-root   50G  9.7G   41G  20% /
/dev/mapper/rhel-home  410G   19G  392G   5% /home
/dev/nvme0n1p2        1014M  349M  666M  35% /boot
/dev/nvme0n1p1         599M   17M  583M   3% /boot/efi
/dev/sdb3              118G   25G   87G  23% /var/lib/boinc <---<<<
/dev/sdb1               92G   60M   87G   1% /D3P1
/dev/sda2               98G   19G   79G  20% /home/jeandavid8/Sound
/dev/sda1              489G  204G  285G  42% /home/jeandavid8/Videos
/dev/sdb2               92G   13G   75G  15% /D3P2
/dev/sdb7              196G  1.8G  194G   1% /D3P7
/dev/sdb6              196G  2.4G  193G   2% /D3P6
/dev/sdb5              387G   16G  371G   5% /home/margaret

ID: 64538 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 82
Credit: 70,017,155
RAC: 3,100
Message 64544 - Posted: 30 Sep 2021, 17:34:28 UTC - in response to Message 64526.  

With respect to crashed tasks not cleaning up after themselves, this seems to me much less of a problem than it used to be and it only rarely seems to happen to me now whereas it used to happen frequently. That may be because outside of testing branch, I only get the very occasional crashed task these days.
I would not say so: If Aurum has the problem with disk space and, as it seems to me, lot of crashed models, these crashed models will eat up a lot of space quite fast! Since I got WSL working on two Win10 computers, I had to clean-up by hand crushed WUs every times Win10 decided to restart my computer after the monthly Up-Date cycle without my intervention. And I remember well going around my Linux computers with WU numbers written down reported on climateprediction.net as crashed and cleaning it up on the hard disk so new ones could be downloaded again. This is the reason I do not run climateprediction.net on my server.
ID: 64544 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 82
Credit: 70,017,155
RAC: 3,100
Message 64545 - Posted: 30 Sep 2021, 17:38:36 UTC - in response to Message 64532.  

EDIT: I looked in the CP project folder and it's obvious there's many outdated folders. I deleted them and it started playing nice with others again.
Great that it worked! Unfortunatelly, this is a little house keeping one has to do on climateprediction.net, when there is no disk space left.
ID: 64545 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Erroneous disk space notices

©2024 cpdn.org