Thread 'OpenIFS Discussion'

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944	Message 66690 - Posted: 1 Dec 2022, 11:57:48 UTC - in response to Message 66687. Last modified: 1 Dec 2022, 12:45:24 UTC I think what's happened is CPDN have not set the memory usage limit high enough and depending on what process does what when, it can hit blow past the limit. It's a working theory I want them to test. I have had one of those failures, 06:37:27 STEP 2509 H=2509:00 +CPU= 16.937 06:37:44 STEP 2510 H=2510:00 +CPU= 16.658 06:38:11 STEP 2511 H=2511:00 +CPU= 24.246 Suspend request received from the BOINC client, suspending the child process double free or corruption (out) So, if I am understanding you correctly, CPDN specify a maximum amount of memory for the application to use and you get problems when (if) it goes above that? It clearly isn't lack of memory on the hot machine as this one has 32GB and only one task was running at the time. ID: 66690 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877	Message 66691 - Posted: 1 Dec 2022, 13:42:01 UTC - in response to Message 66690. So, if I am understanding you correctly, CPDN specify a maximum amount of memory for the application to use and you get problems when (if) it goes above that? It clearly isn't lack of memory on the hot machine as this one has 32GB and only one task was running at the time. It may be an incorrect assumption, but I am presuming that the client either puts the processes in a 'sandbox' (chroot to a slot & restricts memory), or it's killing the process because it exceeds the memory limit, but then I would expect to see a message in the log that it's done that. Anyway, the limits are wrong so let's try the low-hanging fruit first before we try other things on volunteer machines. I'll be doing more testing on my machine in the meantime. Alot of the failed tasks with double free happened right after the trickle files were zipped so I was beginning to suspect that was a clue, but further checking showed that's not as common as I thought. Unfortunately there is not enough information coming back from the controlling wrapper when something goes wrong - something else I hope they will change. I think I've also convinced Andy that we needed to do a more realistic batch test on the dev site, a much bigger batch with more volunteers, to test it as it would go out on the production site. We could have picked up these problems earlier had that been done. Typically the dev test site is used to check the server config as the model has already been tested to run standalone. To that end, the active users on this forum might get an invite soon to join the dev site. ID: 66691 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,826,880 RAC: 4,822	Message 66692 - Posted: 1 Dec 2022, 13:58:55 UTC - in response to Message 66691. What many projects do is to create special short-running tasks for evaluation on their test sites. These would exercise all the major loops in the code, but cover a shorter time simulation. That way, you would start to see the results more quickly, and you might capture the totality of stderr within the 64 KB limit. Would there be any scope for that here? ID: 66692 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66693 - Posted: 1 Dec 2022, 14:02:15 UTC - in response to Message 66691. Typically the dev test site is used to check the server config as the model has already been tested to run standalone. To that end, the active users on this forum might get an invite soon to join the dev site. Would that invite be in our Inbox? Or some other way? I assume those invited would be given instructions on how to participate. ID: 66693 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944	Message 66694 - Posted: 1 Dec 2022, 14:49:40 UTC - in response to Message 66693. Typically the dev test site is used to check the server config as the model has already been tested to run standalone. To that end, the active users on this forum might get an invite soon to join the dev site. Would that invite be in our Inbox? Or some other way? I assume those invited would be given instructions on how to participate. Yes, it would come via in-box with instructions on how to join the dev site - It will show up as another project cpdn_boinc once anyone invited has joined. ID: 66694 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877	Message 66695 - Posted: 1 Dec 2022, 15:24:27 UTC - in response to Message 66692. Last modified: 1 Dec 2022, 15:27:30 UTC What many projects do is to create special short-running tasks for evaluation on their test sites. These would exercise all the major loops in the code, but cover a shorter time simulation. That way, you would start to see the results more quickly, and you might capture the totality of stderr within the 64 KB limit. Would there be any scope for that here? Absolutely. No need to run for the full 3 months. I'm more interested in capturing the way volunteers run the tasks on their machine (stuffed to the limit in some cases from what I've read!). I think that's the problem, we haven't tested at the scale we're running on the production site, so the first batch effectively becomes that test. ID: 66695 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877	Message 66696 - Posted: 1 Dec 2022, 15:34:05 UTC Last modified: 1 Dec 2022, 15:35:52 UTC Assume we get instructions... It works much the same as climateprediction.net, you get tasks as usual & credit but they may not always work. ID: 66696 · Reply Quote

Steven Send message Joined: 28 Jun 14 Posts: 4 Credit: 8,570,955 RAC: 6	Message 66698 - Posted: 1 Dec 2022, 17:10:58 UTC Last modified: 1 Dec 2022, 17:21:24 UTC I'm getting all sorts of errors here. Been trying to budget 8GB of RAM per OpenIFS workunit. This workunit run to the end and then aborted? Did BOINC crash? I was running two at a time on this system with 16GB of RAM. https://www.cpdn.org/result.php?resultid=22247140 <message> Process still present 5 min after writing finish file; aborting</message> This one failed with an upload error. Running one at a time since it only has 8GB of RAM. https://www.cpdn.org/result.php?resultid=22246386 <message> upload failure: <file_xfer_error> <file_name>oifs_43r3_ps_1304_2021050100_123_946_12164393_0_r264053712_122.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> This one failed with code 9. Same machine as previous, but this one may have been running more than one for a while before I noticed the OpenIFS tasks were being sent out. https://www.cpdn.org/result.php?resultid=22247027 <message> process exited with code 9 (0x9, -247)</message> double free or corruption (out) This one ran for 15 hours and somehow has no output file? https://www.cpdn.org/result.php?resultid=22245680 Same machine as previous, ran to the end and then had an upload failure. https://www.cpdn.org/result.php?resultid=22245367 <message> upload failure: <file_xfer_error> <file_name>oifs_43r3_ps_0334_2021050100_123_945_12163423_0_r1586639697_122.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> Meanwhile, this old computer has been running two at a time with no trouble. https://www.cpdn.org/show_host_detail.php?hostid=1526772 These are all on ethernet, sharing a switch. Don't think I'm running into bandwidth issues. Checked a couple of the machines for disk usage. BOINC has 100GB to play with with ~90GB free. Quick edit: Another one just failed. Received this morning on a machine with 8GB of RAM. Running just one workunit. Ran for about 5 hours before failing. "Trickle up message pending" in BOINC manager. Hasn't been reported to the server yet. No output file in the folder, but there was this progress file, if it helps: https://www.cpdn.org/result.php?resultid=22248845 <?xml version="1.0" encoding="utf-8"?> <running_values> <last_cpu_time>19262.910000</last_cpu_time> <upload_file_number>44</upload_file_number> <last_iter>1059</last_iter> <last_upload>3801600</last_upload> <model_completed>0</model_completed> </running_values> ID: 66698 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944	Message 66699 - Posted: 1 Dec 2022, 18:29:16 UTC - in response to Message 66698. The double free or corruption (out) error is a problem with the model or the wrapper code. The failed uploads are because the model has crashed before producing the final upload(s) so they are missing when BOINC tries to upload them to the server once the task has finished. These errors are happening on machines of known good pedigree. Glen is on the case and we may be doing a larger than normal batch over on the testing site to try and resolve this. ID: 66699 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,619,443 RAC: 6,667	Message 66700 - Posted: 1 Dec 2022, 19:48:21 UTC To get the openIFS tasks to run on this VirtualBox ubuntu host: https://www.cpdn.org/results.php?hostid=1512045, I increased the ubuntu VM disc partition from 40GB to 100GB (gparted). After five early openIFS successes, the subsequent tasks have crashed with one error or another. The event log has reported a lot of 'file absent' records, with no obvious local reason that I can see, This afternoon I've increased the memory allocated to the ubuntu VM from 28GB to 32GB and reduced cpus (tasks running) from six to four. On a positive note, after the reboot all the suspended tasks started up successfully! ID: 66700 · Reply Quote

klepel Send message Joined: 9 Oct 04 Posts: 82 Credit: 70,017,155 RAC: 3,100	Message 66701 - Posted: 1 Dec 2022, 20:34:15 UTC - in response to Message 66689. Update. After meeting yesterday with CPDN, the disk and memory requirements for these tasks need revising: memory requirement up & disk down. What was not taken into account when setting the memory was the additional amount required by the wrapper code & all the boinc functions it uses (such as zipping). Hopefully this will eliminate some of the memory errors. The plan is to put out a repeat of the first batch with corrected limits to check how it performs before sending out the rest of this experiment. Sure this will help! On trickles, agree these longer (3 month) runs are producing too many trickle files which I'll adjust. However, I looked at the output filesize per output instance and it's reasonable and at the lower limit of what the scientist needs. I am reluctant to change it. Understood! Hope less tickles might help for smoother uploads. Question for ADSL people: knowing your bottleneck is network, are you happy just reducing the no. of tasks running concurrently? What's your sustainable data-flow rate you would be happy with (give me a number to work with). I do not have any problems to reduce the number of tasks running on my computers to fit into my ADSL bandwidth. I have to remind myself, I offer the scientist a certain amount of compute power, but they have to accept the offer – there are a lot of other worthy BOINC projects! (Hopefully I will remind myself of it, when I will go out shipping computer parts for climatepretiction.net I do not need for my personal daily computer requirements!) However, I am still concerned, how many climateprediction.net participants are reading the Forums and how many users are out there, who have installed BOINC and attached to climateprediction.net, but never check their machines. You might end up, with a lot of OpenIFS results piling up on computers with slow internet connections, wasting energy and resources and never help science. I will send you a PM with my ADSL speed, so you have a number of WUs, I am likely to contribute each day. It is not much! ID: 66701 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66703 - Posted: 1 Dec 2022, 20:54:56 UTC - in response to Message 66701. However, I am still concerned, how many climateprediction.net participants are reading the Forums and how many users are out there, who have installed BOINC and attached to climateprediction.net, but never check their machines. You might end up, with a lot of OpenIFS results piling up on computers with slow internet connections, wasting energy and resources and never help science. I notice, with favor, that these Oifs work units come with about a one-month expiry date instead of a one-year one the traditional work units come with. ID: 66703 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,619,443 RAC: 6,667	Message 66704 - Posted: 1 Dec 2022, 21:10:10 UTC - in response to Message 66701. [ADSL people: knowing your bottleneck is network, are you happy just reducing the no. of tasks running concurrently? What's your sustainable data-flow rate you would be happy with (give me a number to work with). The broadband uplink here is 12Mbps and downlink at 40Mbps. It's pretty consistent at that speed. The event log showed that uploads from six concurrent tasks over the past few days are taking 12-15 seconds each, which is not giving a network headache. A single new task download (3 jf_c... files) is less than two minutes. ID: 66704 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944	Message 66705 - Posted: 1 Dec 2022, 21:38:11 UTC On bored band here with a max upload speed of about 100KB/s it can just about keep up with 2 tasks running at a time. Not a problem for me as if they do build up I can just cut down to 1 task running till it catches up. Lower numbers of tasks for testing runs, i sometimes tether my phone to get four times the througput but with a 15GB/month limit I won't be doing that for main site batches of these! ID: 66705 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,322,658 RAC: 1,085	Message 66706 - Posted: 1 Dec 2022, 21:51:31 UTC - in response to Message 66689. Last modified: 1 Dec 2022, 21:58:24 UTC Glenn Carver wrote: On trickles, agree these longer (3 month) runs are producing too many trickle files which I'll adjust. However, I looked at the output filesize per output instance and it's reasonable and at the lower limit of what the scientist needs. I am reluctant to change it. Question for ADSL people: knowing your bottleneck is network, are you happy just reducing the no. of tasks running concurrently? What's your sustainable data-flow rate you would be happy with (give me a number to work with). If the scientist needs 1.72 GB result data per workunit, then that's what I'll be happily producing. After all, it's the data which the scientist desires, not the CPU cycles which produce them. Going by the task properties of those in the first 3000s batch: Based on the CPUs, RAM and disk space which I have available, I could produce >330 results/day = 570 GB/day. If I switched on some older gear and let the flat become uncomfortably warm, it'd be >460 results/day = 790 GB/day. But based on my Internet uplink, I can deliver at most 8 Mbit/s = 84 GB/day in steady state, minus outages. (That's at most 48 results/day, minus outages.) I have no trouble partitioning my currently running computers such that I produce ≤48 r/d for CPDN and have the rest of computer capacity busy at other projects. If everyone had a narrow uplink like me (there are lesser links which they still call "broadband" here), and if you want >42,000 results done until X-Mas 2022, you would obviously need >36 people like me if they manage to nearly saturate their uplink the whole time. server_status.php claims there were 95 users at OpenIFS 43r3 Perturbed Surface during the last 24 hours, so that looks good. OTOH it seems only 1000 of the first 3000 tasks are done yet, so that does not look as good. Obviously, from the comments in this thread, we have folks here who are bottlenecked by CPU, others by RAM, and others by transfer bandwidth. ID: 66706 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,322,658 RAC: 1,085	Message 66708 - Posted: 1 Dec 2022, 22:34:27 UTC Last modified: 1 Dec 2022, 22:40:21 UTC I'll say one more thing about vboxwrapper, and then I'll stay away from this subject: If you look at LHC@home, the highest producers there run the native Linux ATLAS application, not any of the virtualized applications. And that's no coincidence. One of the reasons is a lot lower RAM requirement by the native application. (Also check out the "average computing" column at apps.php. Or anybody who ever took part in a contest at LHC@home knows very well that the native application is the way to go if computing throughput is of any concern at all.) ID: 66708 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 66709 - Posted: 1 Dec 2022, 22:39:15 UTC Last modified: 1 Dec 2022, 22:40:52 UTC I've had several OOM, despite Boinc being set to use 90% of system RAM, on dedicated hardware (well, a VM dedicated to BOINC tasks in the winter). https://www.cpdn.org/result.php?resultid=22247094 is one - the rest look identical, just a child task exited. It's a hex-core VM with 12GB RAM - I would have assumed that BOINC would limit processes based on memory use, but that doesn't seem to be happening. I'll pull a couple cores out of it for future units, but however the math is happening, OpenIFS tasks are OOMing easily. ID: 66709 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66710 - Posted: 1 Dec 2022, 22:53:52 UTC - in response to Message 66706. Obviously, from the comments in this thread, we have folks here who are bottlenecked by CPU, others by RAM, and others by transfer bandwidth. I think I am bottle-necked by the size of my Processor cache. My CPU is pretty fast, 64 GBytes RAM, and I get 75 Megabits per second on my fiber-optic Internet connection. My other computer is a little one running Windows 10, and it spends most of its life doing Boinc, but notOpenIFS. Memory 62.28 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 477.76 GB Measured floating point speed 6.13 billion ops/sec Measured integer speed 26.09 billion ops/sec Average upload rate 4480.76 KB/sec Average download rate 45235.53 KB/sec I think the data rates reported by Boinc-CPDN are really Kilobits per second, not KiloBytes per second.) Right now I am running three Oifs tasks, three Rosetta tasks, three WCG tasks, two Einstein tasks, and one (single-processor) MilkyWay task. This shows my machine's cache-miss ratio, so the hit ratio would be 50.45% , Not too bad, but not wonderful either. Other than the 12 boinc processes, the machine is not doing much else at the moment (following my typing into Firefox that is doing nothing else). # perf stat -aB -e cache-references,cache-misses Performance counter stats for 'system wide': 20,626,539,435 cache-references 10,220,773,584 cache-misses # 49.552 % of all cache refs 61.867007273 seconds time elapsed ID: 66710 · Reply Quote

Vato Send message Joined: 4 Oct 19 Posts: 15 Credit: 9,174,915 RAC: 3,722	Message 66711 - Posted: 2 Dec 2022, 1:21:07 UTC - in response to Message 66691. i will happily run tests on the dev server if invited. so far i have 9 tasks that appear to run well - no credit though ID: 66711 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944	Message 66712 - Posted: 2 Dec 2022, 5:39:51 UTC It's a hex-core VM with 12GB RAM - I would have assumed that BOINC would limit processes based on memory use, but that doesn't seem to be happening. I'll pull a couple cores out of it for future units, but however the math is happening, OpenIFS tasks are OOMing easily. During early stages of these on the testing site, I was able to run 4 tasks on a box that had only 8GB RAM. That laptop is now dead but it did it albeit at a massive hit on speed because it was swapping to disk every timetwo or more tasks peaked in memory usage at the same time. There wasn't much of a hit when only running 2 at once. But, Sadly the client will not limit how many tasks it will run based on memory. I have the whole of the laptop ssd boot disk that I salvaged as swap on this machine so 128GB but am not trying to run 16 or even 8 tasks at once because connection bandwidth is my bottleneck. ID: 66712 · Reply Quote