climateprediction.net home page
Bringing down the error rate

Bringing down the error rate

Message boards : Number crunching : Bringing down the error rate
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 50975 - Posted: 12 Dec 2014, 17:36:13 UTC

It may be a little premature to post this, but from what I can see of the error rates on various machines, it looks to be timely. Basically I run an i5-3550 under Win7 64-bit, and had my share of errors a couple of years ago. But things are looking up this time around, running CPDN on all four cores (this CPU does not have hyper-threading):
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1330836

The only real difference is that I am running BOINC (7.4.35 x64) not as a service this time, whereas I was running it as a service last time. Also my Z77 motherboard has internal Intel graphics, but it is disabled in the BIOS, the same as last time as I recall. Someone mentioned that it could be a problem.

Otherwise, I don't reboot this machine much and let it crunch 24/7, and it is on a UPS and it is very stable, but that was also true a couple of years ago. It seems to be the BOINC installation change that is the main difference, though of course the applications may be different too. I hope this is of some assistance in bringing down the error rate.

ID: 50975 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1056
Credit: 16,521,771
RAC: 1,278
Message 50981 - Posted: 16 Dec 2014, 16:32:05 UTC - in response to Message 50975.  

Here is a work unit I recently completed: Workunit 9279597. Now what is interesting about it, and about some others, is that I completed it successfully on my machine: Computer 1256552

GenuineIntel
Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz [Family 6 Model 45 Stepping 7]
Number of processors 4
Coprocessors ---
Operating System Linux
2.6.32-504.1.3.el6.x86_64
BOINC client version 7.2.33

There were two other attempts to run this work unit, and both failed.
"Error while computing."

The same kind of thing happened with Workunit 8651207, Workunit 9272257
ID: 50981 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 50982 - Posted: 16 Dec 2014, 17:01:56 UTC - in response to Message 50981.  
Last modified: 16 Dec 2014, 17:07:33 UTC

I don't know of a rhyme or reason. Probably a statistician will need to work it out, but it is fairly clear that some things work better than others. The variability of the work units themselves tends to mask other effects of course. But before I leave the subject, I would note that I run BOINC on a ramdisk, though that is for the high writes in CEP2 and probably has no effect on the error rate in CPDN. Also, it is a dedicated machine and I don't use an anti-virus on it. They sometimes cause problems on some projects, though I don't know specifically about CPDN.

I have to leave the above machine to WCG, since the scheduler does not like resuming CEP2 after a series of long CPDN jobs. I had that problem a couple of years ago too, and had hoped that the changes to BOINC would fix it. But I will try CPDN on an i7-4790 next that does not do CEP2, whenever they have more work; maybe next year?
ID: 50982 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,063,325
RAC: 928
Message 50985 - Posted: 16 Dec 2014, 19:34:45 UTC - in response to Message 50982.  

CPDN has had it�s share of problems with antivirus programs over the years. Just recently one of the major AV programs told several crunchers that CPDN was a virus. That�s why we recommend that all crunchers exempt the Boinc Data Folder from AV scans.

About a year ago I ran (on a lark) the Panda free Cloud Scanner and it deleted everything in the Boinc folder in the ProgramData folder. Fortunately, I had made a backup first on a backup drive so only a few hours was lost.

ID: 50985 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 50996 - Posted: 19 Dec 2014, 0:31:56 UTC - in response to Message 50985.  
Last modified: 19 Dec 2014, 0:32:32 UTC

I am long past paranoid on the subject of anti-viruses (the viruses don't scare me). Exclusions don't always help, since they often exclude just the scans and not the real-time protection, which is the usual culprit. And I like cloud AVs like Panda, except that a couple of weeks ago I had a crash due to a bad drive (not even the OS drive), and Panda refused to uninstall. It was so deeply embedded in the network stack that it wouldn't leave. It was not the first time that I had to re-install the OS due to an AV problem, but I am hoping to make it the last.
ID: 50996 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 51034 - Posted: 25 Dec 2014, 16:36:02 UTC
Last modified: 25 Dec 2014, 16:37:24 UTC

It seems that things are a bit different on the UK Met Office HadAM3P-HadRM3P Australia New Zealand v6.10 work units. There it may possibly help to run BOINC as a service. At least on my i7-4770 and i7-4790 machines I have completed 5 successfully when running BOINC as a non-service, but 3 others errored-out after only 20 seconds or less (Win7 64-bit). However, of those 3, one was completed successfully on an "Anonymous" machine (also Win7 64-bit), which presumably means that BOINC was being run as a service. So I think you need a checklist to know what is best for each project.
ID: 51034 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51038 - Posted: 25 Dec 2014, 19:54:09 UTC - in response to Message 51034.  

Hi Jim

"Anonymous" machines are those that have been hidden by their owners.
There's no visible means to identify service from non-service computers.

ID: 51038 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 51039 - Posted: 25 Dec 2014, 20:09:33 UTC - in response to Message 51038.  

OK, I was not sure. Otherwise, the other machine seemed comparable to mine, except that it was running a 24-core Xeon. It has done quite well on most of the projects and I don't see much of a pattern otherwise.
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=486411&offset=0&show_names=0&state=0
ID: 51039 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 51169 - Posted: 9 Jan 2015, 21:14:02 UTC
Last modified: 9 Jan 2015, 21:41:28 UTC

It should be mentioned that this i5-3550 machine has a wired Ethernet connection. Another machine I have (an i7-4770) has a PCIe wireless adapter, and it has picked up several download errors. But the download errors are only on the HadCM3 shorts, which I don't run on the i5-3550 machine, so I realize that it could be other factors and not the wireless adapter, and it has never been a problem on other BOINC projects (WCG, GPUGrid, Einstein) or Folding.

Also, I think the RAMDisk that I place the BOINC data folder in probably helps after all, based on a comment about disk errors that I saw on this forum recently, but I can't find it again. In that case, all the BOINC writes and reads are to the main memory rather than the disk drive, so they are much faster and not tripped up by any write contention to the disk. I use Primo Ramdisk on the i5-3550, and have sized it at 11GB for the WCG/CEP work, but it could be smaller for CPDN only; about 1GB per CPU core should be enough, if you clean out any failed work units that leave files behind. I have also used DataRAM Ramdisk, and it works OK. (In Linux, I think you can set up a virtual disk for that purpose.)

Otherwise, I see a lot of apparent restarts on a lot of machines, which probably cause problems. This is not a project for laptops in my opinion.
ID: 51169 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 51380 - Posted: 9 Feb 2015, 21:31:07 UTC - in response to Message 51169.  
Last modified: 9 Feb 2015, 21:32:38 UTC

I thought I would follow up on the use of a ramdisk a bit. I set up my i7-4770 machine for the HadCM3 shorts, since they tend to have a high error rate, to see what difference a Ramdisk would make. This machine is also very stable and not overclocked, with a back-up power supply and automatic shutdown via USB in case of a power failure (which did not happen during the test period and rarely happens anyway). It has a Samsung 840 EVO SSD, which is itself quite fast at writes, and in addition I installed the Samsung Magician 4.5 "Rapid Mode" cache. That provides an additional 1 GB of DRAM buffer, which smooths out the writes that actually reach the SSD to a very benign level, while providing a very high speed cache for subsequent reads for data still in the cache. This is about as good as it gets for modern SSDs.

The results are here: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1350074

However, up until 5 February 2015, I was still getting a few errors; not a large number, but I thought it could be reduced further since not all of them were model errors, and a few were completed successfully on other machines. So I set up a DATARam ramdisk on that machine also, 11 GB in size. Then I removed BOINC and re-installed it so that the BOINC program folder was on the ramdisk volume. (You may have to hack into the registry to remove all references to the BOINC program folder to get it installed on the right one if it was previously pointing to your OS drive.)

It can be seen that for all work units after 5 February, the error rate is much lower, and apparently consists of only a model error. So I think that if you want to invest in more DRAM memory for a ramdisk, and a good uninterruptible power supply, that is the way to go for low error rates.

Good luck.
ID: 51380 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,501,246
RAC: 5,648
Message 51383 - Posted: 10 Feb 2015, 8:24:29 UTC

Interesting thoughts Jim, I have stuff running on my new laptop. I think it may only have the one slot for ram though so I am stuck at 8GB. I will check at some point and if I can increase the ram up to 16 I might do so and move my BOINC data over to it.

The one thing a laptop does have if connected to the mains all the time is a built in UPS that currently on mine is set to suspend to ram after 15 minutes when battery life gets down to 20% or less. My desktop machine is getting old and is already maxed out with 4GB of ram for it's two cores so unless someone makes a pci ramdisk I could shove in it? - I am old enough to remember sticking an ISA card in a machine to get an extra two or four MB of RAM a few years ago!
ID: 51383 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 484
Credit: 29,602,471
RAC: 2,231
Message 51384 - Posted: 10 Feb 2015, 9:52:53 UTC - in response to Message 51383.  

Aaah! The days of 512kb RAM and if you were lucky a 30Mb HDD!
ID: 51384 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 51385 - Posted: 10 Feb 2015, 10:01:27 UTC - in response to Message 51383.  
Last modified: 10 Feb 2015, 10:01:40 UTC

Dave,

Thanks for correcting that to the BOINC DATA folder which should be on the ramdisk, not the PROGRAM folder, which can remain on the OS drive. I looked into expansion cards for disk purposes a few years ago and did not find anything really useful. If they exist, they cost more than it is worth. But you do point out one good attribute of laptops, which is that they inherently have a battery backup. I have not tried the suspension route (or hibernation now with Win7), but it should work. However, my machines run 24/7 so it is not really needed.

There are still other problems before you get to to disk drive issues. As noted above, someone before me found that the Intel graphics adapter causes errors on the HadCM3 shorts. I ran a small test to confirm that, with two pretty much identical Ivy Bridge machines (i7-3770) with the same BIOSTAR Z77 motherboards and ramdisks and backup power supplies as usual.

The first one had the internal Intel graphics adapter enabled, and errored out all the shorts, but the others were ok: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1354337
The second machine had the internal Intel graphics adapter disabled in the BIOS, and did not error out a single one of the 10 shorts: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1354336

I think that pretty much covers the subject for what I have found. Maybe a moderator will condense this into a short "Best Practices" list and make a sticky out of it, along with whatever other suggestions people find. I expect these points apply generally to Windows (all Win7 64-bit for me), but probably not for Linux and other OS's, which write to the disk quite differently I believe, among other things.
ID: 51385 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,501,246
RAC: 5,648
Message 51386 - Posted: 10 Feb 2015, 10:16:58 UTC - in response to Message 51384.  

512K!
When I learned to program on an Eliott22,000 as a first year I was only allowed a max of 16K!

Cue Monty Python sketch!
ID: 51386 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1081
Credit: 6,982,827
RAC: 3,789
Message 51387 - Posted: 10 Feb 2015, 11:30:37 UTC - in response to Message 51385.  

[Jim1348 wrote:] As noted above, someone before me found that the Intel graphics adapter causes errors on the HadCM3 shorts. I ran a small test to confirm that, with two pretty much identical Ivy Bridge machines (i7-3770) with the same BIOSTAR Z77 motherboards and ramdisks and backup power supplies as usual.

The first one had the internal Intel graphics adapter enabled, and errored out all the shorts, but the others were ok: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1354337
The second machine had the internal Intel graphics adapter disabled in the BIOS, and did not error out a single one of the 10 shorts: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1354336 ....

The Intel graphics adapter angle on the HADCM3S failures is new to me: do you remember where else it was mentioned? Hard to see how a modern operating system would allow that sort of interaction, but your test looks like a good one ...
ID: 51387 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 51391 - Posted: 10 Feb 2015, 16:45:43 UTC - in response to Message 51387.  
Last modified: 10 Feb 2015, 17:01:39 UTC

The Intel graphics adapter angle on the HADCM3S failures is new to me: do you remember where else it was mentioned? Hard to see how a modern operating system would allow that sort of interaction, but your test looks like a good one ...

Good question, but I can't find it. I thought it was in one of those extended discussions about the HadCM3 shorts, or at least failures in general. It was someone who had looked at a lot of the failures and noticed a pattern. It was so unusual that it stuck in my mind also. But since it was so strange, maybe it was another BOINC project entirely and I misremembered it? At any rate, it seems to be the case, but if anyone wants to check it out further and prove or disprove it they are welcome to do so insofar as I am concerned.
ID: 51391 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 51393 - Posted: 11 Feb 2015, 10:24:30 UTC

I neglected the link for my i7-4790 machine, which is on the HadCM3 shorts also, now running one at a time and also using a DATARam ramdisk for the BOINC DATA folder.
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1351652

The i7-4770 noted previously is on an Asrock Z87 motherboard, and the i7-4790 on an Asrock Z97 MB. I also have a couple of GTX 750 Ti's on each board for various BOINC GPU projects, which are each supported by a CPU core. The Asrock boards do not allow both the PCIe graphics cards and the internal Intel graphics to be in use at the same time, and so the possibility of problems with the internal graphics is eliminated.

The BIOSTAR Z77 motherboards for my i7-3770 machines noted above did allow both to be in use simultaneously if set that way in the MB BIOS. Maybe the errors noted above are due to the interaction between the GPU cards and the internal graphics? It is an area for further study by someone, but not by me unfortunately at the moment.
ID: 51393 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,501,246
RAC: 5,648
Message 51394 - Posted: 11 Feb 2015, 10:31:47 UTC

Just been looking and very simple to set up a ramdisk with linux as well. At some point I might try on laptop setting up 4GB ramdisk which would drop the ram available to everything else to 1GB/core.
ID: 51394 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 51395 - Posted: 11 Feb 2015, 11:20:10 UTC
Last modified: 11 Feb 2015, 11:32:22 UTC

Dave,

For Windows users, there is one more possibility that I should mention which in some cases might be more efficient than a ramdisk, and that is a disk cache. Romex Software sells PrimoCache, which can be set to cache writes to the disk in main memory. (You can cache reads also, but that just wastes memory space for our purposes). If you set the cache size and write latency large enough (it can go up to infinite time), then you get the same effect as a ramdisk. I have used it myself; they have a long free test period. There are advantages and disadvantages, but a cache will send all the writes to main memory first until it is full, and you don't have to specify a different installation folder for the BOINC DATA as you do with a ramdisk, and so it is easier to set up. Depending on what other programs you are using, you might be able to get away with less memory (or not).
http://www.romexsoftware.com/en-us/primo-cache/index.html

On Linux, I think there is something comparable that you can set up in the OS itself. As I recall Linux caches writes anyway, and you just tweak up the write latency to some suitably large value (maybe an hour or more). The more the write-delay, the more memory it will take of course, but you might be able to arrive at a happy medium.

Have fun.
ID: 51395 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,501,246
RAC: 5,648
Message 51396 - Posted: 11 Feb 2015, 13:54:37 UTC - in response to Message 51395.  

Good Pointers Jim,

to some extent you can use more ram for caching by playing with the swappiness. This means you don't even need to play about with settings for individual disks. On my laptop, the OS is on an SSD so loading programs is very fast anyway.
ID: 51396 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Bringing down the error rate

©2024 climateprediction.net