Computational Error exit status 193 (0xc1) various Linux computers

Author	Message
Jonathan Brier Send message Joined: 9 Dec 05 Posts: 3 Credit: 710,088 RAC: 0	Message 54467 - Posted: 9 Jul 2016, 20:04:59 UTC Looking over the past workunits I'm seeing a majority of my devices are exiting with a computation error with exit status 193 (0xc1). It appears to be memory related [url]http://boincfaq.mundayweb.com/index.php?view=238[\url] http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1256213 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1401337 The virtualLHC project recently published their breakdown of computational error at http://lhcathome2.cern.ch/vLHCathome/forum_thread.php?id=1846 which was quite informative on what problems they were encountering, tackling, and somewhat explianed what they were. Could we get such a breakdown for climateprediction.net to see how pervasive the various computational errors are on the tasks or computer types? ID: 54467 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 54469 - Posted: 10 Jul 2016, 2:19:08 UTC - in response to Message 54467. The problem with this, is that the researchers are no longer at Oxford. They're climate physicists from all over the planet. The Oxford people are the IT people who look after the servers and connections. None of these has a reason for compiling a list of errors, or how many there are of each error. If a model runs to completion, good. If not, "why not" can be checked, and another one added to the next batch. Most of the problems can be divided into 2 groups: 1) People who look at the results regularly, and 2) Those who just join and then forget about it. The 2 main problems are: 1) Those who run 64 bit Linux and don't know that they also need 32 bit libraries. And a sub-group of these who only need one more lib, but don't check to see this. 2) Windows users who let MS update their computer whenever MS wants to, meaning a re-boot while models are running. In all of the above, there's also failing hardware, incorrect permissions, running out of disk space due to lots of failed models taking up HD space, and not enough ram. ID: 54469 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,528,667 RAC: 5,893	Message 54470 - Posted: 10 Jul 2016, 7:18:54 UTC Worth running memtest for several hours to exclude a problem with one of your memory modules. Also under Linux, I find it is not unusual for tasks to fail if the computer is turned off for any reason, even when I do follow the advice of suspending computation for a few minutes before exiting BOINC. I must admit I find it harder to remember to re-enable computation for tasks individually one at a time allowing a few minutes between restarting each one, though re-starting the lot at once doesn't seem to affect my machines running BOINC under WINE. ID: 54470 · Reply Quote

Jonathan Brier Send message Joined: 9 Dec 05 Posts: 3 Credit: 710,088 RAC: 0	Message 54474 - Posted: 11 Jul 2016, 3:44:14 UTC - in response to Message 54469. Worth running memtest for several hours to exclude a problem with one of your memory modules. Both computers passed memtest many times over. One is a new workstation stress tested for hours and only issues on climateprediction.net which I'm marking up to the app's handling and robustness. The problem with this, is that the researchers are no longer at Oxford. They're climate physicists from all over the planet. The Oxford people are the IT people who look after the servers and connections. None of these has a reason for compiling a list of errors, or how many there are of each error. If a model runs to completion, good. If not, "why not" can be checked, and another one added to the next batch. The physical location of researchers should not be an issue as the Internet allowing distributed work on software. In the interest of getting the science done efficiently and using volunteer resources ethically someone should be monitoring and caring the errors for needed fixes to be implemented. Knowing which errors are occurring at what rate helps direct the time investment for the largest return in additional computing. You don't know the impact without measuring them. Additionally people should care as these are volunteer resources and software is not a static thing especially when dealing with heterogenous environments such as BOINC was designed. Disregard for the electricity and efficient use of hardware of volunteers will breed ill will for the project no matter the science. Most of the problems can be divided into 2 groups: 1) People who look at the results regularly, and 2) Those who just join and then forget about it. Neither group should have to worry about results failing that is on those running the project to make sure their software is operating correctly and robust to using the donated resources efficiently with the least amount of waste or the researchers should not be recruiting the general public. BOINC projects are designed to be installed and able to be left alone. Anything less and the project is not mature for public participation. The 2 main problems are: 1) Those who run 64 bit Linux and don't know that they also need 32 bit libraries. And a sub-group of these who only need one more lib, but don't check to see this. There should be zero expectation of participants to install extra libraries. Instead the lack of these should be detected by the project to not send work units to these computers or provide necessary the library locally. There should not be a constant stream of errors and wasted resources from volunteers due to the project providing work units to computers that do no have sufficient environments. If providing them locally is not possible then the project should be running in a virtual machine to have full environment control. Regarding the 32 bit libraries needed on a 64 bit Linux machine there is insufficient documentation for this and that needs to be moved to a more visible location than the sticky in the linux section of the forums the join instructions would be one additional place to note this for reference. BOINC notifications would be appropriate starting point to notify computers missing the libraries to help bring them into compliance before an automated mean could be implemented. Participants shouldn't have to sift through the entire thread to find libraries that may need to be installed. 2) Windows users who let MS update their computer whenever MS wants to, meaning a re-boot while models are running. In all of the above, there's also failing hardware, incorrect permissions, running out of disk space due to lots of failed models taking up HD space, and not enough ram. The whole point of checkpointing is to resume where unexpectedly interrupted actions occur. The app should recognise an invalid exit and attempt resuming from the last checkpoint. I'm well aware with the issue that could arise given I've watched BOINC evolve from day one. I expect climateprediction.net to have a more robust approach to their project maintance given how long they have been using BOINC not just a pretty website and highly valuable scienctific project that is perfectly paired for engaging the public.[/quote] ID: 54474 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,528,667 RAC: 5,893	Message 54478 - Posted: 11 Jul 2016, 8:16:08 UTC Regarding the 32 bit libraries needed on a 64 bit Linux machine there is insufficient documentation for this The problem with providing adequate documentation for this is that it varies according to distribution and often changes with a each iteration of a distribution. It has been suggested that CPDN keeps the required libraries in an area where they can be downloaded and copied to the appropriate location along with instructions for doing so. I would support this along with instructions on the joining page. It still wouldn't address the issue for the complete set and forget crowd but it would for those who are able and willing to make a bit more effort. Neither group should have to worry about results failing that is on those running the project to make sure their software is operating correctly and robust to using the donated resources efficiently with the least amount of waste or the researchers should not be recruiting the general public. BOINC projects are designed to be installed and able to be left alone. Anything less and the project is not mature for public participation. I am inclined to agree with you on this. I know that sometimes the scientists do release batches of work that are not really ready for the main site. Occasionally batches get pulled from the main site for this reason. However it is not unusual for one or two users to have a problem that others can not replicate. I guess this is where some automated system for collating reasons why tasks fail would come in useful as it would find out if there were also some among the set and forget crowd who were also experiencing this making an accurate diagnosis of the problem more likely. Unfortunately in the absence of a very significant donation of cash to Oxford that would enable the extra resources in terms of hardware and staff to be devoted to the project I can't see it happening. All above is a personal view as one who has been crunching for CPDN since near the beginning, though a lost email account necessitated a change in user a few years back. ID: 54478 · Reply Quote