climateprediction.net home page
Trying to get tasks to not crash Linux client, now not receiving tasks
Trying to get tasks to not crash Linux client, now not receiving tasks
log in

Advanced search

Questions and Answers : Unix/Linux : Trying to get tasks to not crash Linux client, now not receiving tasks

1 · 2 · Next
Author Message
Pilot_51
Send message
Joined: 19 Sep 17
Posts: 9
Credit: 41,402
RAC: 14
Message 56919 - Posted: 22 Sep 2017, 20:18:23 UTC
Last modified: 22 Sep 2017, 20:29:08 UTC

I recently started using BOINC again (previously used it in 2003-2004, other programs until 2013) to contribute to climate modeling. Unfortunately, I've been having a lot of trouble getting CPDN tasks to work properly on my main PC running Linux Mint 18.2.

Here's the task list.

I installed the BOINC client and manager from the Mint/Ubuntu repository and moved /var/lib/boinc-client to another partition with much more space, making sure to change the BOINC_DIR line in /etc/default/boinc-client. That seemed to work with up to 8 CPDN tasks running, but in the morning I found that the boinc client had crashed and would crash immediately after starting it again.

This was what appeared in syslog when it crashed initially:

Sep 19 05:29:38 mark-main systemd[1]: boinc-client.service: Main process exited, code=exited, status=193/n/a Sep 19 05:29:40 mark-main systemd[1]: boinc-client.service: Unit entered failed state. Sep 19 05:29:40 mark-main systemd[1]: boinc-client.service: Failed with result 'exit-code'.

The exact same errors occurred each time I tried restarting the client.

I tried modifying client_state.xml and deleting files to clear the problem task, but whatever I did didn't help and appeared to cause the remaining tasks to go into an error state at the next client start/crash. I then removed all references to the project I could find and moved the data directory back to /var/lib/boinc-client and reverted BOINC_DIR, thinking maybe it didn't like that I moved the directory. The client started and I started one task which appeared to get farther along, but that ran out of space as I was doing something else that used a lot of /tmp and caused a computation error in the task. I moved the /var/lib/boinc-client directory back to the other partition as before, but this time just used a symbolic link without changing BOINC_DIR. I also made sure to chown boinc:boinc on the moved directory.

I started one task again, but again it reached around 10% and crashed the client. I found that it appeared to be trying to send a result around the same time, so I deleted just the result part from client_state.xml and that allowed the client to restart. However, even though just about every other reference to the task was automatically removed and I even reset the project, I wasn't receiving more tasks and the website kept the failed task as 'In Progress' until I removed and readded the project. I was going to try suspending network activity to see if that would prevent it from crashing before completion, but CPDN hasn't sent any new tasks for about a day now, I just keep getting this in the log:
Fri 22 Sep 2017 03:16:29 PM EDT | climateprediction.net | Sending scheduler request: To fetch work. Fri 22 Sep 2017 03:16:29 PM EDT | climateprediction.net | Requesting new tasks for CPU Fri 22 Sep 2017 03:16:31 PM EDT | climateprediction.net | Scheduler request completed: got 0 new tasks Fri 22 Sep 2017 03:16:31 PM EDT | climateprediction.net | No tasks sent


I can see in the server status that there are still plenty of unsent tasks in wah2, the same application that I was receiving before.

Because of these issues and since the request_delay (communication deferred) time is so long, I've started contributing to WCG to fill the time, but I'd really prefer my resources go toward helping us understand the climate. I have not had a single issue with WCG after about 100 tasks.

Fortunately, all is not lost for my CPDN efforts. I have an Intel NUC server running Debian that has so far been crunching without issue on 3 of its 4 cores, currently 23-39% between the tasks.

Any help in resolving this would be appreciated.

Profile geophi
Volunteer moderator
Send message
Joined: 7 Aug 04
Posts: 1670
Credit: 32,083,245
RAC: 31,083
Message 56921 - Posted: 22 Sep 2017, 20:55:15 UTC - in response to Message 56919.

Welcome back. Unfortunately it is a rough time for Linux and Mac users. The problem detailed in this thread https://www.cpdn.org/cpdnboinc/forum_thread.php?id=8474 is occurring with some tasks that have a lot of months in them. It is a combination of a boinc limitation/bug and cpdn task problem that affects both Mac and Linux on some tasks. And the result is the inability to get boinc to start back up and continue processing tasks.

My advice is to continue on with WCG or other projects. Hopefully some way around this problem can be found. I think the developers are going to do something next week. It might be deprecating Mac and Linux apps until the cpdn problem is found, or creating a win only app for the problem task sets. Hopefully we'll have some news up next week on what the path forward will be.

WB8ILI
Send message
Joined: 1 Sep 04
Posts: 98
Credit: 40,088,982
RAC: 54,475
Message 56922 - Posted: 22 Sep 2017, 21:52:20 UTC

Pilot_51

If you are interested in doing some experimentation, I might suggest you re-install BOINC completely using the default settings. I don't use Mint so I am no help there. But, I do use UBUNTU.

To eliminate the possibly your move of the boinc-client didn't cause other problems, I would suggest leaving everything in the default locations. I know you wrote it was space limited.

From my experience installing from the UBUNTU repository is an excellent way to go.

I have tried moving Boinc files under UBUNTU once and so many problems I threw in the towel. I don't remember the details.

If BOINC works with the "default" locations, you can try moving it and see if it works. If it doesn't work, you know why.
____________

Pilot_51
Send message
Joined: 19 Sep 17
Posts: 9
Credit: 41,402
RAC: 14
Message 56929 - Posted: 23 Sep 2017, 8:49:55 UTC

I completely wiped and reinstalled boinc-client and kept the data folder in the default location, making a backup copy of the fresh directory just in case. I also managed to free up about 7GB of space by uninstalling some software I hadn't used in a while, giving BOINC 9GB to work with and a 1GB margin. Unfortunately, that didn't fix the 0 tasks issue.

I think the developers are going to do something next week. It might be deprecating Mac and Linux apps until the cpdn problem is found, or creating a win only app for the problem task sets. Hopefully we'll have some news up next week on what the path forward will be.

That sounds very plausible and I hope the lack of tasks is intentional in an effort to prevent and ultimately fix the crash issue. Can any other Linux/Mac users confirm whether they've received new WUs since a day or two ago? I suppose it's possible the server just didn't like how all 17 tasks it sent this computer failed, 16 of which were abandoned.

For now, I'll stick with WCG and continue checking CPDN for tasks, as well as keeping an eye out for any news on the crash issue.

bernard_ivo
Send message
Joined: 18 Jul 13
Posts: 252
Credit: 5,903,045
RAC: 23
Message 56951 - Posted: 24 Sep 2017, 17:49:01 UTC - in response to Message 56929.
Last modified: 24 Sep 2017, 17:50:12 UTC

Hi Pilot_51,
I also haven't received Linux WUs in the last few days and after some info exchange it is very likely there are none in the hopper. In such cases I usually use WINE and I hadn't any issues on my two 14.04 LTS machines (there are few WINE related threads). I recently launched a i7-4790 Ubuntu 16.04 LTS machine with 10GB BOINC data space and it goes up to 5-7 GB so I guess you will be fine. This time I set a separate partition /var (during Ubuntu install) as I did not want to move around the CPDN data folder after install. I do have it moved on another HDD on one of my Linux boxes, but finding the instruction how to do it took a while, so I went for /var partition.

Pilot_51
Send message
Joined: 19 Sep 17
Posts: 9
Credit: 41,402
RAC: 14
Message 56955 - Posted: 24 Sep 2017, 21:19:16 UTC - in response to Message 56951.

Yeah, I noticed fewer than 300 unsent WUs this morning and now it's 0, so now it's a wait for more to become available.

I know I could use WINE and do use it for the occasional Windows-only game, but I'd rather not make that compromise and reduce the importance of them making things work correctly on Linux. If there's one thing I like less than running Windows-only software in WINE, it's running cross-platform software in WINE because the native build is broken or buggy, so I'd either deal with the bugs or not use it at all. I know, I'm weird.

Once things get going again and I'm receiving WUs, I'll make sure it completes a task on the main partition and then see if simply moving the data dir breaks the next task. I'm still quite determined to find a stable solution that lets me store the CPDN data on another drive, though probably won't go as far as reinstalling the OS or changing the location of /var.

Profile JIM
Send message
Joined: 31 Dec 07
Posts: 982
Credit: 14,320,108
RAC: 19,627
Message 56957 - Posted: 25 Sep 2017, 6:08:37 UTC

There is a new post in the news section on this topic.
____________

Pilot_51
Send message
Joined: 19 Sep 17
Posts: 9
Credit: 41,402
RAC: 14
Message 56960 - Posted: 25 Sep 2017, 17:44:47 UTC - in response to Message 56957.

Thanks for the heads-up, that helps clear up what was going on.

Interestingly, all 3 tasks given to my Debian server are still going great, currently at 41%, 58%, and 70%.

It would appear that at least one bad batch was pnw25, since that is not running on my server and it was always running on my main system when the client crashed, including the very last task which was running alone. The second-to-last task that got further along and ran out of storage was cam25. All the earlier tasks were running alongside several others including two pnw25 tasks.

So, I think it's safe to say that the location of the data dir had nothing to do with the crashes, and I honestly don't know how it could have. Without knowing exactly what was causing the crash in pnw25, I doubt there was anything that could be done short of using WINE to prevent it from crashing. If I were to receive more WUs with what I know now, assuming the issue wasn't fixed, I'd just abort any pnw25 tasks.

Profile Dave Jackson
Send message
Joined: 15 May 09
Posts: 1789
Credit: 2,671,578
RAC: 898
Message 56971 - Posted: 26 Sep 2017, 8:02:46 UTC - in response to Message 56960.

If I were to receive more WUs with what I know now, assuming the issue wasn't fixed, I'd just abort any pnw25 tasks.


The tasks that caused the crashes have been deprecated for both Linux and Macs until a fix can be found. This will mean fewer tasks for us however. Work is still going on to try and identify the cause of the problem and resolve it but there has been no recent update on where this has reached.

Pilot_51
Send message
Joined: 19 Sep 17
Posts: 9
Credit: 41,402
RAC: 14
Message 56993 - Posted: 28 Sep 2017, 21:20:54 UTC

Doh! A bit off topic, but I'll use this opportunity for a reminder. I lost one on my server because it didn't have libz.so.1. I think it happened at the very end as it was wrapping up. I made sure to install dependencies on my main system and forgot to do it on my server. I just did (lib32ncurses5 and lib32z1) and verified with ldd, so that should prevent the same thing occurring to the remaining two tasks with about 1.5 and 8 days remaining.

For anyone getting started, remember to check/install dependencies on all systems!

Pilot_51
Send message
Joined: 19 Sep 17
Posts: 9
Credit: 41,402
RAC: 14
Message 57037 - Posted: 4 Oct 2017, 17:01:16 UTC - in response to Message 56993.

Unfortunately, the remaining two tasks failed with the same error. Once the second of the three tasks failed, I restarted boinc-client to reload everything in hopes of saving the last task, but it didn't work.

It would be great if BOINC checked dependencies before starting a task, displaying a warning if they aren't satisfied and requiring the user to resolve it before the task starts. It's a waste of resources to spend 15 days on a task that was doomed to fail from the beginning.

WB8ILI
Send message
Joined: 1 Sep 04
Posts: 98
Credit: 40,088,982
RAC: 54,475
Message 57048 - Posted: 5 Oct 2017, 16:00:48 UTC

Regarding missing libz.so.1 -

I had that error on one computer a few days ago (UBUNTU 16.04 LTS 64-bit). I forgot to check mt notes when I installed BOINC. I was missing lib32Z1.

The following libraries MIGHT have something to do with the missing libz.so.1 library. Depending on the UBUNTU version they may not be available. But, in any case, I have installed all of them (if available).

lib32z1
zlib1g
zlib1g:i386
lib64z1
lib64z1:i386
libx32z1
libzadc1

Anyone think was a dumb idea?
____________

Profile geophi
Volunteer moderator
Send message
Joined: 7 Aug 04
Posts: 1670
Credit: 32,083,245
RAC: 31,083
Message 57051 - Posted: 5 Oct 2017, 17:31:42 UTC - in response to Message 57048.

Regarding missing libz.so.1 -

I had that error on one computer a few days ago (UBUNTU 16.04 LTS 64-bit). I forgot to check mt notes when I installed BOINC. I was missing lib32Z1.

The following libraries MIGHT have something to do with the missing libz.so.1 library. Depending on the UBUNTU version they may not be available. But, in any case, I have installed all of them (if available).

lib32z1
zlib1g
zlib1g:i386
lib64z1
lib64z1:i386
libx32z1
libzadc1

Anyone think was a dumb idea?

On any recent version of Ubuntu, I just run this and it takes care of everything.

sudo apt-get install lib32ncurses5 lib32z1 gcc-4.7-multilib


I'm sure it installs some items that aren't strictly necessary for getting cpdn running on 64 bit distributions, but it works.

Pilot_51
Send message
Joined: 19 Sep 17
Posts: 9
Credit: 41,402
RAC: 14
Message 57053 - Posted: 5 Oct 2017, 18:52:51 UTC - in response to Message 57051.

On any recent version of Ubuntu, I just run this and it takes care of everything.

sudo apt-get install lib32ncurses5 lib32z1 gcc-4.7-multilib


I'm sure it installs some items that aren't strictly necessary for getting cpdn running on 64 bit distributions, but it works.


My server is Debian and gcc-4.7-multilib isn't available in the repo. I would think that if all dependencies are satisfied according to ldd, as accomplished by installing lib32z1 in this case, nothing more would be needed.

Profile geophi
Volunteer moderator
Send message
Joined: 7 Aug 04
Posts: 1670
Credit: 32,083,245
RAC: 31,083
Message 57055 - Posted: 5 Oct 2017, 19:39:35 UTC - in response to Message 57053.



My server is Debian and gcc-4.7-multilib isn't available in the repo. I would think that if all dependencies are satisfied according to ldd, as accomplished by installing lib32z1 in this case, nothing more would be needed.

Indeed. That should be fine.

Venkatesh Srinivas
Send message
Joined: 7 May 17
Posts: 15
Credit: 450,081
RAC: 458
Message 57062 - Posted: 6 Oct 2017, 12:30:42 UTC

Any news on Linux/Mac tasks?

Profile Dave Jackson
Send message
Joined: 15 May 09
Posts: 1789
Credit: 2,671,578
RAC: 898
Message 57063 - Posted: 6 Oct 2017, 12:53:54 UTC - in response to Message 57062.

Any news on Linux/Mac tasks?


I would guess that at least until BOINC7.8.3 becomes widespread or the issue crashing some of the WAH2tasks is resolved there won't be a lot. The last I got were two hadcm3 tasks that had already failed on one or two other computers and promptly failed on mine also.

Venkatesh Srinivas
Send message
Joined: 7 May 17
Posts: 15
Credit: 450,081
RAC: 458
Message 57251 - Posted: 29 Oct 2017, 5:30:19 UTC

Any news on Linux/Mac tasks?

(Do we have any idea how much compute capacity is idled by the lack of Linux/Mac workers? Hopefully not that much? Though the backlog of tasks seems pretty high now...)

Profile Dave Jackson
Send message
Joined: 15 May 09
Posts: 1789
Credit: 2,671,578
RAC: 898
Message 57257 - Posted: 29 Oct 2017, 23:00:27 UTC - in response to Message 57251.

Any news on Linux/Mac tasks?


Afraid not, at some point there will probably be some more hadcm3s tasks but it is down to the researchers giving Oxford the work to send out.

Pilot_51
Send message
Joined: 19 Sep 17
Posts: 9
Credit: 41,402
RAC: 14
Message 57279 - Posted: 1 Nov 2017, 4:21:49 UTC

I received a couple HadCM3 tasks a few hours ago and they both failed with repeated segmentation violation crashes within a minute. It appears to be another batch that is failing on Linux and so far working fine on Windows and Mac.

1 · 2 · Next

Questions and Answers : Unix/Linux : Trying to get tasks to not crash Linux client, now not receiving tasks


Main page · Your account · Message boards


Copyright © 2017 climateprediction.net