climateprediction.net (CPDN) home page
Thread 'Intel I7 Woes....No successful completion since April 2015'

Thread 'Intel I7 Woes....No successful completion since April 2015'

Message boards : Number crunching : Intel I7 Woes....No successful completion since April 2015
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Art Masson
Avatar

Send message
Joined: 16 Oct 11
Posts: 254
Credit: 15,954,577
RAC: 0
Message 53700 - Posted: 20 Mar 2016, 16:50:28 UTC - in response to Message 53698.  

Results so far:

WU 10324402 -- Finished successfully. (I had run this as my first test, with all other projects suspended, and all CPDN jobs suspended except for this one WU). Wasting lots of processing capacity on my I7, but have to try to figure out what is causing all my CPDN jobs to fail.

WU 10235486 -- Failed and no longer visible on my machine. Completed 3 trickles but then somehow failed. CPDN still sees this job as "in process"..so the failure was not communicated somehow. (I was running this as the only active CPDN job running with all other CPDN jobs suspended, but had enabled all my other projects).

This seems typical of some failures where I now have many many WU's which CPDN sees as still "in process" but are no longer visible to me in BOINC as WUs waiting or in process. If this is happening to many users, then this could explain the large number of jobs that CPDN is waiting on and which are not in anyone's active queue until they time out for lack of communications much later.

WU 10327029 -- Currently In process. 6 trickles completed. This WU had failed on another users machine after 3 trickles. (This is running again as the only WU processing on my machine). I want to make sure the successful completion on the first WU above as the only active WU was not a fluke!). IF this finishes successfully, I'm going to try two CPDN WU's processing at the same time with all other projects suspended....to try to determine if it is multiple CPDN WUs somehow impacting each other or if the problem is caused by the other project interactions. More Later...Time will tell.
[/b]
ID: 53700 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 53701 - Posted: 20 Mar 2016, 17:12:28 UTC - in response to Message 53700.  
Last modified: 20 Mar 2016, 17:12:58 UTC

WU 10235486 -- Failed and no longer visible on my machine. Completed 3 trickles but then somehow failed. CPDN still sees this job as "in process"..so the failure was not communicated somehow. (I was running this as the only active CPDN job running with all other CPDN jobs suspended, but had enabled all my other projects).

This seems typical of some failures where I now have many many WU's which CPDN sees as still "in process" but are no longer visible to me in BOINC as WUs waiting or in process. If this is happening to many users, then this could explain the large number of jobs that CPDN is waiting on and which are not in anyone's active queue until they time out for lack of communications

I have never seen that. All my jobs in progress as shown on climateprediction.net correspond to what BOINC (actually BoincTasks) shows on my PC. Maybe there is a communication problem between you and CPDN, either on your machine or elsewhere.
ID: 53701 · Report as offensive     Reply Quote
Art Masson
Avatar

Send message
Joined: 16 Oct 11
Posts: 254
Credit: 15,954,577
RAC: 0
Message 53702 - Posted: 20 Mar 2016, 19:25:45 UTC - in response to Message 53701.  

I can not explain it either. I suspect this is related to how the CPDN Work Units fail on my I7 machine (ID 1266353)...with no communications to confirm/report their failure. Right now from the CPDN web site, the project sees a total 50 Work Units "in progress" on this machine, but looking at BOINC on this there are only 13 Work Units total -- one is processing, and the other 12 I have in suspended status so that only the single work unit is running. So, I have 37 WU's that CPDN believes are "in progress" that are (in fact) not in my BOINC queue.

I have four other machines processing CPDN work units (handling all the projects I listed earlier). CPDN work units do not fail this way on other machines and there is good agreement of "In progress" WU units between what BOINC reports and the status on the CPDN web site

Like I say, if this is happening to other users, this is probably not a good thing for overall processing of CPDN units...with the overall project servers waiting for WUs to "time out" when they could be reassigned to other users if this type of error was not occurring.

Would like to hear from someone from the project to discuss if this is worth pursuing further...right now I'm just trying to understand better when my CPDN WU's are failing on this machine.

Art Masson
St. Charles IL
ID: 53702 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,837,810
RAC: 9,716
Message 53703 - Posted: 20 Mar 2016, 19:33:31 UTC - in response to Message 53702.  

Hi Art,
to release the ghost (in-progress-no-longer-on-your-pc) WU's for others to crunch you need to detach (delete) CPDN project from BOINC and then reattach again. However you may need to do that once all your tasks are finished or failed, or you will lose them as well. As for your main cause of trouble I can't be of any help, unfortunately.
ID: 53703 · Report as offensive     Reply Quote
Art Masson
Avatar

Send message
Joined: 16 Oct 11
Posts: 254
Credit: 15,954,577
RAC: 0
Message 53704 - Posted: 20 Mar 2016, 19:45:37 UTC - in response to Message 53703.  

Ah....Thanks Bernard IVO.....I'll do that when I get to a better point on understanding why/when my CPDN are failing. Right now I don't want to do that and lose where I am on for projects in my queue and the one project I have processing trying to understand what combination of running WU's cause these failures.
ID: 53704 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 53705 - Posted: 20 Mar 2016, 19:46:23 UTC

I have had a lot of disk drive (SSD) problems recently, though they have not affected CPDN or any BOINC project. But if something is failing randomly, I do a CHKDSK and if that does not fix it, replace the drive. But an often overlooked and insidious cause of problems is the SATA cables. I have replaced several of those in the last few months, and even replaced the replacements. Some brands are better than others, but they can all fail. (Check the Event Viewer for NTFS errors, and IDE port errors. The symptoms vary.)
ID: 53705 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53706 - Posted: 20 Mar 2016, 19:57:31 UTC
Last modified: 20 Mar 2016, 20:07:20 UTC

Art

Unless you have evidence in your stdoutdae.txt file of those "extra" models being down loaded to your computer, then Bernard is right - those are "ghost" models, which you never had in the first place.

This happens in some cases, when parts of the overall server structure gets overloaded; part of it allocates a task to a computer, and tells another part to add it to the list of tasks to be down loaded to that computer, but the second part doesn't for some reason actually down load it.
So that task/model name appears on your Account page list, but is never sent.

You can ignore them, or you can make the Tasks list look pretty by getting a new computer ID, which won't have those tasks associated with it.

edit
Actually, it's the green message that disappears, to be replaced by a black error message, perhaps "Disconnected".
ID: 53706 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53707 - Posted: 20 Mar 2016, 20:04:27 UTC

Another possible problem, is a hairline crack in the mother board. Perhaps one in an area that makes it sensitive to a heat source near it.

It's been said that climate models "stress" a computer more than any other program. This is partly due to the very high usage of the FFP.
So cpdn will probably create more heat than other projects. And several models run together will create even more heat.

ID: 53707 · Report as offensive     Reply Quote
Art Masson
Avatar

Send message
Joined: 16 Oct 11
Posts: 254
Credit: 15,954,577
RAC: 0
Message 53708 - Posted: 20 Mar 2016, 20:06:29 UTC - in response to Message 53705.  

Thanks Jim1348.

This could be ANYTHING I agree....drivers or driver updates, intermittent Hardware CPU or drive failures or communications failures, Windows 7 updates applied, other applications including virus software interactions....

The current reason I'm doubting a hardware problem is that I can run all my other BOINC projects with all 8 processor cores running 100% with no failures at all for any other projects....it's only the CPDN WUs that are failing. All my other machines are running the same combination of projects and the same version of Windows and the same version of my virus software with no problems (except for my wife's Mac Book which is also processing the same set of projects including CPDN with no problems). So I'm suspecting some strange interaction on my I7 machine within BOINC currently as the highest probability problem.

Would love to hear from someone else running on Windows 7 with an I7 machine to understand what project mix they are running.

I'm hoping what I'm doing to try to find the combination of processing which causes the failures to help identify where to look next. Just wish it didn't take 4-5 days to run each test -- but that's how long CPDN jobs typically run on my machine!

Art Masson
St. Charles, IL
ID: 53708 · Report as offensive     Reply Quote
Art Masson
Avatar

Send message
Joined: 16 Oct 11
Posts: 254
Credit: 15,954,577
RAC: 0
Message 53709 - Posted: 20 Mar 2016, 20:19:22 UTC - in response to Message 53706.  

Hi Les,

Thanks...that may explain some/many of these 37 WUs not in BOINC. I don't think that explains WU 10235486 however. That one is still "in progress" on CPDN but no longer on my BOINC work unit list. It clearly downloaded successfully, I processed 3 trickles and now it's "gone" from BOINC but still "in progress" on CPDN.

Correct? Looks like this one just failed somehow on my machine and CPDN still believes it is "in progress."

Art
ID: 53709 · Report as offensive     Reply Quote
Alex Plantema

Send message
Joined: 3 Sep 04
Posts: 126
Credit: 26,610,380
RAC: 3,377
Message 53712 - Posted: 20 Mar 2016, 23:23:05 UTC - in response to Message 53708.  

Would love to hear from someone else running on Windows 7 with an I7 machine to understand what project mix they are running.

I'm using an i7 with Windows 7 too. I mix it mostly with Einstein@Home. I run 6 tasks simultaneously. I've set the interval to switch between tasks to 14400 minutes (10 days), which almost eliminates the need to pause tasks. Virus scanning skips the Boinc data folder.
ID: 53712 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53715 - Posted: 21 Mar 2016, 1:28:32 UTC - in response to Message 53709.  
Last modified: 21 Mar 2016, 5:19:15 UTC

WU 10235486 will remain "in progress" until your computer "reports" the finished or failed model to the server.
The only way it could be missing from your computer is to have lost the client_state.xml plus the client_state_prev.xml somehow. Perhaps all of the BOINC_client files.
And that looks it happened in mid January.
ID: 53715 · Report as offensive     Reply Quote
Art Masson
Avatar

Send message
Joined: 16 Oct 11
Posts: 254
Credit: 15,954,577
RAC: 0
Message 53719 - Posted: 21 Mar 2016, 4:44:06 UTC - in response to Message 53715.  

I'll try to keep more precise time records as I start subsequent WU's. WU 10235486 was a work unit that I restarted a few days ago (around March 15th) This was the 2nd WU I started)in my testing -- see below. It had clearly processed 3 trickles in January. When I re-started it in my recent testing it ran briefly and then "failed" and dropped off my BOINC WU list (presumably didn't run long enough to create another trickle and apparently didn't report the WU as failed to CPDN). I don't know exactly how long it ran after I restarted it as a single CPDN WU (while allowing all my other projects to start running).

Art Masson
ID: 53719 · Report as offensive     Reply Quote
Art Masson
Avatar

Send message
Joined: 16 Oct 11
Posts: 254
Credit: 15,954,577
RAC: 0
Message 53720 - Posted: 21 Mar 2016, 5:27:27 UTC - in response to Message 53712.  

Hi Alex -- Can you please tell me:
1. What Virus software and version you are running. I'm running Norton 360 -- latest version 22-6-0.
2. How have you set up BOINC so that it only runs six WU's? My setup is allowing eight WU's to run simultaneously (one for each processor core).
Thanks so much!
ID: 53720 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53721 - Posted: 21 Mar 2016, 6:58:05 UTC - in response to Message 53720.  

2.
Is easy to do - go to the Computing preferences in your cpdn Account page, and for On multiprocessors, use at most, change the percentage to the value that corresponds to the number of cores you want to use.
The hard part is working out the number for an odd number of cores. (I just use a calculator these days. :) )

Alternatively, go to the BOINC menu -> Options -> Computing preferences -> Use at most, and change it there.

*************

The first affects ALL of your computers, the second only affects the computer that you change.

ID: 53721 · Report as offensive     Reply Quote
Alex Plantema

Send message
Joined: 3 Sep 04
Posts: 126
Credit: 26,610,380
RAC: 3,377
Message 53731 - Posted: 21 Mar 2016, 23:55:52 UTC - in response to Message 53720.  

1. What Virus software and version you are running. I'm running Norton 360 -- latest version 22-6-0.
2. How have you set up BOINC so that it only runs six WU's? My setup is allowing eight WU's to run simultaneously (one for each processor core).

1. MS Security Essentials.
2. In the Boinc menu, as Les explained.
ID: 53731 · Report as offensive     Reply Quote
Art Masson
Avatar

Send message
Joined: 16 Oct 11
Posts: 254
Credit: 15,954,577
RAC: 0
Message 53732 - Posted: 22 Mar 2016, 2:09:56 UTC - in response to Message 53731.  

TEST #3 RESULTS
WU 10327029 has completed successfully.


I believe I've been able to prove that I can run a single CPDN work unit successfully as long as it's the only WU running on any project on my machine. Since this has now worked for two WU's.

I've just started running TWO CPDN work units at the same time on my machine (no other WU's on any other project are running...all other projects and WU's are suspended).

TEST #4
The Two WU's I've started are WU 10327985 and WU 10292536 -- both are Weather At Home 2 7.08 units. Objective is to see if I can successfully complete multiple CPDN work units running simultaneously. Both WU's were started at approx 9PM CT March 21. Estimated time to completion are 4 Days and 5 Days respectively.

Time will tell....

Art Masson
ID: 53732 · Report as offensive     Reply Quote
Art Masson
Avatar

Send message
Joined: 16 Oct 11
Posts: 254
Credit: 15,954,577
RAC: 0
Message 53815 - Posted: 25 Mar 2016, 16:16:26 UTC - in response to Message 53732.  

TEST #4 RESULTS:
WU 10327985 completed. WU 10292536 still processing (2 days to go)
TEST #4 Successful

TEST #5 Started:
Trying six CPDN WU's simultaneously with no other projects running -- submitted at 5:30AM March 25. Hoping this can confirm that I can stress my CPU with only CPDN work units to see if failures occur.

Started 5 more WU's...Two failed within 3 minutes (WU 10335878 and WU 10335869) -- but these WU's had previously both failed on others machines with a very early computational error, so I'm assuming these are "normal failures" and not due to my I7 processing problems. So I started two others so that I'd have six running.

Currently Running 6 Total CPDN Work Units
WU 10292536 -- still processing from TEST #4 started 9 PM ET March 21.
WU 10305793 -- started 5:30 AM CT March 25
WU 10301603 -- started 5:30 AM CT March 25
WU 10329659 -- started 5:30 AM CT March 25
WU 10298749 -- started 5:33 AM CT March 25
WU 10329645 -- started 5:33 AM CT March 25

Time will tell (again)...but so far it appears I can run any number of CPDN work units as long as I run no other projects....
ID: 53815 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 53816 - Posted: 25 Mar 2016, 16:48:46 UTC - in response to Message 53815.  

I will bet that if these complete, then you will have no problem with other BOINC projects as well, for the most part. I typically run WCG along with CPDN with no problems, and Einstein is no problem either. But I am a bit concerned about Milky Way. Doesn't that use a lot of double-precision floating point? Maybe that stresses the CPU in certain ways that the others don't.
ID: 53816 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 492
Credit: 31,523,112
RAC: 14,304
Message 53818 - Posted: 25 Mar 2016, 17:46:27 UTC - in response to Message 53816.  

I'm running CPDN and Milky Way without any problems. Most of the MW stuff I do is on the GPU so doesn't really affect the CPDN work on the CPU (i5 BTW).
ID: 53818 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Intel I7 Woes....No successful completion since April 2015

©2024 cpdn.org