Message boards : Number crunching : Intel I7 Woes....No successful completion since April 2015
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 16 Oct 11 Posts: 254 Credit: 15,954,577 RAC: 0 |
Results so far: WU 10324402 -- Finished successfully. (I had run this as my first test, with all other projects suspended, and all CPDN jobs suspended except for this one WU). Wasting lots of processing capacity on my I7, but have to try to figure out what is causing all my CPDN jobs to fail. WU 10235486 -- Failed and no longer visible on my machine. Completed 3 trickles but then somehow failed. CPDN still sees this job as "in process"..so the failure was not communicated somehow. (I was running this as the only active CPDN job running with all other CPDN jobs suspended, but had enabled all my other projects). This seems typical of some failures where I now have many many WU's which CPDN sees as still "in process" but are no longer visible to me in BOINC as WUs waiting or in process. If this is happening to many users, then this could explain the large number of jobs that CPDN is waiting on and which are not in anyone's active queue until they time out for lack of communications much later. WU 10327029 -- Currently In process. 6 trickles completed. This WU had failed on another users machine after 3 trickles. (This is running again as the only WU processing on my machine). I want to make sure the successful completion on the first WU above as the only active WU was not a fluke!). IF this finishes successfully, I'm going to try two CPDN WU's processing at the same time with all other projects suspended....to try to determine if it is multiple CPDN WUs somehow impacting each other or if the problem is caused by the other project interactions. More Later...Time will tell. [/b] |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
WU 10235486 -- Failed and no longer visible on my machine. Completed 3 trickles but then somehow failed. CPDN still sees this job as "in process"..so the failure was not communicated somehow. (I was running this as the only active CPDN job running with all other CPDN jobs suspended, but had enabled all my other projects). I have never seen that. All my jobs in progress as shown on climateprediction.net correspond to what BOINC (actually BoincTasks) shows on my PC. Maybe there is a communication problem between you and CPDN, either on your machine or elsewhere. |
Send message Joined: 16 Oct 11 Posts: 254 Credit: 15,954,577 RAC: 0 |
I can not explain it either. I suspect this is related to how the CPDN Work Units fail on my I7 machine (ID 1266353)...with no communications to confirm/report their failure. Right now from the CPDN web site, the project sees a total 50 Work Units "in progress" on this machine, but looking at BOINC on this there are only 13 Work Units total -- one is processing, and the other 12 I have in suspended status so that only the single work unit is running. So, I have 37 WU's that CPDN believes are "in progress" that are (in fact) not in my BOINC queue. I have four other machines processing CPDN work units (handling all the projects I listed earlier). CPDN work units do not fail this way on other machines and there is good agreement of "In progress" WU units between what BOINC reports and the status on the CPDN web site Like I say, if this is happening to other users, this is probably not a good thing for overall processing of CPDN units...with the overall project servers waiting for WUs to "time out" when they could be reassigned to other users if this type of error was not occurring. Would like to hear from someone from the project to discuss if this is worth pursuing further...right now I'm just trying to understand better when my CPDN WU's are failing on this machine. Art Masson St. Charles IL |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,837,810 RAC: 9,716 |
Hi Art, to release the ghost (in-progress-no-longer-on-your-pc) WU's for others to crunch you need to detach (delete) CPDN project from BOINC and then reattach again. However you may need to do that once all your tasks are finished or failed, or you will lose them as well. As for your main cause of trouble I can't be of any help, unfortunately. |
Send message Joined: 16 Oct 11 Posts: 254 Credit: 15,954,577 RAC: 0 |
Ah....Thanks Bernard IVO.....I'll do that when I get to a better point on understanding why/when my CPDN are failing. Right now I don't want to do that and lose where I am on for projects in my queue and the one project I have processing trying to understand what combination of running WU's cause these failures. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I have had a lot of disk drive (SSD) problems recently, though they have not affected CPDN or any BOINC project. But if something is failing randomly, I do a CHKDSK and if that does not fix it, replace the drive. But an often overlooked and insidious cause of problems is the SATA cables. I have replaced several of those in the last few months, and even replaced the replacements. Some brands are better than others, but they can all fail. (Check the Event Viewer for NTFS errors, and IDE port errors. The symptoms vary.) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Art Unless you have evidence in your stdoutdae.txt file of those "extra" models being down loaded to your computer, then Bernard is right - those are "ghost" models, which you never had in the first place. This happens in some cases, when parts of the overall server structure gets overloaded; part of it allocates a task to a computer, and tells another part to add it to the list of tasks to be down loaded to that computer, but the second part doesn't for some reason actually down load it. So that task/model name appears on your Account page list, but is never sent. You can ignore them, or you can make the Tasks list look pretty by getting a new computer ID, which won't have those tasks associated with it. edit Actually, it's the green message that disappears, to be replaced by a black error message, perhaps "Disconnected". |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Another possible problem, is a hairline crack in the mother board. Perhaps one in an area that makes it sensitive to a heat source near it. It's been said that climate models "stress" a computer more than any other program. This is partly due to the very high usage of the FFP. So cpdn will probably create more heat than other projects. And several models run together will create even more heat. |
Send message Joined: 16 Oct 11 Posts: 254 Credit: 15,954,577 RAC: 0 |
Thanks Jim1348. This could be ANYTHING I agree....drivers or driver updates, intermittent Hardware CPU or drive failures or communications failures, Windows 7 updates applied, other applications including virus software interactions.... The current reason I'm doubting a hardware problem is that I can run all my other BOINC projects with all 8 processor cores running 100% with no failures at all for any other projects....it's only the CPDN WUs that are failing. All my other machines are running the same combination of projects and the same version of Windows and the same version of my virus software with no problems (except for my wife's Mac Book which is also processing the same set of projects including CPDN with no problems). So I'm suspecting some strange interaction on my I7 machine within BOINC currently as the highest probability problem. Would love to hear from someone else running on Windows 7 with an I7 machine to understand what project mix they are running. I'm hoping what I'm doing to try to find the combination of processing which causes the failures to help identify where to look next. Just wish it didn't take 4-5 days to run each test -- but that's how long CPDN jobs typically run on my machine! Art Masson St. Charles, IL |
Send message Joined: 16 Oct 11 Posts: 254 Credit: 15,954,577 RAC: 0 |
Hi Les, Thanks...that may explain some/many of these 37 WUs not in BOINC. I don't think that explains WU 10235486 however. That one is still "in progress" on CPDN but no longer on my BOINC work unit list. It clearly downloaded successfully, I processed 3 trickles and now it's "gone" from BOINC but still "in progress" on CPDN. Correct? Looks like this one just failed somehow on my machine and CPDN still believes it is "in progress." Art |
Send message Joined: 3 Sep 04 Posts: 126 Credit: 26,610,380 RAC: 3,377 |
Would love to hear from someone else running on Windows 7 with an I7 machine to understand what project mix they are running. I'm using an i7 with Windows 7 too. I mix it mostly with Einstein@Home. I run 6 tasks simultaneously. I've set the interval to switch between tasks to 14400 minutes (10 days), which almost eliminates the need to pause tasks. Virus scanning skips the Boinc data folder. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
WU 10235486 will remain "in progress" until your computer "reports" the finished or failed model to the server. The only way it could be missing from your computer is to have lost the client_state.xml plus the client_state_prev.xml somehow. Perhaps all of the BOINC_client files. And that looks it happened in mid January. |
Send message Joined: 16 Oct 11 Posts: 254 Credit: 15,954,577 RAC: 0 |
I'll try to keep more precise time records as I start subsequent WU's. WU 10235486 was a work unit that I restarted a few days ago (around March 15th) This was the 2nd WU I started)in my testing -- see below. It had clearly processed 3 trickles in January. When I re-started it in my recent testing it ran briefly and then "failed" and dropped off my BOINC WU list (presumably didn't run long enough to create another trickle and apparently didn't report the WU as failed to CPDN). I don't know exactly how long it ran after I restarted it as a single CPDN WU (while allowing all my other projects to start running). Art Masson |
Send message Joined: 16 Oct 11 Posts: 254 Credit: 15,954,577 RAC: 0 |
Hi Alex -- Can you please tell me: 1. What Virus software and version you are running. I'm running Norton 360 -- latest version 22-6-0. 2. How have you set up BOINC so that it only runs six WU's? My setup is allowing eight WU's to run simultaneously (one for each processor core). Thanks so much! |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
2. Is easy to do - go to the Computing preferences in your cpdn Account page, and for On multiprocessors, use at most, change the percentage to the value that corresponds to the number of cores you want to use. The hard part is working out the number for an odd number of cores. (I just use a calculator these days. :) ) Alternatively, go to the BOINC menu -> Options -> Computing preferences -> Use at most, and change it there. ************* The first affects ALL of your computers, the second only affects the computer that you change. |
Send message Joined: 3 Sep 04 Posts: 126 Credit: 26,610,380 RAC: 3,377 |
1. What Virus software and version you are running. I'm running Norton 360 -- latest version 22-6-0. 1. MS Security Essentials. 2. In the Boinc menu, as Les explained. |
Send message Joined: 16 Oct 11 Posts: 254 Credit: 15,954,577 RAC: 0 |
TEST #3 RESULTS WU 10327029 has completed successfully. I believe I've been able to prove that I can run a single CPDN work unit successfully as long as it's the only WU running on any project on my machine. Since this has now worked for two WU's. I've just started running TWO CPDN work units at the same time on my machine (no other WU's on any other project are running...all other projects and WU's are suspended). TEST #4 The Two WU's I've started are WU 10327985 and WU 10292536 -- both are Weather At Home 2 7.08 units. Objective is to see if I can successfully complete multiple CPDN work units running simultaneously. Both WU's were started at approx 9PM CT March 21. Estimated time to completion are 4 Days and 5 Days respectively. Time will tell.... Art Masson |
Send message Joined: 16 Oct 11 Posts: 254 Credit: 15,954,577 RAC: 0 |
TEST #4 RESULTS: WU 10327985 completed. WU 10292536 still processing (2 days to go) TEST #4 Successful TEST #5 Started: Trying six CPDN WU's simultaneously with no other projects running -- submitted at 5:30AM March 25. Hoping this can confirm that I can stress my CPU with only CPDN work units to see if failures occur. Started 5 more WU's...Two failed within 3 minutes (WU 10335878 and WU 10335869) -- but these WU's had previously both failed on others machines with a very early computational error, so I'm assuming these are "normal failures" and not due to my I7 processing problems. So I started two others so that I'd have six running. Currently Running 6 Total CPDN Work Units WU 10292536 -- still processing from TEST #4 started 9 PM ET March 21. WU 10305793 -- started 5:30 AM CT March 25 WU 10301603 -- started 5:30 AM CT March 25 WU 10329659 -- started 5:30 AM CT March 25 WU 10298749 -- started 5:33 AM CT March 25 WU 10329645 -- started 5:33 AM CT March 25 Time will tell (again)...but so far it appears I can run any number of CPDN work units as long as I run no other projects.... |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I will bet that if these complete, then you will have no problem with other BOINC projects as well, for the most part. I typically run WCG along with CPDN with no problems, and Einstein is no problem either. But I am a bit concerned about Milky Way. Doesn't that use a lot of double-precision floating point? Maybe that stresses the CPU in certain ways that the others don't. |
Send message Joined: 22 Feb 06 Posts: 492 Credit: 31,523,112 RAC: 14,304 |
I'm running CPDN and Milky Way without any problems. Most of the MW stuff I do is on the GPU so doesn't really affect the CPDN work on the CPU (i5 BTW). |
©2024 cpdn.org