Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues
Message board moderation
Previous · 1 · 2 · 3 · 4
Author | Message |
---|---|
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,900,756 RAC: 2,130 |
|
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
Batch 1012 will fail on Intel machines. Batches 1013 & 1014 should continue running. --- CPDN Visiting Scientist |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
I have a 1008 task (https://www.cpdn.org/result.php?resultid=22417298) that seems "stuck" - it's been at 6% and change for 6 days now while the rest of the tasks blow past it. Windows 10 VM on an AMD system. Is there any value to letting it continue spinning, or should I just abort it and let some other system try? |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
Rather than abort, could you please do me a favour? Open up Resource Monitor, click the 'CPU' tab and scroll down to find the 'wah2' list of processes. You should have 3 processes per task. For your task, n15e, please let me know how many you see. I think you'll only see one process: wah2_8.29_windows_intelx86.exe and not the wah2am_* and wah2rm_* processes. Can you confirm? Also, if you know your way around the BOINC folder layout, would be great if you could locate the task directory and check a file for me. I'd like to see the last few lines of a file called 'stdout_mon.txt'. It can be found in the task directory, which will be under your BOINC install 'data' directory: e.g. c:\Program Files\BOINC\data\projects\climateprediction.net\wah2_eas25_n15e_201212_24_1008_012272746_0\stdout_mon.txt Note the 'data' directory under BOINC is usually a hidden directory, you'll need to unhide folders in file explorer. The reason I ask this is because up to now it's appeared that only AMD PCs can run the 1008 tasks (though not produce correct results). It sounds like yours has crashed and I'd like to see how far it got. After this, rather than Abort I suggest doing 'End Task' in Resource Monitor (right click on the correct process name). I *think* this will avoid your host being marked down for aborting tasks. Many thanks. --- CPDN Visiting Scientist |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
I have 9 tasks currently running on the system (it's a bit short on disk, I should fix that). I see 9 wah2_8.29 tasks, 9 wah2am3m2_8.29 tasks, and 9 of the wah2am3m2t_8.29 tasks. One or two of them are showing no CPU use, though. Not sure how to link PIDs to tasks, though. But looking more closely at the task, "CPU Time" reports "1d 03:51:19" on an elapsed time of 6d and change. There's no C:\Program Files\BOINC\data directory, but I've got a C:\ProgramData\BOINC directory with that sort of stuff. stderr_rm and stderr_um are both empty. stdout_mon.txt: worker: Created shared memory region key = wah2_eas25_n15e_201212_24_1008_012272746 of size 73278744 bytes (version 608) Run for 2 Years and 0 Months pShMem->PRECIS_LATITUDE 185 pShMem->PRECIS_LONGITUDE 285 pShMem->EWSPACEA 0.220000 pShMem->NSSPACEA 0.220000 pShMem->FRSTLATA 19.100000 pShMem->FRSTLONA 328.500000 pShMem->POLELATA 55.500000 pShMem->POLELONA 308.000000 pShMem->L_RUN_REGION 1 pShMem->UPLOAD_INTERVAL 0 ulTotalPhaseTimestep 276864 Starting model ID wah2_eas25_n15e_201212_24_1008_012272746 Phase 1 Launching model "C:\ProgramData\BOINC/projects/climateprediction.net/wah2am3m2_um_8.29_windows_intelx86.exe" wah2_eas25_n15e_201212_24_1008_012272746 generic_phase1_spinup_eas25_global_aabaka ic19610319_14_N96 NATclim_ancil_168months_CMIP6-HadGEM3-GC31-LL_SST_2009-01-01_2022-12-30_v2403 NATclim_ancil_168months_CMIP6-HadGEM3-GC31-LL_SIC_2009-01-01_2022-12-30_v2403 so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5 Launching model "C:\ProgramData\BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.29_windows_intelx86.exe" wah2_eas25_n15e_201212_24_1008_012272746 executeModelProcess: MonID=7868, GCM_PID=9088, RCM_PID=9824 stdout_rm.txt: Starting HadRM3 model for ID# wah2_eas25_n15e_201212_24_1008_012272746... Attached to shared memory segment with ID Setting run-time Fortran environment... UM environment variables in use: ASTART=dataout/region_restart.day UNIT11=dataout/xacxf.phist UM_SECTOR_SIZE=2048 UNIT02=jobs/xacxf UM_LBC_COUP=0 VN=4.5 TYPE=CRUN UNIT09=tmp/xacxf.namelists UNIT22=datain/ancil/ctldata/STASHmaster STASETS_DIR=datain/ancil/ctldata/stasets CACHE2=tmp/xacxf.cache2 UNIT08=tmp/xacxf.pipe_dummy UNIT14=tmp/xacxf.errors APSUM1=tmp/xacxf.apsum1 APSTMP1=tmp/xacxf.apstmp1 AOTRANS=tmp/xacxf.aotrans UNIT04=jobs/xacxf.stashc UNIT05=jobs/xacxf.namelists DATAM=dataout/ UNIT12=dataout/xacxf.thist UNIT10=dataout/xacxf.phist UNIT06=dataout/xacxf.out UNIT00=dataout/xacxf.err AINITIAL=dataout/region_restart.day UNIT57=jobs/spec3a_sw_3_asol2c_hadcm3 UNIT80=jobs/spec3a_lw_3_asol2c_hadcm3 SWSPECTD=jobs/spec3a_sw_3_asol2c_hadcm3 LWSPECTD=jobs/spec3a_lw_3_asol2c_hadcm3 Changing to slots dir C:\ProgramData\BOINC\slots\1 stdout_um.txt: Starting HadAM3P model for ID# wah2_eas25_n15e_201212_24_1008_012272746... Attached to shared memory segment with ID Setting run-time Fortran environment... UM environment variables in use: ASTART=datain/dumps/generic_phase1_spinup_eas25_global_aabaka UNIT15=datain/ancil/ic19610319_14_N96 SSTIN=datain/ancil/NATclim_ancil_168months_CMIP6-HadGEM3-GC31-LL_SST_2009-01-01_2022-12-30_v2403 SICEIN=datain/ancil/NATclim_ancil_168months_CMIP6-HadGEM3-GC31-LL_SIC_2009-01-01_2022-12-30_v2403 SULPEMIS=datain/ancil/so2dms_prei_N96_1855_0000P CHEMOXID=datain/ancil/oxi.addfa OZONE=datain/ancil/ozone_preind_N96_1879_0000Pv5 UM_LBC_COUP=0 UNIT11=dataout/xadae.phist UM_SECTOR_SIZE=2048 UNIT02=jobs/xadae VN=4.5 TYPE=CRUN UNIT09=tmp/xadae.namelists AINITIAL=dataout/atmos_restart.day UNIT57=jobs/spec3a_sw_3_asol2c_hadcm3 UNIT80=jobs/spec3a_lw_3_asol2c_hadcm3 SWSPECTD=jobs/spec3a_sw_3_asol2c_hadcm3 LWSPECTD=jobs/spec3a_lw_3_asol2c_hadcm3 UNIT22=datain/ancil/ctldata/STASHmaster STASETS_DIR=datain/ancil/ctldata/stasets CACHE2=tmp/xadae.cache2 UNIT08=tmp/xadae.pipe_dummy UNIT14=tmp/xadae.errors APSUM1=tmp/xadae.apsum1 APSTMP1=tmp/xadae.apstmp1 AOTRANS=tmp/xadae.aotrans UNIT04=jobs/xadae.stashc UNIT05=jobs/xadae.namelists DATAM=dataout/ UNIT12=dataout/xadae.thist UNIT10=dataout/xadae.phist UNIT06=dataout/xadae.out UNIT00=dataout/xadae.err UM_ATM_NPROCX=1 UM_ATM_NPROCY=1 UM_NPES=1 RUNID=xadae Changing to slots dir C:\ProgramData\BOINC\slots\1 Closing model... Detaching shared memory segment... I don't see anything obviously wrong in them... I'm tempted to suspend and resume that task, see if it comes back up properly. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
Thanks. I can see the problem. I have 9 tasks ... I see 9 wah2_8.29 tasks, 9 wah2am3m2_8.29 tasks, and 9 of the wah2am3m2t_8.29 tasks. One or two of them are showing no CPU use, though. Not sure how to link PIDs to tasks, though.Several ways to link PID&task, I like Resource Monitor. Start it up. On the CPU tab, scroll to find the CPDN task process you are interested in. Click the little checkbox left of the process you are interested in. Once clicked, open up the 'Associated Handles' section (little up/down arrow on the title bar below), and it will show you all the files and folders associated with the process. The last lines of that output from the global model 'stdout_um.txt' show the problem: 'Closing model'. That means the model has stopped but for some reason boinc hasn't recognised this and the process hasn't exited. That's why the model has hung up. The global model isn't running so the other two processes are just sitting waiting. Rather than suspend/resume, I would shut down the client to kill the processes. Make sure they really have gone (check Resource Monitor) and then start up the client again. It's possible the tasks will then error but that's what you need anyway. HTH p.s. I've just checked the machine this was running on. I noticed it's only got 8Gb RAM. How many CPDN tasks are running simultaneously and how much of that 8Gb is BOINC allowed to use? Am thinking you might have hit a memory limit causing this odd behaviour. --- CPDN Visiting Scientist |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
It's running 9 tasks, due to disk limits, showing 6.8GB of 8 in use, and BOINC is allowed 90% of RAM. I'll just reboot the VM and up the RAM to it - it's able to have at least 12GB, the system isn't running anything else. |
Send message Joined: 17 Feb 06 Posts: 2 Credit: 1,265,317 RAC: 2,930 |
The reason I ask this is because up to now it's appeared that only AMD PCs can run the 1008 tasks (though not produce correct results). Hello, I have 4 1008 tasks at about 60% on a Ryzen 3600x. Is it worth continuing them if they don't output correct results or should I abort them? Thanks! |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
I am going to let mine continue unless I see a message from Glen here or someone else at the project asking for them to be aborted. (Mine are suspended currently to let some testing branch tasks go through. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
Please let it run. We're not 100% certain of the results. They are a useful comparison to the failed Intel runs. The reason I ask this is because up to now it's appeared that only AMD PCs can run the 1008 tasks (though not produce correct results). --- CPDN Visiting Scientist |
Send message Joined: 17 Feb 06 Posts: 2 Credit: 1,265,317 RAC: 2,930 |
Ok, will do! Thanks 👍 |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,816,918 RAC: 4,574 |
Task 22425060, test batch 1014, Intel has finished successfully. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
Weather@Home running on Intel crashes at start of new year. Followers of this thread will recall there were problems with batches 1008-1012 where the regional model would crash when it started the new calendar year. This only happened on Intel CPUs and not on AMD ones. I've been spending time understanding what's going on. The behaviour of the model was "correct" on Intel, it should have crashed. The problem is caused by a bug in the model code which causes a memory overwrite. Not a lot but enough to do some damage. It turns out this bug has been in the code from the time CPDN originally obtained it from the UK MetO (who have since moved this code on I hasten to add). The impact of the code bug was data dependent and also compiler optimization dependent. The problem was in a part of the model code that recomputed the solar flux variability at the start of a new year. A scalar variable was being passed to a subroutine when it should have been an array of values. As the solar variability is small year on year and Weather@Home runs are relatively short, analysis shows this only has a minimal effect on model results. Certainly less than the variability introduced by the ensemble of forcings. Investigating the crashes also identified another problem. There was a slight discrepancy in the land/sea masks being used by the new sea-surface temperature input data and the model itself for some of the EAS25 batches. This lead to some extra bogus sea-ice points appearing off the western edge of some coasts. This has now been corrected and verified with tests. The code changes will require a new app version. This is being prepared though I am also making some more improvements to the exception handling and a few other aspects. It will be a couple of weeks before a new app appears. We will then rerun one of the earlier batches to do an analysis of the differences. --- CPDN Visiting Scientist |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,617,787 RAC: 9,624 |
Glenn, Thank you for the explanation. |
©2024 cpdn.org