climateprediction.net (CPDN) home page
Thread 'Can we exit non-zero for model crash?'

Thread 'Can we exit non-zero for model crash?'

Message boards : Number crunching : Can we exit non-zero for model crash?
Message board moderation

To post messages, you must log in.

AuthorMessage
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 43,920,589
RAC: 52,786
Message 71635 - Posted: 16 Oct 2024, 7:08:31 UTC

I noticed that for certain computation errors, the application exits with code 0, which is usually an indication for success instead of failure. For example:
https://main.cpdn.org/result.php?resultid=22472239
https://main.cpdn.org/result.php?resultid=22481902

This is a bit annoying for monitoring because exit code is pretty much all I get from "boinccmd --get_old_tasks" to infer whether a task is successful or not. I have a cron job regularly polling this to alert me about any failures that might worth attention. It might also throw off BoincTasks' history, which seems to use the same RPC to obtain historical results.

Could we change the exit code for these crashes to non-zero?
ID: 71635 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71637 - Posted: 16 Oct 2024, 13:16:40 UTC - in response to Message 71635.  
Last modified: 16 Oct 2024, 13:39:37 UTC

I'd also noticed and created an issue a while ago for this. One of the models is crashing (in the first of your example) with an error code of 193 but there's a bug somewhere where the monitor process is not picking this up and reports zero. Don't know why. It will be sometime before I get to this as I've got months worth of more pressing work to do first.

A workaround might be for checking the elapsed time of the task. If it's much less than expected it probably didn't work despite the zero return code.
---
CPDN Visiting Scientist
ID: 71637 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 43,920,589
RAC: 52,786
Message 71642 - Posted: 16 Oct 2024, 16:23:56 UTC - in response to Message 71637.  

Thanks. Good point on checking elapsed time. I do that already to ignore server abort from other projects, so simply to add another "if". :-D
ID: 71642 · Report as offensive     Reply Quote

Message boards : Number crunching : Can we exit non-zero for model crash?

©2024 cpdn.org