logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

automatic job restart: msg#00070

Subject: automatic job restart

I believe this was discussed a while back:  if a job returns non-zero (fail), the job is never re-executed by torque, right?

 

If this is the case, is it possible to ever implement some interface with the returned signals (or stdout/stderr), some way) that a program can indicate “I failed, but due to something likely local to this machine or temporally, try me again so my job ID doesn’t change, etc”

 

A job could specify at submit time what exit code indicates re-run.  If you wanted to get more tricky, the job could even return more helpful hints such as re-run anywhere (not abundantly clear this is that useful now) or re-run on another host, or perhaps others.  To avoid confusion with existing jobs (ie those ppl have today), the job MUST specify it wants to be treated this way so return code of say 10 doesn’t cause restarts when it shouldn’t.

 

This allows for nice wrapper scripts to the real job that can check why something failed perhaps and indicate the re-run

 

Why this works better than say an after not ok dependency that checks & submits is I need that job to keep the same ID (so items dependent on the original job would be dependent on the new one). 

 

Or is there a simple, stateless external way to do this?  (ie I don’t want to have to find all items dependent to a failed job and update every dependency).

 

Some notation of limited restartability for a job is very helpful to obtain clean fault tolerance.

 

-sr

 

 

Sam Rash

srash@xxxxxxxxxxxxx

408-349-7312

vertigosr37

 

_______________________________________________
torqueusers mailing list
torqueusers@xxxxxxxxxxxxxxxx
http://www.supercluster.org/mailman/listinfo/torqueusers
<Prev in Thread] Current Thread [Next in Thread>