I believe this was discussed a while back: if a job returns
non-zero (fail), the job is never re-executed by torque, right?
If this is the case, is it possible to ever implement some
interface with the returned signals (or stdout/stderr), some way) that a
program can indicate “I failed, but due to something likely local to this
machine or temporally, try me again so my job ID doesn’t change, etc”
A job could specify at submit time what exit code indicates
re-run. If you wanted to get more tricky, the job could even return more
helpful hints such as re-run anywhere (not abundantly clear this is that useful
now) or re-run on another host, or perhaps others. To avoid confusion with
existing jobs (ie those ppl have today), the job MUST specify it wants to be
treated this way so return code of say 10 doesn’t cause restarts when it
shouldn’t.
This allows for nice wrapper scripts to the real job that
can check why something failed perhaps and indicate the re-run
Why this works better than say an after not ok dependency
that checks & submits is I need that job to keep the same ID (so items
dependent on the original job would be dependent on the new one).
Or is there a simple, stateless external way to do this? (ie
I don’t want to have to find all items dependent to a failed job and
update every dependency).
Some notation of limited restartability for a job is very
helpful to obtain clean fault tolerance.
-sr
Sam Rash
srash@xxxxxxxxxxxxx
408-349-7312
vertigosr37