logo       


Re: killing over limit jobs is unfriendly to mpiexec: msg#00107

Subject: Re: killing over limit jobs is unfriendly to mpiexec
On Thu, Nov 23, 2006 at 03:48:30PM -0500, Pete Wyckoff alleged:
> I'm trying to figure out why mpiexec isn't catching the exit
> statuses of the tasks when a job goes over a limit, like walltime.
> 
> Mom sends SIGTERM to all the processes in the job.  Mpiexec catches
> the signal, sends tm_kill() to all tasks and waits for them to exit.
> 
> The top-level shell, meanwhile, does not catch the signal, and
> exits.  This triggers code in scan_for_terminated to mark the task
> TI_STATE_EXITED and to send another SIGTERM to all the remaining
> tasks.
> 
> Mpiexec catches this second SIGTERM and just exits, abandoning any
> tasks.  The thought was that when users hit ctrl-c, it tries to
> clean tasks up nicely, but if the batch system has hosed itself, a
> second tap of ctrl-c will force mpiexec to exit.  If I were to
> ignore future SIGTERMs, users would have to hit ctrl-z, then "kill
> -9" the process to get it to go away.

mpiexec can still trap SIGTERM, while still exiting on SIGINT (ctrl-c
sends SIGINT).

 
> However, I can hack/fix mpiexec to keep waiting across the second
> SIGTERM, but it still does not get the proper TM obit messages,
> because mom's scan_for_exiting() sets ptask->ti_fd to -1.  This
> causes task_check() to complain "cannot tm_reply to task 1" rather
> than send the TM message.  Commenting out that set of ti_fd does
> not change the behavior, because kill_task() sits in a tight loop
> for 4 sec waiting for the task to die rather than delivering the
> queued up obits.  Eventually everything dies with SIGKILL.

I'm not happy with that tight loop either, but mostly because it becomes
painfully slow when you have very large numbers of tasks and each one
exits slowly.

And of course, the eventual SIGKILL can be delayed with the kill_delay
attribute on the queue or server.

 
> Everything does work nicely, though, if I ignore the SIGTERM in the
> top-level shell:
> 
>     trap "echo Job shell caught TERM, ignoring >&2" TERM
>     mpiexec a.out
> 
> Works brilliantly, unmodified.  But I'd hate to force users to do
> this to get the right behavior.
> 
> Any ideas how to fix this in torque?  That loop in kill_task() is
> new compared to good-old PBS.  I'm fishing for thoughts at this
> point.  The behavior can always be papered over by not reporting
> exit values when they are missing, but a clean solution would be
> better.

This is tough because the current behaviour solved a problem with
killing uncooperative jobs quicker.


Ruby Jobs
Java Jobs
Jobs in California
more...
what
job title, keywords
where
city, state, zip
jobs by job search
<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
encryption.gpg....    ietf.rfc822/199...    freebsd.devel.i...    lang.haskell.li...    mail.squirrelma...    web.zope.plone....    yellowdog.gener...    text.xml.xalan....    recreation.phot...    kde.devel.educa...    hardware.bus.ca...    printing.ghosts...    voip.peering/20...    assembly/2006-0...    org.user-groups...    culture.interne...    network.i2p/200...    boot-loaders.ya...    xfree86.render/...    qnx.openqnx.dev...    jakarta.velocit...    user-groups.pal...   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe