|
|
Choosing A Webhost: |
Re: Torque not deleting job: msg#00111clustering.torque.user
Chris, I just wanted to give some additional information on tests I ran: 1. I did try to restart the pbs_mom with a -r to see if it would remedy the problem and it did not. 2. The "qsig -s 0 1160" only returned a '0' return code, but the server still thought the process was there. 3. "qdel 1160" works to clear the job from the server Thanks Adam Emerich IBM Corporation - Rochester, MN Staff Engineer Office: 030-3 F305 Office: (507) 253-5483 Cell: (507) 358-2999 aemerich@xxxxxxxxxx "Insanity: doing the same thing over and over again and expecting different results." -Albert Einstein Chris Samuel <csamuel@xxxxxxxx > To Sent by: torqueusers@xxxxxxxxxxxxxxxx torqueusers-bounc cc es@xxxxxxxxxxxxxx rg Subject Re: [torqueusers] Torque not deleting job 04/21/2007 01:03 AM On Sat, 21 Apr 2007, Adam Emerich wrote: Thanks for the replies to myself and Garrick, the plot thickens! > 1. root 2015 1 0 08:54 ? 00:00:02 /usr/local/sbin/pbs_mom > -> by default pbs_mom is not started with "-r" on our system The pbs_mom manual page says about starting a pbs_mom with the -r option after reboot: If the -r option is used following a reboot, process IDs (pids) may be reused and MOM may kill a process that is not a batch session. That could be a Bad Thing(tm). :-) > 2. There is no entry in the server log for a failed epilogue or even a > message that says the job is being terminated (note jobid is now 1160 as I > had to recreate the issue to get more details). The first failure in the > log is due to another process being run that was eventually preempted by > job 1160: Interesting - anything in the pbs_mom logs on the node about that job ? > 3. "qsig -s 0 1160" did not terminate the job from the server's point of > view. OK - now that's just plain bizarre - that is supposed to identify whether or not the child process exists for it and unless you've got a process ID getting recycled (not beyond the realms of possibility) then it should declare that process dead and clear up. It certainly does on our RH 7.3, FC5 and SLES 9 clusters! Long shot - do you have SE Linux enabled ? If so, can you disable it and see if it still happens ? cheers! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia _______________________________________________ torqueusers mailing list torqueusers@xxxxxxxxxxxxxxxx http://www.supercluster.org/mailman/listinfo/torqueusers
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Prologue/Epilogue script, Jeffrey B. Layton |
|---|---|
| Next by Date: | Re: Prologue/Epilogue script, Adam Emerich |
| Previous by Thread: | Re: Torque not deleting job, Garrick Staples |
| Next by Thread: | Re: Torque not deleting job, Chris Samuel |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |