logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

nodes hung (or not?): msg#00100

Subject: nodes hung (or not?)

So we have a small cluster of about 24 nodes.  We keep it packed about 8 hours a day.  I happened to check qstat –Q and find that only half the jobs were in run state as should be.

Example: about 3000 jobs were in the queued state and eligible to run.  This smelled of some nodes being down or something, so I ran pbsnodes –a on each node.  They looked fine (even the ‘empty ones’)—entirely good to schedule on.  Using maui checknode and checkjob, nothing was out of the ordinary (normal msgs for jobs indicating eligible and ready to schedule)

 

I didn’t have time to investigate (already 5+ hrs behind on production), so I just restarted the nodes on the suspect boxes (about 15), and things went along well.

 

Has anyone seen this?  This is the first time in over 10 months of use with torque (and 3 with maui).  If it happens again hopefully I can check more logs and get a better idea.

Any help is greatly appreciated.

 

Sam Rash

srash@xxxxxxxxxxxxx

408-349-7312

vertigosr37

 

_______________________________________________
torqueusers mailing list
torqueusers@xxxxxxxxxxxxxxxx
http://www.supercluster.org/mailman/listinfo/torqueusers
<Prev in Thread] Current Thread [Next in Thread>