So we have a small cluster of about 24 nodes. We keep it
packed about 8 hours a day. I happened to check qstat –Q and find that
only half the jobs were in run state as should be.
Example: about 3000 jobs were in the queued state and
eligible to run. This smelled of some nodes being down or something, so I ran
pbsnodes –a on each node. They looked fine (even the ‘empty ones’)—entirely
good to schedule on. Using maui checknode and checkjob, nothing was out of the
ordinary (normal msgs for jobs indicating eligible and ready to schedule)
I didn’t have time to investigate (already 5+ hrs
behind on production), so I just restarted the nodes on the suspect boxes
(about 15), and things went along well.
Has anyone seen this? This is the first time in over 10
months of use with torque (and 3 with maui). If it happens again hopefully I
can check more logs and get a better idea.
Any help is greatly appreciated.
Sam Rash
srash@xxxxxxxxxxxxx
408-349-7312
vertigosr37