On Fri, Sep 29, 2006 at 12:29:29AM -0400, Tim Miller alleged:
> Hi All,
>
> I have recently upgraded to Torque 2.1.2 with the C scheduler and am
> experiencing a very weird problem. Jobs will sit in the execution queue
> and not run, even though pbsnodes shows sufficient free nodes maching
> the job spec to run the job. I've done a fair bit of digging to try to
> find the root cause, and the problem seems to be in the code in
> node_manager.c for I get a lot of messages like "cannot allocate node
> n2.lobos.nih.gov to job ...".
>
> Adding in some custom debug code, it seems like the second for loop in
> node_spec picks the first node in the server's list of nodes, see it's
> not valid to run on (the node is busy), goes searching for a new node
> via the search function, fails to find one for some reason, and then
> dies out with an error message like the one above.
>
> I should note that this problem occurs irregularly. That is, things will
> work fine for a few hours, and then this problem will crop up and then
> go away on its own after a little while.
>
> Since I don't recall seeing anything else like this on the list, I
> wonder if maybe my configuration is a problem -- I'm pasting my server
> config below in case someone sees something dumb that I have done or
> failed to do.
>
> Server m1.lobos.nih.gov
> server_state = Active
> scheduling = True
> total_jobs = 51
> state_count = Transit:0 Queued:13 Held:0 Waiting:0 Running:38
> Exiting:0
> managers = <<user list deleted>>
> default_queue = entry
> log_events = 511
> mail_from = adm
> resources_assigned.nodect = 48
> scheduler_iteration = 600
> node_check_rate = 120
> tcp_timeout = 6
> pbs_version = 2.1
>
> I hope someone has some ideas since I'm tearing my hair out and going
> through the code in node_manager.c is somewhat tough sledding for
> someone not familiar with how this is supposed to work.
No doubt, this stuff is really complicated and very difficult to debug.
I'd run pbs_server in debug mode (set PBSDEBUG before running it), and
let it run for awhile in your terminal. You'll see a list of counts for
each subnode when jobs are scheduled. When you notice the problem
happening, look at those numbers and see if they make sense.
|