It seems like setting some resource limits has changed this behaviour
for now. Before upgrading the Suse I guess I did not have these set,
since I just copied the queues, perhaps it was not necessary. I still
don't understand exactly what I did (ie. What's the meaning of and
difference between resources_default.nodes versus
resources_default.nodect, and should it be on the server, the routing
queue, or the execution queues? Some googling and looking thorugh the
manuals led me to try this but I don't believe I quite undestand these
resources.
I have 1 server, 1 routing queue, 4 execution queues separated by the
requested wall-time, and a couple of special purpose queues (eg. for
counted-license jobs) I set resources_default.nodes and
resources_default.nodect to 1 for everything, maybe this is overkill!
Once we have some parallel applciations this may need rethinking.
I would be willing to try the snapshot if it can be done without in any
way impacting running jobs? Can a parallel installation be running
without messing up the current one?
The jobs are running for a long time. They are other people's jobs, not
mine. I successfully migrated all but 2 jobs off the first node.
Otherwise I can only try installing on another (single) but I am not
sure how to test this problem on such a setup.
-----Original Message-----
From: torqueusers-bounces@xxxxxxxxxxxxxxxx
[mailto:torqueusers-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Garrick
Staples
Sent: 19 July 2006 19:15
To: torqueusers@xxxxxxxxxxxxxxxx
Subject: Re: [torqueusers] All jobs going to one node
On Wed, Jul 19, 2006 at 01:19:40PM +0100, Atwood, Robert C alleged:
> I cannot figure out why all the jobs are going to one node. The nodes
> are configured as 2 processors per node. Now I have 10 jobs running on
> one node, the other nodes are free. Pbsnodes indicates the node is
still
> free!
>
> Thanks
> Robert
>
>
>
> node01
> state = free
> np = 2
> ntype = cluster
> jobs = 0/304.mt-hive2.mt.ic.ac.uk, 0/303.mt-hive2.mt.ic.ac.uk,
> 0/302.mt-hive2.mt.ic.ac.uk, 0/299.mt-hive2.mt.ic.ac.uk,
> 0/298.mt-hive2.mt.ic.ac.uk, 0/294.mt-hive2.mt.ic.ac.uk,
> 0/293.mt-hive2.mt.ic.ac.uk, 0/292.mt-hive2.mt.ic.ac.uk,
> 0/291.mt-hive2.mt.ic.ac.uk, 0/287.mt-hive2.mt.ic.ac.uk
> status = opsys=linux,uname=Linux node01
> 2.6.15.1-clustervision-81_cvos #1 SMP Thu May 18 10:47:02 CEST 2006
> x86_64,sessions=21720 21952 21992 22032 22094 22537 22577 22687 22733
>
22773,nsessions=10,nusers=2,idletime=749989,totmem=12275280kb,availmem=8
>
045184kb,physmem=8178748kb,ncpus=4,loadave=10.04,netload=7095136962,stat
> e=free,jobs=287.mt-hive2.mt.ic.ac.uk 291.mt-hive2.mt.ic.ac.uk
> 292.mt-hive2.mt.ic.ac.uk 293.mt-hive2.mt.ic.ac.uk
> 294.mt-hive2.mt.ic.ac.uk 298.mt-hive2.mt.ic.ac.uk
> 299.mt-hive2.mt.ic.ac.uk 302.mt-hive2.mt.ic.ac.uk
> 303.mt-hive2.mt.ic.ac.uk 304.mt-hive2.mt.ic.ac.uk,rectime=1153311345
>
> This is Torque 2.1.1 on Suse 10.0 , default scheduler.
This could be a bug with cpu counting that I've been working on. Can
you give today's snapshot a try?
_______________________________________________
torqueusers mailing list
torqueusers@xxxxxxxxxxxxxxxx
http://www.supercluster.org/mailman/listinfo/torqueusers
|