|
|
Choosing A Webhost: |
RE: All jobs going to one node: msg#00110clustering.torque.user
Actuall it may still not working as I would like ... For some reason , it has placed 2 jobs on node2, 2 jobs on node 3 , then 4 jobs on node4 ! When I place more test jobs in, I get 5 jobs on node4 and 2 on all other nodes... Very strange. However some of the jobs were rerun using qrerun so maybe they did not get the new default resources? But qstat -f returns these for the jobs: Rerunable = True Resource_List.neednodes = 1 Resource_List.nice = 20 Resource_List.nodect = 1 Resource_List.nodes = 1 -----Original Message----- From: torqueusers-bounces@xxxxxxxxxxxxxxxx [mailto:torqueusers-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Atwood, Robert C Sent: 20 July 2006 12:35 To: Garrick Staples; torqueusers@xxxxxxxxxxxxxxxx Subject: RE: [torqueusers] All jobs going to one node It seems like setting some resource limits has changed this behaviour for now. Before upgrading the Suse I guess I did not have these set, since I just copied the queues, perhaps it was not necessary. I still don't understand exactly what I did (ie. What's the meaning of and difference between resources_default.nodes versus resources_default.nodect, and should it be on the server, the routing queue, or the execution queues? Some googling and looking thorugh the manuals led me to try this but I don't believe I quite undestand these resources. I have 1 server, 1 routing queue, 4 execution queues separated by the requested wall-time, and a couple of special purpose queues (eg. for counted-license jobs) I set resources_default.nodes and resources_default.nodect to 1 for everything, maybe this is overkill! Once we have some parallel applciations this may need rethinking. I would be willing to try the snapshot if it can be done without in any way impacting running jobs? Can a parallel installation be running without messing up the current one? The jobs are running for a long time. They are other people's jobs, not mine. I successfully migrated all but 2 jobs off the first node. Otherwise I can only try installing on another (single) but I am not sure how to test this problem on such a setup. -----Original Message----- From: torqueusers-bounces@xxxxxxxxxxxxxxxx [mailto:torqueusers-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Garrick Staples Sent: 19 July 2006 19:15 To: torqueusers@xxxxxxxxxxxxxxxx Subject: Re: [torqueusers] All jobs going to one node On Wed, Jul 19, 2006 at 01:19:40PM +0100, Atwood, Robert C alleged: > I cannot figure out why all the jobs are going to one node. The nodes > are configured as 2 processors per node. Now I have 10 jobs running on > one node, the other nodes are free. Pbsnodes indicates the node is still > free! > > Thanks > Robert > > > > node01 > state = free > np = 2 > ntype = cluster > jobs = 0/304.mt-hive2.mt.ic.ac.uk, 0/303.mt-hive2.mt.ic.ac.uk, > 0/302.mt-hive2.mt.ic.ac.uk, 0/299.mt-hive2.mt.ic.ac.uk, > 0/298.mt-hive2.mt.ic.ac.uk, 0/294.mt-hive2.mt.ic.ac.uk, > 0/293.mt-hive2.mt.ic.ac.uk, 0/292.mt-hive2.mt.ic.ac.uk, > 0/291.mt-hive2.mt.ic.ac.uk, 0/287.mt-hive2.mt.ic.ac.uk > status = opsys=linux,uname=Linux node01 > 2.6.15.1-clustervision-81_cvos #1 SMP Thu May 18 10:47:02 CEST 2006 > x86_64,sessions=21720 21952 21992 22032 22094 22537 22577 22687 22733 > 22773,nsessions=10,nusers=2,idletime=749989,totmem=12275280kb,availmem=8 > 045184kb,physmem=8178748kb,ncpus=4,loadave=10.04,netload=7095136962,stat > e=free,jobs=287.mt-hive2.mt.ic.ac.uk 291.mt-hive2.mt.ic.ac.uk > 292.mt-hive2.mt.ic.ac.uk 293.mt-hive2.mt.ic.ac.uk > 294.mt-hive2.mt.ic.ac.uk 298.mt-hive2.mt.ic.ac.uk > 299.mt-hive2.mt.ic.ac.uk 302.mt-hive2.mt.ic.ac.uk > 303.mt-hive2.mt.ic.ac.uk 304.mt-hive2.mt.ic.ac.uk,rectime=1153311345 > > This is Torque 2.1.1 on Suse 10.0 , default scheduler. This could be a bug with cpu counting that I've been working on. Can you give today's snapshot a try? _______________________________________________ torqueusers mailing list torqueusers@xxxxxxxxxxxxxxxx http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers@xxxxxxxxxxxxxxxx http://www.supercluster.org/mailman/listinfo/torqueusers
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: All jobs going to one node, Garrick Staples |
|---|---|
| Next by Date: | Re: torque not writing output?, Aquarijen |
| Previous by Thread: | Re: All jobs going to one node, Garrick Staples |
| Next by Thread: | Jobs stuck in "W" state, Gianfranco Sciacca |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |