Please take our Survey
logo       

Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...

RE: All jobs going to one node: msg#00110

clustering.torque.user

Subject: RE: All jobs going to one node

Actuall it may still not working as I would like ...

For some reason , it has placed 2 jobs on node2, 2 jobs on node 3 , then
4 jobs on node4 ! When I place more test jobs in, I get 5 jobs on node4
and 2 on all other nodes... Very strange. However some of the jobs were
rerun using qrerun so maybe they did not get the new default resources?
But qstat -f returns these for the jobs:


Rerunable = True
Resource_List.neednodes = 1
Resource_List.nice = 20
Resource_List.nodect = 1
Resource_List.nodes = 1

-----Original Message-----
From: torqueusers-bounces@xxxxxxxxxxxxxxxx
[mailto:torqueusers-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Atwood,
Robert C
Sent: 20 July 2006 12:35
To: Garrick Staples; torqueusers@xxxxxxxxxxxxxxxx
Subject: RE: [torqueusers] All jobs going to one node

It seems like setting some resource limits has changed this behaviour
for now. Before upgrading the Suse I guess I did not have these set,
since I just copied the queues, perhaps it was not necessary. I still
don't understand exactly what I did (ie. What's the meaning of and
difference between resources_default.nodes versus
resources_default.nodect, and should it be on the server, the routing
queue, or the execution queues? Some googling and looking thorugh the
manuals led me to try this but I don't believe I quite undestand these
resources.


I have 1 server, 1 routing queue, 4 execution queues separated by the
requested wall-time, and a couple of special purpose queues (eg. for
counted-license jobs) I set resources_default.nodes and
resources_default.nodect to 1 for everything, maybe this is overkill!
Once we have some parallel applciations this may need rethinking.


I would be willing to try the snapshot if it can be done without in any
way impacting running jobs? Can a parallel installation be running
without messing up the current one?

The jobs are running for a long time. They are other people's jobs, not
mine. I successfully migrated all but 2 jobs off the first node.
Otherwise I can only try installing on another (single) but I am not
sure how to test this problem on such a setup.




-----Original Message-----
From: torqueusers-bounces@xxxxxxxxxxxxxxxx
[mailto:torqueusers-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Garrick
Staples
Sent: 19 July 2006 19:15
To: torqueusers@xxxxxxxxxxxxxxxx
Subject: Re: [torqueusers] All jobs going to one node

On Wed, Jul 19, 2006 at 01:19:40PM +0100, Atwood, Robert C alleged:
> I cannot figure out why all the jobs are going to one node. The nodes
> are configured as 2 processors per node. Now I have 10 jobs running on
> one node, the other nodes are free. Pbsnodes indicates the node is
still
> free!
>
> Thanks
> Robert
>
>
>
> node01
> state = free
> np = 2
> ntype = cluster
> jobs = 0/304.mt-hive2.mt.ic.ac.uk, 0/303.mt-hive2.mt.ic.ac.uk,
> 0/302.mt-hive2.mt.ic.ac.uk, 0/299.mt-hive2.mt.ic.ac.uk,
> 0/298.mt-hive2.mt.ic.ac.uk, 0/294.mt-hive2.mt.ic.ac.uk,
> 0/293.mt-hive2.mt.ic.ac.uk, 0/292.mt-hive2.mt.ic.ac.uk,
> 0/291.mt-hive2.mt.ic.ac.uk, 0/287.mt-hive2.mt.ic.ac.uk
> status = opsys=linux,uname=Linux node01
> 2.6.15.1-clustervision-81_cvos #1 SMP Thu May 18 10:47:02 CEST 2006
> x86_64,sessions=21720 21952 21992 22032 22094 22537 22577 22687 22733
>
22773,nsessions=10,nusers=2,idletime=749989,totmem=12275280kb,availmem=8
>
045184kb,physmem=8178748kb,ncpus=4,loadave=10.04,netload=7095136962,stat
> e=free,jobs=287.mt-hive2.mt.ic.ac.uk 291.mt-hive2.mt.ic.ac.uk
> 292.mt-hive2.mt.ic.ac.uk 293.mt-hive2.mt.ic.ac.uk
> 294.mt-hive2.mt.ic.ac.uk 298.mt-hive2.mt.ic.ac.uk
> 299.mt-hive2.mt.ic.ac.uk 302.mt-hive2.mt.ic.ac.uk
> 303.mt-hive2.mt.ic.ac.uk 304.mt-hive2.mt.ic.ac.uk,rectime=1153311345
>
> This is Torque 2.1.1 on Suse 10.0 , default scheduler.

This could be a bug with cpu counting that I've been working on. Can
you give today's snapshot a try?

_______________________________________________
torqueusers mailing list
torqueusers@xxxxxxxxxxxxxxxx
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers@xxxxxxxxxxxxxxxx
http://www.supercluster.org/mailman/listinfo/torqueusers


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
ide.eclipse.wtp...    bug-tracking.ro...    xfree86.cvs/200...    lisp.wxcl.devel...    file-systems.ar...    kde.devel.kwrit...    jakarta.jetspee...    qnx.openqnx.dev...    drivers.openib/...    ports.xbox.deve...    gis.gdal.devel/...    netbsd.ports.ma...    ubuntu.marketin...    systemtap/2005-...    web.omniweb/200...    mail.qmail.ldap...    hardware.soekri...    os.netbsd.devel...    audio.madman.ge...    tv.freeguide-tv...    cluster.openmos...    education.ezpro...   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe

Navigation