logo       

maui pausing on Torque multiple qsubs: msg#00118

Subject: maui pausing on Torque multiple qsubs
Hi there,

I have a problem extremely similar to
http://www.clusterresources.com/pipermail/torqueusers/2006-April/003576.html

When doing multiple qsubs, Maui starts scheduling, then times out getting node info. However, it doesn't start scheduling again for a significant amount of time. (15+mins)
The most recent time, this happened to me while qdel-ing 3 jobs:

[craigm@bohol terrier]$qdel 545 546 547
No Permission.
qdel: cannot connect to server trmaster (errno=15007)
[craigm@bohol terrier]$qstat
Job id              Name             User            Time Use S Queue
------------------- ---------------- --------------- -------- - -----
547.trmaster        index_wt2g_2     craigm                 0 H verylong
[craigm@bohol terrier]$qdel 547


At this point, maui will do it's pause.

This happens with torque-2.1.6 and maui 3.2.6p17

My issue is that Maui seems to recieve a timeout from the libpbs, but doesnt seem to know what to do with it for a significant amount of time (till something else times out?). Is there any timeouts we can configure in Maui to reduce this.

The alternative route is to trace why the timeout occur in pbs_server.

Configurations and logs below. Logs are from a previous occurrence of this event.

Many thanks

Craig

I have Torque config
set server node_check_rate = 150
set server tcp_timeout = 6
set server poll_jobs = True
set server scheduler_iteration = 600

and relevant bit for Maui:
RMPOLLINTERVAL        00:00:30

Maui Log
=======
01/23 22:58:05 MResUpdateStats()
01/23 22:58:05 INFO: current util[2097]: 7/8 (87.50%) PH: 28.43% active jobs: 2 of 2 (completed: 413)
01/23 22:58:05 MQueueCheckStatus()
01/23 22:58:05 MNodeCheckStatus()
01/23 22:58:05 ALERT: node 'trnode03' sync from expected state 'Idle' to state 'Running' at Tue Jan 23 22:58:05 01/23 22:58:05 ALERT: node 'trnode04' sync from expected state 'Idle' to state 'Running' at Tue Jan 23 22:58:05 01/23 22:58:05 ALERT: node 'trnode05' sync from expected state 'Idle' to state 'Running' at Tue Jan 23 22:58:05 01/23 22:58:05 ALERT: node 'trnode06' sync from expected state 'Idle' to state 'Running' at Tue Jan 23 22:58:05 01/23 22:58:05 ALERT: node 'trnode08' sync from expected state 'Idle' to state 'Running' at Tue Jan 23 22:58:05
01/23 22:58:05 MUClearChild(PID)
01/23 22:58:05 INFO:     scheduling complete.  sleeping 30 seconds
01/23 22:58:14 INFO:     connect request from 130.209.249.20
01/23 22:58:14 INFO:     received service request from host 'trmaster'
01/23 22:58:14 MSURecvPacket(9,BufP,4,NULL,100000,SC)
01/23 22:58:14 INFO:     connect request from 130.209.249.20
01/23 22:58:14 INFO:     received service request from host 'trmaster'
01/23 22:58:14 MSURecvPacket(9,BufP,4,NULL,100000,SC)
01/23 22:58:14 ServerProcessRequests()
01/23 22:58:14 INFO:     not rolling logs (8941183 < 10000000)
01/23 22:58:14 MResAdjust(NULL,0,0)
01/23 22:58:14 MStatInitializeActiveSysUsage()
01/23 22:58:14 MStatClearUsage([NONE],Active)
01/23 22:58:14 ServerUpdate()
01/23 22:58:14 MSysUpdateTime()
01/23 22:58:14 INFO:     starting iteration 2098
01/23 22:58:14 MRMGetInfo()
01/23 22:58:14 MClusterClearUsage()
01/23 22:58:14 MRMClusterQuery()
01/23 22:58:14 MPBSClusterQuery(base,RCount,SC)
01/23 22:58:23 ERROR:    cannot get node info: Premature end of message
<PAUSE HERE>
01/23 23:13:44 ALERT: cannot load cluster resources on RM (RM 'base' failed in function 'clusterquery')
01/23 23:13:44 WARNING:  no resources detected
01/23 23:13:44 MRMWorkloadQuery()
01/23 23:13:44 MPBSWorkloadQuery(base,JCount,SC)
01/23 23:13:44 MPBSInitialize(base,SC)
01/23 23:13:45 MSUListen(S)
01/23 23:13:45 INFO:     opened service socket on port 15004
01/23 23:13:45 __MPBSSystemQuery(base,RCount,SC)
01/23 23:13:45 INFO:     connected to PBS server :0 on sd 1
01/23 23:13:45 MPBSJobUpdate(422,422.trmaster,TaskList,0)
01/23 23:13:45 MStatUpdateActiveJobUsage(422)

Torque pbs_server log
================
01/23/2007 22:58:14;0040;PBS_Server;Svr;trmaster;Scheduler sent command new
01/23/2007 22:58:14;0100;PBS_Server;Req;;Type AuthenticateUser request received from craigm@trangan, sock=13 01/23/2007 22:58:14;0100;PBS_Server;Req;;Type QueueJob request received from craigm@trangan, sock=11 01/23/2007 22:58:14;0100;PBS_Server;Req;;Type JobScript request received from craigm@trangan, sock=11 01/23/2007 22:58:14;0100;PBS_Server;Req;;Type ReadyToCommit request received from craigm@trangan, sock=11 01/23/2007 22:58:14;0100;PBS_Server;Req;;Type Commit request received from craigm@trangan, sock=11 01/23/2007 22:58:14;0100;PBS_Server;Job;490.trmaster;enqueuing into feed, state 1 hop 1 01/23/2007 22:58:14;0100;PBS_Server;Job;490.trmaster;dequeuing from feed, state QUEUED 01/23/2007 22:58:14;0100;PBS_Server;Job;490.trmaster;enqueuing into verylong, state 1 hop 1 01/23/2007 22:58:14;0008;PBS_Server;Job;490.trmaster;Job Queued at request of craigm@trangan, owner = craigm@trangan, job name = tagDisk454, queue = verylong 01/23/2007 22:58:14;0100;PBS_Server;Req;;Type AuthenticateUser request received from craigm@trangan, sock=12 01/23/2007 22:58:39;0100;PBS_Server;Req;;Type QueueJob request received from craigm@trangan, sock=11
01/23/2007 22:58:39;0040;PBS_Server;Svr;trmaster;Scheduler sent command time
01/23/2007 22:58:39;0100;PBS_Server;Req;;Type StatusNode request received from craigm@trmaster, sock=9


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
qnx.openqnx.dev...    politics.lenini...    audio.emagic.ex...    tex.texinfo.gen...    handhelds.linux...    ietf.sipping/20...    lang.erlang.gen...    cygwin.talk/200...    yellowdog.gener...    mozilla.devel.l...    xfree86.newbie/...    openbsd.ports/2...    db.oracle.devel...    kde.kalyxo.deve...    user-groups.lin...    bbc.cvs/2003-04...    gnu.libtool.bug...    redhat.k12osn/2...    emulators.wine....    freebsd.devel.d...    search.xapian.g...    java.izpack.use...    network.mrtg.us...    windows.total-c...   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe