logo       
Google Custom Search
    AddThis Social Bookmark Button

RE: RE: RE: more maui Q...: msg#00090

Subject: RE: RE: RE: more maui Q...
It was a left-over from using simple pbs_sched and being on bsd where
pbs_mom doesn't seem to be able to get the memory accurately.  I have 1
queue where I want at most 2 of those jobs to execute at a time on a node
(they use 1gb each at most) but I allow 3 of any other.  Setting np=11 and
using 3/4 for ppn in queues gets you this...and allows that.

What I really need is to just have 2gb of consumable memory resource no each
node and set the one queue to require 1gb and the others to 250mb

I assume this can be done in maui, but I've yet do it.


Sam Rash
srash@xxxxxxxxxxxxx
408-349-7312
vertigosr37

-----Original Message-----
From: Donald E Tripp [mailto:dtripp@xxxxxxxxxx] 
Sent: Monday, July 17, 2006 2:29 PM
To: Sam Rash
Cc: torqueusers@xxxxxxxxxxxxxxxx
Subject: Re: RE: RE: [torqueusers] more maui Q...

Just curious, but why are you setting each node to have more "virtual"
processors than actual processors?

- Don

----- Original Message -----
From: Sam Rash <srash@xxxxxxxxxxxxx>
Date: Monday, July 17, 2006 11:22 am
Subject: RE: RE: [torqueusers] more maui Q...
To: torqueusers@xxxxxxxxxxxxxxxx

> So what I notice is the real load (using below or just being logged 
> on and
> watching top) is around 1, but I have a job running where I set 
> ppn=4.  It
> seems like maui tries to estimate the load based on the ppn (which is
> reasonable).
> 
> So I fell back to setting my jobs/nodes to the REAL resources 
> (np=3, ppn=1)
> and I get better util, but I still see what I guess is the torque 
> bug where
> used slots is 2/3 yet I see job-exclusive as the node state via pbs.
> 
> Is it easy to configure the memory as a consumable resource on a 
> box and set
> this as a default via the queue in pbs or directly with class 
> attributes in
> maui?
> 
> 
> 
> Sam Rash
> srash@xxxxxxxxxxxxx
> 408-349-7312
> vertigosr37
> 
> -----Original Message-----
> From: Donald E Tripp [mailto:dtripp@xxxxxxxxxx] 
> Sent: Monday, July 17, 2006 1:51 PM
> To: Sam Rash
> Cc: 'Justin Bronder'; torqueusers@xxxxxxxxxxxxxxxx
> Subject: Re: RE: [torqueusers] more maui Q...
> 
> try:
> ssh <node> w
> 
> that should return something like:
> 
> ssh n001 w
> 10:44:40 up 3 days, 25 min,  0 users,  load average: 0.00, 0.00, 0.00
> USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
> 
> you are looking at the line:  load average: 0.00, 0.00, 0.00,  
> which for me
> means n001 has no load. Then, look at /usr/local/maui/maui.cfg. 
> find the
> line: 
> 
> NODEMAXLOAD                 0.5
> 
> that means the max load of a node must be below 0.5 for maui to see 
> if at
> "free" and schedule jobs to it. If your node shows higher than 
> this, its not
> necessarily bad. It may be that you have some other processes 
> running taking
> up resources. If you want to just try to see if this is the issue, 
> changeNODEMAXLOAD to somehting higher than the node's actuall load, 
> and it should
> schedule it fine.
> 
> - Don
> 
> ----- Original Message -----
> From: Sam Rash <srash@xxxxxxxxxxxxx>
> Date: Monday, July 17, 2006 9:31 am
> Subject: RE: [torqueusers] more maui Q...
> To: 'Justin Bronder' <jsbronder@xxxxxxxxx>
> Cc: torqueusers@xxxxxxxxxxxxxxxx
> 
> > Hmm,  when I ran checknode, 
> > 
> > 
> > 
> > checking node m24
> > 
> > 
> > 
> > State:   Running  (in current state for 00:00:00)
> > 
> > Expected State:     Idle   SyncDeadline: Mon Jul 17 11:57:18
> > 
> > Configured Resources: PROCS: 11  MEM: 2010M  SWAP: 2010M  DISK: 1M
> > 
> > Utilized   Resources: [NONE]
> > 
> > Dedicated  Resources: PROCS: 4
> > 
> > Opsys:       freebsd  Arch:      [NONE]
> > 
> > Speed:      1.00  Load:       1.000
> > 
> > Network:    [DEFAULT]
> > 
> > Features:   [NONE]
> > 
> > Attributes: [Batch]
> > 
> > Classes:    [education 11:11][test 11:11][news 7:11][tv 
> > 11:11][top_priority11:11][low_priority 11:11][sports 11:11][games 
> > 11:11][movies 11:11][health
> > 11:11][tech 11:11]
> > 
> > 
> > 
> > Total Time: 2:13:15:30  Up: 2:13:15:30 (100.00%)  Active: 3:37:42 
> > (5.92%)
> > 
> > 
> > Reservations:
> > 
> >  Job '16995'(x4)  -00:00:32 -> 99:23:59:27 (99:23:59:59)
> > 
> > JobList:  16995
> > 
> > 
> > 
> > 
> > 
> > >From what I see, only 4/11 are used => 7 free, yet checkjob on a 
> > job that is
> > 'idle' shows it failed due to the below reason (need/free = 4/3).
> > 
> > 
> > 
> > The only thing I see that seems off is the state != expected 
> state; 
> > if I
> > recall reading right, maui won't schedule to a node where state 
> != 
> > expectedstate.
> > 
> > 
> > 
> > If that's the case, what could be going on here?  (is this the 
> bug 
> > wheretorque is off so the maui/torque sync is an issue?)
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Sam Rash
> > 
> > srash@xxxxxxxxxxxxx
> > 
> > 408-349-7312
> > 
> > vertigosr37
> > 
> >  _____  
> > 
> > From: Justin Bronder [mailto:jsbronder@xxxxxxxxx] 
> > Sent: Monday, July 17, 2006 11:32 AM
> > To: Sam Rash
> > Cc: torqueusers@xxxxxxxxxxxxxxxx
> > Subject: Re: [torqueusers] more maui Q...
> > 
> > 
> > 
> > We use Moab on our site, but I'm fairly certain the commands are 
> > the same.
> > What you are probably seeing is runaway processes on the node with
> > only 3/4 processes ("nps needed/free: 4/3").  You can check this 
> by 
> > using"mdiag -n", "checknode <nodename>", or manually checking the 
> > load on the
> > node.
> > 
> > To fix this, you'll probably have to figure out what is doing all 
> > the IO on
> > the node and kill/modify it so this doesn't continue to happen.
> > 
> > Then again, if it's not a high-load problem, I'm not quite sure.
> > 
> > Hope this helps,
> > Justin.
> > 
> > On 7/17/06, Sam Rash <srash@xxxxxxxxxxxxx> wrote:
> > 
> > Not to spam the list, but I'm actively trying to properly 
> configure 
> > maui to
> > work with pbs so we can move to it.
> > 
> > 
> > 
> > Regarding my seeing underutilized resources, when I see the 
> > resources at:
> > 
> > 
> > 
> >     7 Active Jobs      28 of   88 Processors Active (31.82%)
> > 
> >                         5 of    8 Nodes Active      (62.50%)
> > 
> > 
> > 
> > In the case where ~500 jobs are in the system
> > 
> > 
> > 
> > And I check a job to see why it hasn't run:
> > 
> > checking job 16553
> > 
> > 
> > 
> > State: Idle
> > 
> > Creds:  user:srash  group:users  class:news  qos:DEFAULT
> > 
> > WallTime: 00:00:00 of 99:23:59:59
> > 
> > SubmitTime: Mon Jul 17 10:55:17
> > 
> >  (Time Queued  Total: 00:03:19  Eligible: 00:01:05)
> > 
> > 
> > 
> > StartDate: -00:01:07  Mon Jul 17 10:57:29
> > 
> > Total Tasks: 4
> > 
> > 
> > 
> > Req[0]  TaskCount: 4  Partition: DEFAULT
> > 
> > Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> > 
> > Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> > 
> > 
> > 
> > 
> > 
> > IWD: [NONE]  Executable:  [NONE]
> > 
> > Bypass: 0  StartCount: 85
> > 
> > PartitionMask: [ALL]
> > 
> > Flags:       RESTARTABLE
> > 
> > 
> > 
> > Messages:  cannot start job - RM failure, rc: 15044, msg: 'Resource
> > temporarily unavailable REJHOST=m24 MSG=cannot allocate node 
> 'm24' 
> > to job -
> > node not currently available (nps needed/free: 4/3,  joblist:
> > 17060.17:0,17060.m17:1,17060.m17:2,17060.m17:3)'
> > 
> > PE:  4.00  StartPriority:  0
> > 
> > job can run in partition DEFAULT (44 procs available.  4 procs 
> > required)
> > 
> > 
> > 
> > 
> > but checking m24 with pbsnodes -a as well as the host itself 
> > reveals no
> > problems with it.
> > 
> > 
> > 
> > suggestions?
> > 
> > 
> > 
> > 
> > 
> > Sam Rash
> > 
> > srash@xxxxxxxxxxxxx
> > 
> > 408-349-7312
> > 
> > vertigosr37
> > 
> > 
> > 
> > 
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers@xxxxxxxxxxxxxxxx 
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers@xxxxxxxxxxxxxxxx
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 



Try Searching:
servers, voip, java, networking, microsoft ...
<Prev in Thread] Current Thread [Next in Thread>