It was a left-over from using simple pbs_sched and being on bsd where
pbs_mom doesn't seem to be able to get the memory accurately. I have 1
queue where I want at most 2 of those jobs to execute at a time on a node
(they use 1gb each at most) but I allow 3 of any other. Setting np=11 and
using 3/4 for ppn in queues gets you this...and allows that.
What I really need is to just have 2gb of consumable memory resource no each
node and set the one queue to require 1gb and the others to 250mb
I assume this can be done in maui, but I've yet do it.
Sam Rash
srash@xxxxxxxxxxxxx
408-349-7312
vertigosr37
-----Original Message-----
From: Donald E Tripp [mailto:dtripp@xxxxxxxxxx]
Sent: Monday, July 17, 2006 2:29 PM
To: Sam Rash
Cc: torqueusers@xxxxxxxxxxxxxxxx
Subject: Re: RE: RE: [torqueusers] more maui Q...
Just curious, but why are you setting each node to have more "virtual"
processors than actual processors?
- Don
----- Original Message -----
From: Sam Rash <srash@xxxxxxxxxxxxx>
Date: Monday, July 17, 2006 11:22 am
Subject: RE: RE: [torqueusers] more maui Q...
To: torqueusers@xxxxxxxxxxxxxxxx
> So what I notice is the real load (using below or just being logged
> on and
> watching top) is around 1, but I have a job running where I set
> ppn=4. It
> seems like maui tries to estimate the load based on the ppn (which is
> reasonable).
>
> So I fell back to setting my jobs/nodes to the REAL resources
> (np=3, ppn=1)
> and I get better util, but I still see what I guess is the torque
> bug where
> used slots is 2/3 yet I see job-exclusive as the node state via pbs.
>
> Is it easy to configure the memory as a consumable resource on a
> box and set
> this as a default via the queue in pbs or directly with class
> attributes in
> maui?
>
>
>
> Sam Rash
> srash@xxxxxxxxxxxxx
> 408-349-7312
> vertigosr37
>
> -----Original Message-----
> From: Donald E Tripp [mailto:dtripp@xxxxxxxxxx]
> Sent: Monday, July 17, 2006 1:51 PM
> To: Sam Rash
> Cc: 'Justin Bronder'; torqueusers@xxxxxxxxxxxxxxxx
> Subject: Re: RE: [torqueusers] more maui Q...
>
> try:
> ssh <node> w
>
> that should return something like:
>
> ssh n001 w
> 10:44:40 up 3 days, 25 min, 0 users, load average: 0.00, 0.00, 0.00
> USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
>
> you are looking at the line: load average: 0.00, 0.00, 0.00,
> which for me
> means n001 has no load. Then, look at /usr/local/maui/maui.cfg.
> find the
> line:
>
> NODEMAXLOAD 0.5
>
> that means the max load of a node must be below 0.5 for maui to see
> if at
> "free" and schedule jobs to it. If your node shows higher than
> this, its not
> necessarily bad. It may be that you have some other processes
> running taking
> up resources. If you want to just try to see if this is the issue,
> changeNODEMAXLOAD to somehting higher than the node's actuall load,
> and it should
> schedule it fine.
>
> - Don
>
> ----- Original Message -----
> From: Sam Rash <srash@xxxxxxxxxxxxx>
> Date: Monday, July 17, 2006 9:31 am
> Subject: RE: [torqueusers] more maui Q...
> To: 'Justin Bronder' <jsbronder@xxxxxxxxx>
> Cc: torqueusers@xxxxxxxxxxxxxxxx
>
> > Hmm, when I ran checknode,
> >
> >
> >
> > checking node m24
> >
> >
> >
> > State: Running (in current state for 00:00:00)
> >
> > Expected State: Idle SyncDeadline: Mon Jul 17 11:57:18
> >
> > Configured Resources: PROCS: 11 MEM: 2010M SWAP: 2010M DISK: 1M
> >
> > Utilized Resources: [NONE]
> >
> > Dedicated Resources: PROCS: 4
> >
> > Opsys: freebsd Arch: [NONE]
> >
> > Speed: 1.00 Load: 1.000
> >
> > Network: [DEFAULT]
> >
> > Features: [NONE]
> >
> > Attributes: [Batch]
> >
> > Classes: [education 11:11][test 11:11][news 7:11][tv
> > 11:11][top_priority11:11][low_priority 11:11][sports 11:11][games
> > 11:11][movies 11:11][health
> > 11:11][tech 11:11]
> >
> >
> >
> > Total Time: 2:13:15:30 Up: 2:13:15:30 (100.00%) Active: 3:37:42
> > (5.92%)
> >
> >
> > Reservations:
> >
> > Job '16995'(x4) -00:00:32 -> 99:23:59:27 (99:23:59:59)
> >
> > JobList: 16995
> >
> >
> >
> >
> >
> > >From what I see, only 4/11 are used => 7 free, yet checkjob on a
> > job that is
> > 'idle' shows it failed due to the below reason (need/free = 4/3).
> >
> >
> >
> > The only thing I see that seems off is the state != expected
> state;
> > if I
> > recall reading right, maui won't schedule to a node where state
> !=
> > expectedstate.
> >
> >
> >
> > If that's the case, what could be going on here? (is this the
> bug
> > wheretorque is off so the maui/torque sync is an issue?)
> >
> >
> >
> >
> >
> >
> >
> > Sam Rash
> >
> > srash@xxxxxxxxxxxxx
> >
> > 408-349-7312
> >
> > vertigosr37
> >
> > _____
> >
> > From: Justin Bronder [mailto:jsbronder@xxxxxxxxx]
> > Sent: Monday, July 17, 2006 11:32 AM
> > To: Sam Rash
> > Cc: torqueusers@xxxxxxxxxxxxxxxx
> > Subject: Re: [torqueusers] more maui Q...
> >
> >
> >
> > We use Moab on our site, but I'm fairly certain the commands are
> > the same.
> > What you are probably seeing is runaway processes on the node with
> > only 3/4 processes ("nps needed/free: 4/3"). You can check this
> by
> > using"mdiag -n", "checknode <nodename>", or manually checking the
> > load on the
> > node.
> >
> > To fix this, you'll probably have to figure out what is doing all
> > the IO on
> > the node and kill/modify it so this doesn't continue to happen.
> >
> > Then again, if it's not a high-load problem, I'm not quite sure.
> >
> > Hope this helps,
> > Justin.
> >
> > On 7/17/06, Sam Rash <srash@xxxxxxxxxxxxx> wrote:
> >
> > Not to spam the list, but I'm actively trying to properly
> configure
> > maui to
> > work with pbs so we can move to it.
> >
> >
> >
> > Regarding my seeing underutilized resources, when I see the
> > resources at:
> >
> >
> >
> > 7 Active Jobs 28 of 88 Processors Active (31.82%)
> >
> > 5 of 8 Nodes Active (62.50%)
> >
> >
> >
> > In the case where ~500 jobs are in the system
> >
> >
> >
> > And I check a job to see why it hasn't run:
> >
> > checking job 16553
> >
> >
> >
> > State: Idle
> >
> > Creds: user:srash group:users class:news qos:DEFAULT
> >
> > WallTime: 00:00:00 of 99:23:59:59
> >
> > SubmitTime: Mon Jul 17 10:55:17
> >
> > (Time Queued Total: 00:03:19 Eligible: 00:01:05)
> >
> >
> >
> > StartDate: -00:01:07 Mon Jul 17 10:57:29
> >
> > Total Tasks: 4
> >
> >
> >
> > Req[0] TaskCount: 4 Partition: DEFAULT
> >
> > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> >
> > Opsys: [NONE] Arch: [NONE] Features: [NONE]
> >
> >
> >
> >
> >
> > IWD: [NONE] Executable: [NONE]
> >
> > Bypass: 0 StartCount: 85
> >
> > PartitionMask: [ALL]
> >
> > Flags: RESTARTABLE
> >
> >
> >
> > Messages: cannot start job - RM failure, rc: 15044, msg: 'Resource
> > temporarily unavailable REJHOST=m24 MSG=cannot allocate node
> 'm24'
> > to job -
> > node not currently available (nps needed/free: 4/3, joblist:
> > 17060.17:0,17060.m17:1,17060.m17:2,17060.m17:3)'
> >
> > PE: 4.00 StartPriority: 0
> >
> > job can run in partition DEFAULT (44 procs available. 4 procs
> > required)
> >
> >
> >
> >
> > but checking m24 with pbsnodes -a as well as the host itself
> > reveals no
> > problems with it.
> >
> >
> >
> > suggestions?
> >
> >
> >
> >
> >
> > Sam Rash
> >
> > srash@xxxxxxxxxxxxx
> >
> > 408-349-7312
> >
> > vertigosr37
> >
> >
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers@xxxxxxxxxxxxxxxx
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
> >
> >
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers@xxxxxxxxxxxxxxxx
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
|