Hi everyone,
This problem has just manifested itself again, in a different way. Users' jobs
are actually being queued, and then being killed, rather than sitting in the
BLOCKED queue as before.
"lwgordon@whiteout:~> qsub -I -lncpus=5
qsub: waiting for job 14093.whiteout.sf.utas.edu.au to start
qsub: job 14093.whiteout.sf.utas.edu.au ready
error: wanted 5 cpus but only 1 cpus are available. Error in PBS setup?
qsub: job 14093.whiteout.sf.utas.edu.au completed"
However, showq reports: "68 Active Jobs 113 of 118 Processors Active
(95.76%)"
Single CPU jobs seem to be entering running queue fine, but 4 CPU jobs enter
the running queue, and then die shortly afterwards.
"whiteout:/var/spool/maui/log # tracejob 14090
/var/spool/torque/sched_logs/20070823: No such file or directory
Job: 14090.whiteout.sf.utas.edu.au
08/23/2007 12:54:52 S enqueuing into batch, state 1 hop 1
08/23/2007 12:54:52 S Job Queued at request of
rijamos@xxxxxxxxxxxxxxxxxxxxxxx, owner = rijamos@xxxxxxxxxxxxxxxxxxxxxxx, job
name = mpmnitror, queue = batch
08/23/2007 12:54:52 A queue=batch
08/23/2007 12:54:53 S Job Modified at request of
maui@xxxxxxxxxxxxxxxxxxxxxxx
08/23/2007 12:54:53 S Job Run at request of maui@xxxxxxxxxxxxxxxxxxxxxxx
08/23/2007 12:55:04 S send of job to whiteout failed error = 15020
08/23/2007 12:55:14 S send of job to whiteout failed error = 15020
08/23/2007 12:55:14 S child reported failure for job after 21 seconds
(dest=whiteout), rc=10
08/23/2007 12:55:14 S unable to run job, MOM rejected/timeout
08/23/2007 13:01:43 S Job Modified at request of
maui@xxxxxxxxxxxxxxxxxxxxxxx
08/23/2007 13:01:43 S Job Run at request of maui@xxxxxxxxxxxxxxxxxxxxxxx
08/23/2007 13:01:51 S send of job to whiteout failed error = 15020
08/23/2007 13:02:01 S send of job to whiteout failed error = 15020
08/23/2007 13:02:01 S child reported failure for job after 18 seconds
(dest=whiteout), rc=10
08/23/2007 13:02:01 S unable to run job, MOM rejected/timeout
08/23/2007 13:17:13 S Job Modified at request of
maui@xxxxxxxxxxxxxxxxxxxxxxx
08/23/2007 13:17:13 S Job Run at request of maui@xxxxxxxxxxxxxxxxxxxxxxx
08/23/2007 13:17:24 S send of job to whiteout failed error = 15020
08/23/2007 13:17:34 S send of job to whiteout failed error = 15020
08/23/2007 13:17:34 S child reported failure for job after 21 seconds
(dest=whiteout), rc=10
08/23/2007 13:17:34 S unable to run job, MOM rejected/timeout
08/23/2007 13:23:15 S Job Modified at request of
maui@xxxxxxxxxxxxxxxxxxxxxxx
08/23/2007 13:23:15 S Job Run at request of maui@xxxxxxxxxxxxxxxxxxxxxxx
08/23/2007 13:23:23 S send of job to whiteout failed error = 15020
08/23/2007 13:23:36 S send of job to whiteout failed error = 15020
08/23/2007 13:23:36 S child reported failure for job after 21 seconds
(dest=whiteout), rc=10
08/23/2007 13:23:36 S unable to run job, MOM rejected/timeout
08/23/2007 13:29:34 S Job Modified at request of
maui@xxxxxxxxxxxxxxxxxxxxxxx
08/23/2007 13:29:34 S Job Run at request of maui@xxxxxxxxxxxxxxxxxxxxxxx
08/23/2007 13:29:45 S send of job to whiteout failed error = 15020
08/23/2007 13:29:58 S send of job to whiteout failed error = 15020
08/23/2007 13:29:58 S child reported failure for job after 24 seconds
(dest=whiteout), rc=10
08/23/2007 13:29:58 S unable to run job, MOM rejected/timeout
08/23/2007 13:35:02 S Job Modified at request of
maui@xxxxxxxxxxxxxxxxxxxxxxx
08/23/2007 13:35:02 S Job Run at request of maui@xxxxxxxxxxxxxxxxxxxxxxx
08/23/2007 13:35:07 S child reported success for job after 5 seconds
(dest=whiteout), rc=0
08/23/2007 13:35:07 A user=rijamos group=users jobname=mpmnitror
queue=batch ctime=1187837692 qtime=1187837692 etime=1187837692
start=1187840107 exec_host=whiteout Resource_List.mem=6700mb
Resource_List.ncpus=4
Resource_List.neednodes=whiteout Resource_List.nodect=1
Resource_List.walltime=400:00:00
08/23/2007 13:35:08 S Exit_status=-2
08/23/2007 13:35:08 A user=rijamos group=users jobname=mpmnitror
queue=batch ctime=1187837692 qtime=1187837692 etime=1187837692
start=1187840107 exec_host=whiteout Resource_List.mem=6700mb
Resource_List.ncpus=4
Resource_List.neednodes=whiteout Resource_List.nodect=1
Resource_List.walltime=400:00:00 session=0 end=1187840108 Exit_status=-2
08/23/2007 13:36:27 M forking to user, uid: 55197 gid: 100
homedir: '/u/rijamos'
08/23/2007 13:36:35 S dequeuing from batch, state COMPLETE"
If I run strace on MOM, it's doing this for a LONG time, looping over every
process in the system(about 1300ish) a lot of times(once for each job in the
queue?):
"open("/proc/8181/stat", O_RDONLY) = 12
fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x2000000002900000
read(12, "8181 (UVic_ESCM) R 8177 8097 809"..., 1024) = 267
fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
close(12) = 0
munmap(0x2000000002900000, 65536) = 0
open("/proc/8457/stat", O_RDONLY) = 12
fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x2000000002900000
read(12, "8457 (bash) S 16979 8457 8457 0 "..., 1024) = 265
fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
close(12) = 0
munmap(0x2000000002900000, 65536) = 0
open("/proc/8468/stat", O_RDONLY) = 12
fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x2000000002900000
read(12, "8468 (bash) S 16979 8468 8468 0 "..., 1024) = 266
fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
close(12) = 0
munmap(0x2000000002900000, 65536) = 0"
Even when this finishes and the strace goes back to a sane speed(just
reading /proc/loadavg every now and then), the node reports as down for a
while, then comes back up, only to be taken down by MOM again!
Is there ANY way to stop the system from crawling to a halt and killing jobs
as soon as they are submitted? Even if people's jobs are queued and checked
less frequently, it would be preferable to the jobs being killed for no
reason and having to try and resubmit them hoping that it will work this
time! Frustrating as you can imagine :)
It seems that even when there's sufficient CPUs free, by the time it's gets
around to scheduling the job, it's back to it's "/proc trawling" again so
then can't communicate and the job either enters the queue and then dies, or
remains in a deferred state.
Any ideas? I realise a proper fix involves rewriting the program for better
scalability, but does anyone have any ideas to alleviate this problem in the
meantime? Maybe suggest which values in MOM/maui/pbs_server config can be
tweaked to lighten to load?
I think you were right on the money David, this /proc business is a bit
ridiculous and it's ironic that something with this much grunt can be slowed
to a crawl by a single process!
Thanks
Leigh
--
Regards,
Leigh Gordon
High Performance Computing Systems Administrator
IT Resources, University of Tasmania
Phone: 03 6226 6389
http://www.tpac.org.au
"What we're really after is simply that people acquire a legal license for
Windows for each computer they own before they move on to Linux or Sun
Solaris or BSD or OS/2 or whatever."
- Bill Gates -
On Mon, 20 Aug 2007 08:54:55 am Leigh Gordon wrote:
> Hi, thanks for the reply.
>
> I've looked into it a bit more(even going as far as looking at the source
> code for 'top', which unfortunately also uses "/proc" rather than system
> calls) and it seems like it's a job for someone with a lot more experience
> with C than myself. I did find a GPL project called Supermon, which I read
> a paper on(and it specifically said it used system calls rather than /proc
> on Linux) which may hold some clues on how to do it.
>
> Decreasing the mom polling frequency down to 5 minutes(was 3 minutes) seems
> to at least be preventing the jobs from becoming blocked for good(the jobs
> that get marked "Deferred" return to the "Idle" status, rather than
> "BatchHold"), although it doesn't prevent the system having it's little
> fits every few minutes.
>
> Thanks again,
> Leigh
>
> On Thursday 16 August 2007 06:43:45 David Singleton wrote:
> > Hi Leigh,
> >
> > I'm guessing the problem is the way torque MOMs trawl /proc on
> > Linux causing the MOM to spend a lot of time very busy. Have
> > you strace'd the MOM? Is your MOM running in the same cpuset
> > as jobs or otherwise fighting for cpu time?
> >
> > I haven't looked at this for a while but it used to be that the MOM
> > would run through the whole of /proc multiple times per job (in
> > cput_sum(), overcpu_proc(), mem_sum(), ....). Reading /proc on a
> > full 128P system is not cheap and if most of your jobs are single
> > cpu then your multiplier of this number is large. The cost is about
> > 64^2 times what it is on a cluster of 2P nodes.
> >
> > There are multiple ways this can be improved. On solaris and
> > digitalunix, MOMs reuse a single trawling of /proc for multiple jobs.
> > Since most uses are sampling for resource usage/limits, it doesn't
> > matter if the sampled /proc data is a few millisecs stale. [On top
> > of this, the binary blob dump of /proc entries on Solaris and Tru64
> > is noticeably *faster* than Linux's annoying "readable" format so
> > MOM on Linux takes a double hit in comparison.]
> >
> > Another alternative is to use cpusets or SGI "process aggregates"
> > (as used in SGI array sessions or their "job" module) to have the
> > kernel accurately track the list of processes in a job (actually in
> > each task of a job). Then, for each job, MOM can just loop over the
> > list of pids in these job "containers" task lists instead of the
> > whole of /proc. [Good news for Linux: a generic "process container"
> > infrastructure is hopefully being added to the kernel. It effectively
> > provides an indelible session id and a means of accessing job
> > task lists.]
> >
> > There is benefit in applying both these modifications.
> >
> > I could be totally wrong about what is causing your system grief but
> > it does sound like a MOM scalability issue. If so, fixing it will
> > require some development work.
> >
> > Cheers,
> > David
|