logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

Re: Qstat reporting false node use: msg#00050

Subject: Re: Qstat reporting false node use
On Wed, Apr 11, 2007 at 11:38:56AM -0700, Clevenger, Kevin alleged:
> Hi,
> 
> Whene running multiple NAMD jobs on the cluster (Rocks 4.2.1) we see qstat -n 
> report that the jobs start on separate nodes, but when you look at the 
> processes with cluster-ps they in fact are not. Anyone know why this is and 
> how to straigten it out? Output below.

Note that TORQUE has nothing to do with launching processes; it just
runs your job script.  It is the job script's responsibility to launch
processes on the nodes listed in $PBS_NODEFILE.

 
> Thanks
> 
> Kevin
> 
> ###################################################
> 
> $ qstat -n
> 
> cluster.coh.org: 
>                                                                    Req'd  
> Req'd   Elap
> Job ID               Username Queue    Jobname    SessID NDS   TSK Memory 
> Time  S Time
> -------------------- -------- -------- ---------- ------ ----- --- ------ 
> ----- - -----
> 153.cluster.coh.org      bob     longrun  eq32.submi   5042     8   1    --  
> 1000: R 00:23
>    
> c-0-24+c-0-24+c-0-23+c-0-23+c-0-22+c-0-22+c-0-21+c-0-21+c-0-20+c-0-20+c-0-19
>    +c-0-19+c-0-18+c-0-18+c-0-17+c-0-17
> 154.cluster.coh.org      bob     longrun  eq08.submi  32618     4   1    --  
> 1000: R 00:22
>    c-0-16+c-0-16+c-0-15+c-0-15+c-0-14+c-0-14+c-0-13+c-0-13
> 155.cluster.coh.org      bob     longrun  TAK779-eq0   1383     4   1    --  
> 1000: R 00:18
>    c-0-12+c-0-12+c-0-11+c-0-11+c-0-10+c-0-10+c-0-9+c-0-9

Ok, so we should see jobs running on c-0-12, c-0-16, and c-0-24.


> c-0-12: 
> bob      1414  0.0  0.0  5848  764 ?        S    11:08   0:00 
> /home/bob/vaidsim ++remote-shell ssh ++nodelist /share/data/etc/nodelist +p16 
> /home/bob/vaidsimpl /home/bob/CCR2TAK779/MD/eq08-con.namd

> c-0-16: 
> bob     32649  0.0  0.0  5848  764 ?        S    11:04   0:00 
> /home/bob/vaidsim ++remote-shell ssh ++nodelist /share/data/etc/nodelist +p16 
> /home/bob/vaidsimpl /home/bob/CCR2APO/MD/eq08-con.namd

> c-0-24: 
> bob      5069  0.0  0.0  5848  764 ?        S    11:03   0:00 
> /home/bob/vaidsim ++remote-shell ssh ++nodelist /share/data/etc/nodelist +p16 
> /home/bob/vaidsimpl /home/bob/STAT3/eq32.namd

Good, We see the jobs running on the correct nodes.  But it appears your
command is using a private nodes file to launch processes whereever it
wants.


<Prev in Thread] Current Thread [Next in Thread>