|
|
Choosing A Webhost: |
Bug in Torque 1.2.0p6 ?: msg#00221clustering.torque.user
Hi. We're running a 6-nodes cluster composed of bi-Opteron computers. One of our nodes is currently running 3 jobs instead of 2, and we have a strange result when typing qstat : Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 7447.ulmo.calcu bouchere infiniCa twi1D2 31413 1 -- 1000mb 2000: R 153:5 callas04/0 7993.ulmo.math. khodor q1jourCa microf -- -- -- 1500mb 02:00 Q -- callas04/1 7994.ulmo.math. khodor q1jourCa nsmgev 30742 -- -- 1500mb 02:00 R 00:52 callas04/1 Job 7993 is marked as QUEUED, but has a processor reserved... the same processor as 7994 ! but it is actually RUNNING on the node : ps auxf : root 16336 0.0 0.0 34520 3944 ? Ss 2005 31:55 /usr/sbin/pbs_mom -r bouchere 31413 0.0 0.1 30984 5440 ? Ss Jan18 0:00 \_ -bash bouchere 31445 0.0 0.1 30988 5448 ? S Jan18 0:00 | \_ -bash bouchere 31551 0.0 0.0 2336 292 ? S Jan18 0:00 | \_ time /home/mab/bouchere/THESE/1D/2NIV/TWI_FIN/POLLEN/TWI/run bouchere 31552 97.8 17.3 713016 704240 ? R Jan18 9213:01 | \_ /home/mab/bouchere/THESE/1D/2NIV/TWI_FIN/POLLEN/TWI/run khodor 30527 0.0 0.1 30988 5444 ? Ss 16:35 0:00 \_ -bash khodor 30559 0.0 0.1 30992 5452 ? S 16:35 0:00 | \_ -bash khodor 2356 0.0 0.0 2336 292 ? S 17:14 0:00 | \_ time ./microf khodor 2357 16.1 0.0 12336 3756 ? R 17:14 7:46 | \_ ./microf khodor 30742 0.0 0.1 30988 5444 ? Ss 16:36 0:00 \_ -bash khodor 30774 0.0 0.1 30992 5452 ? S 16:36 0:00 \_ -bash khodor 31678 0.0 0.0 2336 292 ? S 16:40 0:00 \_ time ./nsmgev khodor 31679 38.0 0.9 46076 37432 ? R 16:40 30:56 \_ ./nsmgev The nodes file says : callas01 np=2 opteron callas callas02 np=2 opteron callas callas03 np=2 opteron callas callas04 np=2 opteron callas callas05 np=2 opteron callas callas06 np=2 opteron callas and pbsnodes -a tells there are 2 jobs on the node : # pbsnodes -a callas04 callas04 state = job-exclusive np = 2 properties = opteron,callas ntype = cluster jobs = 0/7447.ulmo.calcul, 1/7994.ulmo.math.u-bordeaux1.fr What can be wrong ? -- Jacques Foury
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Patch: fix line breaking on qstat, Ronny T. Lampert |
|---|---|
| Next by Date: | Re: Patch: fix line breaking on qstat, Dave Jackson |
| Previous by Thread: | Patch: fix line breaking on qstat, Ronny T. Lampert |
| Next by Thread: | Re: Bug in Torque 1.2.0p6 ?, Garrick Staples |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |