logo       
Google Custom Search
    AddThis Social Bookmark Button

Re: Re: Re: Trouble running jobs with TORQUE: msg#00181

Subject: Re: Re: Re: Trouble running jobs with TORQUE
All users who run jobs need to have password-less logins. The easiest way to do this is setup ssh keypairs across a shared home directory. They don't need to be able to login all the time; pbs_server can modify the access.conf file if necessary. So, here's a little bit of a hypothetical situation:

exportfs /home --> compute_nodes/home

cd /home/user
ssh-keygen -t rsa ...
cat .ssh/authorized_keys
  xxxxxxx user@admin

cd /home/user/job
qsub job_file

when the job gets queued, the scheduler will ssh to the compute nodes from the users account:

ssh compute01
compute01$ 

is what it should get. Remember, its in batch mode , so it can't enter a password.



- Donald Tripp
----------------------------------------------
HPC Systems Administrator
High Performance Computing Center
University of Hawai'i at Hilo
200 W. Kawili Street
Hilo,   Hawaii   96720


On Mar 27, 2007, at 3:57 PM, aohara@xxxxxxxxxxxxx wrote:

Thanks for responding.  I also got a private mail response asking a few
questions about the set up.  So I'll answer that here too.
In and e-mail from Thomas Pierce, he suggested not using pbs_sched at all
and solely using Maui for the scheduler.  I just wanted to clarify that is
what we are/plain to do.  I just tried it out with both to isolate the
problem as torque or maui specific.  Also the version of TORQUE is 2.1.8.

I did do a configuration with the '--with-scp' command, however I was
under the impression that only the root account, which is
running/operating TORQUE and Maui, was the only one that needed to have
passwordless ssh.  Under the current compiled configuration of TORQUE
would I then need to add all users to each compute node and then set up
ssh for them?  Is there a configuration option in the compile that I can
keep it so only the root user has passwordless-ssh set up to the compute
nodes?

After running echo 'hostname' | qsub, no files were outputted either.

Thanks again,
Andy

P.S. Is there an easier way to reply to a message since I'm getting the
digests. Thanks.

Date: Tue, 27 Mar 2007 11:26:49 -0400
Subject: Re: [torqueusers] Trouble running jobs with TORQUE
Message-ID:
Content-Type: text/plain; charset="us-ascii"

The off the cuff answer is that there might be a problem with the rsh/ssh
permissions on the system.  Have you verified that the user submitting the
job (administrator@babbage) can do a passwordless ssh (assuming you
configured with --with-scp) to the compute nodes and back to the headnode.

For the test echo 'hostname' | qsub are you getting stdout and stderror
files back? (STDIN.e123456 looking things)?  If you are, is there anything
in them?  Is the administrator getting an email about these jobs with any
information in them?

A seperate issue with python that I have run into is ensuring that the
'all set' python setup includes PYTHONPATH being set appropriately in the
shell that torque opens, if you have installed extra packages.  But any
problems here should show up in a stack trace in the stderror file and can
be diagnosed that way.

Hope that gives you a start,
Nate





26-Mar-2007 17:48

To
cc

Subject
[torqueusers] Trouble running jobs with TORQUE






Hi,
We just recently began setting up a linux cluster here at Haverford
College using TORQUE and Maui.  The general specs are 6 blades with two
dual core AMD opterons, 16 gb ram, and a head node with a similar
processor setup.
Over the past week, we installed TORQUE (and Maui), however TORQUE seems
to be having trouble running jobs.
Running 'pbsnodes -a' reports correctly on the state of all nodes and if
neither pbs_sched or Maui are running then qstat shows jobs labeled Q, as
expected.  However, when either pbs_sched or Maui are running, the jobs
don't seem to be running properly.  I tried submitting both the test
phrase `echo "sleep 30" | qsub' and a script `qsub testjob' where testjob
is a script containing `python myprogram.py'.  All necessary python
packages are installed too, so I know this isn't the problem (I've
manually ran the python code on all nodes).  The reason I suspect some
form of TORQUE error is that this job also completes immediately, even tho
it should take roughly 20 minutes to run.  The tracejob output for one is
here (both are basically the same though):

03/26/2007 17:25:17  S    enqueuing into batch, state 1 hop 1
03/26/2007 17:25:17  S    Job Queued at request of administrator@babbage,
owner
                          = administrator@babbage, job name = testjob.sh,
queue
                          = batch
03/26/2007 17:25:18  S    Job Modified at request of root@babbage
03/26/2007 17:25:18  S    Job Run at request of root@babbage
03/26/2007 17:25:18  S    Job Modified at request of root@babbage
03/26/2007 17:25:18  S    Exit_status=-1
03/26/2007 17:25:18  S    Post job file processing error
03/26/2007 17:25:18  S    dequeuing from batch, state COMPLETE

Any help would be greatly appreciated, thanks.  If you need any more
information about our cluster hardward/software setup just ask.

Thanks,
Andy O'Hara
Haverford College Physics '09

_______________________________________________
torqueusers mailing list

_______________________________________________
torqueusers mailing list
torqueusers@xxxxxxxxxxxxxxxx
http://www.supercluster.org/mailman/listinfo/torqueusers
<Prev in Thread] Current Thread [Next in Thread>