|
|
Torque on a Scyld cluster: msg#00132
|
Subject: |
Torque on a Scyld cluster |
Has anybody ever experienced any problems with torque on a Scyld
Cluster? I'm running torque 2.1.8 on a Scyld cluster (Clusterware 4.13).
When I have a configuration with the master node defined as a single
node with 48 processes (we have 24 duals as compute nodes) with the
server, mom, and scheduler (Maui) running on a master node, everything
seems to be working. BUT! After recent upgrade of torque (from Scyld
repository) the new configuration includes MOMs running on all the
compute nodes, and the compute nodes are explicitly listed in
.../server_priv/nodes file. Now Maui doesn't work at all (all the jobs
deferred), and the original torque scheduler works but does
unexplainable things. All the serial jobs are fine (submitted using
beorun), but the MPI jobs behave very strangely. When submitted with
mpiexec from within pbs script (with #PBL -l nodes=<n>:ppn=2 the
processes don't communicate with each other and each process runs
independently like with ncpus=1. When submitted with mpirun from within
pbs script all the processes are started on a single node... Could
somebody tell me what is going on and how we can fix it?
Thanks,
Alex
--
|
| |