logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

Torque on a Scyld cluster: msg#00132

Subject: Torque on a Scyld cluster
Has anybody ever experienced any problems with torque on a Scyld Cluster? I'm running torque 2.1.8 on a Scyld cluster (Clusterware 4.13). When I have a configuration with the master node defined as a single node with 48 processes (we have 24 duals as compute nodes) with the server, mom, and scheduler (Maui) running on a master node, everything seems to be working. BUT! After recent upgrade of torque (from Scyld repository) the new configuration includes MOMs running on all the compute nodes, and the compute nodes are explicitly listed in .../server_priv/nodes file. Now Maui doesn't work at all (all the jobs deferred), and the original torque scheduler works but does unexplainable things. All the serial jobs are fine (submitted using beorun), but the MPI jobs behave very strangely. When submitted with mpiexec from within pbs script (with #PBL -l nodes=<n>:ppn=2 the processes don't communicate with each other and each process runs independently like with ncpus=1. When submitted with mpirun from within pbs script all the processes are started on a single node... Could somebody tell me what is going on and how we can fix it?

Thanks,
Alex
--


<Prev in Thread] Current Thread [Next in Thread>