logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

Re: Strange problem in Torque and Maui: msg#00075

Subject: Re: Strange problem in Torque and Maui
James J Coyle wrote:
Kandy,

   Have you logged into the server and checked that
hostname
and
cat /var/spool/torque/server_name (or cat /var/spool/pbs/server_name if you are using pbs rather than torque)

both return the same thing.

>From your decription it looks like the logs are tellig you that
the contents of /var/spool/pbs/server_name
is just the string 'server name'
not the name or the ip address.

  I don't know how this got changed, you might check your
initialization script in e.g. /etc/rc3.d

Hi Everyone,

I'm using torque-2.1.8 with maui-3.2.6p17.
The system worked fine before but days ago when I tried to submit a job, the job never gets run even though I'm sure all the nodes are available. But I still could force the job to run using 'qrun'.
The strange things are:
when I try to 'showq', it shows
0 of    0 Processors Active (0.00%)

I can use the command 'pbsnodes -a', qstat' and 'qmgr' on the nodes. But not on the server. The followings are the output on the server:
pbsnodes -a
No default server name.
pbsnodes: cannot connect to server , error=15034

qstat:
No default server name.
qstat: cannot connect to server (null) (errno=15034)

qmgr
No default server name.
qmgr: cannot connect to server

Also, when I use the 'xpbs' and 'xpbsmon' commands on the nodes, it shows all the correction information like the server name and queues. But when I tried it on the server, it complains about 'No Permission.\nxpbs_datadump: Can not connect to server (15007)'

So I look at the /var/spool/pbs/server_logs:
PBS_Server;Svr;WARNING;ALERT: unable to contact node server name
PBS_Server;Svr;PBS_Server;Connection refused (111) in contact_sched, Could not contact Scheduler - port 15004

And in the log /var/spool/maui/
ERROR:    cannot connect to PBS server 'server name'  rc: -1 (errno: 15007)
ALERT:    cannot re-initialize PBS interface
10/23 11:12:02 ALERT: cannot load cluster resources on RM (RM '0' failed in function 'clusterquery')
10/23 11:12:02 WARNING:  no resources detected

The only thing I can think of is that the server got reset a week ago but I'm sure all the pbs_server, pbs_mom and maui services are back on. 'server_name' is in /var/spool/pbs
Any ideas?
Thank you very much.

Kandy

_______________________________________________
torqueusers mailing list
torqueusers@xxxxxxxxxxxxxxxx
http://www.supercluster.org/mailman/listinfo/torqueusers




Hi Everyone,

I'm using torque-2.1.8 with maui-3.2.6p17.
The system worked fine before but days ago when I tried to submit a job, the job never gets run even though I'm sure all the nodes are available. But I still could force the job to run using 'qrun'.
The strange things are:
when I try to 'showq', it shows
0 of    0 Processors Active (0.00%)

I can use the command 'pbsnodes -a', qstat' and 'qmgr' on the nodes. But not on the server. The followings are the output on the server:
pbsnodes -a
No default server name.
pbsnodes: cannot connect to server , error=15034

qstat:
No default server name.
qstat: cannot connect to server (null) (errno=15034)

qmgr
No default server name.
qmgr: cannot connect to server

Also, when I use the 'xpbs' and 'xpbsmon' commands on the nodes, it shows all the correction information like the server name and queues. But when I tried it on the server, it complains about 'No Permission.\nxpbs_datadump: Can not connect to server (15007)'

So I look at the /var/spool/pbs/server_logs:
PBS_Server;Svr;WARNING;ALERT: unable to contact node server name
PBS_Server;Svr;PBS_Server;Connection refused (111) in contact_sched, Could not contact Scheduler - port 15004

And in the log /var/spool/maui/
ERROR:    cannot connect to PBS server 'server name'  rc: -1 (errno: 15007)
ALERT:    cannot re-initialize PBS interface
10/23 11:12:02 ALERT: cannot load cluster resources on RM (RM '0' failed in function 'clusterquery')
10/23 11:12:02 WARNING:  no resources detected

The only thing I can think of is that the server got reset a week ago but I'm sure all the pbs_server, pbs_mom and maui services are back on. 'server_name' is in /var/spool/pbs
Any ideas?
Thank you very much.

Kandy

_______________________________________________
torqueusers mailing list
torqueusers@xxxxxxxxxxxxxxxx
http://www.supercluster.org/mailman/listinfo/torqueusers



This error occurs when none of server_name file and the environment variable PBS_DEFAULT point to pbs_server. Try setting up one of them.

--vinod


<Prev in Thread] Current Thread [Next in Thread>