[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: com.cloud.agent.api.CheckRouterCommand timeout


Hi Daan,

thanks for your reply.

The latest occurance of our VRs going to UNKNOWN did resolve 24 hours
after it had occured. Nevertheless I would appreciate some insight into
how the checkRouter command is handled, as I expect the problem to come
back again.
Am 21.06.2018 um 10:39 schrieb Daan Hoogland:
> Melanie, this depends a bit on the type of hypervisor. The command executes
> the checkrouter.sh script on the virtual router if it reaches it, but it
> seems your problem is before that. I would look at the network first and
> follow the path that the execution takes for your hypervisortype.

With Stephans help I figured out the following guess for the path of
connections for the checkrouter command. Could someone please correct
me, if my guess is not correct. ;)

 x Management Nodes connects to XenServer hypervisor host via management
network on port 22 by SSH
 x On hypervisor host the wrapper script
"/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs
via link-local IP and port 3922
 x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual
check.

In our case the API call times out with log messages
 x Operation timed out: Commands 1063975411966525473 to Host 29 timed
out after 60
 x Unable to update router r-2595-VM's status
 x Redundant virtual router (name: r-2595-VM, id: 2595)  just switch
from BACKUP to UNKNOWN

To me it seems that this is a timeout that occurs when ACS management is
waitig for the API call to return. At what stage (management host <->
virtualization host) or (virutalization host <-> VR> the answer is
delayed is unclear to me. (SSH Login from virtualization host to VR via
link-local is working all the time)

And it is unclear to me, why both VRs of the respective network stay in
UNKNOWN for 24 hours, are accessible via link-local but come back
immedately after a reboot.

I am happy for any suggestions or explanations in this topic and will
investigate further as soon, as the problem comes back again.

A portion of our management log for the latest occurance of the problem
is attached to this email.

Greetings,

Melanie

> 
> On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
> m.desaive@xxxxxxxxxxxxxxxxxxx> wrote:
> 
>> Hi all,
>>
>> we have a recurring problem with our virtual routers. By the log
>> messages it seems that com.cloud.agent.api.CheckRouterCommand runs into
>> a timeout and therefore switches to UNKNOWN.
>>
>> All network traffic through the routers is still working. They can be
>> accessed by their link-local IP adresses, and configuration looks good
>> at a first sight. But configuration changes through the CloudStack API
>> do no longer reach the routers. A reboot fixes the problem.
>>
>> I would like to investigate a little further but lack understanding
>> about how the checkRouter command is trying to access the virtual router.
>>
>> Could someone point me to some relevant documentation or give a short
>> overview how the connection from CS-Management is done and where such an
>> timeout could occur?
>>
>> As background information - the sequence from the management log looks
>> kind of this:
>>
>> ---
>>
>>  x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns
>> a state BACKUP or MASTER correctly
>>  x When the problem occurs the log messages change. Some snippets below
>>
>>  x ... Waiting some more time because this is the current command
>>  x ... Waiting some more time because this is the current command
>>  x Could not find exception:
>> com.cloud.exception.OperationTimedoutException in error code list for
>> exceptions
>>  x Timed out on Seq 28-2352567855348137104
>>  x Seq 28-2352567855348137104: Cancelling.
>>  x Operation timed out: Commands 2352567855348137104 to Host 28 timed
>> out after 60
>>  x Unable to update router r-2594-VM's status
>>  x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
>> from MASTER to UNKNOWN
>>
>>  x Those error messages are now repeated for each following
>> CheckRouterCommand until the virtual router is rebootet
>>
>>
>> Greetings,
>>
>> Melanie
>>
>> --
>> --
>>
>> Heinlein Support GmbH
>> Linux: Akademie - Support - Hosting
>>
>> http://www.heinlein-support.de
>> Tel: 030 / 40 50 51 - 0
>> Fax: 030 / 40 50 51 - 19
>>
>> Zwangsangaben lt. §35a GmbHG:
>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>>
> 
> 
> 

-- 
--

Heinlein Support GmbH
Linux: Akademie - Support - Hosting

http://www.heinlein-support.de
Tel: 030 / 40 50 51 - 0
Fax: 030 / 40 50 51 - 19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin