OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: com.cloud.agent.api.CheckRouterCommand timeout



Am 21.06.2018 um 17:08 schrieb Daan Hoogland:
> makes sense, well let's hope all breaks soon ;)

I am sure it will break! :D

And then I will get back to you with more questions!

Thanks a lot for taking the time!

> 
> On Thu, Jun 21, 2018 at 2:15 PM, Melanie Desaive <
> m.desaive@xxxxxxxxxxxxxxxxxxx> wrote:
> 
>> Hi Daan,
>>
>> Am 21.06.2018 um 15:29 schrieb Daan Hoogland:
>>> Melanie, attachments get deleted for this list. Your assumption for the
>>> comm path is right for xen. Did you try and execute the script as it is
>>> called by the proxy script from the host? and capture the return? We had
>> a
>>> bad problem with getting the template version in the past on xen, this
>>> might be similar. That was due to processing of the returned string in
>> the
>>> script.
>>
>> I called both stages of the script manually but at at time, when all was
>> working as expected and the routers where back to MASTER and BACKUP.
>>
>> Looked like:
>>
>> [root@acs-compute-5 ~]# /opt/cloud/bin/router_proxy.sh checkrouter.sh
>> 169.254.1.178
>> Status: BACKUP
>>
>> root@r-2595-VM:~# /opt/cloud/bin/checkrouter.sh
>> Status: BACKUP
>>
>>
>>>
>>> On Thu, Jun 21, 2018 at 1:16 PM, Melanie Desaive <
>>> m.desaive@xxxxxxxxxxxxxxxxxxx> wrote:
>>>
>>>> Hi Daan,
>>>>
>>>> thanks for your reply.
>>>>
>>>> The latest occurance of our VRs going to UNKNOWN did resolve 24 hours
>>>> after it had occured. Nevertheless I would appreciate some insight into
>>>> how the checkRouter command is handled, as I expect the problem to come
>>>> back again.
>>>> Am 21.06.2018 um 10:39 schrieb Daan Hoogland:
>>>>> Melanie, this depends a bit on the type of hypervisor. The command
>>>> executes
>>>>> the checkrouter.sh script on the virtual router if it reaches it, but
>> it
>>>>> seems your problem is before that. I would look at the network first
>> and
>>>>> follow the path that the execution takes for your hypervisortype.
>>>>
>>>> With Stephans help I figured out the following guess for the path of
>>>> connections for the checkrouter command. Could someone please correct
>>>> me, if my guess is not correct. ;)
>>>>
>>>>  x Management Nodes connects to XenServer hypervisor host via management
>>>> network on port 22 by SSH
>>>>  x On hypervisor host the wrapper script
>>>> "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs
>>>> via link-local IP and port 3922
>>>>  x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual
>>>> check.
>>>>
>>>> In our case the API call times out with log messages
>>>>  x Operation timed out: Commands 1063975411966525473 to Host 29 timed
>>>> out after 60
>>>>  x Unable to update router r-2595-VM's status
>>>>  x Redundant virtual router (name: r-2595-VM, id: 2595)  just switch
>>>> from BACKUP to UNKNOWN
>>>>
>>>> To me it seems that this is a timeout that occurs when ACS management is
>>>> waitig for the API call to return. At what stage (management host <->
>>>> virtualization host) or (virutalization host <-> VR> the answer is
>>>> delayed is unclear to me. (SSH Login from virtualization host to VR via
>>>> link-local is working all the time)
>>>>
>>>> And it is unclear to me, why both VRs of the respective network stay in
>>>> UNKNOWN for 24 hours, are accessible via link-local but come back
>>>> immedately after a reboot.
>>>>
>>>> I am happy for any suggestions or explanations in this topic and will
>>>> investigate further as soon, as the problem comes back again.
>>>>
>>>> A portion of our management log for the latest occurance of the problem
>>>> is attached to this email.
>>>>
>>>> Greetings,
>>>>
>>>> Melanie
>>>>
>>>>>
>>>>> On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
>>>>> m.desaive@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> we have a recurring problem with our virtual routers. By the log
>>>>>> messages it seems that com.cloud.agent.api.CheckRouterCommand runs
>> into
>>>>>> a timeout and therefore switches to UNKNOWN.
>>>>>>
>>>>>> All network traffic through the routers is still working. They can be
>>>>>> accessed by their link-local IP adresses, and configuration looks good
>>>>>> at a first sight. But configuration changes through the CloudStack API
>>>>>> do no longer reach the routers. A reboot fixes the problem.
>>>>>>
>>>>>> I would like to investigate a little further but lack understanding
>>>>>> about how the checkRouter command is trying to access the virtual
>>>> router.
>>>>>>
>>>>>> Could someone point me to some relevant documentation or give a short
>>>>>> overview how the connection from CS-Management is done and where such
>> an
>>>>>> timeout could occur?
>>>>>>
>>>>>> As background information - the sequence from the management log looks
>>>>>> kind of this:
>>>>>>
>>>>>> ---
>>>>>>
>>>>>>  x Every few seconds the com.cloud.agent.api.CheckRouterCommand
>> returns
>>>>>> a state BACKUP or MASTER correctly
>>>>>>  x When the problem occurs the log messages change. Some snippets
>> below
>>>>>>
>>>>>>  x ... Waiting some more time because this is the current command
>>>>>>  x ... Waiting some more time because this is the current command
>>>>>>  x Could not find exception:
>>>>>> com.cloud.exception.OperationTimedoutException in error code list for
>>>>>> exceptions
>>>>>>  x Timed out on Seq 28-2352567855348137104
>>>>>>  x Seq 28-2352567855348137104: Cancelling.
>>>>>>  x Operation timed out: Commands 2352567855348137104 to Host 28 timed
>>>>>> out after 60
>>>>>>  x Unable to update router r-2594-VM's status
>>>>>>  x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
>>>>>> from MASTER to UNKNOWN
>>>>>>
>>>>>>  x Those error messages are now repeated for each following
>>>>>> CheckRouterCommand until the virtual router is rebootet
>>>>>>
>>>>>>
>>>>>> Greetings,
>>>>>>
>>>>>> Melanie
>>>>>>
>>>>>> --
>>>>>> --
>>>>>>
>>>>>> Heinlein Support GmbH
>>>>>> Linux: Akademie - Support - Hosting
>>>>>>
>>>>>> http://www.heinlein-support.de
>>>>>> Tel: 030 / 40 50 51 - 0
>>>>>> Fax: 030 / 40 50 51 - 19
>>>>>>
>>>>>> Zwangsangaben lt. §35a GmbHG:
>>>>>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
>>>>>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> --
>>>>
>>>> Heinlein Support GmbH
>>>> Linux: Akademie - Support - Hosting
>>>>
>>>> http://www.heinlein-support.de
>>>> Tel: 030 / 40 50 51 - 0
>>>> Fax: 030 / 40 50 51 - 19
>>>>
>>>> Zwangsangaben lt. §35a GmbHG:
>>>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
>>>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>>>>
>>>
>>>
>>>
>>
>> --
>> --
>>
>> Heinlein Support GmbH
>> Linux: Akademie - Support - Hosting
>>
>> http://www.heinlein-support.de
>> Tel: 030 / 40 50 51 - 0
>> Fax: 030 / 40 50 51 - 19
>>
>> Zwangsangaben lt. §35a GmbHG:
>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>>
> 
> 
> 

-- 
--

Heinlein Support GmbH
Linux: Akademie - Support - Hosting

http://www.heinlein-support.de
Tel: 030 / 40 50 51 - 0
Fax: 030 / 40 50 51 - 19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin