[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: com.cloud.agent.api.CheckRouterCommand timeout


Melanie, attachments get deleted for this list. Your assumption for the
comm path is right for xen. Did you try and execute the script as it is
called by the proxy script from the host? and capture the return? We had a
bad problem with getting the template version in the past on xen, this
might be similar. That was due to processing of the returned string in the
script.

On Thu, Jun 21, 2018 at 1:16 PM, Melanie Desaive <
m.desaive@xxxxxxxxxxxxxxxxxxx> wrote:

> Hi Daan,
>
> thanks for your reply.
>
> The latest occurance of our VRs going to UNKNOWN did resolve 24 hours
> after it had occured. Nevertheless I would appreciate some insight into
> how the checkRouter command is handled, as I expect the problem to come
> back again.
> Am 21.06.2018 um 10:39 schrieb Daan Hoogland:
> > Melanie, this depends a bit on the type of hypervisor. The command
> executes
> > the checkrouter.sh script on the virtual router if it reaches it, but it
> > seems your problem is before that. I would look at the network first and
> > follow the path that the execution takes for your hypervisortype.
>
> With Stephans help I figured out the following guess for the path of
> connections for the checkrouter command. Could someone please correct
> me, if my guess is not correct. ;)
>
>  x Management Nodes connects to XenServer hypervisor host via management
> network on port 22 by SSH
>  x On hypervisor host the wrapper script
> "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs
> via link-local IP and port 3922
>  x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual
> check.
>
> In our case the API call times out with log messages
>  x Operation timed out: Commands 1063975411966525473 to Host 29 timed
> out after 60
>  x Unable to update router r-2595-VM's status
>  x Redundant virtual router (name: r-2595-VM, id: 2595)  just switch
> from BACKUP to UNKNOWN
>
> To me it seems that this is a timeout that occurs when ACS management is
> waitig for the API call to return. At what stage (management host <->
> virtualization host) or (virutalization host <-> VR> the answer is
> delayed is unclear to me. (SSH Login from virtualization host to VR via
> link-local is working all the time)
>
> And it is unclear to me, why both VRs of the respective network stay in
> UNKNOWN for 24 hours, are accessible via link-local but come back
> immedately after a reboot.
>
> I am happy for any suggestions or explanations in this topic and will
> investigate further as soon, as the problem comes back again.
>
> A portion of our management log for the latest occurance of the problem
> is attached to this email.
>
> Greetings,
>
> Melanie
>
> >
> > On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
> > m.desaive@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> >> Hi all,
> >>
> >> we have a recurring problem with our virtual routers. By the log
> >> messages it seems that com.cloud.agent.api.CheckRouterCommand runs into
> >> a timeout and therefore switches to UNKNOWN.
> >>
> >> All network traffic through the routers is still working. They can be
> >> accessed by their link-local IP adresses, and configuration looks good
> >> at a first sight. But configuration changes through the CloudStack API
> >> do no longer reach the routers. A reboot fixes the problem.
> >>
> >> I would like to investigate a little further but lack understanding
> >> about how the checkRouter command is trying to access the virtual
> router.
> >>
> >> Could someone point me to some relevant documentation or give a short
> >> overview how the connection from CS-Management is done and where such an
> >> timeout could occur?
> >>
> >> As background information - the sequence from the management log looks
> >> kind of this:
> >>
> >> ---
> >>
> >>  x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns
> >> a state BACKUP or MASTER correctly
> >>  x When the problem occurs the log messages change. Some snippets below
> >>
> >>  x ... Waiting some more time because this is the current command
> >>  x ... Waiting some more time because this is the current command
> >>  x Could not find exception:
> >> com.cloud.exception.OperationTimedoutException in error code list for
> >> exceptions
> >>  x Timed out on Seq 28-2352567855348137104
> >>  x Seq 28-2352567855348137104: Cancelling.
> >>  x Operation timed out: Commands 2352567855348137104 to Host 28 timed
> >> out after 60
> >>  x Unable to update router r-2594-VM's status
> >>  x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
> >> from MASTER to UNKNOWN
> >>
> >>  x Those error messages are now repeated for each following
> >> CheckRouterCommand until the virtual router is rebootet
> >>
> >>
> >> Greetings,
> >>
> >> Melanie
> >>
> >> --
> >> --
> >>
> >> Heinlein Support GmbH
> >> Linux: Akademie - Support - Hosting
> >>
> >> http://www.heinlein-support.de
> >> Tel: 030 / 40 50 51 - 0
> >> Fax: 030 / 40 50 51 - 19
> >>
> >> Zwangsangaben lt. §35a GmbHG:
> >> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> >> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> >>
> >
> >
> >
>
> --
> --
>
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
>
> http://www.heinlein-support.de
> Tel: 030 / 40 50 51 - 0
> Fax: 030 / 40 50 51 - 19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>



-- 
Daan