[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: com.cloud.agent.api.CheckRouterCommand timeout


makes sense, well let's hope all breaks soon ;)

On Thu, Jun 21, 2018 at 2:15 PM, Melanie Desaive <
m.desaive@xxxxxxxxxxxxxxxxxxx> wrote:

> Hi Daan,
>
> Am 21.06.2018 um 15:29 schrieb Daan Hoogland:
> > Melanie, attachments get deleted for this list. Your assumption for the
> > comm path is right for xen. Did you try and execute the script as it is
> > called by the proxy script from the host? and capture the return? We had
> a
> > bad problem with getting the template version in the past on xen, this
> > might be similar. That was due to processing of the returned string in
> the
> > script.
>
> I called both stages of the script manually but at at time, when all was
> working as expected and the routers where back to MASTER and BACKUP.
>
> Looked like:
>
> [root@acs-compute-5 ~]# /opt/cloud/bin/router_proxy.sh checkrouter.sh
> 169.254.1.178
> Status: BACKUP
>
> root@r-2595-VM:~# /opt/cloud/bin/checkrouter.sh
> Status: BACKUP
>
>
> >
> > On Thu, Jun 21, 2018 at 1:16 PM, Melanie Desaive <
> > m.desaive@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> >> Hi Daan,
> >>
> >> thanks for your reply.
> >>
> >> The latest occurance of our VRs going to UNKNOWN did resolve 24 hours
> >> after it had occured. Nevertheless I would appreciate some insight into
> >> how the checkRouter command is handled, as I expect the problem to come
> >> back again.
> >> Am 21.06.2018 um 10:39 schrieb Daan Hoogland:
> >>> Melanie, this depends a bit on the type of hypervisor. The command
> >> executes
> >>> the checkrouter.sh script on the virtual router if it reaches it, but
> it
> >>> seems your problem is before that. I would look at the network first
> and
> >>> follow the path that the execution takes for your hypervisortype.
> >>
> >> With Stephans help I figured out the following guess for the path of
> >> connections for the checkrouter command. Could someone please correct
> >> me, if my guess is not correct. ;)
> >>
> >>  x Management Nodes connects to XenServer hypervisor host via management
> >> network on port 22 by SSH
> >>  x On hypervisor host the wrapper script
> >> "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs
> >> via link-local IP and port 3922
> >>  x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual
> >> check.
> >>
> >> In our case the API call times out with log messages
> >>  x Operation timed out: Commands 1063975411966525473 to Host 29 timed
> >> out after 60
> >>  x Unable to update router r-2595-VM's status
> >>  x Redundant virtual router (name: r-2595-VM, id: 2595)  just switch
> >> from BACKUP to UNKNOWN
> >>
> >> To me it seems that this is a timeout that occurs when ACS management is
> >> waitig for the API call to return. At what stage (management host <->
> >> virtualization host) or (virutalization host <-> VR> the answer is
> >> delayed is unclear to me. (SSH Login from virtualization host to VR via
> >> link-local is working all the time)
> >>
> >> And it is unclear to me, why both VRs of the respective network stay in
> >> UNKNOWN for 24 hours, are accessible via link-local but come back
> >> immedately after a reboot.
> >>
> >> I am happy for any suggestions or explanations in this topic and will
> >> investigate further as soon, as the problem comes back again.
> >>
> >> A portion of our management log for the latest occurance of the problem
> >> is attached to this email.
> >>
> >> Greetings,
> >>
> >> Melanie
> >>
> >>>
> >>> On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
> >>> m.desaive@xxxxxxxxxxxxxxxxxxx> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> we have a recurring problem with our virtual routers. By the log
> >>>> messages it seems that com.cloud.agent.api.CheckRouterCommand runs
> into
> >>>> a timeout and therefore switches to UNKNOWN.
> >>>>
> >>>> All network traffic through the routers is still working. They can be
> >>>> accessed by their link-local IP adresses, and configuration looks good
> >>>> at a first sight. But configuration changes through the CloudStack API
> >>>> do no longer reach the routers. A reboot fixes the problem.
> >>>>
> >>>> I would like to investigate a little further but lack understanding
> >>>> about how the checkRouter command is trying to access the virtual
> >> router.
> >>>>
> >>>> Could someone point me to some relevant documentation or give a short
> >>>> overview how the connection from CS-Management is done and where such
> an
> >>>> timeout could occur?
> >>>>
> >>>> As background information - the sequence from the management log looks
> >>>> kind of this:
> >>>>
> >>>> ---
> >>>>
> >>>>  x Every few seconds the com.cloud.agent.api.CheckRouterCommand
> returns
> >>>> a state BACKUP or MASTER correctly
> >>>>  x When the problem occurs the log messages change. Some snippets
> below
> >>>>
> >>>>  x ... Waiting some more time because this is the current command
> >>>>  x ... Waiting some more time because this is the current command
> >>>>  x Could not find exception:
> >>>> com.cloud.exception.OperationTimedoutException in error code list for
> >>>> exceptions
> >>>>  x Timed out on Seq 28-2352567855348137104
> >>>>  x Seq 28-2352567855348137104: Cancelling.
> >>>>  x Operation timed out: Commands 2352567855348137104 to Host 28 timed
> >>>> out after 60
> >>>>  x Unable to update router r-2594-VM's status
> >>>>  x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
> >>>> from MASTER to UNKNOWN
> >>>>
> >>>>  x Those error messages are now repeated for each following
> >>>> CheckRouterCommand until the virtual router is rebootet
> >>>>
> >>>>
> >>>> Greetings,
> >>>>
> >>>> Melanie
> >>>>
> >>>> --
> >>>> --
> >>>>
> >>>> Heinlein Support GmbH
> >>>> Linux: Akademie - Support - Hosting
> >>>>
> >>>> http://www.heinlein-support.de
> >>>> Tel: 030 / 40 50 51 - 0
> >>>> Fax: 030 / 40 50 51 - 19
> >>>>
> >>>> Zwangsangaben lt. §35a GmbHG:
> >>>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> >>>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> >>>>
> >>>
> >>>
> >>>
> >>
> >> --
> >> --
> >>
> >> Heinlein Support GmbH
> >> Linux: Akademie - Support - Hosting
> >>
> >> http://www.heinlein-support.de
> >> Tel: 030 / 40 50 51 - 0
> >> Fax: 030 / 40 50 51 - 19
> >>
> >> Zwangsangaben lt. §35a GmbHG:
> >> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> >> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> >>
> >
> >
> >
>
> --
> --
>
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
>
> http://www.heinlein-support.de
> Tel: 030 / 40 50 51 - 0
> Fax: 030 / 40 50 51 - 19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>



-- 
Daan