com.cloud.agent.api.CheckRouterCommand timeout

Hi all,

we have a recurring problem with our virtual routers. By the log
messages it seems that com.cloud.agent.api.CheckRouterCommand runs into
a timeout and therefore switches to UNKNOWN.

All network traffic through the routers is still working. They can be
accessed by their link-local IP adresses, and configuration looks good
at a first sight. But configuration changes through the CloudStack API
do no longer reach the routers. A reboot fixes the problem.

I would like to investigate a little further but lack understanding
about how the checkRouter command is trying to access the virtual router.

Could someone point me to some relevant documentation or give a short
overview how the connection from CS-Management is done and where such an
timeout could occur?

As background information - the sequence from the management log looks
kind of this:


 x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns
a state BACKUP or MASTER correctly
 x When the problem occurs the log messages change. Some snippets below

 x ... Waiting some more time because this is the current command
 x ... Waiting some more time because this is the current command
 x Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list for
 x Timed out on Seq 28-2352567855348137104
 x Seq 28-2352567855348137104: Cancelling.
 x Operation timed out: Commands 2352567855348137104 to Host 28 timed
out after 60
 x Unable to update router r-2594-VM's status
 x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch

 x Those error messages are now repeated for each following
CheckRouterCommand until the virtual router is rebootet




