[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CloudStack Host HA Failure

Hi All:

       I am testing the host ha function and found that ha will not take effect when the hard disk fails.
       How does com.cloud.agent.api.CheckHealthCommand determine that the Agent is still online?

Here is the test process

Cloudstack Version : 4.10
OS Version :Ubuntu 14.04.5 ,4.4.0-119-generic
Server: Dell R730 /E5-2620V3 *2 /300G HDD *1(non-raid)

There are three nodes under cluster:Test-01/test-02/test-03

Global config ->ha.tag =ha
Host.tag =ha
Instance ->compute offering->host.tag =ha

To simulate Host HA under hard disk failure

Action: extract the hard disk from the server online

Ha unsuccessful,  instances will not start on other nodes.

Monitoring ha state

1、Management Ping agent is normal

2、All instances on Agent are already inaccessible

3、Other commands can not be executed on Agent except CD command.

4、Management shows that the agent status is Up
5、Management Log

2018-06-19 16:17:59,299 INFO  [c.c.a.m.AgentManagerImpl] (AgentTaskPool-14:ctx-f8355fb7) (logid:f61237b8) Investigating why host 31 has disconnected with event PingTimeout
2018-06-19 16:17:59,303 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-14:ctx-f8355fb7) (logid:f61237b8) checking if agent (31) is alive
2018-06-19 16:17:59,314 DEBUG [c.c.a.t.Request] (AgentTaskPool-14:ctx-f8355fb7) (logid:f61237b8) Seq 31-4026218066869223474: Sending  { Cmd , MgmtId: 11065950535710, via: 31(test-02), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
2018-06-19 16:17:59,358 DEBUG [c.c.a.t.Request] (AgentTaskPool-14:ctx-f8355fb7) (logid:f61237b8) Seq 31-4026218066869223474: Received:  { Ans: , MgmtId: 11065950535710, via: 31(test-02), Ver: v1, Flags: 10, { CheckHealthAnswer } }
2018-06-19 16:17:59,358 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-14:ctx-f8355fb7) (logid:f61237b8) Details from executing class com.cloud.agent.api.CheckHealthCommand: resource is alive
2018-06-19 16:17:59,358 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-14:ctx-f8355fb7) (logid:f61237b8) agent (31) responded to checkHeathCommand, reporting that agent is Up
2018-06-19 16:17:59,358 INFO  [c.c.a.m.AgentManagerImpl] (AgentTaskPool-14:ctx-f8355fb7) (logid:f61237b8) The agent from host 31 state determined is Up
2018-06-19 16:17:59,358 INFO  [c.c.a.m.AgentManagerImpl] (AgentTaskPool-14:ctx-f8355fb7) (logid:f61237b8) Agent is determined to be up and running
2018-06-19 16:17:59,358 DEBUG [c.c.h.Status] (AgentTaskPool-14:ctx-f8355fb7) (logid:f61237b8) Transition:[Resource state = Enabled, Agent event = Ping, Host id = 31, name = test-02]