[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

high availability with automated disaster recovery using zookeeper

Hi all,


We are now examining how to achieve high availability for Flink, and to support also automatic recovery in disaster scenario- when all DC goes down.

We have DC1 which we usually want work to be done, and DC2 – which is more remote and we want work to go there only when DC1 is down.


We examined few options and would be glad to hear feedback a suggestion for another way to achieve this.

·         Two zookeeper separate zookeeper and flink clusters on the two data centers.

Only the cluster on DC1 are running, and state is copied to DC2 in offline process.

To achieve automatic recovery we need to use some king of watch dog which will check DC1 availability , and if it is down will start DC2 (and same later if DC2 is down).

Is there recommended tool for this?

·         Zookeeper “stretch cluster” cross data centers – with 2 nodes on DC1, 2 nodes on DC2 and one observer node.

Also flink cluster jobmabnager1 on DC1 and jobmanager2 on DC2.

This way when DC1 is down, zookeeper will notice this automatically and will transfer work to jobmanager2 on DC2.

However we would like zookeeper leader, and flink jobmanager leader (primary one) to be from DC1 – unless it is down.

Is there a way to achieve this?


Thanks and regards,


Tovi Sofer

Software Engineer
+972 (3) 7405756