osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Old job resurrected during HA failover


Hi Elias and Vino,

sorry for the late reply. 

I think your analysis is pretty much to the point. The current implementation does not properly respect the situation with multiple standby JobManagers. In the single JobManager case, a loss of leadership either means that the JobManager has died and, thus, also its ZooKeeper connection which causes the ephemeral nodes to disappear or the same JobManager will be reelected. In the case of multiple standby JobManagers a lost leadership caused by a ZooKeeper hickup could cause that a different JM will become the leader while the old leader still keeps its connection to ZooKeeper (e.g. after reestablishing it). In this case, the ephemeral nodes won't be deleted automatically. Consequently, it is necessary to explicitly free all locked resources as Elias has proposed. This problem affects the legacy as well as the new mode. This is a critical issue which we should fix asap.

Thanks for reporting this issue and the in-depth analysis of the cause Elias!

A somewhat related problem is that the actual ZooKeeper delete operation is executed in a background thread without proper failure handling. As far as I can tell, we only log on DEBUG that the node could not be deleted. I think this should be fixed as well because then the problem would be easier to identify.

Cheers,
Till

On Fri, Aug 3, 2018 at 5:42 PM Elias Levy <fearsome.lucidity@xxxxxxxxx> wrote:
Till,

Thoughts?

On Wed, Aug 1, 2018 at 7:34 PM vino yang <yanghua1127@xxxxxxxxx> wrote:
Your analysis is correct, yes, in theory the old jobgraph should be deleted, but Flink currently uses the method of locking and asynchronously deleting Path, so that it can not give you the acknowledgment of deleting, so this is a risk point.

cc Till, there have been users who have encountered this problem before. I personally think that asynchronous deletion may be risky, which may cause JM to be revived by the cancel job after the failover.