Hi Elias and Vino,
sorry for the late reply.
I think your analysis is pretty much to the point. The current implementation does not properly respect the situation with multiple standby JobManagers. In the single JobManager case, a loss of leadership either means that the JobManager has died and, thus, also its ZooKeeper connection which causes the ephemeral nodes to disappear or the same JobManager will be reelected. In the case of multiple standby JobManagers a lost leadership caused by a ZooKeeper hickup could cause that a different JM will become the leader while the old leader still keeps its connection to ZooKeeper (e.g. after reestablishing it). In this case, the ephemeral nodes won't be deleted automatically. Consequently, it is necessary to explicitly free all locked resources as Elias has proposed. This problem affects the legacy as well as the new mode. This is a critical issue which we should fix asap.
Thanks for reporting this issue and the in-depth analysis of the cause Elias!
A somewhat related problem is that the actual ZooKeeper delete operation is executed in a background thread without proper failure handling. As far as I can tell, we only log on DEBUG that the node could not be deleted. I think this should be fixed as well because then the problem would be easier to identify.