I can see in the logs that the JM 1 (10.210.22.167), that one that became leader after failover, thinks it deleted the
2a4eff355aef849c5ca37dbac04f2f f1 job from ZK when it was canceled:July 30th 2018, 15:32:27.231 Trying to cancel job with ID 2a4eff355aef849c5ca37dbac04f2f f1.July 30th 2018, 15:32:27.232 Job Some Job ( 2a4eff355aef849c5ca37dbac04f2f f1) switched from state RESTARTING to CANCELED.July 30th 2018, 15:32:27.232 Stopping checkpoint coordinator for job 2a4eff355aef849c5ca37dbac04f2f f1July 30th 2018, 15:32:27.239 Removed job graph 2a4eff355aef849c5ca37dbac04f2f f1 from ZooKeeper.July 30th 2018, 15:32:27.245 Removing /flink/cluster_1/checkpoints/ 2a4eff355aef849c5ca37dbac04f2f f1 from ZooKeeperJuly 30th 2018, 15:32:27.251 Removing /checkpoint-counter/ 2a4eff355aef849c5ca37dbac04f2f f1 from ZooKeeperBoth /flink/cluster_1/ checkpoints/ 2a4eff355aef849c5ca37dbac04f2f f1 and /flink/cluster_1/ checkpoint-counter/ 2a4eff355aef849c5ca37dbac04f2f f1 no longer exist, but for some reason the job graph as is still there.Looking at the ZK logs I find the problem:July 30th 2018, 15:32:27.241 Got user-level KeeperException when processing sessionid:0x2000001d2330001 type:delete cxid:0x434c zxid:0x60009dd94 txntype:-1 reqpath:n/a Error Path:/flink/cluster_1/ jobgraphs/ 2a4eff355aef849c5ca37dbac04f2f f1 Error:KeeperErrorCode = Directory not empty for /flink/cluster_1/jobgraphs/ 2a4eff355aef849c5ca37dbac04f2f f1Looking in ZK, we see:[zk: localhost:2181(CONNECTED) 0] ls /flink/cluster_1/jobgraphs/ 2a4eff355aef849c5ca37dbac04f2f f1[d833418c-891a-4b5e-b983- 080be803275c]From the comments in ZooKeeperStateHandleStore.java I gather that this child node is used as a deletion lock. Looking at the contents of this ephemeral lock node:[zk: localhost:2181(CONNECTED) 16] get /flink/cluster_1/jobgraphs/ 2a4eff355aef849c5ca37dbac04f2f f1/d833418c-891a-4b5e-b983- 080be803275c10.210.42.62cZxid = 0x60002ffa7ctime = Tue Jun 12 20:01:26 UTC 2018mZxid = 0x60002ffa7mtime = Tue Jun 12 20:01:26 UTC 2018pZxid = 0x60002ffa7cversion = 0dataVersion = 0aclVersion = 0ephemeralOwner = 0x30000003f4a0003dataLength = 12numChildren = 0and compared to the ephemeral node lock of the currently running job:[zk: localhost:2181(CONNECTED) 17] get /flink/cluster_1/jobgraphs/ d77948df92813a68ea6dfd6783f40e 7e/596a4add-9f5c-4113-99ec- 9c942fe9117220.127.116.11cZxid = 0x60009df4bctime = Mon Jul 30 23:01:04 UTC 2018mZxid = 0x60009df4bmtime = Mon Jul 30 23:01:04 UTC 2018pZxid = 0x60009df4bcversion = 0dataVersion = 0aclVersion = 0ephemeralOwner = 0x2000001d2330001dataLength = 13numChildren = 0Assuming the content of the nodes represent the owner, it seems the job graph for the old canceled job, 2a4eff355aef849c5ca37dbac04f2f f1, is locked by the previous JM leader, JM 2(10.210.42.62), while the running job locked by the current JM leader, JM 1 (10.210.22.167).Somehow the previous leader, JM 2, did not give up the lock when leadership failed over to JM 2.Shouldn't something call ZooKeeperStateHandleStore. releaseAll during HA failover to release the locks on the graphs?On Wed, Aug 1, 2018 at 9:49 AM Elias Levy <fearsome.lucidity@xxxxxxxxx> wrote:Thanks for the reply. Looking in ZK I see:[zk: localhost:2181(CONNECTED) 5] ls /flink/cluster_1/jobgraphs[ d77948df92813a68ea6dfd6783f40e 7e, 2a4eff355aef849c5ca37dbac04f2f f1]Again we see HA state for job 2a4eff355aef849c5ca37dbac04f2f f1, even though that job is no longer running (it was canceled while it was in a loop attempting to restart, but failing because of a lack of cluster slots).Any idea why that may be the case?