OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Old job resurrected during HA failover


Vino,

Thanks for the reply.  Looking in ZK I see:

[zk: localhost:2181(CONNECTED) 5] ls /flink/cluster_1/jobgraphs
[d77948df92813a68ea6dfd6783f40e7e, 2a4eff355aef849c5ca37dbac04f2ff1]

Again we see HA state for job 2a4eff355aef849c5ca37dbac04f2ff1, even though that job is no longer running (it was canceled while it was in a loop attempting to restart, but failing because of a lack of cluster slots).

Any idea why that may be the case?


On Wed, Aug 1, 2018 at 8:38 AM vino yang <yanghua1127@xxxxxxxxx> wrote:
If a job is explicitly canceled, its jobgraph node on ZK will be deleted. 
However, it is worth noting here that Flink enables a background thread to asynchronously delete the jobGraph node, 
so there may be cases where it cannot be deleted. 
On the other hand, the jobgraph node on ZK is the only basis for the JM leader to restore the job. 
There may be an unexpected recovery or an old job resurrection.