osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

After job cancel, leftover ZK state prevents job manager startup


Hello,

We've been seeing an issue with several Flink 1.5.4 clusters that looks like this:

1. Job is cancelled with a savepoint
2. The jar is deleted from our HA blobstore (S3)
3. The jobgraph in ZK is *not* deleted
4. We restart the cluster
5. Startup fails in recovery because the jar is not available, with the stacktrace:

00:13:58.486 ERROR o.a.f.r.e.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
{{ java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager}}
{{ at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)}}
{{ at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)}}
{{ at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)}}
{{ at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)}}
{{ at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}}
{{ at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}}
{{ at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}
{{ at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}Caused by: java.lang.Exception: Cannot set up the user code libraries: No such file or directory: s3://streamingplatform-production/{JOB_NAME}/flink/highavailability/{JOB_NAME}/blob/job_5a3fe2c00c05efd3a552a1c6707d2c10/blob_p-6d585831f5c947335ac505b400cf8f3630cc706a-42355c2885b668b0bc5e15b856141b0

This superficially seems similar to several issues that have apparently been fixed in 1.5.4, like FLINK-10255 and FLINK-10184.

Has anybody else seen this issue on 1.5.4 (or later) clusters? Or any advice for debugging?

Thanks,
Micah