osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Could not cancel job (with savepoint) "Ask timed out"


I was trying to cancel a job with savepoint, but the CLI command failed with "akka.pattern.AskTimeoutException: Ask timed out".

The stack trace reveals that ask timeout is 10 seconds:

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_0#106635280]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

Indeed it's documented that the default value for akka.ask.timeout="10 s" in
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Behind the scenes the savepoint creation & job cancellation succeeded, that was to be expected, kind of. So my problem is just getting a proper response back from the CLI call instead of timing out so eagerly.

To be exact, what I ran was:

flink-1.5.2/bin/flink cancel b7c7d19d25e16a952d3afa32841024e5 -m yarn-cluster -yid application_1533676784032_0001 --withSavepoint

Should I change the akka.ask.timeout to have a longer timeout? If yes, can I override it just for the CLI call somehow? Maybe it might have undesired side-effects if set globally for the actual flink jobs to use?

What about akka.client.timeout? The default for it is also rather low: "60 s". Should it also be increased accordingly if I want to accept longer than 60 s for savepoint creation?

Finally, that default timeout is so low that I would expect this to be a common problem. I would say that Flink CLI should have higher default timeout for cancel and savepoint creation ops.

Thanks!