We have a small QA environment with just one job manager and one task manager. There are several jobs running with parallelism 1.
There is a problem with one job. During our regular upgrade process one job wasn't cancelled due to savepoint timeout:
Cancelling job 1b80efe346d437c01e17b6efda640909 with savepoint to /path/to/nfsrecovery/flink-distribution.
The program finished with the following exception:
java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]
at java.security.AccessController.doPrivileged(Native Method)
So we ended up with 2 similar jobs running in parallel (not sure if it's related to the problem).
There is no activity on this environment now but I'm seeing that there is a high backpressure on one of the operators of this job. Also, all the checkpoints are failing by timeout (5 minutes) for this particular job. Other jobs are all good.
I've looked at the job manager logs and noticed that once a day we have a connection issue between JM and TM nodes:
02 Aug 2018 22:07:23,502 WARN akka.remote.Remoting - Association to [akka.tcp://flink@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:42579
] with unknown UID is irrecoverably failed. Address cannot be quarantined without knowing the UID, gating instead for 5000 ms.
Other than that I don't see anything strange in the logs.
I would very much appreciate any advice to help me solve the problem.