Do you have any quick idea what could cause this problems in flink 1.4.2?
Seems like one operator takes too long to deploy and downstream tasks error out on partition not found. This only seems to happen when the job is restored from state and in fact that operator has some keyed and operator state as well.
Deploying the same job from empty state works well. We tried increasing the taskmanager.network.request-backoff.max that didnt help.
It would be great if you have some pointers where to look further, I havent seen this happening before.
org.apache.flink.runtime.io.network.partition.: Partition 4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd not found.