I recently upgraded to flink 1.4 from 1.3 and leverage Queryable State client in my application. I have 1 jm and 5 tm all serviced behind kubernetes. A large state is built and distributed evenly across task mangers and the client can query state for specified key
Issue: if a task manager dies and a new one gets spun up(automatically) and the QS states successfully recover in new nodes/task slots. I start to get time out exception when the client tries to query for key, even if I try to reset or re-deploy the client jobs
I have been trying to triage this and figure out a way to remediate this issue and I found that in KvStateClientProxyHandler which is not exposed in code, there is a forceUpdate flag that can help reset KvStateLocations(plus inetAddresses) but the default is false and can't be overriden
I was wandering if anyone knows how to remediate this kind of issue or if there is a way to have the jobmanager know that the task manager location in cache is no more valid.
Any tip to resolve this will be appreciated (I can't downgrade back to 1.3 or upgrade from 1.4)