As Andrey noted, it’s a known issue with (currently) no good solution.
I talk a bit about how we worked around it on slide 26 of my Flink Forward talk
on a Flink-based web crawler.
Basically we do some cheesy approximate monitoring of in-flight data, and throttle the key producer so that (hopefully) network buffers don’t fill up to the point of deadlock.
It seems to be a known issue. Community will hopefully work on this but I do not see more updates since the last answer to the similar question , see also  and .
We've tried using iterations feature and in case of significant load the job sometimes stalls and stops processing events due to high back pressure both in tasks that produces records for iteration and all the other inputs to this task. It looks like a back pressure loop the task can't handle all the incoming records, iteration sink loops back into this task and also gets back pressured. This is basically a "back pressure loop" which causes a complete job stoppage.
Is there a way to mitigate this (to guarantee such issue does not occur)?