Thank you very much for your response.
I will try the new feature of Flink 1.5 when it is released.
But I am not sure minimising buffers sizes will work in all scenarios.
If I understand correctly these settings are affecting the whole Flink instance.
We might have a flow like this:
Source: Read file paths --> Unpack and parse files --> Analyse parsed data -> ….
So it will be a very small amount of data at first step but quite a lot of parsed data later.
Changing buffer sizes globally will probably affect throughput of later steps, as you wrote.
Yes, Flink 1.5.0 will come with better tools to handle this problem. Namely you will be able to limit the “in flight” data, by controlling the number of assigned credits per channel/input gate. Even without any configuring Flink 1.5.0 will out of the box buffer less data, thus mitigating the problem.
There are some tweaks that you could use to make 1.4.x work better. With small records that require heavy processing, generally speaking you do not need huge buffers sizes to keep max throughput. You can try to both reduce the buffer pool and reduce the memory segment sizes:
• taskmanager.network.memory.fraction: Fraction of JVM memory to use for network buffers (DEFAULT: 0.1),
• taskmanager.network.memory.min: Minimum memory size for network buffers in bytes (DEFAULT: 64 MB),
• taskmanager.network.memory.max: Maximum memory size for network buffers in bytes (DEFAULT: 1 GB), and
• taskmanager.memory.segment-size: Size of memory buffers used by the memory manager and the network stack in bytes (DEFAULT: 32768 (= 32 KiBytes)).
Reducing those values will reduce amount of in-flight data that will be caught between checkpoints. But keep in mind that smaller values can lead to smaller throughput, but as I said, with small number of heavy processing records this is not an issue. In an extreme example, if your records are lets say 8 bytes each and require 1 hour to process, there is almost no need for any buffering.
With the current version of Flink, there is no general solution to this problem.
The upcoming version 1.5.0 of Flink adds a feature called credit-based flow control which might help here.
I'm adding @Piotr to this thread who knows more about the details of this new feature.