Yes, Flink 1.5.0 will come with better tools to handle this problem. Namely you will be able to limit the “in flight” data, by controlling the number of assigned credits per channel/input gate. Even without any configuring Flink 1.5.0 will out of the box buffer less data, thus mitigating the problem.
There are some tweaks that you could use to make 1.4.x work better. With small records that require heavy processing, generally speaking you do not need huge buffers sizes to keep max throughput. You can try to both reduce the buffer pool and reduce the memory segment sizes:
• taskmanager.network.memory.fraction: Fraction of JVM memory to use for network buffers (DEFAULT: 0.1),
• taskmanager.network.memory.min: Minimum memory size for network buffers in bytes (DEFAULT: 64 MB),
• taskmanager.network.memory.max: Maximum memory size for network buffers in bytes (DEFAULT: 1 GB), and
• taskmanager.memory.segment-size: Size of memory buffers used by the memory manager and the network stack in bytes (DEFAULT: 32768 (= 32 KiBytes)).
Reducing those values will reduce amount of in-flight data that will be caught between checkpoints. But keep in mind that smaller values can lead to smaller throughput, but as I said, with small number of heavy processing records this is not an issue. In an extreme example, if your records are lets say 8 bytes each and require 1 hour to process, there is almost no need for any buffering.