I am trying to use Flink for data ingestion.
Input is a Kafka topic with strings - paths to incoming archive files. The job is unpacking the archives, reads data in them, parses and stores data in another format.
Everything works fine if the topic is empty at the beginning of execution and then archives income with regular intervals. But if the queue contains several thousands of paths when the job starts - the checkpont durations become too long and write transactions either fail or take hours to complete.
As Kafka messages are very small (just paths) - Kafka source manages to read all of them almost instantly before backpressure is detected. And then it tries to process all these entries within a single checkpoint. As archives might be pretty large - it takes hours.
Do you know a solution for this problem? Is it possible to ask Flink source to read data slowly before the correct processing speed is detected?
I decreased "fetch.max.bytes" kafka source property to 1kb and set buffer timeout to 1ms. It seems to work for the current data set, but it does not look like a good solution...