osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Why checkpoint took so long


I noticed a strange thing in flink 1.3 checkpointing. Checkpoint succeeded, but took so long 15 minutes 53 seconds. Size of checkpoint metadata on s3 is just 1.7MB. Most of the time checkpoints actually fails. 

aws --profile cure s3 ls --recursive --summarize --human s3://curation-two-admin/flink/sa-checkpoint/sa1/checkpoint_metadata-c99cfda10951
2018-08-16 13:34:07    1.7 MiB flink/sa-checkpoint/sa1/checkpoint_metadata-c99cfda10951

I came this discussion http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoints-very-slow-with-high-backpressure-td12762.html#a19370. But it looks like the problem was caused by high back pressure. Not the case for me.  

taskmanager.network.memory.max    128 MB
very small, I was hoping to get faster checkpoints with smaller buffers. Reading from durable storage (s3) and don't worry about buffering reads due to slow writing.

Any ideas, what can cause such slow checkpointing? Thanks. -Alex

Screen Shot 2018-08-16 at 1.43.23 PM.png