[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Why checkpoint took so long


1. No custom implementations for source and checkpoints. Source is json files on s3.
JsonLinesInputFormat format = new JsonLinesInputFormat(new Path(customerPath), configuration);
// Read JSON objects from the given path, monitoring it continuously for updates
.readFile(format, customerPath, FileProcessingMode.PROCESS_CONTINUOUSLY, pollInterval.toMillis())

RocksDB is used as sate backend.

2. Majority of checkpoints timeout after 15 minutes. 


On Thu, Aug 16, 2018 at 8:48 PM vino yang <yanghua1127@xxxxxxxxx> wrote:
Hi Alex,

I still have a few questions:

1) Is this file source and checkpoint logic implemented by you? .
2) Other failed checkpoints, can you give the corresponding failure log or more descriptions, such as failure due to timeout, or other reasons?

Thanks, vino.

Alex Vinnik <alvinnik.g@xxxxxxxxx> 于2018年8月17日周五 上午3:03写道:
I noticed a strange thing in flink 1.3 checkpointing. Checkpoint succeeded, but took so long 15 minutes 53 seconds. Size of checkpoint metadata on s3 is just 1.7MB. Most of the time checkpoints actually fails. 

aws --profile cure s3 ls --recursive --summarize --human s3://curation-two-admin/flink/sa-checkpoint/sa1/checkpoint_metadata-c99cfda10951
2018-08-16 13:34:07    1.7 MiB flink/sa-checkpoint/sa1/checkpoint_metadata-c99cfda10951

I came this discussion http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoints-very-slow-with-high-backpressure-td12762.html#a19370. But it looks like the problem was caused by high back pressure. Not the case for me.  

taskmanager.network.memory.max    128 MB
very small, I was hoping to get faster checkpoints with smaller buffers. Reading from durable storage (s3) and don't worry about buffering reads due to slow writing.

Any ideas, what can cause such slow checkpointing? Thanks. -Alex

Screen Shot 2018-08-16 at 1.43.23 PM.png