OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Understanding checkpoint behavior


Hi,

Checkpoint duration sync, that’s only the time taken for the “synchronous” part of taking a snapshot of your operator. Your 11m time probably comes from the fact that before this snapshot, checkpoint barrier was stuck somewhere in your pipeline for that amount of time processing some record or bunch of records.

If you write a simple function that only performs `Thread.sleep(new Random().randomInt(3600000))` and nothing else, your checkpoints will be taking random amount of time, since snapshots can not be taken while your function is also executing some code. You can read about some of those concepts in the documentation

https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html

Piotrek

Btw, Flink 1.2.1 is very old and not supported anymore version. One reason to upgrade are improvements in the network stack in Flink 1.5.x, which were in part aiming to reduce checkpoint duration.

> On 5 Nov 2018, at 21:33, PranjalChauhan <pranjalhchauhan@xxxxxxxxx> wrote:
> 
> Hi,
> 
> I am new Fink user and currently, using Flink 1.2.1 version. I am trying to
> understand how checkpoints actually work when Window operator is processing
> events.
> 
> My pipeline has the following flow where each operator's parallelism is 1.
> source -> flatmap -> tumbling window -> sink
> In this pipeline, I had configured the window to be evaluated every 1 hour
> (3600 seconds) and the checkpoint interval was 5 mins. The checkpoint
> timeout was set to 1 hour as I wanted the checkpoints to complete.
> 
> In my window function, the job makes https call to another service so window
> function may take some time to evaluate/process all events.
> 
> Please refer the following image. In this case, the window was triggered at
> 23:00:00. Checkpoint 12 was triggered soon after that and I notice that
> checkpoint 12 takes long time to complete (compared to other checkpoints
> when window function is not processing events).
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1766/overall_checkpoint_duration_summary_when_waiting_for_window_operator.png> 
> 
> Following images shows checkpoint 12 details of window & sink operators.
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1766/window_operator_checkpoint_duration_after_window_interval.png> 
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1766/sink_operator_checkpoint_duration_after_window_interval.png> 
> 
> I see that the time spent for checkpoint was actually just 5 ms & 8 ms
> (checkpoint duration sync) for window & sink operators. However, End to End
> Duration for checkpoint was 11m 12s for both window & sink operator.
> 
> Is this expected behavior? If yes, do you have any suggestion to reduce the
> end to end checkpoint duration?
> 
> Please let me know if any more information is needed.
> 
> Thanks.
> 
> 
> 
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/