[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Small-files source - partitioning based on prefix of file

Hi Averell,

Conceptually, you are right. Checkpoints are taken at every operator at the same "logical" time.
It is not important, that each operator checkpoints at the same wallclock time. Instead, the need to take a checkpoint when they have processed the same input.
This is implemented with so-called Checkpoint Barriers, which are special records that are injected at the sources.
[Simplification] When an operator receives a barrier it performs a checkpoint. [/Simplification]
This way, we do not need to pause the processing of all operators but can perform the checkpoints locally for each operator.

This page of the Internal docs should help to understand how the mechanism works in detail [1].

Best, Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/internals/stream_checkpointing.html

2018-08-10 14:43 GMT+02:00 Averell <lvhuyen@xxxxxxxxx>:
Thank you Vino, Jorn, and Fabian.
Please forgive me for my ignorant, as I am still not able to fully
understand state/checkpointing and the statement that Fabian gave earlier:
"/In either case, some record will be read twice but if reading position can
be reset, you can still have exactly-once state consistency because the
state is reset as well./"

My current understanding is: checkpointing is managed at the
Execution-Environment level, and it would happen at the same time at all the
operators of the pipeline. Is this true?
My concern here is how to manage that synchronization? It would be quite
possible that at different operators, checkpointing happens at some
milliseconds apart, which would lead to duplicated or missed records,
wouldn't it?

I tried to read Flink's document about managing State  here
. However, I have not been able to find the information I am looking for.
Please help point me to the right place.

Thanks and best regards,