[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Questions in sink exactly once implementation

	I am reading the book “Introduction to Apache Flink”, and in the book there mentions two ways to achieve sink exactly once:
	1. The first way is to buffer all output at the sink and commit this atomically when the sink receives a checkpoint record.
	2. The second way is to eagerly write data to the output, keeping in mind that some of this data might be “dirty” and replayed after a failure. If there is a failure, then we need to roll back the output, thus overwriting the dirty data and effectively deleting dirty data that has already been written to the output.

	I read the code of Elasticsearch sink, and find there is a flushOnCheckpoint option, if set to true, the change will accumulate until checkpoint is made. I guess it will guarantee at-least-once delivery, because although it use batch flush, but the flush is not a atomic action, so it can not guarantee exactly-once delivery. 

	My question is : 
	1. As many sinks do not support transaction, at this case I have to choose 2 to achieve exactly once. At this case, I have to buffer all the records between checkpoints and delete them, it is a bit heavy action.
	2. I guess mysql sink should support exactly once delivery, because it supports transaction, but at this case I have to execute batch according to the number of actions between checkpoints but not a specific number, 100 for example. When there are a lot of items between checkpoints, it is a heavy action either.