Thanks a lot for your response. I have a few questions about this topic. Would you please help me about it?
1. I have heard a idempotent way but I do not know how to implement it, would you please enlighten me about it by a example?
2. If dirty data are added but not updated, then only overwrite is not enough I think.
3. If using two-phase commit, the sink must support transaction.
3.1 If the sink does not support transaction, for example, elasticsearch, do I have to use idempotent to implement exactly-once?
3.2 If the sink support transaction, for example, mysql, idempotent and two-phase commit is both OK. But like you say, if there are a lot of items between checkpoints, the batch insert is a heavy action, I still have to use idempotent way to implement exactly-once.
Yes, exactly once using atomic way is heavy for mysql. However, you don't have to buffer data if you choose option 2. You can simply overwrite old records with new ones if result data is idempotent and this way can also achieve exactly once.
There is a document about End-to-End Exactly-Once Processing in Apache Flink, which may be helpful for you.
I am reading the book “Introduction to Apache Flink”, and in the book there mentions two ways to achieve sink exactly once:
1. The first way is to buffer all output at the sink and commit this atomically when the sink receives a checkpoint record.
2. The second way is to eagerly write data to the output, keeping in mind that some of this data might be “dirty” and replayed after a failure. If there is a failure, then we need to roll back the output, thus overwriting the dirty data and effectively deleting dirty data that has already been written to the output.
I read the code of Elasticsearch sink, and find there is a flushOnCheckpoint option, if set to true, the change will accumulate until checkpoint is made. I guess it will guarantee at-least-once delivery, because although it use batch flush, but the flush is not a atomic action, so it can not guarantee exactly-once delivery.
My question is :
1. As many sinks do not support transaction, at this case I have to choose 2 to achieve exactly once. At this case, I have to buffer all the records between checkpoints and delete them, it is a bit heavy action.
2. I guess mysql sink should support exactly once delivery, because it supports transaction, but at this case I have to execute batch according to the number of actions between checkpoints but not a specific number, 100 for example. When there are a lot of items between checkpoints, it is a heavy action either.