[DISCUSS][TABLE] How to handle empty delete for UpsertSource
Currently, I am working on FLINK-8577 Implement proctime DataStream to
Table upsert conversion <https://issues.apache.org/jira/browse/FLINK-8577>.
And a design doc can be found here
It received many valuable suggestions. Many thanks to all of you.
However there are some problems I think may need more discussion.
1. *Upsert Stream:* Stream that include a key definition and will be
updated. Message types include insert, update and delete.
2. *Upsert Source:* Source that ingest Upsert Stream.
3. *Empty Delete:* For a specific key, the first message is a delete
*Problem to be discussed*
How to handle empty deletes for UpsertSource？
*Ways to solve the problem*
1. Throw away empty delete messages in the UpsertSource(personally in
favor of this option)
- This makes sense in semantics. An empty table + delete message is
still an empty table. Losing deletes does not affect the final results.
- At present, the operators or functions in flink are assumed to
process the add message first and then delete. Throw away
empty deletes in
source, so that the downstream operators do not need to
consider the empty
- Maintaining the state in source is expensive, especially for some
simple sql like: UpsertSource -> Calc -> UpsertSink.
2. Throw away empty delete messages when source generate
retractions, otherwise pass empty delete messages down
- Downstream operator does not need to consider empty delete messages
when the source generates retraction.
- Performance is better since source don't have to maintain state
if it doesn't generate retractions.
- The judgment that whether the downstream operator will receive
empty delete messages is complicated. Not only take source into
consideration, but also should consider the operators that
are followed by
source. Take join as an example, for the sql: upsert_source
the join receives empty deletes while in sql(upsert_source ->
upsert_join), the join doesn't since empty deletes are ingested by
- The semantics of how to process empty deletes are not clear.
Users may be difficult to understand, because sometimes empty
passed down, but sometimes don't.
3. Pass empty delete messages down always
- Performance is better since source don't have to maintain state if
it doesn't generate retractions.
- All table operators and functions in flink need to consider empty
*Another related problem*
Another related problem is FLINK-9528 Incorrect results: Filter does not
treat Upsert messages correctly
<https://issues.apache.org/jira/browse/FLINK-9528> which I think should be
The problem in FLINK-9528 is, for sql like upsert_source -> filter ->
upsert_sink, when the data of a key changes from non-filtering to
filtering, the filter only removes the upsert message such that the
previous version remains in the result.
1. One way to solve the problem is to make UpserSource generates
2. Another way is to make a filter aware of the update semantics
(retract or upsert) and convert the upsert message into a delete message if
the predicate evaluates to false.
The second way will also generate many empty delete messages. To avoid too
many empty deletes, the solution is to maintain a filter state at sink to
prevent the empty deletes from causing devastating pressure on the physical
database. However, if UpsertSource can also output empty deletes, these
empty deletes will be more difficult to control. We don't know where these
deletes come from, and whether should be filtered out. The ambiguity of the
semantics of processing empty deletes makes the user unable to judge
whether there will be empty deletes output.
*My personal opinion*
>From my point of view, I think the first option(Throw away empty delete
messages in the UpsertSource) is the best, not only because the semantics
are more clear but also the processing logic of the entire table layer can
be more simple thus more efficient. Furthermore the performance loss is
acceptable (We can even only store key in state when source doesn't
Any suggestions are greatly appreciated!