[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tombstone removal optimization and question

Yes it does. Consider if it didn't and you kept writing to the same partition, you'd never be able to remove any tombstones for that partition.

On Tue., 6 Nov. 2018, 19:40 DuyHai Doan <doanduyhai@xxxxxxxxx wrote:
Hello all

I have tried to sum up all rules related to tombstone removal:


Given a tombstone written at timestamp (t) for a partition key (P) in SSTable (S1). This tombstone will be removed:

1) after gc_grace_seconds period has passed
2) at the next compaction round, if SSTable S1 is selected (not at all guaranteed because compaction is not deterministic)
3) if the partition key (P) is not present in any other SSTable that is NOT picked by the current round of compaction

Rule 3) is quite complex to understand so here is the detailed explanation:

If Partition Key (P) also exists in another SSTable (S2) that is NOT compacted together with SSTable (S1), if we remove the tombstone, there is some data in S2 that may resurrect.

Precisely, at compaction time, Cassandra does not have ANY detail about Partition (P) that stays in S2 so it cannot remove the tombstone right away.

Now, for each SSTable, we have some metadata, namely minTimestamp and maxTimestamp. 

I wonder if the current compaction optimization does use/leverage this metadata for tombstone removal. Indeed if we know that tombstone timestamp (t) < minTimestamp, it can be safely removed.

Does someone has the info ?