[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Tombstone passed GC period causes un-repairable inconsistent data


We know that the deleted data may re-appear if repair is not run within
gc_grace_seconds. When the tombstone is not propagated to all nodes, the
data will re-appear. But it's also causing following 2 issues before the
tombstone is compacted away:
a. inconsistent query result

With consistency level ONE or QUORUM, it may or may not return the value.
b. lots of read repairs, but doesn't repair anything

With consistency level ALL, it always triggers a read repair.
With consistency level QUORUM, it also very likely (2/3) causes a read
repair. But it doesn't repair the data, so it's causing repair every time.

Here are the reproducing steps:

1. Create a 3 nodes cluster
2. Create a table (with small gc_grace_seconds):

CREATE KEYSPACE foo WITH replication = {'class': 'SimpleStrategy',
'replication_factor': 3};
CREATE TABLE foo.bar (
    id int PRIMARY KEY,
    name text
) WITH gc_grace_seconds=30;

3. Insert data with consistency all:

INSERT INTO foo.bar (id, name) VALUES(1, 'cstar');

4. stop 1 node

$ ccm node2 stop

5. Delete the data with consistency quorum:

DELETE FROM foo.bar WHERE id=1;

6. Wait 30 seconds and then start node2:

$ ccm node2 start

Now the tombstone is on node1 and node3 but not on node2.

With quorum read, it may or may not return value, and read repair will send
the data from node2 to node1 and node3, but it doesn't repair anything.

I'd like to discuss a few potential solutions and workarounds:

1. Can hints replay sends GCed tombstone?

2. Can we have a "deep repair" which detects such issue and repair the GCed
tombstone? Or temperately increase the gc_grace_seconds for repair?

What other suggestions you have if the user is having such issue?