Re: Academic paper about Cassandra database compaction
I suspect I know what the increased disk usage in TWCS, and it's a solvable
problem, the problem is roughly something like this:
- Window 1 has sstables 1, 2, 3, 4, 5, 6
- We start compacting 1, 2, 3, 4 (using STCS-in-TWCS first window)
- The TWCS window rolls over
- We flush (sstable 7), and trigger the TWCS window major compaction, which
starts compacting 5, 6, 7 + any other sstable from that window
- If the first compaction (1,2,3,4) has finished by the time sstable 7 is
flushed, we'll include it's result in that compaction, if it doesn't we'll
have to do the major compaction twice to guarantee we have exactly one
sstable per window, which will temporarily increase disk space
We can likely fix this by not scheduling the major compaction until we know
all of the sstables in the window are available to be compacted.
Also your data model is probably typical, but not well suited for time
series cases - if you find my 2016 Cassandra Summit TWCS talk (it's on
youtube), I mention aligning partition keys to TWCS windows, which involves
adding a second component to the partition key. This is hugely important in
terms of making sure TWCS data expires quickly and avoiding having to read
from more than one TWCS window at a time.
On Mon, May 14, 2018 at 7:12 AM, Lucas Benevides <
> Dear community,
> I want to tell you about my paper published in a conference in March. The
> title is " NoSQL Database Performance Tuning for IoT Data - Cassandra
> Case Study" and it is available (not for free) in
> 10.5220/0006782702770284 .
> TWCS is used and compared with DTCS.
> I hope you can download it, unfortunately I cannot send copies as the
> publisher has its copyright.
> Lucas B. Dias