[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Wondering how cql3 DISTINCT query is implemented

Hi, we built a simple system to migrate live cassandra data to other databases, mainly by using these queries:

1. SELECT DISTINCT TOKEN(partition_key) FROM table WHERE TOKEN(partition_key) > current_offset AND TOKEN(partition_key) <= upper_bound LIMIT token_fetch_size
2. Any cql query that retrieves all rows, given a set of tokens

And we observed that the "SELECT DISTINCT TOKEN" query takes way longer when the table is wide partitioned (about 200+ rows on average), look like the underlying operation is not linear. 

Is it that the query would scan every rows of every partitions found until token_fetch_size is met? Or is it due to some low-level operations that are naturally more time consuming when dealing with wide partitioned data?

Any advice on this question or where to find the concerning code would be appreciated.