[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Quantifying Virtual Node Impact on Cassandra Availability

Josh Snyder and I have been working on evaluating virtual nodes for large
scale deployments and while it seems like there is a lot of anecdotal
support for reducing the vnode count [1], we couldn't find any concrete
math on the topic, so we had some fun and took a whack at quantifying how
different choices of num_tokens impact a Cassandra cluster.

According to the model we developed [2] it seems that at small cluster
sizes there isn't much of a negative impact on availability, but when
clusters scale up to hundreds of hosts, vnodes have a major impact on
availability. In particular, the probability of outage during short
failures (e.g. process restarts or failures) or permanent failure (e.g.
disk or machine failure) appears to be orders of magnitude higher for large

The model attempts to explain why we may care about this and advances a few
existing/new ideas for how to fix the scalability problems that vnodes fix
without the availability (and consistency—due to the effects on repair)
problems high num_tokens create. We would of course be very interested in
any feedback. The model source code is on github [3], PRs are welcome or
feel free to play around with the jupyter notebook to match your
environment and see what the graphs look like. I didn't attach the pdf here
because it's too large apparently (lots of pretty graphs).

I know that users can always just pick whichever number they prefer, but I
think the current default was chosen when token placement was random, and I
wonder whether it's still the right default.

Thank you,
-Joey Lynch

[1] https://issues.apache.org/jira/browse/CASSANDRA-13701
[3] https://github.com/jolynch/python_performance_toolkit/tree/