OSDir

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quantifying Virtual Node Impact on Cassandra Availability


Great write up. Glad someone finally did the math for us. I don't think
this will come as a surprise for many of the developers. Availability is
only one issue raised by vnodes. Load distribution and performance are also
pretty big concerns.

I'm always a proponent for fixing vnodes, and removing them as a default
until we do. Happy to help on this and we have ideas in mind that at some
point I'll create tickets for...

On Tue., 17 Apr. 2018, 06:16 Joseph Lynch, <joe.e.lynch@xxxxxxxxx> wrote:

> If the blob link on github doesn't work for the pdf (looks like mobile
> might not like it), try:
>
>
> https://github.com/jolynch/python_performance_toolkit/raw/master/notebooks/cassandra_availability/whitepaper/cassandra-availability-virtual.pdf
>
> -Joey
> <
> https://github.com/jolynch/python_performance_toolkit/raw/master/notebooks/cassandra_availability/whitepaper/cassandra-availability-virtual.pdf
> >
>
> On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch <joe.e.lynch@xxxxxxxxx>
> wrote:
>
> > Josh Snyder and I have been working on evaluating virtual nodes for large
> > scale deployments and while it seems like there is a lot of anecdotal
> > support for reducing the vnode count [1], we couldn't find any concrete
> > math on the topic, so we had some fun and took a whack at quantifying how
> > different choices of num_tokens impact a Cassandra cluster.
> >
> > According to the model we developed [2] it seems that at small cluster
> > sizes there isn't much of a negative impact on availability, but when
> > clusters scale up to hundreds of hosts, vnodes have a major impact on
> > availability. In particular, the probability of outage during short
> > failures (e.g. process restarts or failures) or permanent failure (e.g.
> > disk or machine failure) appears to be orders of magnitude higher for
> large
> > clusters.
> >
> > The model attempts to explain why we may care about this and advances a
> > few existing/new ideas for how to fix the scalability problems that
> vnodes
> > fix without the availability (and consistency—due to the effects on
> repair)
> > problems high num_tokens create. We would of course be very interested in
> > any feedback. The model source code is on github [3], PRs are welcome or
> > feel free to play around with the jupyter notebook to match your
> > environment and see what the graphs look like. I didn't attach the pdf
> here
> > because it's too large apparently (lots of pretty graphs).
> >
> > I know that users can always just pick whichever number they prefer, but
> I
> > think the current default was chosen when token placement was random,
> and I
> > wonder whether it's still the right default.
> >
> > Thank you,
> > -Joey Lynch
> >
> > [1] https://issues.apache.org/jira/browse/CASSANDRA-13701
> > [2] https://github.com/jolynch/python_performance_toolkit/
> > raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> > availability-virtual.pdf
> >
> > <
> https://github.com/jolynch/python_performance_toolkit/blob/master/notebooks/cassandra_availability/whitepaper/cassandra-availability-virtual.pdf
> >
> > [3] https://github.com/jolynch/python_performance_toolkit/tree/m
> > aster/notebooks/cassandra_availability
> >
>