OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quantifying Virtual Node Impact on Cassandra Availability


I'm also not convinced the problems listed in the paper with removenode are
so serious. With lots of vnodes per node, removenode causes data to be
streamed into all other nodes in parallel, so is (n-1) times quicker than
replacement for n nodes. For R=3, the failure rate goes up with vnodes
(without vnodes, after the first failure, any 4 neighbouring node failures
lose quorum but for vnodes, any other node failure loses quorum) by a
factor of (n-1)/4. The increase in speed more than offsets this so in fact
vnodes with removenode give theoretically 4x higher availability than no
vnodes.

If anyone is interested in using vnodes in large clusters I'd strongly
suggest testing this out to see if the concerns in section 4.3.3 are valid.

Richard.

On 17 April 2018 at 08:29, Jeff Jirsa <jjirsa@xxxxxxxxx> wrote:

> There are two huge advantages
>
> 1) during expansion / replacement / decom, you stream from far more
> ranges. Since streaming is single threaded per stream, this enables you to
> max out machines during streaming where single token doesn’t
>
> 2) when adjusting the size of a cluster, you can often grow incrementally
> without rebalancing
>
> Streaming entire wholly covered/contained/owned sstables during range
> movements is probably a huge benefit in many use cases that may make the
> single threaded streaming implementation less of a concern, and likely
> works reasonably well without major changes to LCS in particular  - I’m
> fairly confident there’s a JIRA for this, if not it’s been discussed in
> person among various operators for years as an obvious future improvement.
>
> --
> Jeff Jirsa
>
>
> > On Apr 17, 2018, at 8:17 AM, Carl Mueller <carl.mueller@xxxxxxxxxxxxxxx>
> wrote:
> >
> > Do Vnodes address anything besides alleviating cluster planners from
> doing
> > token range management on nodes manually? Do we have a centralized list
> of
> > advantages they provide beyond that?
> >
> > There seem to be lots of downsides. 2i index performance, the above
> > availability, etc.
> >
> > I also wonder if in vnodes (and manually managed tokens... I'll return to
> > this) the node recovery scenarios are being hampered by sstables having
> the
> > hash ranges of the vnodes intermingled in the same set of sstables. I
> > wondered in another thread in vnodes why sstables are separated into sets
> > by the vnode ranges they represent. For a manually managed contiguous
> token
> > range, you could separate the sstables into a fixed number of sets, kind
> of
> > vnode-light.
> >
> > So if there was rebalancing or reconstruction, you could sneakernet or
> > reliably send entire sstable sets that would belong in a range.
> >
> > I also thing this would improve compactions and repairs too. Compactions
> > would be naturally parallelizable in all compaction schemes, and repairs
> > would have natural subsets to do merkle tree calculations.
> >
> > Granted sending sstables might result in "overstreaming" due to data
> > replication across the sstables, but you wouldn't have CPU and random I/O
> > to look up the data. Just sequential transfers.
> >
> > For manually managed tokens with subdivided sstables, if there was
> > rebalancing, you would have the "fringe" edges of the hash range
> subdivided
> > already, and you would only need to deal with the data in the border
> areas
> > of the token range, and again could sneakernet / dumb transfer the tables
> > and then let the new node remove the unneeded in future repairs.
> > (Compaction does not remove data that is not longer managed by a node,
> only
> > repair does? Or does only nodetool clean do that?)
> >
> > Pre-subdivided sstables for manually maanged tokens would REALLY pay big
> > dividends in large-scale cluster expansion. Say you wanted to double or
> > triple the cluster. Since the sstables are already split by some numeric
> > factor that has lots of even divisors (60 for RF 2,3,4,5), you simply
> bulk
> > copy the already-subdivided sstables for the new nodes' hash ranges and
> > you'd basically be done. In AWS EBS volumes, that could just be a drive
> > detach / drive attach.
> >
> >
> >
> >
> >> On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves <kurt@xxxxxxxxxxxxxxx>
> wrote:
> >>
> >> Great write up. Glad someone finally did the math for us. I don't think
> >> this will come as a surprise for many of the developers. Availability is
> >> only one issue raised by vnodes. Load distribution and performance are
> also
> >> pretty big concerns.
> >>
> >> I'm always a proponent for fixing vnodes, and removing them as a default
> >> until we do. Happy to help on this and we have ideas in mind that at
> some
> >> point I'll create tickets for...
> >>
> >>> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch, <joe.e.lynch@xxxxxxxxx>
> wrote:
> >>>
> >>> If the blob link on github doesn't work for the pdf (looks like mobile
> >>> might not like it), try:
> >>>
> >>>
> >>> https://github.com/jolynch/python_performance_toolkit/
> >> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> >> availability-virtual.pdf
> >>>
> >>> -Joey
> >>> <
> >>> https://github.com/jolynch/python_performance_toolkit/
> >> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> >> availability-virtual.pdf
> >>>>
> >>>
> >>> On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch <joe.e.lynch@xxxxxxxxx>
> >>> wrote:
> >>>
> >>>> Josh Snyder and I have been working on evaluating virtual nodes for
> >> large
> >>>> scale deployments and while it seems like there is a lot of anecdotal
> >>>> support for reducing the vnode count [1], we couldn't find any
> concrete
> >>>> math on the topic, so we had some fun and took a whack at quantifying
> >> how
> >>>> different choices of num_tokens impact a Cassandra cluster.
> >>>>
> >>>> According to the model we developed [2] it seems that at small cluster
> >>>> sizes there isn't much of a negative impact on availability, but when
> >>>> clusters scale up to hundreds of hosts, vnodes have a major impact on
> >>>> availability. In particular, the probability of outage during short
> >>>> failures (e.g. process restarts or failures) or permanent failure
> (e.g.
> >>>> disk or machine failure) appears to be orders of magnitude higher for
> >>> large
> >>>> clusters.
> >>>>
> >>>> The model attempts to explain why we may care about this and advances
> a
> >>>> few existing/new ideas for how to fix the scalability problems that
> >>> vnodes
> >>>> fix without the availability (and consistency—due to the effects on
> >>> repair)
> >>>> problems high num_tokens create. We would of course be very interested
> >> in
> >>>> any feedback. The model source code is on github [3], PRs are welcome
> >> or
> >>>> feel free to play around with the jupyter notebook to match your
> >>>> environment and see what the graphs look like. I didn't attach the pdf
> >>> here
> >>>> because it's too large apparently (lots of pretty graphs).
> >>>>
> >>>> I know that users can always just pick whichever number they prefer,
> >> but
> >>> I
> >>>> think the current default was chosen when token placement was random,
> >>> and I
> >>>> wonder whether it's still the right default.
> >>>>
> >>>> Thank you,
> >>>> -Joey Lynch
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/CASSANDRA-13701
> >>>> [2] https://github.com/jolynch/python_performance_toolkit/
> >>>> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> >>>> availability-virtual.pdf
> >>>>
> >>>> <
> >>> https://github.com/jolynch/python_performance_toolkit/
> >> blob/master/notebooks/cassandra_availability/whitepaper/cassandra-
> >> availability-virtual.pdf
> >>>>
> >>>> [3] https://github.com/jolynch/python_performance_toolkit/tree/m
> >>>> aster/notebooks/cassandra_availability
> >>>>
> >>>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>
>