osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quantifying Virtual Node Impact on Cassandra Availability


I've posted a bunch of things relevant to commitlog --> sstable and
associated compaction / sstable metadata changes on here. I really need to
learn that section of the code.

On Tue, Apr 17, 2018 at 10:29 AM, Jeff Jirsa <jjirsa@xxxxxxxxx> wrote:

> There are two huge advantages
>
> 1) during expansion / replacement / decom, you stream from far more
> ranges. Since streaming is single threaded per stream, this enables you to
> max out machines during streaming where single token doesn’t
>
> 2) when adjusting the size of a cluster, you can often grow incrementally
> without rebalancing
>
> Streaming entire wholly covered/contained/owned sstables during range
> movements is probably a huge benefit in many use cases that may make the
> single threaded streaming implementation less of a concern, and likely
> works reasonably well without major changes to LCS in particular  - I’m
> fairly confident there’s a JIRA for this, if not it’s been discussed in
> person among various operators for years as an obvious future improvement.
>
> --
> Jeff Jirsa
>
>
> > On Apr 17, 2018, at 8:17 AM, Carl Mueller <carl.mueller@xxxxxxxxxxxxxxx>
> wrote:
> >
> > Do Vnodes address anything besides alleviating cluster planners from
> doing
> > token range management on nodes manually? Do we have a centralized list
> of
> > advantages they provide beyond that?
> >
> > There seem to be lots of downsides. 2i index performance, the above
> > availability, etc.
> >
> > I also wonder if in vnodes (and manually managed tokens... I'll return to
> > this) the node recovery scenarios are being hampered by sstables having
> the
> > hash ranges of the vnodes intermingled in the same set of sstables. I
> > wondered in another thread in vnodes why sstables are separated into sets
> > by the vnode ranges they represent. For a manually managed contiguous
> token
> > range, you could separate the sstables into a fixed number of sets, kind
> of
> > vnode-light.
> >
> > So if there was rebalancing or reconstruction, you could sneakernet or
> > reliably send entire sstable sets that would belong in a range.
> >
> > I also thing this would improve compactions and repairs too. Compactions
> > would be naturally parallelizable in all compaction schemes, and repairs
> > would have natural subsets to do merkle tree calculations.
> >
> > Granted sending sstables might result in "overstreaming" due to data
> > replication across the sstables, but you wouldn't have CPU and random I/O
> > to look up the data. Just sequential transfers.
> >
> > For manually managed tokens with subdivided sstables, if there was
> > rebalancing, you would have the "fringe" edges of the hash range
> subdivided
> > already, and you would only need to deal with the data in the border
> areas
> > of the token range, and again could sneakernet / dumb transfer the tables
> > and then let the new node remove the unneeded in future repairs.
> > (Compaction does not remove data that is not longer managed by a node,
> only
> > repair does? Or does only nodetool clean do that?)
> >
> > Pre-subdivided sstables for manually maanged tokens would REALLY pay big
> > dividends in large-scale cluster expansion. Say you wanted to double or
> > triple the cluster. Since the sstables are already split by some numeric
> > factor that has lots of even divisors (60 for RF 2,3,4,5), you simply
> bulk
> > copy the already-subdivided sstables for the new nodes' hash ranges and
> > you'd basically be done. In AWS EBS volumes, that could just be a drive
> > detach / drive attach.
> >
> >
> >
> >
> >> On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves <kurt@xxxxxxxxxxxxxxx>
> wrote:
> >>
> >> Great write up. Glad someone finally did the math for us. I don't think
> >> this will come as a surprise for many of the developers. Availability is
> >> only one issue raised by vnodes. Load distribution and performance are
> also
> >> pretty big concerns.
> >>
> >> I'm always a proponent for fixing vnodes, and removing them as a default
> >> until we do. Happy to help on this and we have ideas in mind that at
> some
> >> point I'll create tickets for...
> >>
> >>> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch, <joe.e.lynch@xxxxxxxxx>
> wrote:
> >>>
> >>> If the blob link on github doesn't work for the pdf (looks like mobile
> >>> might not like it), try:
> >>>
> >>>
> >>> https://github.com/jolynch/python_performance_toolkit/
> >> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> >> availability-virtual.pdf
> >>>
> >>> -Joey
> >>> <
> >>> https://github.com/jolynch/python_performance_toolkit/
> >> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> >> availability-virtual.pdf
> >>>>
> >>>
> >>> On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch <joe.e.lynch@xxxxxxxxx>
> >>> wrote:
> >>>
> >>>> Josh Snyder and I have been working on evaluating virtual nodes for
> >> large
> >>>> scale deployments and while it seems like there is a lot of anecdotal
> >>>> support for reducing the vnode count [1], we couldn't find any
> concrete
> >>>> math on the topic, so we had some fun and took a whack at quantifying
> >> how
> >>>> different choices of num_tokens impact a Cassandra cluster.
> >>>>
> >>>> According to the model we developed [2] it seems that at small cluster
> >>>> sizes there isn't much of a negative impact on availability, but when
> >>>> clusters scale up to hundreds of hosts, vnodes have a major impact on
> >>>> availability. In particular, the probability of outage during short
> >>>> failures (e.g. process restarts or failures) or permanent failure
> (e.g.
> >>>> disk or machine failure) appears to be orders of magnitude higher for
> >>> large
> >>>> clusters.
> >>>>
> >>>> The model attempts to explain why we may care about this and advances
> a
> >>>> few existing/new ideas for how to fix the scalability problems that
> >>> vnodes
> >>>> fix without the availability (and consistency—due to the effects on
> >>> repair)
> >>>> problems high num_tokens create. We would of course be very interested
> >> in
> >>>> any feedback. The model source code is on github [3], PRs are welcome
> >> or
> >>>> feel free to play around with the jupyter notebook to match your
> >>>> environment and see what the graphs look like. I didn't attach the pdf
> >>> here
> >>>> because it's too large apparently (lots of pretty graphs).
> >>>>
> >>>> I know that users can always just pick whichever number they prefer,
> >> but
> >>> I
> >>>> think the current default was chosen when token placement was random,
> >>> and I
> >>>> wonder whether it's still the right default.
> >>>>
> >>>> Thank you,
> >>>> -Joey Lynch
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/CASSANDRA-13701
> >>>> [2] https://github.com/jolynch/python_performance_toolkit/
> >>>> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> >>>> availability-virtual.pdf
> >>>>
> >>>> <
> >>> https://github.com/jolynch/python_performance_toolkit/
> >> blob/master/notebooks/cassandra_availability/whitepaper/cassandra-
> >> availability-virtual.pdf
> >>>>
> >>>> [3] https://github.com/jolynch/python_performance_toolkit/tree/m
> >>>> aster/notebooks/cassandra_availability
> >>>>
> >>>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>
>