OSDir

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quantifying Virtual Node Impact on Cassandra Availability


As far as I'm aware if you're using a high number of tokens per host you
can't bootstrap two hosts without potentially violating RaW consistency if
they have overlapping token ranges (with 256 this is basically guaranteed).
I'm definitely not an expert on this though, when I've used vnodes I've
always scaled up single node at a time.

Simultaneous bootstrap with a few (or one) tokens per node is very much
possible and is the fourth solution we proposed in the paper (to just
bootstrap wherever and then allow a token rebalancer to gradually shift the
cluster as proposed in CASSANDRA-1418
<https://issues.apache.org/jira/browse/CASSANDRA-1418> into balance over
time, with little impact to the cluster). Personally I am really interested
in this approach a lot more than vnodes for balancing clusters because it
deals with hot partitions way better (in the edge case you end up with a
single node holding a single partition as that partition got hotter and
hotter), doesn't impact repair or gossip, can be easily controlled in
impact to the cluster, and would have incremental bootstrap for free
because you just bootstrap near another token and gradually move your token
over. If others agree this might be a useful direction to explore I think
we might be interested in working on this instead of moving to vnodes.

-Joey

On Tue, Apr 17, 2018 at 10:22 AM, Carl Mueller <carl.mueller@xxxxxxxxxxxxxxx
> wrote:

> Is this a fundamental vnode disadvantage:
>
> do Vnodes preclude cluster expansion faster than 1 at a time? I would think
> with manual management you could expand a datacenter by multiples of
> machines/nodes. Or at least in multiples of ReplicationFactor:
>
> RF3 starts as:
>
> a1 b1 c1
>
> doubles to:
>
> a1 a2 b1 b2 c1 c2
>
> expands again by 3:
>
> a1 a2 a3 b1 b2 b3 c1 c3 c3
>
> all via sneakernet or similar schemes? Or am I wrong about being able to do
> bigger expansions on manual tokens and that vnodes can't safely do that?
>
> Most of the paper seems to surround the streaming time being what exposes
> the cluster to risk. But manual tokens lend themselves to sneakernet
> rebuilds, do they not?
>
>
> On Tue, Apr 17, 2018 at 11:16 AM, Richard Low <richard@xxxxxxxxxxx> wrote:
>
> > I'm also not convinced the problems listed in the paper with removenode
> are
> > so serious. With lots of vnodes per node, removenode causes data to be
> > streamed into all other nodes in parallel, so is (n-1) times quicker than
> > replacement for n nodes. For R=3, the failure rate goes up with vnodes
> > (without vnodes, after the first failure, any 4 neighbouring node
> failures
> > lose quorum but for vnodes, any other node failure loses quorum) by a
> > factor of (n-1)/4. The increase in speed more than offsets this so in
> fact
> > vnodes with removenode give theoretically 4x higher availability than no
> > vnodes.
> >
> > If anyone is interested in using vnodes in large clusters I'd strongly
> > suggest testing this out to see if the concerns in section 4.3.3 are
> valid.
> >
> > Richard.
> >
> > On 17 April 2018 at 08:29, Jeff Jirsa <jjirsa@xxxxxxxxx> wrote:
> >
> > > There are two huge advantages
> > >
> > > 1) during expansion / replacement / decom, you stream from far more
> > > ranges. Since streaming is single threaded per stream, this enables you
> > to
> > > max out machines during streaming where single token doesn’t
> > >
> > > 2) when adjusting the size of a cluster, you can often grow
> incrementally
> > > without rebalancing
> > >
> > > Streaming entire wholly covered/contained/owned sstables during range
> > > movements is probably a huge benefit in many use cases that may make
> the
> > > single threaded streaming implementation less of a concern, and likely
> > > works reasonably well without major changes to LCS in particular  - I’m
> > > fairly confident there’s a JIRA for this, if not it’s been discussed in
> > > person among various operators for years as an obvious future
> > improvement.
> > >
> > > --
> > > Jeff Jirsa
> > >
> > >
> > > > On Apr 17, 2018, at 8:17 AM, Carl Mueller <
> > carl.mueller@xxxxxxxxxxxxxxx>
> > > wrote:
> > > >
> > > > Do Vnodes address anything besides alleviating cluster planners from
> > > doing
> > > > token range management on nodes manually? Do we have a centralized
> list
> > > of
> > > > advantages they provide beyond that?
> > > >
> > > > There seem to be lots of downsides. 2i index performance, the above
> > > > availability, etc.
> > > >
> > > > I also wonder if in vnodes (and manually managed tokens... I'll
> return
> > to
> > > > this) the node recovery scenarios are being hampered by sstables
> having
> > > the
> > > > hash ranges of the vnodes intermingled in the same set of sstables. I
> > > > wondered in another thread in vnodes why sstables are separated into
> > sets
> > > > by the vnode ranges they represent. For a manually managed contiguous
> > > token
> > > > range, you could separate the sstables into a fixed number of sets,
> > kind
> > > of
> > > > vnode-light.
> > > >
> > > > So if there was rebalancing or reconstruction, you could sneakernet
> or
> > > > reliably send entire sstable sets that would belong in a range.
> > > >
> > > > I also thing this would improve compactions and repairs too.
> > Compactions
> > > > would be naturally parallelizable in all compaction schemes, and
> > repairs
> > > > would have natural subsets to do merkle tree calculations.
> > > >
> > > > Granted sending sstables might result in "overstreaming" due to data
> > > > replication across the sstables, but you wouldn't have CPU and random
> > I/O
> > > > to look up the data. Just sequential transfers.
> > > >
> > > > For manually managed tokens with subdivided sstables, if there was
> > > > rebalancing, you would have the "fringe" edges of the hash range
> > > subdivided
> > > > already, and you would only need to deal with the data in the border
> > > areas
> > > > of the token range, and again could sneakernet / dumb transfer the
> > tables
> > > > and then let the new node remove the unneeded in future repairs.
> > > > (Compaction does not remove data that is not longer managed by a
> node,
> > > only
> > > > repair does? Or does only nodetool clean do that?)
> > > >
> > > > Pre-subdivided sstables for manually maanged tokens would REALLY pay
> > big
> > > > dividends in large-scale cluster expansion. Say you wanted to double
> or
> > > > triple the cluster. Since the sstables are already split by some
> > numeric
> > > > factor that has lots of even divisors (60 for RF 2,3,4,5), you simply
> > > bulk
> > > > copy the already-subdivided sstables for the new nodes' hash ranges
> and
> > > > you'd basically be done. In AWS EBS volumes, that could just be a
> drive
> > > > detach / drive attach.
> > > >
> > > >
> > > >
> > > >
> > > >> On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves <kurt@xxxxxxxxxxxxxxx
> >
> > > wrote:
> > > >>
> > > >> Great write up. Glad someone finally did the math for us. I don't
> > think
> > > >> this will come as a surprise for many of the developers.
> Availability
> > is
> > > >> only one issue raised by vnodes. Load distribution and performance
> are
> > > also
> > > >> pretty big concerns.
> > > >>
> > > >> I'm always a proponent for fixing vnodes, and removing them as a
> > default
> > > >> until we do. Happy to help on this and we have ideas in mind that at
> > > some
> > > >> point I'll create tickets for...
> > > >>
> > > >>> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch, <joe.e.lynch@xxxxxxxxx>
> > > wrote:
> > > >>>
> > > >>> If the blob link on github doesn't work for the pdf (looks like
> > mobile
> > > >>> might not like it), try:
> > > >>>
> > > >>>
> > > >>> https://github.com/jolynch/python_performance_toolkit/
> > > >> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> > > >> availability-virtual.pdf
> > > >>>
> > > >>> -Joey
> > > >>> <
> > > >>> https://github.com/jolynch/python_performance_toolkit/
> > > >> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> > > >> availability-virtual.pdf
> > > >>>>
> > > >>>
> > > >>> On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch <
> joe.e.lynch@xxxxxxxxx
> > >
> > > >>> wrote:
> > > >>>
> > > >>>> Josh Snyder and I have been working on evaluating virtual nodes
> for
> > > >> large
> > > >>>> scale deployments and while it seems like there is a lot of
> > anecdotal
> > > >>>> support for reducing the vnode count [1], we couldn't find any
> > > concrete
> > > >>>> math on the topic, so we had some fun and took a whack at
> > quantifying
> > > >> how
> > > >>>> different choices of num_tokens impact a Cassandra cluster.
> > > >>>>
> > > >>>> According to the model we developed [2] it seems that at small
> > cluster
> > > >>>> sizes there isn't much of a negative impact on availability, but
> > when
> > > >>>> clusters scale up to hundreds of hosts, vnodes have a major impact
> > on
> > > >>>> availability. In particular, the probability of outage during
> short
> > > >>>> failures (e.g. process restarts or failures) or permanent failure
> > > (e.g.
> > > >>>> disk or machine failure) appears to be orders of magnitude higher
> > for
> > > >>> large
> > > >>>> clusters.
> > > >>>>
> > > >>>> The model attempts to explain why we may care about this and
> > advances
> > > a
> > > >>>> few existing/new ideas for how to fix the scalability problems
> that
> > > >>> vnodes
> > > >>>> fix without the availability (and consistency—due to the effects
> on
> > > >>> repair)
> > > >>>> problems high num_tokens create. We would of course be very
> > interested
> > > >> in
> > > >>>> any feedback. The model source code is on github [3], PRs are
> > welcome
> > > >> or
> > > >>>> feel free to play around with the jupyter notebook to match your
> > > >>>> environment and see what the graphs look like. I didn't attach the
> > pdf
> > > >>> here
> > > >>>> because it's too large apparently (lots of pretty graphs).
> > > >>>>
> > > >>>> I know that users can always just pick whichever number they
> prefer,
> > > >> but
> > > >>> I
> > > >>>> think the current default was chosen when token placement was
> > random,
> > > >>> and I
> > > >>>> wonder whether it's still the right default.
> > > >>>>
> > > >>>> Thank you,
> > > >>>> -Joey Lynch
> > > >>>>
> > > >>>> [1] https://issues.apache.org/jira/browse/CASSANDRA-13701
> > > >>>> [2] https://github.com/jolynch/python_performance_toolkit/
> > > >>>> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> > > >>>> availability-virtual.pdf
> > > >>>>
> > > >>>> <
> > > >>> https://github.com/jolynch/python_performance_toolkit/
> > > >> blob/master/notebooks/cassandra_availability/whitepaper/cassandra-
> > > >> availability-virtual.pdf
> > > >>>>
> > > >>>> [3] https://github.com/jolynch/python_performance_toolkit/tree/m
> > > >>>> aster/notebooks/cassandra_availability
> > > >>>>
> > > >>>
> > > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> > >
> > >
> >
>