OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quantifying Virtual Node Impact on Cassandra Availability


I'm pretty worried with large clusters using removenode given my experience
with Elasticsearch. Elasticsearch shard recovery is basically removenode +
bootstrap, and it does work really quickly if not throttled but it
completely destroys latency sensitive clusters (P99's spike to multiple
hundreds of milliseconds) if not limited to a maximum of 2-4 concurrent
shard recoveries
<https://www.elastic.co/guide/en/elasticsearch/reference/current/shards-allocation.html#shards-allocation>.
Since Cassandra doesn't have the ability to limit the number of nodes
impacted at a time during removenode, I'm concerned. I am interested if
there are any users at scale with >200 node clusters that require sub 10ms
p99 read latency and successfully use removenode under load.

Anecdotally I'm aware of large scale production users using every proposed
solution in the paper except for removenode (num_tokens=16, EBS data
volumes, range moves + simultaneous bootstrap).

-Joey

On Tue, Apr 17, 2018 at 9:16 AM, Richard Low <richard@xxxxxxxxxxx> wrote:

> I'm also not convinced the problems listed in the paper with removenode are
> so serious. With lots of vnodes per node, removenode causes data to be
> streamed into all other nodes in parallel, so is (n-1) times quicker than
> replacement for n nodes. For R=3, the failure rate goes up with vnodes
> (without vnodes, after the first failure, any 4 neighbouring node failures
> lose quorum but for vnodes, any other node failure loses quorum) by a
> factor of (n-1)/4. The increase in speed more than offsets this so in fact
> vnodes with removenode give theoretically 4x higher availability than no
> vnodes.
>
> If anyone is interested in using vnodes in large clusters I'd strongly
> suggest testing this out to see if the concerns in section 4.3.3 are valid.
>
> Richard.
>
> On 17 April 2018 at 08:29, Jeff Jirsa <jjirsa@xxxxxxxxx> wrote:
>
> > There are two huge advantages
> >
> > 1) during expansion / replacement / decom, you stream from far more
> > ranges. Since streaming is single threaded per stream, this enables you
> to
> > max out machines during streaming where single token doesn’t
> >
> > 2) when adjusting the size of a cluster, you can often grow incrementally
> > without rebalancing
> >
> > Streaming entire wholly covered/contained/owned sstables during range
> > movements is probably a huge benefit in many use cases that may make the
> > single threaded streaming implementation less of a concern, and likely
> > works reasonably well without major changes to LCS in particular  - I’m
> > fairly confident there’s a JIRA for this, if not it’s been discussed in
> > person among various operators for years as an obvious future
> improvement.
> >
> > --
> > Jeff Jirsa
> >
> >
> > > On Apr 17, 2018, at 8:17 AM, Carl Mueller <
> carl.mueller@xxxxxxxxxxxxxxx>
> > wrote:
> > >
> > > Do Vnodes address anything besides alleviating cluster planners from
> > doing
> > > token range management on nodes manually? Do we have a centralized list
> > of
> > > advantages they provide beyond that?
> > >
> > > There seem to be lots of downsides. 2i index performance, the above
> > > availability, etc.
> > >
> > > I also wonder if in vnodes (and manually managed tokens... I'll return
> to
> > > this) the node recovery scenarios are being hampered by sstables having
> > the
> > > hash ranges of the vnodes intermingled in the same set of sstables. I
> > > wondered in another thread in vnodes why sstables are separated into
> sets
> > > by the vnode ranges they represent. For a manually managed contiguous
> > token
> > > range, you could separate the sstables into a fixed number of sets,
> kind
> > of
> > > vnode-light.
> > >
> > > So if there was rebalancing or reconstruction, you could sneakernet or
> > > reliably send entire sstable sets that would belong in a range.
> > >
> > > I also thing this would improve compactions and repairs too.
> Compactions
> > > would be naturally parallelizable in all compaction schemes, and
> repairs
> > > would have natural subsets to do merkle tree calculations.
> > >
> > > Granted sending sstables might result in "overstreaming" due to data
> > > replication across the sstables, but you wouldn't have CPU and random
> I/O
> > > to look up the data. Just sequential transfers.
> > >
> > > For manually managed tokens with subdivided sstables, if there was
> > > rebalancing, you would have the "fringe" edges of the hash range
> > subdivided
> > > already, and you would only need to deal with the data in the border
> > areas
> > > of the token range, and again could sneakernet / dumb transfer the
> tables
> > > and then let the new node remove the unneeded in future repairs.
> > > (Compaction does not remove data that is not longer managed by a node,
> > only
> > > repair does? Or does only nodetool clean do that?)
> > >
> > > Pre-subdivided sstables for manually maanged tokens would REALLY pay
> big
> > > dividends in large-scale cluster expansion. Say you wanted to double or
> > > triple the cluster. Since the sstables are already split by some
> numeric
> > > factor that has lots of even divisors (60 for RF 2,3,4,5), you simply
> > bulk
> > > copy the already-subdivided sstables for the new nodes' hash ranges and
> > > you'd basically be done. In AWS EBS volumes, that could just be a drive
> > > detach / drive attach.
> > >
> > >
> > >
> > >
> > >> On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves <kurt@xxxxxxxxxxxxxxx>
> > wrote:
> > >>
> > >> Great write up. Glad someone finally did the math for us. I don't
> think
> > >> this will come as a surprise for many of the developers. Availability
> is
> > >> only one issue raised by vnodes. Load distribution and performance are
> > also
> > >> pretty big concerns.
> > >>
> > >> I'm always a proponent for fixing vnodes, and removing them as a
> default
> > >> until we do. Happy to help on this and we have ideas in mind that at
> > some
> > >> point I'll create tickets for...
> > >>
> > >>> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch, <joe.e.lynch@xxxxxxxxx>
> > wrote:
> > >>>
> > >>> If the blob link on github doesn't work for the pdf (looks like
> mobile
> > >>> might not like it), try:
> > >>>
> > >>>
> > >>> https://github.com/jolynch/python_performance_toolkit/
> > >> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> > >> availability-virtual.pdf
> > >>>
> > >>> -Joey
> > >>> <
> > >>> https://github.com/jolynch/python_performance_toolkit/
> > >> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> > >> availability-virtual.pdf
> > >>>>
> > >>>
> > >>> On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch <joe.e.lynch@xxxxxxxxx
> >
> > >>> wrote:
> > >>>
> > >>>> Josh Snyder and I have been working on evaluating virtual nodes for
> > >> large
> > >>>> scale deployments and while it seems like there is a lot of
> anecdotal
> > >>>> support for reducing the vnode count [1], we couldn't find any
> > concrete
> > >>>> math on the topic, so we had some fun and took a whack at
> quantifying
> > >> how
> > >>>> different choices of num_tokens impact a Cassandra cluster.
> > >>>>
> > >>>> According to the model we developed [2] it seems that at small
> cluster
> > >>>> sizes there isn't much of a negative impact on availability, but
> when
> > >>>> clusters scale up to hundreds of hosts, vnodes have a major impact
> on
> > >>>> availability. In particular, the probability of outage during short
> > >>>> failures (e.g. process restarts or failures) or permanent failure
> > (e.g.
> > >>>> disk or machine failure) appears to be orders of magnitude higher
> for
> > >>> large
> > >>>> clusters.
> > >>>>
> > >>>> The model attempts to explain why we may care about this and
> advances
> > a
> > >>>> few existing/new ideas for how to fix the scalability problems that
> > >>> vnodes
> > >>>> fix without the availability (and consistency—due to the effects on
> > >>> repair)
> > >>>> problems high num_tokens create. We would of course be very
> interested
> > >> in
> > >>>> any feedback. The model source code is on github [3], PRs are
> welcome
> > >> or
> > >>>> feel free to play around with the jupyter notebook to match your
> > >>>> environment and see what the graphs look like. I didn't attach the
> pdf
> > >>> here
> > >>>> because it's too large apparently (lots of pretty graphs).
> > >>>>
> > >>>> I know that users can always just pick whichever number they prefer,
> > >> but
> > >>> I
> > >>>> think the current default was chosen when token placement was
> random,
> > >>> and I
> > >>>> wonder whether it's still the right default.
> > >>>>
> > >>>> Thank you,
> > >>>> -Joey Lynch
> > >>>>
> > >>>> [1] https://issues.apache.org/jira/browse/CASSANDRA-13701
> > >>>> [2] https://github.com/jolynch/python_performance_toolkit/
> > >>>> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> > >>>> availability-virtual.pdf
> > >>>>
> > >>>> <
> > >>> https://github.com/jolynch/python_performance_toolkit/
> > >> blob/master/notebooks/cassandra_availability/whitepaper/cassandra-
> > >> availability-virtual.pdf
> > >>>>
> > >>>> [3] https://github.com/jolynch/python_performance_toolkit/tree/m
> > >>>> aster/notebooks/cassandra_availability
> > >>>>
> > >>>
> > >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> >
> >
>