Re: Repair scheduling tools

Vnodes is related and because we made it a default lots of people are using
it. Repairing a cluster with vnodes is a catastrophe (even a small one is
often problematic), but we have to deal with it if we build in repair

Repair scheduling is very important and we should definitely include it
with C* (sidecar long term makes most sense to me but only if we looked at
moving other background ops to the sidecar), but I'm positive it's not
going to work well with vnodes in their current state. Having said that, it
should still support scheduling repairs on vnode clusters, but the
vnode+repair problem should be fixed separately (and probably with more
attention than we've given it) because it's a major issue.

FWIW I know of 256 vnode clusters with > 100 nodes, yet I'd be surprised if
any of them are currently successfully repairing.

On 6 April 2018 at 03:03, Nate McCall <zznate.m@xxxxxxxxx> wrote:

> I think a take away here is that we can't assume a level of operation
> maturity will coincide automatically with scale. To make our core
> features robust, we have to account for less-experienced users.
> A lot of folks on this thread have *really* strong ops and OpsViz
> stories. Let's not forget that most of our users don't.
> ((Un)fortunately, as a consulting firm, we tend to see the worst of
> this).
> On Fri, Apr 6, 2018 at 2:52 PM, Jonathan Haddad <jon@xxxxxxxxxxxxx> wrote:
> > Off the top of my head I can remember clusters with 600 or 700 nodes with
> > 256 tokens.
> >
> > Not the best situation, but it’s real. 256 has been the default for
> better
> > or worse.
> >
> > On Thu, Apr 5, 2018 at 7:41 PM Joseph Lynch <joe.e.lynch@xxxxxxxxx>
> wrote:
> >
> >> >
> >> > We see this in larger clusters regularly. Usually folks have just
> >> > 'grown into it' because it was the default.
> >> >
> >>
> >> I could understand a few dozen nodes with 256 vnodes, but hundreds is
> >> surprising. I have a whitepaper draft lying around showing how vnodes
> >> decrease availability in large clusters by orders of magnitude, I'll
> polish
> >> it up and send it out to the list when I get a second.
> >>
> >> In the meantime, sorry for de-railing a conversation about repair
> >> scheduling to talk about vnodes, let's chat about that in a different
> >> thread :-)
> >>
> >> -Joey
> >>
