OSDir

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Repair scheduling tools


Simpler scheduler is never simple. I agree in principle — ala “Cassandra-Agent” which could manage any order of tasks, schedules, etc needing to prune and manage the C* engine. Cassandra has enough TPs, it needs to manage already.

On Apr 5, 2018, 3:09 PM -0400, Joseph Lynch <joe.e.lynch@xxxxxxxxx>, wrote:
> I think that getting into the various repair strategies in this discussion
> is perhaps orthogonal to how we schedule repair.
>
> Whether we end up with incremental, full, tickers (read @ALL), continuous
> <https://issues.apache.org/jira/browse/CASSANDRA-13924> repair, mutation
> based <https://issues.apache.org/jira/browse/CASSANDRA-8911> repair, etc
> ... something still needs to schedule them for all tables and give good
> introspection into when they ran, how long they took to run, etc. If we're
> able to get a simple scheduler into Cassandra I think we can always add
> additional repair type
> <https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit#bookmark=id.ri8u2fn7uwd>'s
> and configuration options, we could even make them an interface so that
> users can plug in their own repair strategy.
>
> For example if we added a "read-repair" repair type, we could drift
> <https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit#bookmark=id.xn6852786lv8
> that pretty effortlessly.
>
> -Joey
>
> On Thu, Apr 5, 2018 at 11:48 AM, benjamin roth <brstgt@xxxxxxxxx> wrote:
>
> > I don't say reaper is the problem. I don't want to do wrong to Reaper but
> > in the end it is "just" an instrumentation for CS's built in repairs that
> > slices and schedules, right?
> > The problem I see is that the built in repairs are rather inefficient (for
> > many, maybe not all use cases) due to many reasons. To name some of them:
> >
> > - Overstreaming as only whole partitions are repaired, not single mutations
> > - Race conditions in merkle tree calculation on nodes taking part in a
> > repair session
> > - Every stream creates a SSTable, needing to be compacted
> > - Possible SSTable creation floods can even kill a node due to "too many
> > open files" - yes we had that
> > - Incremental repairs have issues
> >
> > Today we had a super simple case where I first ran 'nodetool repair' on a
> > super small system keyspace and then ran a 'scrape-repair':
> > - nodetool took 4 minutes on a single node
> > - scraping took 1 sec repairing all nodes together
> >
> > In the beginning I was twisting my brain how this could be optimized in CS
> > - in the end going with scraping solved every problem we had.
> >
> > 2018-04-05 20:32 GMT+02:00 Jonathan Haddad <jon@xxxxxxxxxxxxx>:
> >
> > > To be fair, reaper in 2016 only worked with 2.0 and was just sitting
> > > around, more or less.
> > >
> > > Since then we've had 401 commits changing tens of thousands of lines of
> > > code, dealing with fault tolerance, repair retries, scalability, etc.
> > > We've had 1 reaper node managing repairs across dozens of clusters and
> > > thousands of nodes. It's a totally different situation today.
> > >
> > >
> > > On Thu, Apr 5, 2018 at 11:17 AM benjamin roth <brstgt@xxxxxxxxx> wrote:
> > >
> > > > That would be totally awesome!
> > > >
> > > > Not sure if it helps here but for completeness:
> > > > We completely "dumped" regular repairs - no matter if 'nodetool repair'
> > > or
> > > > reaper - and run our own tool that does simply CL_ALL scraping over the
> > > > whole cluster.
> > > > It runs now for over a year in production and the only problem we
> > > > encountered was that we got timeouts when scraping (too) large /
> > > tombstoned
> > > > partitions. It turned out that the large partitions weren't even
> > readable
> > > > with CQL / cqlsh / DevCenter. So that wasn't a problem of the repair.
> > It
> > > > was rather a design problem. Storing data that can't be read doesn't
> > make
> > > > sense anyway.
> > > >
> > > > What I can tell from our experience:
> > > > - It works much more reliable than what we had before - also more
> > > reliable
> > > > than reaper (state of 2016)
> > > > - It runs totally smooth and much faster than regular repairs as it
> > only
> > > > streams what needs to be streamed
> > > > - It's easily manageable, interruptible, resumable on a very
> > fine-grained
> > > > level. The only thing you need to do is to store state (KS/CF/Last
> > Token)
> > > > in a simple storage like redis
> > > > - It works even pretty well when populating a empty node e.g. when
> > > changing
> > > > RFs / bootstrapping DCs
> > > > - You can easily control the cluster-load by tuning the concurrency of
> > > the
> > > > scrape process
> > > >
> > > > I don't see a reason for us to ever go back to built-in repairs if they
> > > > don't improve immensely. In many cases (especially with MVs) they are
> > > true
> > > > resource killers.
> > > >
> > > > Just my 2 cent and experience.
> > > >
> > > > 2018-04-04 17:00 GMT+02:00 Ben Bromhead <ben@xxxxxxxxxxxxxxx>:
> > > >
> > > > > +1 to including the implementation in Cassandra itself. Makes managed
> > > > > repair a first-class citizen, it nicely rounds out Cassandra's
> > > > consistency
> > > > > story and makes it 1000x more likely that repairs will get run.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Apr 4, 2018 at 10:45 AM Jon Haddad <jon@xxxxxxxxxxxxx
> > wrote:
> > > > >
> > > > > > Implementation details aside, I’m firmly in the “it would be nice
> > of
> > > C*
> > > > > > could take care of it” camp. Reaper is pretty damn easy to use and
> > > > > people
> > > > > > *still* don’t put it in prod.
> > > > > >
> > > > > >
> > > > > > > On Apr 4, 2018, at 4:16 AM, Rahul Singh <
> > > > rahul.xavier.singh@xxxxxxxxx
> > > > > > wrote:
> > > > > > >
> > > > > > > I understand the merits of both approaches. In working with other
> > > DBs
> > > > > In
> > > > > > the “old country” of SQL, we often had to write indexing sequences
> > > > > manually
> > > > > > for important tables. It was “built into the product” but in order
> > to
> > > > > > leverage the maximum benefits of indices we had to have different
> > > > indices
> > > > > > other than the clustered (physical index). The process still
> > sucked.
> > > > It’s
> > > > > > never perfect.
> > > > > > >
> > > > > > > The JVM is already fraught with GC issues and putting another
> > > process
> > > > > > being managed in the same heapspace is what I’m worried about.
> > > > > Technically
> > > > > > the process could be in the same binary but started as a side Car
> > or
> > > in
> > > > > the
> > > > > > same main process.
> > > > > > >
> > > > > > > Consider a process called “cassandra-agent” that’s sitting around
> > > > with
> > > > > a
> > > > > > scheduler based on config or a Cassandra table. Distributed in the
> > > same
> > > > > > release. Shell / service scripts would start it. The end user knows
> > > it
> > > > > only
> > > > > > by examining the .sh files. This opens possibilities of including a
> > > GUI
> > > > > > hosted in the same process without cluttering the core coolness of
> > > > > > Cassandra.
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > --
> > > > > > > Rahul Singh
> > > > > > > rahul.singh@xxxxxxxx
> > > > > > >
> > > > > > > Anant Corporation
> > > > > > >
> > > > > > > On Apr 4, 2018, 2:50 AM -0400, Dor Laor <dor@xxxxxxxxxxxx>,
> > wrote:
> > > > > > > > We at Scylla, implemented repair in a similar way to the
> > Cassandra
> > > > > > reaper.
> > > > > > > > We do
> > > > > > > > that using an external application, written in go that manages
> > > > repair
> > > > > > for
> > > > > > > > multiple clusters
> > > > > > > > and saves the data in an external Scylla cluster. The logic
> > > > resembles
> > > > > > the
> > > > > > > > reaper one with
> > > > > > > > some specific internal sharding optimizations and uses the
> > Scylla
> > > > rest
> > > > > > api.
> > > > > > > >
> > > > > > > > However, I have doubts it's the ideal way. After playing a bit
> > > with
> > > > > > > > CockroachDB, I realized
> > > > > > > > it's super nice to have a single binary that repairs itself,
> > > > provides
> > > > > a
> > > > > > GUI
> > > > > > > > and is the core DB.
> > > > > > > >
> > > > > > > > Even while distributed, you can elect a leader node to manage
> > the
> > > > > > repair in
> > > > > > > > a consistent
> > > > > > > > way so the complexity can be reduced to a minimum. Repair can
> > > write
> > > > > its
> > > > > > > > status to the
> > > > > > > > system tables and to provide an api for progress, rate control,
> > > etc.
> > > > > > > >
> > > > > > > > The big advantage for repair to embedded in the core is that
> > there
> > > > is
> > > > > no
> > > > > > > > need to expose
> > > > > > > > internal state to the repair logic. So an external program
> > doesn't
> > > > > need
> > > > > > to
> > > > > > > > deal with different
> > > > > > > > version of Cassandra, different repair capabilities of the core
> > > > (such
> > > > > as
> > > > > > > > incremental on/off)
> > > > > > > > and so forth. A good database should schedule its own repair, it
> > > > knows
> > > > > > > > whether the shreshold
> > > > > > > > of hintedhandoff was cross or not, it knows whether nodes where
> > > > > > replaced,
> > > > > > > > etc,
> > > > > > > >
> > > > > > > > My 2 cents. Dor
> > > > > > > >
> > > > > > > > On Tue, Apr 3, 2018 at 11:13 PM, Dinesh Joshi <
> > > > > > > > dinesh.joshi@xxxxxxxxx.invalid> wrote:
> > > > > > > >
> > > > > > > > > Simon,
> > > > > > > > > You could still do load aware repair outside of the main
> > process
> > > by
> > > > > > > > > reading Cassandra's metrics.
> > > > > > > > > In general, I don't think the maintenance tasks necessarily
> > need
> > > to
> > > > > > live
> > > > > > > > > in the main process. They could negatively impact the read /
> > > write
> > > > > > path.
> > > > > > > > > Unless strictly required by the serving path, it could live in
> > a
> > > > > > sidecar
> > > > > > > > > process. There are multiple benefits including isolation,
> > faster
> > > > > > iteration,
> > > > > > > > > loose coupling. For example - this would mean that the
> > > maintenance
> > > > > > tasks
> > > > > > > > > can have a different gc profile than the main process and it
> > > would
> > > > be
> > > > > > ok.
> > > > > > > > > Today that is not the case.
> > > > > > > > > The only issue I see is that the project does not provide an
> > > > official
> > > > > > > > > sidecar. Perhaps there should be one. We probably would've not
> > > had
> > > > to
> > > > > > have
> > > > > > > > > this discussion ;)
> > > > > > > > > Dinesh
> > > > > > > > >
> > > > > > > > > On Tuesday, April 3, 2018, 10:12:56 PM PDT, Qingcun Zhou <
> > > > > > > > > zhouqingcun@xxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > Repair has been a problem for us at Uber. In general I'm in
> > favor
> > > > of
> > > > > > > > > including the scheduling logic in Cassandra daemon. It has the
> > > > > benefit
> > > > > > of
> > > > > > > > > introducing something like load-aware repair, eg, only schedule
> > > > > repair
> > > > > > > > > while no ongoing compaction or traffic is low, etc. As proposed
> > > by
> > > > > > others,
> > > > > > > > > we can expose keyspace/table-level configurations so that users
> > > can
> > > > > > opt-in.
> > > > > > > > > Regarding the risk, yes there will be problems at the beginning
> > > but
> > > > > in
> > > > > > the
> > > > > > > > > long run, users will appreciate that repair works out of the
> > box,
> > > > > just
> > > > > > like
> > > > > > > > > compaction. We have large Cassandra deployments and can work
> > with
> > > > > > Netflix
> > > > > > > > > folks for intensive testing to boost user confidence.
> > > > > > > > >
> > > > > > > > > On the other hand, have we looked into how other NoSQL
> > databases
> > > do
> > > > > > repair?
> > > > > > > > > Is there a side car process?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Apr 3, 2018 at 9:21 PM, sankalp kohli <
> > > > > kohlisankalp@xxxxxxxxx
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Repair is critical for running C* and I agree with Roopa that
> > it
> > > > > > needs to
> > > > > > > > > > be part of the offering. I think we should make it easy for
> > new
> > > > > users
> > > > > > to
> > > > > > > > > > run C*.
> > > > > > > > > >
> > > > > > > > > > Can we have a side car process which we can add to Apache
> > > > Cassandra
> > > > > > > > > > offering and we can put this repair their? I am also fine
> > > putting
> > > > it
> > > > > > in
> > > > > > > > > C*
> > > > > > > > > > if side car is more long term.
> > > > > > > > > >
> > > > > > > > > > On Tue, Apr 3, 2018 at 6:20 PM, Roopa Tangirala <
> > > > > > > > > > rtangirala@xxxxxxxxxxx.invalid> wrote:
> > > > > > > > > >
> > > > > > > > > > > In seeing so many companies grapple with running repairs
> > > > > successfully
> > > > > > > > > in
> > > > > > > > > > > production, and seeing the success of distributed scheduled
> > > > repair
> > > > > > here
> > > > > > > > > > at
> > > > > > > > > > > Netflix, I strongly believe that adding this to Cassandra
> > would
> > > > be
> > > > > a
> > > > > > > > > > great
> > > > > > > > > > > addition to the database. I am hoping, we as a community will
> > > > make
> > > > > it
> > > > > > > > > > easy
> > > > > > > > > > > for teams to operate and run Cassandra by enhancing the core
> > > > > product,
> > > > > > > > > and
> > > > > > > > > > > making the maintenances like repairs and compactions part of
> > > the
> > > > > > > > > database
> > > > > > > > > > > without external tooling. We can have an experimental flag
> > for
> > > > the
> > > > > > > > > > feature
> > > > > > > > > > > and only teams who are confident with the service can enable
> > > > them,
> > > > > > > > > while
> > > > > > > > > > > others can fall back to default repairs.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > *Regards,*
> > > > > > > > > > >
> > > > > > > > > > > *Roopa Tangirala*
> > > > > > > > > > >
> > > > > > > > > > > Engineering Manager CDE
> > > > > > > > > > >
> > > > > > > > > > > *(408) 438-3156 - mobile*
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Apr 3, 2018 at 4:19 PM, Kenneth Brotman <
> > > > > > > > > > > kenbrotman@xxxxxxxxx.invalid> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Why not make it configurable?
> > > > > > > > > > > > auto_manage_repair_consistancy: true (default: false)
> > > > > > > > > > > >
> > > > > > > > > > > > Then users can use the built in auto repair function that
> > > would
> > > > be
> > > > > > > > > > > created
> > > > > > > > > > > > or continue to handle it as now. Default behavior would be
> > > > "false"
> > > > > > > > > so
> > > > > > > > > > > > nothing changes on its own. Just wondering why not have that
> > > > > option?
> > > > > > > > > > It
> > > > > > > > > > > > might accelerate progress as others have already suggested.
> > > > > > > > > > > >
> > > > > > > > > > > > Kenneth Brotman
> > > > > > > > > > > >
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Nate McCall [mailto:zznate.m@xxxxxxxxx]
> > > > > > > > > > > > Sent: Tuesday, April 03, 2018 1:37 PM
> > > > > > > > > > > > To: dev
> > > > > > > > > > > > Subject: Re: Repair scheduling tools
> > > > > > > > > > > >
> > > > > > > > > > > > This document does a really good job of listing out some of
> > > the
> > > > > > > > > issues
> > > > > > > > > > of
> > > > > > > > > > > > coordinating scheduling repair. Regardless of which camp you
> > > > fall
> > > > > > > > > into,
> > > > > > > > > > > it
> > > > > > > > > > > > is certainly worth a read.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Apr 4, 2018 at 8:10 AM, Joseph Lynch <
> > > > > joe.e.lynch@xxxxxxxxx
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > I just want to say I think it would be great for our users
> > if
> > > > we
> > > > > > > > > > moved
> > > > > > > > > > > > > repair scheduling into Cassandra itself. The team here at
> > > > Netflix
> > > > > > > > > has
> > > > > > > > > > > > > opened the ticket
> > > > > > > > > > > > > <https://issues.apache.org/jira/browse/CASSANDRA-14346
> > > > > > > > > > > > > and have written a detailed design document
> > > > > > > > > > > > > <https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_
> > > > > > > > > > t45rz7H3xs9G
> > > > > > > > > > > > > bFSEyGzEtM/edit#heading=h.iasguic42ger
> > > > > > > > > > > > > that includes problem discussion and prior art if anyone
> > > wants
> > > > to
> > > > > > > > > > > > > contribute to that. We tried to fairly discuss existing
> > > > > solutions,
> > > > > > > > > > > > > what their drawbacks are, and a proposed solution.
> > > > > > > > > > > > >
> > > > > > > > > > > > > If we were to put this as part of the main Cassandra
> > daemon,
> > > I
> > > > > > > > > think
> > > > > > > > > > > > > it should probably be marked experimental and of course be
> > > > > > > > > something
> > > > > > > > > > > > > that users opt into (table by table or cluster by cluster)
> > > with
> > > > > the
> > > > > > > > > > > > > understanding that it might not fully work out of the box
> > the
> > > > > first
> > > > > > > > > > > > > time we ship it. We have to be willing to take risks but we
> > > > also
> > > > > > > > > have
> > > > > > > > > > > > > to be honest with our users. It may help build confidence
> > if
> > > a
> > > > > few
> > > > > > > > > > > > > major deployments use it (such as Netflix) and we are happy
> > > of
> > > > > > > > > course
> > > > > > > > > > > > > to provide that QA as best we can.
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Joey
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Apr 3, 2018 at 10:48 AM, Blake Eggleston
> > > > > > > > > > > > > <beggleston@xxxxxxxxx
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi dev@,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The question of the best way to schedule repairs came up
> > on
> > > > > > > > > > > > > > CASSANDRA-14346, and I thought it would be good to bring
> > up
> > > > the
> > > > > > > > > idea
> > > > > > > > > > > > > > of an external tool on the dev list.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cassandra lacks any sort of tools for automating routine
> > > tasks
> > > > > > > > > that
> > > > > > > > > > > > > > are required for running clusters, specifically repair.
> > > > Regular
> > > > > > > > > > > > > > repair is a must for most clusters, like compaction. This
> > > > means
> > > > > > > > > > that,
> > > > > > > > > > > > > > especially as far as eventual consistency is concerned,
> > > > > Cassandra
> > > > > > > > > > > > > > isn’t totally functional out of the box. Operators either
> > > need
> > > > > to
> > > > > > > > > > > > > > find a 3rd party solution or implement one themselves.
> > > Adding
> > > > > this
> > > > > > > > > > to
> > > > > > > > > > > > > > Cassandra would make it easier to use.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Is this something we should be doing? If so, what should
> > it
> > > > look
> > > > > > > > > > like?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Personally, I feel like this is a pretty big gap in the
> > > > project
> > > > > > > > > and
> > > > > > > > > > > > > > would like to see an out of process tool offered. Ideally,
> > > > > > > > > Cassandra
> > > > > > > > > > > > > > would just take care of itself, but writing a distributed
> > > > repair
> > > > > > > > > > > > > > scheduler that you trust to run in production is a lot
> > > harder
> > > > > than
> > > > > > > > > > > > > > writing a single process management application that can
> > > > > failover.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Any thoughts on this?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Blake
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ------------------------------
> > ------------------------------
> > > > > > > > > ---------
> > > > > > > > > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.
> > apache.org
> > > > > > > > > > > > For additional commands, e-mail:
> > > dev-help@xxxxxxxxxxxxxxxxxxxx
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ------------------------------
> > ------------------------------
> > > > > > > > > ---------
> > > > > > > > > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.
> > apache.org
> > > > > > > > > > > > For additional commands, e-mail:
> > > dev-help@xxxxxxxxxxxxxxxxxxxx
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Thank you & Best Regards,
> > > > > > > > > --Simon (Qingcun) Zhou
> > > > > > > > >
> > > > > >
> > > > > >
> > > > > > ------------------------------------------------------------
> > > ---------
> > > > > > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > > > > > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> > > > > >
> > > > > > --
> > > > > Ben Bromhead
> > > > > CTO | Instaclustr <https://www.instaclustr.com/
> > > > > +1 650 284 9692 <(650)%20284-9692
> > > > > Reliability at Scale
> > > > > Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
> > > > >
> > > >
> > >
> >