osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Repair scheduling tools


This thread is mainly focused on how repairs are scheduled, not implementation details of how the repairs themselves work.

On 4/16/18, 11:07 AM, "Carl Mueller" <carl.mueller@xxxxxxxxxxxxxxx> wrote:

    So reading (
    https://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1)...
    anticompaction problems from repair seem related to the fact that the
    sstables for a repair range can have data that isn't in the repaired data
    range, so we then have an sstable with the repaired data (I'm ... guessing
    ... this "repaired" sstable only has the repair-range-relevant data?), and
    the unrepaired sstable with data outside the repair range needs to stick
    around too.
    
    But if our sstables from the start were organized by subdivided ranges
    (either the vnode ranges or some fraction of manually managed tokens), then
    the hash range is constrained for both the repair and compaction... and if
    the sstables are implicitly bucketed by a hash range, compactions are very
    easy to parallelize?
    
    I guess for 256 vnodes and RF 3 that would be 768 sets of sstables per
    table...
    
    On Mon, Apr 16, 2018 at 12:21 PM, Carl Mueller <carl.mueller@xxxxxxxxxxxxxxx
    > wrote:
    
    > Is the fundamental nature of sstable fragmentation the big wrench here?
    > I've been trying to imagine aids like an offline repair resolver or a
    > gradual node replacement/regenerator process that could serve as a
    > backstop/insurance for compaction and repair problems. After all, some of
    > the "we don't even bother repairing" places just do gradual automatic node
    > replacement, or what the one with the ALL scrubber was doing.
    >
    > Is there a reason cassandra does not subdivide sstables by hash range,
    > especially for vnodes? Reduction of seeks (not an issue in the ssd era
    > really)?  Since repair that avoids overstreaming is performed on subranges
    > and generate new sstables for further compaction, if the sstables (in
    > vnodes or not) were split by dedicated hash ranges then maybe the scale of
    > data being dealt with on a node and a repair and compaction would be
    > reduced in scope/complexity.
    >
    > It's before lunch for me, so I'm probably missing a major major caveat
    > here...
    >
    > But I'm trying to think why we wouldn't bucket sstables by hash range.
    > Seems to me it would be simple to do in the commitlog --> sstable step, an
    > addition to the sstable metadata that isn't too big, and then the
    > compaction and repair processes could unentangle and validate ranges with
    > more quickly with less excess I/O
    >
    > On Thu, Apr 12, 2018 at 9:18 PM, Rahul Singh <rahul.xavier.singh@xxxxxxxxx
    > > wrote:
    >
    >> Schedule scheme looks good. I believe in process / sidecar can both
    >> coexist. As an admin would love to be able to run one or the other or none.
    >>
    >> Thank you for taking a lead and producing a plan that can actually be
    >> executed.
    >>
    >> --
    >> Rahul Singh
    >> rahul.singh@xxxxxxxx
    >>
    >> Anant Corporation
    >>
    >> On Apr 12, 2018, 6:35 PM -0400, Joseph Lynch <joe.e.lynch@xxxxxxxxx>,
    >> wrote:
    >> > Given the feedback here and on the ticket, I've written up a proposal
    >> > for a repair
    >> > sidecar tool
    >> > <https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t4
    >> 5rz7H3xs9GbFSEyGzEtM/edit#heading=h.5f10ng8gzle8
    >> > in the ticket's design document. If there are no major concerns we're
    >> going
    >> > to start working on porting the Priam implementation into this new tool
    >> > soon.
    >> >
    >> > -Joey
    >> >
    >> > On Tue, Apr 10, 2018 at 4:17 PM, Elliott Sims <elliott@xxxxxxxxxxxxx>
    >> wrote:
    >> >
    >> > > My two cents as a (relatively small) user. I'm coming at this from the
    >> > > ops/user side, so my apologies if some of these don't make sense
    >> based on a
    >> > > more detailed understanding of the codebase:
    >> > >
    >> > > Repair is definitely a major missing piece of Cassandra. Integrated
    >> would
    >> > > be easier, but a sidecar might be more flexible. As an intermediate
    >> step
    >> > > that works towards both options, does it make sense to start with
    >> > > finer-grained tracking and reporting for subrange repairs? That is,
    >> expose
    >> > > a set of interfaces (both internally and via JMX) that give a
    >> scheduler
    >> > > enough information to run subrange repairs across multiple keyspaces
    >> or
    >> > > even non-overlapping ranges at the same time. That lets people
    >> experiment
    >> > > with and quickly/safely/easily iterate on different scheduling
    >> strategies
    >> > > in the short term, and long-term those strategies can be integrated
    >> into a
    >> > > built-in scheduler
    >> > >
    >> > > On the subject of scheduling, I think adjusting
    >> parallelism/aggression with
    >> > > a possible whitelist or blacklist would be a lot more useful than a
    >> "time
    >> > > between repairs". That is, if repairs run for a few hours then don't
    >> run
    >> > > for a few (somewhat hard-to-predict) hours, I still have to size the
    >> > > cluster for the load when the repairs are running. The only reason I
    >> can
    >> > > think of for an interval between repairs is to allow re-compaction
    >> from
    >> > > repair anticompactions, and subrange repairs seem to eliminate this.
    >> Even
    >> > > if they didn't, a more direct method along the lines of "don't repair
    >> when
    >> > > the compaction queue is too long" might make more sense. Blacklisted
    >> > > timeslots might be useful for avoiding peak time or batch jobs, but
    >> only if
    >> > > they can be specified for consistent time-of-day intervals instead of
    >> > > unpredictable lulls between repairs.
    >> > >
    >> > > I really like the idea of automatically adjusting gc_grace_seconds
    >> based on
    >> > > repair state. The only_purge_repaired_tombstones option fixes this
    >> > > elegantly for sequential/incremental repairs on STCS, but not for
    >> subrange
    >> > > repairs or LCS (unless a scheduler gains the ability somehow to
    >> determine
    >> > > that every subrange in an sstable has been repaired and mark it
    >> > > accordingly?)
    >> > >
    >> > >
    >> > > On 2018/04/03 17:48:14, Blake Eggleston <b...@xxxxxxxxx> wrote:
    >> > > > Hi dev@,
    >> > > >
    >> > > > >
    >> > > >
    >> > > > The question of the best way to schedule repairs came up on
    >> > > CASSANDRA-14346, and I thought it would be good to bring up the idea
    >> of an
    >> > > external tool on the dev list.
    >> > > >
    >> > > > >
    >> > > >
    >> > > > Cassandra lacks any sort of tools for automating routine tasks that
    >> are
    >> > > required for running clusters, specifically repair. Regular repair is
    >> a
    >> > > must for most clusters, like compaction. This means that, especially
    >> as far
    >> > > as eventual consistency is concerned, Cassandra isn’t totally
    >> functional
    >> > > out of the box. Operators either need to find a 3rd party solution or
    >> > > implement one themselves. Adding this to Cassandra would make it
    >> easier to
    >> > > use.
    >> > > >
    >> > > > >
    >> > > >
    >> > > > Is this something we should be doing? If so, what should it look
    >> like?
    >> > > >
    >> > > > >
    >> > > >
    >> > > > Personally, I feel like this is a pretty big gap in the project and
    >> would
    >> > > like to see an out of process tool offered. Ideally, Cassandra would
    >> just
    >> > > take care of itself, but writing a distributed repair scheduler that
    >> you
    >> > > trust to run in production is a lot harder than writing a single
    >> process
    >> > > management application that can failover.
    >> > > >
    >> > > > >
    >> > > >
    >> > > > Any thoughts on this?
    >> > > >
    >> > > > >
    >> > > >
    >> > > > Thanks,
    >> > > >
    >> > > > >
    >> > > >
    >> > > > Blake
    >> > > >
    >> > > >
    >> > >
    >>
    >
    >
    



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx