Re: Repair scheduling tools

My two cents as a (relatively small) user.  I'm coming at this from the
ops/user side, so my apologies if some of these don't make sense based on a
more detailed understanding of the codebase:

Repair is definitely a major missing piece of Cassandra.  Integrated would
be easier, but a sidecar might be more flexible.  As an intermediate step
that works towards both options, does it make sense to start with
finer-grained tracking and reporting for subrange repairs?  That is, expose
a set of interfaces (both internally and via JMX) that give a scheduler
enough information to run subrange repairs across multiple keyspaces or
even non-overlapping ranges at the same time.  That lets people experiment
with and quickly/safely/easily iterate on different scheduling strategies
in the short term, and long-term those strategies can be integrated into a
built-in scheduler

On the subject of scheduling, I think adjusting parallelism/aggression with
a possible whitelist or blacklist would be a lot more useful than a "time
between repairs".  That is, if repairs run for a few hours then don't run
for a few (somewhat hard-to-predict) hours, I still have to size the
cluster for the load when the repairs are running.   The only reason I can
think of for an interval between repairs is to allow re-compaction from
repair anticompactions, and subrange repairs seem to eliminate this.  Even
if they didn't, a more direct method along the lines of "don't repair when
the compaction queue is too long" might make more sense.  Blacklisted
timeslots might be useful for avoiding peak time or batch jobs, but only if
they can be specified for consistent time-of-day intervals instead of
unpredictable lulls between repairs.

I really like the idea of automatically adjusting gc_grace_seconds based on
repair state.  The only_purge_repaired_tombstones option fixes this
elegantly for sequential/incremental repairs on STCS, but not for subrange
repairs or LCS (unless a scheduler gains the ability somehow to determine
that every subrange in an sstable has been repaired and mark it

