OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposing an Apache Cassandra Management process


Is this something we're moving ahead with despite the feature freeze?

On Sat, 22 Sep 2018 at 08:32, dinesh.joshi@xxxxxxxxx.INVALID
<dinesh.joshi@xxxxxxxxx.invalid> wrote:

> I have created a sub-task - CASSANDRA-14783. Could we get some feedback
> before we begin implementing anything?
>
> Dinesh
>
>     On Thursday, September 20, 2018, 11:22:33 PM PDT, Dinesh Joshi <
> dinesh.joshi@xxxxxxxxx.INVALID> wrote:
>
>  I have updated the doc with a short paragraph providing the
> clarification. Sankalp's suggestion is already part of the doc. If there
> aren't further objections could we move this discussion over to the jira
> (CASSANDRA-14395)?
>
> Dinesh
>
> > On Sep 18, 2018, at 10:31 AM, sankalp kohli <kohlisankalp@xxxxxxxxx>
> wrote:
> >
> > How about we start with a few basic features in side car. How about
> starting with this
> > 1. Bulk nodetool commands: User can curl any sidecar and be able to run
> a nodetool command in bulk across the cluster.
> >
> <sidecar>:<port>/bulk/nodetool/tablestats?arg0=keyspace_name.table_name&arg1=<if
> required>
> >
> > And later
> > 2: Health checks.
> >
> > On Thu, Sep 13, 2018 at 11:34 AM dinesh.joshi@xxxxxxxxx.INVALID <
> dinesh.joshi@xxxxxxxxx.invalid> wrote:
> > I will update the document to add that point. The document did not mean
> to serve as a design or architectural document but rather something that
> would spark a discussion on the idea.
> > Dinesh
> >
> >    On Thursday, September 13, 2018, 10:59:34 AM PDT, Jonathan Haddad <
> jon@xxxxxxxxxxxxx <mailto:jon@xxxxxxxxxxxxx>> wrote:
> >
> >  Most of the discussion and work was done off the mailing list - there's
> a
> > big risk involved when folks disappear for months at a time and resurface
> > with big pile of code plus an agenda that you failed to loop everyone in
> > on. In addition, by your own words the design document didn't accurately
> > describe what was being built.  I don't write this to try to argue about
> > it, I just want to put some perspective for those of us that weren't part
> > of this discussion on a weekly basis over the last several months.  Going
> > forward let's keep things on the ML so we can avoid confusion and
> > frustration for all parties.
> >
> > With that said - I think Blake made a really good point here and it's
> > helped me understand the scope of what's being built better.  Looking at
> it
> > from a different perspective it doesn't seem like there's as much overlap
> > as I had initially thought.  There's the machinery that runs certain
> tasks
> > (what Joey has been working on) and the user facing side of exposing that
> > information in management tool.
> >
> > I do appreciate (and like) the idea of not trying to boil the ocean, and
> > working on things incrementally.  Putting a thin layer on top of
> Cassandra
> > that can perform cluster wide tasks does give us an opportunity to move
> in
> > the direction of a general purpose user-facing admin tool without
> > committing to trying to write the full stack all at once (or even make
> > decisions on it now).  We do need a sensible way of doing rolling
> restarts
> > / scrubs / scheduling and Reaper wasn't built for that, and even though
> we
> > can add it I'm not sure if it's the best mechanism for the long term.
> >
> > So if your goal is to add maturity to the project by making cluster wide
> > tasks easier by providing a framework to build on top of, I'm in favor of
> > that and I don't see it as antithetical to what I had in mind with
> Reaper.
> > Rather, the two are more complementary than I had originally realized.
> >
> > Jon
> >
> >
> >
> >
> > On Thu, Sep 13, 2018 at 10:39 AM dinesh.joshi@xxxxxxxxx.INVALID
> > <dinesh.joshi@xxxxxxxxx <mailto:dinesh.joshi@xxxxxxxxx>.invalid> wrote:
> >
> > > I have a few clarifications -
> > > The scope of the management process is not to simply run repair
> > > scheduling. Repair scheduling is one of the many features we could
> > > implement or adopt from existing sources. So could we please split the
> > > Management Process discussion and the repair scheduling?
> > > After re-reading the management process proposal, I see we missed to
> > > communicate a basic idea in the document. We wanted to take a pluggable
> > > approach to various activities that the management process could
> perform.
> > > This could accommodate different implementations of common activities
> such
> > > as repair. The management process would provide the basic framework
> and it
> > > would have default implementations for some of the basic activities.
> This
> > > would allow for speedier iteration cycles and keep things extensible.
> > > Turning to some questions that Jon and others have raised, when I +1,
> my
> > > intention is to fully contribute and stay with this community. That
> said,
> > > things feel rushed for some but for me it feels like analysis
> paralysis.
> > > We're looking for actionable feedback and to discuss the management
> process
> > > _not_ repair scheduling solutions.
> > > Thanks,
> > > Dinesh
> > >
> > >
> > >
> > > On Sep 12, 2018, at 6:24 PM, sankalp kohli <kohlisankalp@xxxxxxxxx
> <mailto:kohlisankalp@xxxxxxxxx>> wrote:
> > > Here is a list of open discussion points from the voting thread. I
> think
> > > some are already answered but I will still gather these questions here.
> > >
> > > From several people:
> > > 1. Vote is rushed and we need more time for discussion.
> > >
> > > From Sylvain
> > > 2. About the voting process...I think that was addressed by Jeff Jirsa
> and
> > > deserves a separate thread as it is not directly related to this
> thread.
> > > 3. Does the project need a side car.
> > >
> > > From Jonathan Haddad
> > > 4. Are people doing +1 willing to contribute
> > >
> > > From Jonathan Ellis
> > > 5. List of feature set, maturity, maintainer availability from Reaper
> or
> > > any other project being donated.
> > >
> > > Mick Semb Wever
> > > 6. We should not vote on these things and instead build consensus.
> > >
> > > Open Questions from this thread
> > > 7. What technical debts we are talking about in Reaper. Can someone
> give
> > > concrete examples.
> > > 8. What is the timeline of donating Reaper to Apache Cassandra.
> > >
> > > On Wed, Sep 12, 2018 at 3:49 PM sankalp kohli <kohlisankalp@xxxxxxxxx
> <mailto:kohlisankalp@xxxxxxxxx>>
> > > wrote:
> > >
> > >
> > > (Using this thread and not the vote thread intentionally)
> > > For folks talking about vote being rushed. I would use the email from
> > > Joseph to show this is not rushed. There was no email on this thread
> for 4
> > > months until I pinged.
> > >
> > >
> > > Dec 2016: Vinay worked with Jon and Alex to try to collaborate on
> Reaper to
> > > come up with design goals for a repair scheduler that could work at
> Netflix
> > > scale.
> > >
> > > ~Feb 2017: Netflix believes that the fundamental design gaps prevented
> us
> > > from using Reaper as it relies heavily on remote JMX connections and
> > > central coordination.
> > >
> > > Sep. 2017: Vinay gives a lightning talk at NGCC about a highly
> available
> > > and distributed repair scheduling sidecar/tool. He is encouraged by
> > > multiple committers to build repair scheduling into the daemon itself
> and
> > > not as a sidecar so the database is truly eventually consistent.
> > >
> > > ~Jun. 2017 - Feb. 2018: Based on internal need and the positive
> feedback at
> > > NGCC, Vinay and myself prototype the distributed repair scheduler
> within
> > > Priam and roll it out at Netflix scale.
> > >
> > > Mar. 2018: I open a Jira (CASSANDRA-14346) along with a detailed 20
> page
> > > design document for adding repair scheduling to the daemon itself and
> open
> > > the design up for feedback from the community. We get feedback from
> Alex,
> > > Blake, Nate, Stefan, and Mick. As far as I know there were zero
> proposals
> > > to contribute Reaper at this point. We hear the consensus that the
> > > community would prefer repair scheduling in a separate distributed
> sidecar
> > > rather than in the daemon itself and we re-work the design to match
> this
> > > consensus, re-aligning with our original proposal at NGCC.
> > >
> > > Apr 2018: Blake brings the discussion of repair scheduling to the dev
> list
> > > (
> > >
> > >
> > >
> https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E
> <
> https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E
> >
> > > ).
> > > Many community members give positive feedback that we should solve it
> as
> > > part of Cassandra and there is still no mention of contributing Reaper
> at
> > > this point. The last message is my attempted summary giving context on
> how
> > > we want to take the best of all the sidecars (OpsCenter, Priam,
> Reaper) and
> > > ship them with Cassandra.
> > >
> > > Apr. 2018: Dinesh opens CASSANDRA-14395 along with a public design
> document
> > > for gathering feedback on a general management sidecar. Sankalp and
> Dinesh
> > > encourage Vinay and myself to kickstart that sidecar using the repair
> > > scheduler patch
> > >
> > > Apr 2018: Dinesh reaches out to the dev list (
> > >
> > >
> > >
> https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E
> <
> https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E
> >
> > > )
> > > about the general management process to gain further feedback. All
> feedback
> > > remains positive as it is a potential place for multiple community
> members
> > > to contribute their various sidecar functionality.
> > >
> > > May-Jul 2017: Vinay and I work on creating a basic sidecar for running
> the
> > > repair scheduler based on the feedback from the community in
> > > CASSANDRA-14346 and CASSANDRA-14395
> > >
> > > Jun 2018: I bump CASSANDRA-14346 indicating we're still working on
> this,
> > > nobody objects
> > >
> > > Jul 2018: Sankalp asks on the dev list if anyone has feature Jiras
> anyone
> > > needs review for before 4.0, I mention again that we've nearly got the
> > > basic sidecar and repair scheduling work done and will need help with
> > > review. No one responds.
> > >
> > > Aug 2018: We submit a patch that brings a basic distributed sidecar and
> > > robust distributed repair to Cassandra itself. Dinesh mentions that he
> will
> > > try to review. Now folks appear concerned about it being in tree and
> > > instead maybe it should go in a different repo all together. I don't
> think
> > > we have consensus on the repo choice yet.
> > >
> > > On Sun, Sep 9, 2018 at 9:13 AM sankalp kohli <kohlisankalp@xxxxxxxxx
> <mailto:kohlisankalp@xxxxxxxxx>>
> > > wrote:
> > >
> > >
> > > I agree with Jon and I think folks who are talking about tech debts in
> > > Reaper should elaborate with examples about these tech debts. Can we be
> > > more precise and list them down? I see it spread out over this long
> email
> > > thread!!
> > >
> > > On Sun, Sep 9, 2018 at 6:29 AM Elliott Sims <elliott@xxxxxxxxxxxxx
> <mailto:elliott@xxxxxxxxxxxxx>>
> > > wrote:
> > >
> > >
> > > A big one to add to your list there, IMO as a user:
> > > * API for determining detailed repair state (and history?).
> Essentially,
> > > something beyond just "Is some sort of repair running?" so that tools
> > > like
> > > Reaper can parallelize better.
> > >
> > > On Sun, Sep 9, 2018 at 8:30 AM, Stefan Podkowinski <spod@xxxxxxxxxx
> <mailto:spod@xxxxxxxxxx>>
> > > wrote:
> > >
> > >
> > > Does it have to be a single project with functionality provided by
> > > multiple plugins? Designing a plugin API at this point seems to be a
> > >
> > > bit
> > >
> > > early and comes with additional complexity around managing plugins in
> > > general.
> > >
> > > I was more thinking into the direction of: "what can we do to enable
> > > people to create any kind of side car or tooling solution?". Thinks
> > >
> > > like:
> > >
> > >
> > > Common cluster discovery and management API
> > > * Detect local Cassandra processes
> > > * Discover and receive events on cluster topology
> > > * Get assigned tokens for nodes
> > > * Read node configuration
> > > * Health checks (as already proposed)
> > >
> > > Any side cars should be easy to install on nodes that already run
> > >
> > > Cassandra
> > >
> > > * Scripts for packaging (tar, deb, rpm)
> > > * Templates for systemd support, optionally with auto-startup
> > >
> > > dependency
> > >
> > > on the Cassandra main process
> > >
> > > Integration testing
> > > * Provide basic testing framework for mocking cluster state and
> > >
> > > messages
> > >
> > >
> > > Support for other languages / avoid having to use JMX
> > > * JMX bridge (HTTP? gRPC?, already implemented in #14346?)
> > >
> > > Obviously the whole side car discussion is not moving into a direction
> > > everyone's happy with. Would it be an option to take a step back and
> > > start implementing such a tooling framework with scripts and libraries
> > > for the features described above, as a small GitHub project, instead of
> > > putting an existing side-car solution up for vote? If that would work
> > > and we get people collaborating on code shared between existing
> > > side-cars, then we could take the next step and think about either
> > > revisit the "official Cassandra side-car" topic, or add the created
> > > client tooling framework as official sub-project to the Cassandra
> > > project (maybe via Apache incubator).
> > >
> > >
> > > On 08.09.18 02:49, Joseph Lynch wrote:
> > >
> > > On Fri, Sep 7, 2018 at 5:03 PM Jonathan Haddad <jon@xxxxxxxxxxxxx
> <mailto:jon@xxxxxxxxxxxxx>>
> > >
> > > wrote:
> > >
> > >
> > > We haven’t even defined any requirements for an admin tool. It’s
> > >
> > >
> > >
> > > hard to
> > >
> > >
> > >
> > > make a case for anything without agreement on what we’re trying to
> > >
> > >
> > > build.
> > >
> > >
> > >
> > >
> > > We were/are trying to sketch out scope/requirements in the #14395 and
> > > #14346 tickets as well as their associated design documents. I think
> > > the general proposed direction is a distributed 1:1 management
> > >
> > >
> > > sidecar
> > >
> > >
> > > process similar in architecture to Netflix's Priam except explicitly
> > > built to be general and pluggable by anyone rather than tightly
> > > coupled to AWS.
> > >
> > > Dinesh, Vinay and I were aiming for low amounts of scope at first and
> > > take things in an iterative approach with just enough upfront design
> > > but not so much we are unable to make any progress at all. For
> > >
> > >
> > > example
> > >
> > >
> > > maybe something like:
> > >
> > > 1. Get a super simple and non controversial sidecar process that
> > >
> > >
> > > ships
> > >
> > >
> > > with Cassandra and exposes a lightweight HTTP interface to e.g. some
> > > basic JMX endpoints
> > > 2a. Add a pluggable execution engine for cron/oneshot/scheduled jobs
> > > with the basic interfaces and state store and such
> > > 2b. Start scoping and implementing the full HTTP interface, e.g.
> > > backup status, cluster health status, etc ...
> > > 3a. Start integrating implementations of the jobs from 2a such as
> > > snapshot, backup, cluster restart, daemon + sstable upgrade, repair,
> > > etc
> > > 3b. Start integrating UI components that pair with the HTTP interface
> > >
> > > from 2b
> > >
> > > 4. ?? Perhaps start unlocking next generation operations like moving
> > > "background" activities like compaction, streaming, repair etc into
> > > one or more sidecar contained processes to ensure the main daemon
> > >
> > >
> > > only
> > >
> > >
> > > handles read+write requests
> > >
> > > There are going to be a lot of questions to answer, and I think
> > >
> > >
> > > trying
> > >
> > >
> > > to answer them all up front will mean that we get nowhere or make
> > > unfortunate compromises that cripple the project from the start. If
> > > people think we need to do more design and discussion than we have
> > > been doing then we can spend more time on the design, but personally
> > > I'd rather start iterating on code and prove value incrementally. If
> > > it doesn't work out we won't release it GA to the community ...
> > >
> > > -Joey
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx <mailto:
> dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx>
> > > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> <mailto:dev-help@xxxxxxxxxxxxxxxxxxxx>
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx <mailto:
> dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx>
> > > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> <mailto:dev-help@xxxxxxxxxxxxxxxxxxxx>
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> > --
> > Jon Haddad
> > http://www.rustyrazorblade.com <http://www.rustyrazorblade.com/>
> > twitter: rustyrazorblade
>