OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposing an Apache Cassandra Management process


I think this feature is important to the community and I don’t want to
stifle that but if committers/contributors are working on the management
process instead of testing 4.0 it takes away from it regardless of where
the code lives. Waiting to merge until after 4.0, at a minimum, would
benefit the testing effort.

Jordan

On Sat, Sep 22, 2018 at 10:06 AM Sankalp Kohli <kohlisankalp@xxxxxxxxx>
wrote:

> This is not part of core database and a separate repo and so my impression
> is that this can continue to make progress. Also we can always make
> progress and not merge it till freeze is lifted.
>
> Open to ideas/suggestions if someone thinks otherwise.
>
> > On Sep 22, 2018, at 03:13, kurt greaves <kurt@xxxxxxxxxxxxxxx> wrote:
> >
> > Is this something we're moving ahead with despite the feature freeze?
> >
> > On Sat, 22 Sep 2018 at 08:32, dinesh.joshi@xxxxxxxxx.INVALID
> > <dinesh.joshi@xxxxxxxxx.invalid> wrote:
> >
> >> I have created a sub-task - CASSANDRA-14783. Could we get some feedback
> >> before we begin implementing anything?
> >>
> >> Dinesh
> >>
> >>    On Thursday, September 20, 2018, 11:22:33 PM PDT, Dinesh Joshi <
> >> dinesh.joshi@xxxxxxxxx.INVALID> wrote:
> >>
> >> I have updated the doc with a short paragraph providing the
> >> clarification. Sankalp's suggestion is already part of the doc. If there
> >> aren't further objections could we move this discussion over to the jira
> >> (CASSANDRA-14395)?
> >>
> >> Dinesh
> >>
> >>> On Sep 18, 2018, at 10:31 AM, sankalp kohli <kohlisankalp@xxxxxxxxx>
> >> wrote:
> >>>
> >>> How about we start with a few basic features in side car. How about
> >> starting with this
> >>> 1. Bulk nodetool commands: User can curl any sidecar and be able to run
> >> a nodetool command in bulk across the cluster.
> >>>
> >>
> <sidecar>:<port>/bulk/nodetool/tablestats?arg0=keyspace_name.table_name&arg1=<if
> >> required>
> >>>
> >>> And later
> >>> 2: Health checks.
> >>>
> >>> On Thu, Sep 13, 2018 at 11:34 AM dinesh.joshi@xxxxxxxxx.INVALID <
> >> dinesh.joshi@xxxxxxxxx.invalid> wrote:
> >>> I will update the document to add that point. The document did not mean
> >> to serve as a design or architectural document but rather something that
> >> would spark a discussion on the idea.
> >>> Dinesh
> >>>
> >>>   On Thursday, September 13, 2018, 10:59:34 AM PDT, Jonathan Haddad <
> >> jon@xxxxxxxxxxxxx <mailto:jon@xxxxxxxxxxxxx>> wrote:
> >>>
> >>> Most of the discussion and work was done off the mailing list - there's
> >> a
> >>> big risk involved when folks disappear for months at a time and
> resurface
> >>> with big pile of code plus an agenda that you failed to loop everyone
> in
> >>> on. In addition, by your own words the design document didn't
> accurately
> >>> describe what was being built.  I don't write this to try to argue
> about
> >>> it, I just want to put some perspective for those of us that weren't
> part
> >>> of this discussion on a weekly basis over the last several months.
> Going
> >>> forward let's keep things on the ML so we can avoid confusion and
> >>> frustration for all parties.
> >>>
> >>> With that said - I think Blake made a really good point here and it's
> >>> helped me understand the scope of what's being built better.  Looking
> at
> >> it
> >>> from a different perspective it doesn't seem like there's as much
> overlap
> >>> as I had initially thought.  There's the machinery that runs certain
> >> tasks
> >>> (what Joey has been working on) and the user facing side of exposing
> that
> >>> information in management tool.
> >>>
> >>> I do appreciate (and like) the idea of not trying to boil the ocean,
> and
> >>> working on things incrementally.  Putting a thin layer on top of
> >> Cassandra
> >>> that can perform cluster wide tasks does give us an opportunity to move
> >> in
> >>> the direction of a general purpose user-facing admin tool without
> >>> committing to trying to write the full stack all at once (or even make
> >>> decisions on it now).  We do need a sensible way of doing rolling
> >> restarts
> >>> / scrubs / scheduling and Reaper wasn't built for that, and even though
> >> we
> >>> can add it I'm not sure if it's the best mechanism for the long term.
> >>>
> >>> So if your goal is to add maturity to the project by making cluster
> wide
> >>> tasks easier by providing a framework to build on top of, I'm in favor
> of
> >>> that and I don't see it as antithetical to what I had in mind with
> >> Reaper.
> >>> Rather, the two are more complementary than I had originally realized.
> >>>
> >>> Jon
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Sep 13, 2018 at 10:39 AM dinesh.joshi@xxxxxxxxx.INVALID
> >>> <dinesh.joshi@xxxxxxxxx <mailto:dinesh.joshi@xxxxxxxxx>.invalid>
> wrote:
> >>>
> >>>> I have a few clarifications -
> >>>> The scope of the management process is not to simply run repair
> >>>> scheduling. Repair scheduling is one of the many features we could
> >>>> implement or adopt from existing sources. So could we please split the
> >>>> Management Process discussion and the repair scheduling?
> >>>> After re-reading the management process proposal, I see we missed to
> >>>> communicate a basic idea in the document. We wanted to take a
> pluggable
> >>>> approach to various activities that the management process could
> >> perform.
> >>>> This could accommodate different implementations of common activities
> >> such
> >>>> as repair. The management process would provide the basic framework
> >> and it
> >>>> would have default implementations for some of the basic activities.
> >> This
> >>>> would allow for speedier iteration cycles and keep things extensible.
> >>>> Turning to some questions that Jon and others have raised, when I +1,
> >> my
> >>>> intention is to fully contribute and stay with this community. That
> >> said,
> >>>> things feel rushed for some but for me it feels like analysis
> >> paralysis.
> >>>> We're looking for actionable feedback and to discuss the management
> >> process
> >>>> _not_ repair scheduling solutions.
> >>>> Thanks,
> >>>> Dinesh
> >>>>
> >>>>
> >>>>
> >>>> On Sep 12, 2018, at 6:24 PM, sankalp kohli <kohlisankalp@xxxxxxxxx
> >> <mailto:kohlisankalp@xxxxxxxxx>> wrote:
> >>>> Here is a list of open discussion points from the voting thread. I
> >> think
> >>>> some are already answered but I will still gather these questions
> here.
> >>>>
> >>>> From several people:
> >>>> 1. Vote is rushed and we need more time for discussion.
> >>>>
> >>>> From Sylvain
> >>>> 2. About the voting process...I think that was addressed by Jeff Jirsa
> >> and
> >>>> deserves a separate thread as it is not directly related to this
> >> thread.
> >>>> 3. Does the project need a side car.
> >>>>
> >>>> From Jonathan Haddad
> >>>> 4. Are people doing +1 willing to contribute
> >>>>
> >>>> From Jonathan Ellis
> >>>> 5. List of feature set, maturity, maintainer availability from Reaper
> >> or
> >>>> any other project being donated.
> >>>>
> >>>> Mick Semb Wever
> >>>> 6. We should not vote on these things and instead build consensus.
> >>>>
> >>>> Open Questions from this thread
> >>>> 7. What technical debts we are talking about in Reaper. Can someone
> >> give
> >>>> concrete examples.
> >>>> 8. What is the timeline of donating Reaper to Apache Cassandra.
> >>>>
> >>>> On Wed, Sep 12, 2018 at 3:49 PM sankalp kohli <kohlisankalp@xxxxxxxxx
> >> <mailto:kohlisankalp@xxxxxxxxx>>
> >>>> wrote:
> >>>>
> >>>>
> >>>> (Using this thread and not the vote thread intentionally)
> >>>> For folks talking about vote being rushed. I would use the email from
> >>>> Joseph to show this is not rushed. There was no email on this thread
> >> for 4
> >>>> months until I pinged.
> >>>>
> >>>>
> >>>> Dec 2016: Vinay worked with Jon and Alex to try to collaborate on
> >> Reaper to
> >>>> come up with design goals for a repair scheduler that could work at
> >> Netflix
> >>>> scale.
> >>>>
> >>>> ~Feb 2017: Netflix believes that the fundamental design gaps prevented
> >> us
> >>>> from using Reaper as it relies heavily on remote JMX connections and
> >>>> central coordination.
> >>>>
> >>>> Sep. 2017: Vinay gives a lightning talk at NGCC about a highly
> >> available
> >>>> and distributed repair scheduling sidecar/tool. He is encouraged by
> >>>> multiple committers to build repair scheduling into the daemon itself
> >> and
> >>>> not as a sidecar so the database is truly eventually consistent.
> >>>>
> >>>> ~Jun. 2017 - Feb. 2018: Based on internal need and the positive
> >> feedback at
> >>>> NGCC, Vinay and myself prototype the distributed repair scheduler
> >> within
> >>>> Priam and roll it out at Netflix scale.
> >>>>
> >>>> Mar. 2018: I open a Jira (CASSANDRA-14346) along with a detailed 20
> >> page
> >>>> design document for adding repair scheduling to the daemon itself and
> >> open
> >>>> the design up for feedback from the community. We get feedback from
> >> Alex,
> >>>> Blake, Nate, Stefan, and Mick. As far as I know there were zero
> >> proposals
> >>>> to contribute Reaper at this point. We hear the consensus that the
> >>>> community would prefer repair scheduling in a separate distributed
> >> sidecar
> >>>> rather than in the daemon itself and we re-work the design to match
> >> this
> >>>> consensus, re-aligning with our original proposal at NGCC.
> >>>>
> >>>> Apr 2018: Blake brings the discussion of repair scheduling to the dev
> >> list
> >>>> (
> >>>>
> >>>>
> >>>>
> >>
> https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E
> >> <
> >>
> https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E
> >>>
> >>>> ).
> >>>> Many community members give positive feedback that we should solve it
> >> as
> >>>> part of Cassandra and there is still no mention of contributing Reaper
> >> at
> >>>> this point. The last message is my attempted summary giving context on
> >> how
> >>>> we want to take the best of all the sidecars (OpsCenter, Priam,
> >> Reaper) and
> >>>> ship them with Cassandra.
> >>>>
> >>>> Apr. 2018: Dinesh opens CASSANDRA-14395 along with a public design
> >> document
> >>>> for gathering feedback on a general management sidecar. Sankalp and
> >> Dinesh
> >>>> encourage Vinay and myself to kickstart that sidecar using the repair
> >>>> scheduler patch
> >>>>
> >>>> Apr 2018: Dinesh reaches out to the dev list (
> >>>>
> >>>>
> >>>>
> >>
> https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E
> >> <
> >>
> https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E
> >>>
> >>>> )
> >>>> about the general management process to gain further feedback. All
> >> feedback
> >>>> remains positive as it is a potential place for multiple community
> >> members
> >>>> to contribute their various sidecar functionality.
> >>>>
> >>>> May-Jul 2017: Vinay and I work on creating a basic sidecar for running
> >> the
> >>>> repair scheduler based on the feedback from the community in
> >>>> CASSANDRA-14346 and CASSANDRA-14395
> >>>>
> >>>> Jun 2018: I bump CASSANDRA-14346 indicating we're still working on
> >> this,
> >>>> nobody objects
> >>>>
> >>>> Jul 2018: Sankalp asks on the dev list if anyone has feature Jiras
> >> anyone
> >>>> needs review for before 4.0, I mention again that we've nearly got the
> >>>> basic sidecar and repair scheduling work done and will need help with
> >>>> review. No one responds.
> >>>>
> >>>> Aug 2018: We submit a patch that brings a basic distributed sidecar
> and
> >>>> robust distributed repair to Cassandra itself. Dinesh mentions that he
> >> will
> >>>> try to review. Now folks appear concerned about it being in tree and
> >>>> instead maybe it should go in a different repo all together. I don't
> >> think
> >>>> we have consensus on the repo choice yet.
> >>>>
> >>>> On Sun, Sep 9, 2018 at 9:13 AM sankalp kohli <kohlisankalp@xxxxxxxxx
> >> <mailto:kohlisankalp@xxxxxxxxx>>
> >>>> wrote:
> >>>>
> >>>>
> >>>> I agree with Jon and I think folks who are talking about tech debts in
> >>>> Reaper should elaborate with examples about these tech debts. Can we
> be
> >>>> more precise and list them down? I see it spread out over this long
> >> email
> >>>> thread!!
> >>>>
> >>>> On Sun, Sep 9, 2018 at 6:29 AM Elliott Sims <elliott@xxxxxxxxxxxxx
> >> <mailto:elliott@xxxxxxxxxxxxx>>
> >>>> wrote:
> >>>>
> >>>>
> >>>> A big one to add to your list there, IMO as a user:
> >>>> * API for determining detailed repair state (and history?).
> >> Essentially,
> >>>> something beyond just "Is some sort of repair running?" so that tools
> >>>> like
> >>>> Reaper can parallelize better.
> >>>>
> >>>> On Sun, Sep 9, 2018 at 8:30 AM, Stefan Podkowinski <spod@xxxxxxxxxx
> >> <mailto:spod@xxxxxxxxxx>>
> >>>> wrote:
> >>>>
> >>>>
> >>>> Does it have to be a single project with functionality provided by
> >>>> multiple plugins? Designing a plugin API at this point seems to be a
> >>>>
> >>>> bit
> >>>>
> >>>> early and comes with additional complexity around managing plugins in
> >>>> general.
> >>>>
> >>>> I was more thinking into the direction of: "what can we do to enable
> >>>> people to create any kind of side car or tooling solution?". Thinks
> >>>>
> >>>> like:
> >>>>
> >>>>
> >>>> Common cluster discovery and management API
> >>>> * Detect local Cassandra processes
> >>>> * Discover and receive events on cluster topology
> >>>> * Get assigned tokens for nodes
> >>>> * Read node configuration
> >>>> * Health checks (as already proposed)
> >>>>
> >>>> Any side cars should be easy to install on nodes that already run
> >>>>
> >>>> Cassandra
> >>>>
> >>>> * Scripts for packaging (tar, deb, rpm)
> >>>> * Templates for systemd support, optionally with auto-startup
> >>>>
> >>>> dependency
> >>>>
> >>>> on the Cassandra main process
> >>>>
> >>>> Integration testing
> >>>> * Provide basic testing framework for mocking cluster state and
> >>>>
> >>>> messages
> >>>>
> >>>>
> >>>> Support for other languages / avoid having to use JMX
> >>>> * JMX bridge (HTTP? gRPC?, already implemented in #14346?)
> >>>>
> >>>> Obviously the whole side car discussion is not moving into a direction
> >>>> everyone's happy with. Would it be an option to take a step back and
> >>>> start implementing such a tooling framework with scripts and libraries
> >>>> for the features described above, as a small GitHub project, instead
> of
> >>>> putting an existing side-car solution up for vote? If that would work
> >>>> and we get people collaborating on code shared between existing
> >>>> side-cars, then we could take the next step and think about either
> >>>> revisit the "official Cassandra side-car" topic, or add the created
> >>>> client tooling framework as official sub-project to the Cassandra
> >>>> project (maybe via Apache incubator).
> >>>>
> >>>>
> >>>> On 08.09.18 02:49, Joseph Lynch wrote:
> >>>>
> >>>> On Fri, Sep 7, 2018 at 5:03 PM Jonathan Haddad <jon@xxxxxxxxxxxxx
> >> <mailto:jon@xxxxxxxxxxxxx>>
> >>>>
> >>>> wrote:
> >>>>
> >>>>
> >>>> We haven’t even defined any requirements for an admin tool. It’s
> >>>>
> >>>>
> >>>>
> >>>> hard to
> >>>>
> >>>>
> >>>>
> >>>> make a case for anything without agreement on what we’re trying to
> >>>>
> >>>>
> >>>> build.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> We were/are trying to sketch out scope/requirements in the #14395 and
> >>>> #14346 tickets as well as their associated design documents. I think
> >>>> the general proposed direction is a distributed 1:1 management
> >>>>
> >>>>
> >>>> sidecar
> >>>>
> >>>>
> >>>> process similar in architecture to Netflix's Priam except explicitly
> >>>> built to be general and pluggable by anyone rather than tightly
> >>>> coupled to AWS.
> >>>>
> >>>> Dinesh, Vinay and I were aiming for low amounts of scope at first and
> >>>> take things in an iterative approach with just enough upfront design
> >>>> but not so much we are unable to make any progress at all. For
> >>>>
> >>>>
> >>>> example
> >>>>
> >>>>
> >>>> maybe something like:
> >>>>
> >>>> 1. Get a super simple and non controversial sidecar process that
> >>>>
> >>>>
> >>>> ships
> >>>>
> >>>>
> >>>> with Cassandra and exposes a lightweight HTTP interface to e.g. some
> >>>> basic JMX endpoints
> >>>> 2a. Add a pluggable execution engine for cron/oneshot/scheduled jobs
> >>>> with the basic interfaces and state store and such
> >>>> 2b. Start scoping and implementing the full HTTP interface, e.g.
> >>>> backup status, cluster health status, etc ...
> >>>> 3a. Start integrating implementations of the jobs from 2a such as
> >>>> snapshot, backup, cluster restart, daemon + sstable upgrade, repair,
> >>>> etc
> >>>> 3b. Start integrating UI components that pair with the HTTP interface
> >>>>
> >>>> from 2b
> >>>>
> >>>> 4. ?? Perhaps start unlocking next generation operations like moving
> >>>> "background" activities like compaction, streaming, repair etc into
> >>>> one or more sidecar contained processes to ensure the main daemon
> >>>>
> >>>>
> >>>> only
> >>>>
> >>>>
> >>>> handles read+write requests
> >>>>
> >>>> There are going to be a lot of questions to answer, and I think
> >>>>
> >>>>
> >>>> trying
> >>>>
> >>>>
> >>>> to answer them all up front will mean that we get nowhere or make
> >>>> unfortunate compromises that cripple the project from the start. If
> >>>> people think we need to do more design and discussion than we have
> >>>> been doing then we can spend more time on the design, but personally
> >>>> I'd rather start iterating on code and prove value incrementally. If
> >>>> it doesn't work out we won't release it GA to the community ...
> >>>>
> >>>> -Joey
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx <mailto:
> >> dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx>
> >>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> >> <mailto:dev-help@xxxxxxxxxxxxxxxxxxxx>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx <mailto:
> >> dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx>
> >>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> >> <mailto:dev-help@xxxxxxxxxxxxxxxxxxxx>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>> --
> >>> Jon Haddad
> >>> http://www.rustyrazorblade.com <http://www.rustyrazorblade.com/>
> >>> twitter: rustyrazorblade
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>
>