osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Evolving the client protocol


On Mon, Apr 23, 2018 at 4:13 PM, Jonathan Haddad <jon@xxxxxxxxxxxxx> wrote:

> From where I stand it looks like you've got only two options for any
> feature that involves updating the protocol:
>
> 1. Don't built the feature
> 2. Built it in Cassanda & scylladb, update the drivers accordingly
>
> I don't think you have a third option, which is built it only in ScyllaDB,
> because that means you have to fork *all* the drivers and make it work,
> then maintain them.  Your business model appears to be built on not doing
> any of the driver work yourself, and you certainly aren't giving back to
> the open source community via a permissive license on ScyllaDB itself, so
> I'm a bit lost here.
>

It's totally not about business model.
Scylla itself is 99% open source with AGPL license that prevents abuse and
forces to be committed back to the project. We also have our core engine
(seastar) licensed
as Apache since it needs to be integrated with  the core application.
Recently one of our community members even created a new Seastar based, C++
driver.

Scylla chose to be compatible with the drivers in order to leverage the
existing infrastructure
and (let's be frank) in order to allow smooth migration.
We would have loved to contribute more to the drivers but up to recently we:
1. Were busy on top of our heads with the server
2. Happy w/ the existing drivers
3. Developed extensions - GoCQLX - our own contribution

Finally we can contribute back to the same driver project, we want to do it
the right way,
without forking and without duplicated efforts.

Many times, having a private fork is way easier than proper open source
work so from
a pure business perspective, we don't select the shortest path.


>
> To me it looks like you're asking a bunch of volunteers that work on
> Cassandra to accommodate you.  What exactly do we get out of this
> relationship?  What incentive do I or anyone else have to spend time
> helping you instead of working on something that interests me?
>

Jon, this is certainty not the case.
We genuinely wish to make true *open source* work on:
a. Cassandra drivers
b. Client protocol
c. Scylla server side.
d. Cassandra community related work: mailing list, Jira, design

But not
e. Cassandra server side

While I wouldn't mind doing the Cassandra server work, we don't have the
resources or
the expertise. The Cassandra _developer_ community is welcome to decide
whether
we get to contribute a/b/c/d. Avi has enumerated the options of
cooperation, passive cooperation
and zero cooperation (below).

1. The protocol change is developed using the Cassandra process in a JIRA
ticket, culminating in a patch to doc/native_protocol*.spec when consensus
is achieved.
2. The protocol change is developed outside the Cassandra process.
3. No cooperation.

Look, I can understand the hostility and suspicious, however, from the C*
project POV, it makes no
sense to ignore, otherwise we'll fork the drivers and you won't get
anything back. There is another
at least one vendor today with their server fork and driver fork and it
makes sense to keep the protocol
unified in an extensible way and to discuss new features _together_.



>
> Jon
>
>
> On Mon, Apr 23, 2018 at 7:59 AM Ben Bromhead <ben@xxxxxxxxxxxxxxx> wrote:
>
> > > >> This doesn't work without additional changes, for RF>1. The token
> ring
> > > could place two replicas of the same token range on the same physical
> > > server, even though those are two separate cores of the same server.
> You
> > > could add another element to the hierarchy (cluster -> datacenter ->
> rack
> > > -> node -> core/shard), but that generates unneeded range movements
> when
> > a
> > > node is added.
> > > > I have seen rack awareness used/abused to solve this.
> > > >
> > >
> > > But then you lose real rack awareness. It's fine for a quick hack, but
> > > not a long-term solution.
> > >
> > > (it also creates a lot more tokens, something nobody needs)
> > >
> >
> > I'm having trouble understanding how you loose "real" rack awareness, as
> > these shards are in the same rack anyway, because the address and port
> are
> > on the same server in the same rack. So it behaves as expected. Could you
> > explain a situation where the shards on a single server would be in
> > different racks (or fault domains)?
> >
> > If you wanted to support a situation where you have a single rack per DC
> > for simple deployments, extending NetworkTopologyStrategy to behave the
> way
> > it did before https://issues.apache.org/jira/browse/CASSANDRA-7544 with
> > respect to treating InetAddresses as servers rather than the address and
> > port would be simple. Both this implementation in Apache Cassandra and
> the
> > respective load balancing classes in the drivers are explicitly designed
> to
> > be pluggable so that would be an easier integration point for you.
> >
> > I'm not sure how it creates more tokens? If a server normally owns 256
> > tokens, each shard on a different port would just advertise ownership of
> > 256/# of cores (e.g. 4 tokens if you had 64 cores).
> >
> >
> > >
> > > > Regards,
> > > > Ariel
> > > >
> > > >> On Apr 22, 2018, at 8:26 AM, Avi Kivity <avi@xxxxxxxxxxxx> wrote:
> > > >>
> > > >>
> > > >>
> > > >>> On 2018-04-19 21:15, Ben Bromhead wrote:
> > > >>> Re #3:
> > > >>>
> > > >>> Yup I was thinking each shard/port would appear as a discrete
> server
> > > to the
> > > >>> client.
> > > >> This doesn't work without additional changes, for RF>1. The token
> ring
> > > could place two replicas of the same token range on the same physical
> > > server, even though those are two separate cores of the same server.
> You
> > > could add another element to the hierarchy (cluster -> datacenter ->
> rack
> > > -> node -> core/shard), but that generates unneeded range movements
> when
> > a
> > > node is added.
> > > >>
> > > >>> If the per port suggestion is unacceptable due to hardware
> > > requirements,
> > > >>> remembering that Cassandra is built with the concept scaling
> > > *commodity*
> > > >>> hardware horizontally, you'll have to spend your time and energy
> > > convincing
> > > >>> the community to support a protocol feature it has no (current) use
> > > for or
> > > >>> find another interim solution.
> > > >> Those servers are commodity servers (not x86, but still commodity).
> In
> > > any case 60+ logical cores are common now (hello AWS i3.16xlarge or
> even
> > > i3.metal), and we can only expect logical core count to continue to
> > > increase (there are 48-core ARM processors now).
> > > >>
> > > >>> Another way, would be to build support and consensus around a clear
> > > >>> technical need in the Apache Cassandra project as it stands today.
> > > >>>
> > > >>> One way to build community support might be to contribute an Apache
> > > >>> licensed thread per core implementation in Java that matches the
> > > protocol
> > > >>> change and shard concept you are looking for ;P
> > > >> I doubt I'll survive the egregious top-posting that is going on in
> > this
> > > list.
> > > >>
> > > >>>
> > > >>>> On Thu, Apr 19, 2018 at 1:43 PM Ariel Weisberg <ariel@xxxxxxxxxxx
> >
> > > wrote:
> > > >>>>
> > > >>>> Hi,
> > > >>>>
> > > >>>> So at technical level I don't understand this yet.
> > > >>>>
> > > >>>> So you have a database consisting of single threaded shards and a
> > > socket
> > > >>>> for accept that is generating TCP connections and in advance you
> > > don't know
> > > >>>> which connection is going to send messages to which shard.
> > > >>>>
> > > >>>> What is the mechanism by which you get the packets for a given TCP
> > > >>>> connection delivered to a specific core? I know that a given TCP
> > > connection
> > > >>>> will normally have all of its packets delivered to the same queue
> > > from the
> > > >>>> NIC because the tuple of source address + port and destination
> > > address +
> > > >>>> port is typically hashed to pick one of the queues the NIC
> > presents. I
> > > >>>> might have the contents of the tuple slightly wrong, but it always
> > > includes
> > > >>>> a component you don't get to control.
> > > >>>>
> > > >>>> Since it's hashing how do you manipulate which queue packets for a
> > TCP
> > > >>>> connection go to and how is it made worse by having an accept
> socket
> > > per
> > > >>>> shard?
> > > >>>>
> > > >>>> You also mention 160 ports as bad, but it doesn't sound like a big
> > > number
> > > >>>> resource wise. Is it an operational headache?
> > > >>>>
> > > >>>> RE tokens distributed amongst shards. The way that would work
> right
> > > now is
> > > >>>> that each port number appears to be a discrete instance of the
> > > server. So
> > > >>>> you could have shards be actual shards that are simply colocated
> on
> > > the
> > > >>>> same box, run in the same process, and share resources. I know
> this
> > > pushes
> > > >>>> more of the complexity into the server vs the driver as the server
> > > expects
> > > >>>> all shards to share some client visible like system tables and
> > certain
> > > >>>> identifiers.
> > > >>>>
> > > >>>> Ariel
> > > >>>>> On Thu, Apr 19, 2018, at 12:59 PM, Avi Kivity wrote:
> > > >>>>> Port-per-shard is likely the easiest option but it's too ugly to
> > > >>>>> contemplate. We run on machines with 160 shards (IBM POWER
> > 2s20c160t
> > > >>>>> IIRC), it will be just horrible to have 160 open ports.
> > > >>>>>
> > > >>>>>
> > > >>>>> It also doesn't fit will with the NICs ability to automatically
> > > >>>>> distribute packets among cores using multiple queues, so the
> kernel
> > > >>>>> would have to shuffle those packets around. Much better to have
> > those
> > > >>>>> packets delivered directly to the core that will service them.
> > > >>>>>
> > > >>>>>
> > > >>>>> (also, some protocol changes are needed so the driver knows how
> > > tokens
> > > >>>>> are distributed among shards)
> > > >>>>>
> > > >>>>>> On 2018-04-19 19:46, Ben Bromhead wrote:
> > > >>>>>> WRT to #3
> > > >>>>>> To fit in the existing protocol, could you have each shard
> listen
> > > on a
> > > >>>>>> different port? Drivers are likely going to support this due to
> > > >>>>>> https://issues.apache.org/jira/browse/CASSANDRA-7544 (
> > > >>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11596).  I'm
> not
> > > super
> > > >>>>>> familiar with the ticket so their might be something I'm missing
> > > but it
> > > >>>>>> sounds like a potential approach.
> > > >>>>>>
> > > >>>>>> This would give you a path forward at least for the short term.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Thu, Apr 19, 2018 at 12:10 PM Ariel Weisberg <
> > ariel@xxxxxxxxxxx>
> > > >>>> wrote:
> > > >>>>>>> Hi,
> > > >>>>>>>
> > > >>>>>>> I think that updating the protocol spec to Cassandra puts the
> > onus
> > > on
> > > >>>> the
> > > >>>>>>> party changing the protocol specification to have an
> > implementation
> > > >>>> of the
> > > >>>>>>> spec in Cassandra as well as the Java and Python driver (those
> > are
> > > >>>> both
> > > >>>>>>> used in the Cassandra repo). Until it's implemented in
> Cassandra
> > we
> > > >>>> haven't
> > > >>>>>>> fully evaluated the specification change. There is no
> substitute
> > > for
> > > >>>> trying
> > > >>>>>>> to make it work.
> > > >>>>>>>
> > > >>>>>>> There are also realities to consider as to what the maintainers
> > of
> > > the
> > > >>>>>>> drivers are willing to commit.
> > > >>>>>>>
> > > >>>>>>> RE #1,
> > > >>>>>>>
> > > >>>>>>> I am +1 on the fact that we shouldn't require an extra hop for
> > > range
> > > >>>> scans.
> > > >>>>>>> In JIRA Jeremiah made the point that you can still do this from
> > the
> > > >>>> client
> > > >>>>>>> by breaking up the token ranges, but it's a leaky abstraction
> to
> > > have
> > > >>>> a
> > > >>>>>>> paging interface that isn't a vanilla ResultSet interface.
> Serial
> > > vs.
> > > >>>>>>> parallel is kind of orthogonal as the driver can do either.
> > > >>>>>>>
> > > >>>>>>> I agree it looks like the current specification doesn't make
> what
> > > >>>> should
> > > >>>>>>> be simple as simple as it could be for driver implementers.
> > > >>>>>>>
> > > >>>>>>> RE #2,
> > > >>>>>>>
> > > >>>>>>> +1 on this change assuming an implementation in Cassandra and
> the
> > > >>>> Java and
> > > >>>>>>> Python drivers.
> > > >>>>>>>
> > > >>>>>>> RE #3,
> > > >>>>>>>
> > > >>>>>>> It's hard to be +1 on this because we don't benefit by boxing
> > > >>>> ourselves in
> > > >>>>>>> by defining a spec we haven't implemented, tested, and decided
> we
> > > are
> > > >>>>>>> satisfied with. Having it in ScyllaDB de-risks it to a certain
> > > >>>> extent, but
> > > >>>>>>> what if Cassandra decides to go a different direction in some
> > way?
> > > >>>>>>>
> > > >>>>>>> I don't think there is much discussion to be had without an
> > example
> > > >>>> of the
> > > >>>>>>> the changes to the CQL specification to look at, but even then
> if
> > > it
> > > >>>> looks
> > > >>>>>>> risky I am not likely to be in favor of it.
> > > >>>>>>>
> > > >>>>>>> Regards,
> > > >>>>>>> Ariel
> > > >>>>>>>
> > > >>>>>>>> On Thu, Apr 19, 2018, at 9:33 AM, glommer@xxxxxxxxxxxx wrote:
> > > >>>>>>>> On 2018/04/19 07:19:27, kurt greaves <kurt@xxxxxxxxxxxxxxx>
> > > wrote:
> > > >>>>>>>>>> 1. The protocol change is developed using the Cassandra
> > process
> > > in
> > > >>>>>>>>>>      a JIRA ticket, culminating in a patch to
> > > >>>>>>>>>>      doc/native_protocol*.spec when consensus is achieved.
> > > >>>>>>>>> I don't think forking would be desirable (for anyone) so this
> > > seems
> > > >>>>>>>>> the most reasonable to me. For 1 and 2 it certainly makes
> sense
> > > but
> > > >>>>>>>>> can't say I know enough about sharding to comment on 3 -
> seems
> > > to me
> > > >>>>>>>>> like it could be locking in a design before anyone truly
> knows
> > > what
> > > >>>>>>>>> sharding in C* looks like. But hopefully I'm wrong and there
> > are
> > > >>>>>>>>> devs out there that have already thought that through.
> > > >>>>>>>> Thanks. That is our view and is great to hear.
> > > >>>>>>>>
> > > >>>>>>>> About our proposal number 3: In my view, good protocol designs
> > are
> > > >>>>>>>> future proof and flexible. We certainly don't want to propose
> a
> > > >>>> design
> > > >>>>>>>> that works just for Scylla, but would support reasonable
> > > >>>>>>>> implementations regardless of how they may look like.
> > > >>>>>>>>
> > > >>>>>>>>> Do we have driver authors who wish to support both projects?
> > > >>>>>>>>>
> > > >>>>>>>>> Surely, but I imagine it would be a minority. ​
> > > >>>>>>>>>
> > > >>>>>>>>
> > > ---------------------------------------------------------------------
> > > >>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > For
> > > >>>>>>>> additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> > > >>>>>>>>
> > > >>>>>>>
> > > ---------------------------------------------------------------------
> > > >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > > >>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> > > >>>>>>>
> > > >>>>>>> --
> > > >>>>>> Ben Bromhead
> > > >>>>>> CTO | Instaclustr <https://www.instaclustr.com/>
> > > >>>>>> +1 650 284 9692 <(650)%20284-9692> <(650)%20284-9692>
> > <(650)%20284-9692>
> > > >>>>>> Reliability at Scale
> > > >>>>>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
> > > >>>>>>
> > > >>>>>
> > ---------------------------------------------------------------------
> > > >>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > > >>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> > > >>>>>
> > > >>>>
> > ---------------------------------------------------------------------
> > > >>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > > >>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> > > >>>>
> > > >>>> --
> > > >>> Ben Bromhead
> > > >>> CTO | Instaclustr <https://www.instaclustr.com/>
> > > >>> +1 650 284 9692 <(650)%20284-9692> <(650)%20284-9692>
> > > >>> Reliability at Scale
> > > >>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
> > > >>>
> > > >>
> > > >> ------------------------------------------------------------
> ---------
> > > >> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > > >> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> > > >>
> > > >
> > > > ------------------------------------------------------------
> ---------
> > > > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > > > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> > > >
> > >
> > > --
> > Ben Bromhead
> > CTO | Instaclustr <https://www.instaclustr.com/>
> > +1 650 284 9692 <(650)%20284-9692>
> > Reliability at Scale
> > Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
> >
>