OSDir

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Evolving the client protocol


I have not asked this list to do any work on the drivers.


If Cassandra agrees to Scylla protocol changes (either proactively or retroactively) then the benefit to Cassandra is that if the drivers are changed (by the driver maintainers or by Scylla developers) then Cassandra developers need not do additional work to update the drivers. So there is less work for you, in the future, if those features are of interest to you.


On 2018-04-24 02:13, Jonathan Haddad wrote:
 From where I stand it looks like you've got only two options for any
feature that involves updating the protocol:

1. Don't built the feature
2. Built it in Cassanda & scylladb, update the drivers accordingly

I don't think you have a third option, which is built it only in ScyllaDB,
because that means you have to fork *all* the drivers and make it work,
then maintain them.  Your business model appears to be built on not doing
any of the driver work yourself, and you certainly aren't giving back to
the open source community via a permissive license on ScyllaDB itself, so
I'm a bit lost here.

To me it looks like you're asking a bunch of volunteers that work on
Cassandra to accommodate you.  What exactly do we get out of this
relationship?  What incentive do I or anyone else have to spend time
helping you instead of working on something that interests me?

Jon


On Mon, Apr 23, 2018 at 7:59 AM Ben Bromhead <ben@xxxxxxxxxxxxxxx> wrote:

This doesn't work without additional changes, for RF>1. The token ring
could place two replicas of the same token range on the same physical
server, even though those are two separate cores of the same server. You
could add another element to the hierarchy (cluster -> datacenter -> rack
-> node -> core/shard), but that generates unneeded range movements when
a
node is added.
I have seen rack awareness used/abused to solve this.

But then you lose real rack awareness. It's fine for a quick hack, but
not a long-term solution.

(it also creates a lot more tokens, something nobody needs)

I'm having trouble understanding how you loose "real" rack awareness, as
these shards are in the same rack anyway, because the address and port are
on the same server in the same rack. So it behaves as expected. Could you
explain a situation where the shards on a single server would be in
different racks (or fault domains)?

If you wanted to support a situation where you have a single rack per DC
for simple deployments, extending NetworkTopologyStrategy to behave the way
it did before https://issues.apache.org/jira/browse/CASSANDRA-7544 with
respect to treating InetAddresses as servers rather than the address and
port would be simple. Both this implementation in Apache Cassandra and the
respective load balancing classes in the drivers are explicitly designed to
be pluggable so that would be an easier integration point for you.

I'm not sure how it creates more tokens? If a server normally owns 256
tokens, each shard on a different port would just advertise ownership of
256/# of cores (e.g. 4 tokens if you had 64 cores).


Regards,
Ariel

On Apr 22, 2018, at 8:26 AM, Avi Kivity <avi@xxxxxxxxxxxx> wrote:



On 2018-04-19 21:15, Ben Bromhead wrote:
Re #3:

Yup I was thinking each shard/port would appear as a discrete server
to the
client.
This doesn't work without additional changes, for RF>1. The token ring
could place two replicas of the same token range on the same physical
server, even though those are two separate cores of the same server. You
could add another element to the hierarchy (cluster -> datacenter -> rack
-> node -> core/shard), but that generates unneeded range movements when
a
node is added.
If the per port suggestion is unacceptable due to hardware
requirements,
remembering that Cassandra is built with the concept scaling
*commodity*
hardware horizontally, you'll have to spend your time and energy
convincing
the community to support a protocol feature it has no (current) use
for or
find another interim solution.
Those servers are commodity servers (not x86, but still commodity). In
any case 60+ logical cores are common now (hello AWS i3.16xlarge or even
i3.metal), and we can only expect logical core count to continue to
increase (there are 48-core ARM processors now).
Another way, would be to build support and consensus around a clear
technical need in the Apache Cassandra project as it stands today.

One way to build community support might be to contribute an Apache
licensed thread per core implementation in Java that matches the
protocol
change and shard concept you are looking for ;P
I doubt I'll survive the egregious top-posting that is going on in
this
list.
On Thu, Apr 19, 2018 at 1:43 PM Ariel Weisberg <ariel@xxxxxxxxxxx>
wrote:
Hi,

So at technical level I don't understand this yet.

So you have a database consisting of single threaded shards and a
socket
for accept that is generating TCP connections and in advance you
don't know
which connection is going to send messages to which shard.

What is the mechanism by which you get the packets for a given TCP
connection delivered to a specific core? I know that a given TCP
connection
will normally have all of its packets delivered to the same queue
from the
NIC because the tuple of source address + port and destination
address +
port is typically hashed to pick one of the queues the NIC
presents. I
might have the contents of the tuple slightly wrong, but it always
includes
a component you don't get to control.

Since it's hashing how do you manipulate which queue packets for a
TCP
connection go to and how is it made worse by having an accept socket
per
shard?

You also mention 160 ports as bad, but it doesn't sound like a big
number
resource wise. Is it an operational headache?

RE tokens distributed amongst shards. The way that would work right
now is
that each port number appears to be a discrete instance of the
server. So
you could have shards be actual shards that are simply colocated on
the
same box, run in the same process, and share resources. I know this
pushes
more of the complexity into the server vs the driver as the server
expects
all shards to share some client visible like system tables and
certain
identifiers.

Ariel
On Thu, Apr 19, 2018, at 12:59 PM, Avi Kivity wrote:
Port-per-shard is likely the easiest option but it's too ugly to
contemplate. We run on machines with 160 shards (IBM POWER
2s20c160t
IIRC), it will be just horrible to have 160 open ports.


It also doesn't fit will with the NICs ability to automatically
distribute packets among cores using multiple queues, so the kernel
would have to shuffle those packets around. Much better to have
those
packets delivered directly to the core that will service them.


(also, some protocol changes are needed so the driver knows how
tokens
are distributed among shards)

On 2018-04-19 19:46, Ben Bromhead wrote:
WRT to #3
To fit in the existing protocol, could you have each shard listen
on a
different port? Drivers are likely going to support this due to
https://issues.apache.org/jira/browse/CASSANDRA-7544 (
https://issues.apache.org/jira/browse/CASSANDRA-11596).  I'm not
super
familiar with the ticket so their might be something I'm missing
but it
sounds like a potential approach.

This would give you a path forward at least for the short term.


On Thu, Apr 19, 2018 at 12:10 PM Ariel Weisberg <
ariel@xxxxxxxxxxx>
wrote:
Hi,

I think that updating the protocol spec to Cassandra puts the
onus
on
the
party changing the protocol specification to have an
implementation
of the
spec in Cassandra as well as the Java and Python driver (those
are
both
used in the Cassandra repo). Until it's implemented in Cassandra
we
haven't
fully evaluated the specification change. There is no substitute
for
trying
to make it work.

There are also realities to consider as to what the maintainers
of
the
drivers are willing to commit.

RE #1,

I am +1 on the fact that we shouldn't require an extra hop for
range
scans.
In JIRA Jeremiah made the point that you can still do this from
the
client
by breaking up the token ranges, but it's a leaky abstraction to
have
a
paging interface that isn't a vanilla ResultSet interface. Serial
vs.
parallel is kind of orthogonal as the driver can do either.

I agree it looks like the current specification doesn't make what
should
be simple as simple as it could be for driver implementers.

RE #2,

+1 on this change assuming an implementation in Cassandra and the
Java and
Python drivers.

RE #3,

It's hard to be +1 on this because we don't benefit by boxing
ourselves in
by defining a spec we haven't implemented, tested, and decided we
are
satisfied with. Having it in ScyllaDB de-risks it to a certain
extent, but
what if Cassandra decides to go a different direction in some
way?
I don't think there is much discussion to be had without an
example
of the
the changes to the CQL specification to look at, but even then if
it
looks
risky I am not likely to be in favor of it.

Regards,
Ariel

On Thu, Apr 19, 2018, at 9:33 AM, glommer@xxxxxxxxxxxx wrote:
On 2018/04/19 07:19:27, kurt greaves <kurt@xxxxxxxxxxxxxxx>
wrote:
1. The protocol change is developed using the Cassandra
process
in
      a JIRA ticket, culminating in a patch to
      doc/native_protocol*.spec when consensus is achieved.
I don't think forking would be desirable (for anyone) so this
seems
the most reasonable to me. For 1 and 2 it certainly makes sense
but
can't say I know enough about sharding to comment on 3 - seems
to me
like it could be locking in a design before anyone truly knows
what
sharding in C* looks like. But hopefully I'm wrong and there
are
devs out there that have already thought that through.
Thanks. That is our view and is great to hear.

About our proposal number 3: In my view, good protocol designs
are
future proof and flexible. We certainly don't want to propose a
design
that works just for Scylla, but would support reasonable
implementations regardless of how they may look like.

Do we have driver authors who wish to support both projects?

Surely, but I imagine it would be a minority. ​

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For
additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx

--
Ben Bromhead
CTO | Instaclustr <https://www.instaclustr.com/>
+1 650 284 9692 <(650)%20284-9692> <(650)%20284-9692>
<(650)%20284-9692>
Reliability at Scale
Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx

--
Ben Bromhead
CTO | Instaclustr <https://www.instaclustr.com/>
+1 650 284 9692 <(650)%20284-9692> <(650)%20284-9692>
Reliability at Scale
Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx

--
Ben Bromhead
CTO | Instaclustr <https://www.instaclustr.com/>
+1 650 284 9692 <(650)%20284-9692>
Reliability at Scale
Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx