We have a cluster over 100 nodes that performs just fine for its use case. In our case, we needed the disk space and did not want the admin headache of very dense
nodes. It does take more automation and process to handle a larger cluster, but those are all good things to solve anyway.
But count me in on being interested in what DataStax is calling “Big Node.” Would love to be able to use denser nodes, if the headaches are reduced.
From: Ben Slater <ben.slater@xxxxxxxxxxxxxxx>
Sent: Wednesday, November 07, 2018 6:08 PM
Subject: [EXTERNAL] Re: Multiple cluster for a single application
I tend to recommend an approach similar to Eric’s functional sharding although I describe it at quality of service sharding - group your small, hot data into one cluster and your large, cooler data into another so you can provision infrastructure
and tune according. I guess it depends on you management environment but if you app functionality allows your to split into multiple clusters (ie all your data is not all in one giant table) then I would generally look to split. Splitting also gives you the
advantage of making it harder to have an outage that brings everything down.
Interesting approach Eric, thanks for sharing that.
> I've read documents recommended to use clusters with less than 50 or 100 nodes (Netflix got hundreds of clusters with less 100 nodes on each).
Not sure where you read that, but it's nonsense. We work with quite a few clusters that are several hundred nodes each. Your problems can get a bit amplified, for instance dynamic snitch can make a cluster perform significantly worse
than if you just flat out disable it, which is what I usually recommend.
I'm curious how you arrived at the estimate of needing > 100 nodes. Is that due to space constraints or performance ones?
We are engaging in both strategies at the same time:
1) We call it functional sharding - we write to clusters targeted according to the type of data being written. Because different data types often have different workloads this has the nice side effect of being able to tune each cluster
according to its workload. Your ability to grow in this dimension is limited by the number of business object types you're recording.
2) We write to clusters sharded by time. Our objects are network security events, so there's always an element of time. We encode that time into deterministic object IDs so that we are able to identify in the read path which shard to
direct the request to by extracting the time component. This basic idea should be able to work any time you're able to use surrogate keys instead of natural keys. If you are using natural keys, you may be facing an unpleasant migration should you need to
increase the number of shards in this dimension.
Our reason for engaging in the second strategy was not purely Cassandra's fault, rather we were using DSE with a search workload, and the cost of rebuilding Solr indexes on streaming operations (such as adding nodes to an existing cluster)
required enough resources that we found it prohibitive. That's because the bootstrapping node was also taking a production write workload, and we didn't want to run our cluster with enough overhead that a node could bootstrap and take production workload
at the same time.
For vanilla Cassandra workloads we have run clusters with quite a bit more nodes than 100 without any appreciable trouble. Curious if you can share documents about clusters over 100 nodes causing troubles for users. I'm wondering if it's
related to node failure rate combined with vnodes meaning that several concurrent node failures cause a part of the ring to go offline too reliably.
On Mon, Nov 5, 2018 at 7:38 AM onmstester onmstester <firstname.lastname@example.org> wrote:
One of my applications requires to create a cluster with more than 100 nodes, I've read documents recommended to use clusters with less than 50 or 100 nodes (Netflix got hundreds
of clusters with less 100 nodes on each).
Is it a good idea to use multiple clusters for a single application, just to decrease maintenance problems and system complexity/performance?
If So, which one of below policies is more suitable to distribute data among clusters and Why?
1. each cluster' would be responsible for a specific partial set of tables only (table sizes are almost equal so easy calculations here) for example inserts to table X would
go to cluster Y
2. shard data at loader level by some business logic grouping of data, for example all rows with some column starting with X would go to cluster Y
I would appreciate sharing your experiences working with big clusters, problem encountered and solutions.
Chief Product Officer
Read our latest technical blog posts here.
This email has been sent on behalf of Instaclustr Pty. Limited (Australia) and Instaclustr Inc (USA).
This email and any attachments may contain confidential and legally privileged information. If you are not the intended recipient, do not copy or disclose its content,
but please reply to this email immediately and highlight the error to the sender and then immediately delete the message.
The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution
or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The
Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan
horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.