Re: CASSANDRA-13241 lower default chunk_length_in_kb
On Thu, Oct 11, 2018 at 4:31 PM Ben Bromhead <ben@xxxxxxxxxxxxxxx> wrote:
> This is something that's bugged me for ages, tbh the performance gain for
> most use cases far outweighs the increase in memory usage and I would even
> be in favor of changing the default now, optimizing the storage cost later
> (if it's found to be worth it).
> For some anecdotal evidence:
> 4kb is usually what we end setting it to, 16kb feels more reasonable given
> the memory impact, but what would be the point if practically, most folks
> set it to 4kb anyway?
> Note that chunk_length will largely be dependent on your read sizes, but 4k
> is the floor for most physical devices in terms of ones block size.
It might be worth while to investigate how splitting chunk size into data,
index and compaction sizes would affect performance.
> +1 for making this change in 4.0 given the small size and the large
> improvement to new users experience (as long as we are explicit in the
> documentation about memory consumption).
> On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg <ariel@xxxxxxxxxxx> wrote:
> > Hi,
> > This is regarding https://issues.apache.org/jira/browse/CASSANDRA-13241
> > This ticket has languished for a while. IMO it's too late in 4.0 to
> > implement a more memory efficient representation for compressed chunk
> > offsets. However I don't think we should put out another release with the
> > current 64k default as it's pretty unreasonable.
> > I propose that we lower the value to 16kb. 4k might never be the correct
> > default anyways as there is a cost to compression and 16k will still be a
> > large improvement.
> > Benedict and Jon Haddad are both +1 on making this change for 4.0. In the
> > past there has been some consensus about reducing this value although
> > with more memory efficiency.
> > The napkin math for what this costs is:
> > "If you have 1TB of uncompressed data, with 64k chunks that's 16M chunks
> > at 8 bytes each (128MB).
> > With 16k chunks, that's 512MB.
> > With 4k chunks, it's 2G.
> > Per terabyte of data (pre-compression)."
> > By way of comparison memory mapping the files has a similar cost per 4k
> > page of 8 bytes. Multiple mappings makes this more expensive. With a
> > default of 16kb this would be 4x less expensive than memory mapping a
> > I only mention this to give a sense of the costs we are already paying. I
> > am not saying they are directly related.
> > I'll wait a week for discussion and if there is consensus make the
> > Regards,
> > Ariel
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> > --
> Ben Bromhead
> CTO | Instaclustr <https://www.instaclustr.com/>
> +1 650 284 9692
> Reliability at Scale
> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer