Re: CASSANDRA-13241 lower default chunk_length_in_kb
This is something that's bugged me for ages, tbh the performance gain for
most use cases far outweighs the increase in memory usage and I would even
be in favor of changing the default now, optimizing the storage cost later
(if it's found to be worth it).
For some anecdotal evidence:
4kb is usually what we end setting it to, 16kb feels more reasonable given
the memory impact, but what would be the point if practically, most folks
set it to 4kb anyway?
Note that chunk_length will largely be dependent on your read sizes, but 4k
is the floor for most physical devices in terms of ones block size.
+1 for making this change in 4.0 given the small size and the large
improvement to new users experience (as long as we are explicit in the
documentation about memory consumption).
On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg <ariel@xxxxxxxxxxx> wrote:
> This is regarding https://issues.apache.org/jira/browse/CASSANDRA-13241
> This ticket has languished for a while. IMO it's too late in 4.0 to
> implement a more memory efficient representation for compressed chunk
> offsets. However I don't think we should put out another release with the
> current 64k default as it's pretty unreasonable.
> I propose that we lower the value to 16kb. 4k might never be the correct
> default anyways as there is a cost to compression and 16k will still be a
> large improvement.
> Benedict and Jon Haddad are both +1 on making this change for 4.0. In the
> past there has been some consensus about reducing this value although maybe
> with more memory efficiency.
> The napkin math for what this costs is:
> "If you have 1TB of uncompressed data, with 64k chunks that's 16M chunks
> at 8 bytes each (128MB).
> With 16k chunks, that's 512MB.
> With 4k chunks, it's 2G.
> Per terabyte of data (pre-compression)."
> By way of comparison memory mapping the files has a similar cost per 4k
> page of 8 bytes. Multiple mappings makes this more expensive. With a
> default of 16kb this would be 4x less expensive than memory mapping a file.
> I only mention this to give a sense of the costs we are already paying. I
> am not saying they are directly related.
> I'll wait a week for discussion and if there is consensus make the change.
> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
CTO | Instaclustr <https://www.instaclustr.com/>
+1 650 284 9692
Reliability at Scale
Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer