Few times a day I see spikes of requests latencies on my cassandra clients. Usually 99thPercentile is below 100ms but that times it grows above 1 second.
Type of request doesn't matter: different services are affected and I found that three absolutely identical requests (to the same partition key, issued in a three-second interval) completed in 1ms, 30ms and 1100ms. Also I found no correlation between spikes and patterns of load. G1 GC does not report any significant (>50ms) delays.
Few suspicious things:
- nodetool shows that there are dropped READs
- there are DigestMismatchExceptions in logs
- in tracing events I see that event "Executing single-partition query on *" sometimes happens right after "READ message received from /*.*.*.*" (in less than 100 micros) and sometimes after hundreds of milliseconds
My cluster runs on six c5.2xlarge Amazon instances, data is stored on EBS. Cassandra version is 3.10.
Any help in explaining this behavior is appreciated. I'm glad to share more details if needed.