[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Sporadic high IO bandwidth and Linux OOM killer


We are running the following setup on AWS EC2:

Host system (AWS AMI): Ubuntu 14.04.4 LTS,
Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Cassandra process runs inside a docker container.
Docker image is based on Ubuntu 18.04.1 LTS.

Apache Cassandra 3.0.17, installed from .deb packages.

$ java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

We have a total of 36 nodes.  All are r4.large instances, they have 2 vCPUs and ~15 GB RAM.
On each instance we have:
- 2TB gp2 SSD EBS volume for data and commit log,
- 8GB gp2 SSD EBS for system (root volume).

Non-default settings in cassandra.yaml:
num_tokens: 16
memtable_flush_writers: 1
concurrent_compactors: 1
snitch: Ec2Snitch

JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
Garbage collection: CMS with default settings.

We repair once a week using Cassandra Reaper: parallel, intensity 1, 64 segments per node.  The issue also happens outside of repair time.

The symptoms:

Sporadically a node becomes unavailable for a period of time between few minutes and a few hours.  According to our analysis and as pointed out by AWS support team, the unavailability is caused by exceptionally high read bandwidth on the *root* EBS volume.  I repeat, on the root volume, *not* on the data/commitlog volume.  Basically, the amount if IO exceeds instance's bandwidth (~52MB/s) and all other network communication becomes impossible.

The root volume contains operating system, docker container with OpenJDK and Cassandra binaries, and the logs.

Most of the time, whenever this happens it is too late to SSH into the instance to troubleshoot: it becomes completely unavailable within very short period of time.
Rebooting the affected instance helps to bring it back to life.

Starting from the middle of last week we have seen this problem repeatedly 1-3 times a day, affecting different instances in a seemingly random fashion.  Most of the time it affects only one instance, but we've had one incident when 9 nodes (3 from each of the 3 Availability Zones) were down at the same time due to this exact issue.

Actually, we've had the same issue previously on the same Cassandra cluster around 3 months ago (beginning to mid of September 2018).  At that time we were running on m4.xlarge instances (these have 4 vCPUs and 16GB RAM).

As a mitigation measure we have migrated away from those to r4.2xlarge.  Then we didn't observe any issues for a few weeks, so we have scaled down two times: to r4.xlarge and then to r4.large.  The last migration was completed before Nov 13th.  No changes to the cluster or application happened since that time.

Now, after some weeks the issue appears again...

When we are not fast enough to react and reboot the affected instance, we can see that ultimately Linux OOM killer kicks in and kills the java process running Cassandra.  After that the instance becomes available almost immediately.  This allows us to rule out other processes running in background as potential offenders.

We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via Jolokia).  At the same time, Committed_AS on each host is constantly around 11-12GB, as reported by atop(1) and prometheus.

We are running atop with sampling interval of 60 seconds.  After the fact we observe that the java process is the one responsible for the most of disk activity during unavailability period.  We also see kswapd0 high on the list from time to time, which always has 0K reads, but non-zero write bandwidth.  There is no swap space defined on these instances, so not really clear why kswapd appears at the top of the list all (measurement error?).

We also attempted to troubleshoot by running jstack, jmap and pmap against Cassandra process in background every few minutes.  The idea was to compare dumps taken before and during unavailability, but that didn't lead to any findings so far.  Ultimately we had to stop doing this, once we've seen that jmap can also become stuck burning CPU cycles.  Now the output of jmap is not that useful, but we fear that jstack might also expose the same behavior.  So we wanted to avoid making the issue worse than it currently is and disabled this debug sampling.

Now to my questions:

1. Is there anything in Cassandra or in the JVM that could explain all of a sudden reading from non-data volume at such a high rate, for prolonged periods of time as described above?

2. Why does JVM heap utilization never reaches the 8GB that we provide to it?

3. Why Committed virtual memory is so much bigger than sum of the heap and off-heap reported by JMX?  To what can this difference be attributed?
I've just visited a node at random and collected "off heap memory used" numbers reported by nodetool cfstats, and still I see only 2.6GB in total, while Commited_AS is ~12GB.  Is there a more direct way to monitor off-heap memory usage by JVM?

4. The only Jira issue related to Linux OOM, we've found is this one: https://issues.apache.org/jira/browse/CASSANDRA-13931  This might be related to our OOM, but still doesn't explain the unexpected IO anomalies.

I would really appreciate any hints / pointers / insights!  The more I learn about this issue, the less I understand it...