[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Sporadic high IO bandwidth and Linux OOM killer

On Thu, Dec 6, 2018 at 11:14 AM Riccardo Ferrari <ferrarir@xxxxxxxxx> wrote:

I had few instances in the past that were showing that unresponsivveness behaviour. Back then I saw with iotop/htop/dstat ... the system was stuck on a single thread processing (full throttle) for seconds. According to iotop that was the kswapd0 process. That system was an ubuntu 16.04 actually "Ubuntu 16.04.4 LTS".


Did you by chance also observe Linux OOM?  How long did the unresponsiveness last in your case?

From there I started to dig what kswap process was involved in a system with no swap and found that is used for mmapping. This erratic (allow me to say erratic) behaviour was not showing up when I was on 3.0.6 but started to right after upgrading to 3.0.17.

By "load" I refer to the load as reported by the `nodetool status`. On my systems, when disk_access_mode is auto (read mmap), it is the sum of the node load plus the jmv heap size. Of course this is just what I noted on my systems not really sure if that should be the case on yours too.

I've checked and indeed we are using disk_access_mode=auto (well, implicitly because it's not even part of config file anymore): DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap.

I hope someone with more experience than me will add a comment about your settings. Reading the configuration file, writers and compactors should be 2 at minimum. I can confirm when I tried in the past to change the concurrent_compactors to 1 I had really bad things happenings (high system load, high message drop rate, ...)

As I've mentioned, we did not observe any other issues with the current setup: system load is reasonable, no dropped messages, no big number of hints, request latencies are OK, no big number of pending compactions.  Also during repair everything looks fine.

I have the "feeling", when running on constrained hardware the underlaying kernel optimization is a must. I agree with Jonathan H. that you should think about increasing the instance size, CPU and memory mathters a lot.

How did you solve your issue in the end?  You didn't rollback to 3.0.6?  Did you tune kernel parameters?  Which ones?

Thank you!