Re: Bootstrap streaming issues

Did anyone run into similar issues?

On Thu, Sep 6, 2018 at 10:27 AM Jai Bheemsen Rao Dhanwada <jaibheemsen@xxxxxxxxx> wrote:
Here is the stacktrace from the failure, it looks like it's trying to gather all the columfamily metrics and going OOM. Is this just for the JMX metrics?


ERROR [MessagingService-Incoming-/] 2018-09-06 15:43:19,280 CassandraDaemon.java:231 - Exception in thread Thread[MessagingService-Incoming-/x.x.x.x,5,main]
java.lang.OutOfMemoryError: Java heap space
        at java.io.DataInputStream.<init>(DataInputStream.java:58) ~[na:1.8.0_151]
        at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:139) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88) ~[apache-cassandra-2.1.16.jar:2.1.16]
ERROR [InternalResponseStage:1] 2018-09-06 15:43:19,281 CassandraDaemon.java:231 - Exception in thread Thread[InternalResponseStage:1,5,main]
java.lang.OutOfMemoryError: Java heap space
        at org.apache.cassandra.metrics.ColumnFamilyMetrics$AllColumnFamilyMetricNameFactory.createMetricName(ColumnFamilyMetrics.java:784) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.metrics.ColumnFamilyMetrics.createColumnFamilyHistogram(ColumnFamilyMetrics.java:716) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.metrics.ColumnFamilyMetrics.<init>(ColumnFamilyMetrics.java:597) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.db.ColumnFamilyStore.<init>(ColumnFamilyStore.java:361) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:527) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:498) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:335) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.db.DefsTables.addColumnFamily(DefsTables.java:385) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.db.DefsTables.mergeColumnFamilies(DefsTables.java:293) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.db.DefsTables.mergeSchemaInternal(DefsTables.java:194) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:166) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:75) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:54) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_151]
        at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151]

On Thu, Aug 30, 2018 at 12:51 PM Jai Bheemsen Rao Dhanwada <jaibheemsen@xxxxxxxxx> wrote:
thank you

On Thu, Aug 30, 2018 at 11:58 AM Jeff Jirsa <jjirsa@xxxxxxxxx> wrote:
This is the closest JIRA that comes to mind (from memory, I didn't search, there may be others): https://issues.apache.org/jira/browse/CASSANDRA-8150

The best blog that's all in one place on tuning GC in cassandra is actually Amy's 2.1 tuning guide: https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html - it's somewhat out of date as it's for 2.1, but since that's what you're running, that works out in your favor. 

On Thu, Aug 30, 2018 at 10:53 AM Jai Bheemsen Rao Dhanwada <jaibheemsen@xxxxxxxxx> wrote:
Hi Jeff,

Is there any JIRA that talks about increasing the HEAP will help? 
Also, any other alternatives than increasing the HEAP Size? last time when I tried increasing the heap, longer GC Pauses caused more damage in terms of latencies while gc pause.

On Wed, Aug 29, 2018 at 11:07 PM Jai Bheemsen Rao Dhanwada <jaibheemsen@xxxxxxxxx> wrote:
okay, thank you

On Wed, Aug 29, 2018 at 11:04 PM Jeff Jirsa <jjirsa@xxxxxxxxx> wrote:
You’re seeing an OOM, not a socket error / timeout. 

Jeff Jirsa

On Aug 29, 2018, at 10:56 PM, Jai Bheemsen Rao Dhanwada <jaibheemsen@xxxxxxxxx> wrote:


any idea if this is somehow related to : https://issues.apache.org/jira/browse/CASSANDRA-11840?
does increasing the value of streaming_socket_timeout_in_ms to a higher value helps?

On Wed, Aug 29, 2018 at 10:52 PM Jai Bheemsen Rao Dhanwada <jaibheemsen@xxxxxxxxx> wrote:
I have 72 nodes in the cluster, across 8 datacenters.. the moment I try to increase the node above 84 or so, the issue starts.

I am still using CMS Heap, assuming it will create more harm if I increase the heap size beyond 8G(recommended).

On Wed, Aug 29, 2018 at 6:53 PM Jeff Jirsa <jjirsa@xxxxxxxxx> wrote:
Given the size of your schema, you’re probably getting flooded with a bunch of huge schema mutations as it hops into gossip and tries to pull the schema from every host it sees. You say 8 DCs but you don’t say how many nodes - I’m guessing it’s  a lot? 

This is something that’s incrementally better in 3.0, but a real proper fix has been talked about a few times  - https://issues.apache.org/jira/browse/CASSANDRA-11748 and https://issues.apache.org/jira/browse/CASSANDRA-13569 for example 

In the short term, you may be able to work around this by increasing your heap size. If that doesn’t work, there’s an ugly ugly hack that’ll work on 2.1:  limiting the number of schema blobs you can get at a time - in this case, that means firewall off all but a few nodes in your cluster for 10-30 seconds, make sure it gets the schema (watch the logs or file system for the tables to be created), then remove the firewall so it can start the bootstrap process (it needs the schema to setup the streaming plan, and it needs all the hosts up in gossip to stream successfully, so this is an ugly hack to give you time to get the schema and then heal the cluster so it can bootstrap).

Yea that’s awful. Hopefully either of the two above JIRAs lands to make this less awful. 

Jeff Jirsa

On Aug 29, 2018, at 6:29 PM, Jai Bheemsen Rao Dhanwada <jaibheemsen@xxxxxxxxx> wrote:

It fails before bootstrap

streaming throughpu on the nodes is set to 400Mb/ps

On Wednesday, August 29, 2018, Jeff Jirsa <jjirsa@xxxxxxxxx> wrote:
Is the bootstrap plan succeeding (does streaming start or does it crash before it logs messages about streaming starting)?

Have you capped the stream throughput on the existing hosts? 

Jeff Jirsa

On Aug 29, 2018, at 5:02 PM, Jai Bheemsen Rao Dhanwada <jaibheemsen@xxxxxxxxx> wrote:

Hello All,

We are seeing some issue when we add more nodes to the cluster, where new node bootstrap is not able to stream the entire metadata and fails to bootstrap. Finally the process dies with OOM (java.lang.OutOfMemoryError: Java heap space)

But if I remove few nodes from the cluster we don't see this issue.

Cassandra Version: 2.1.16
# of KS and CF : 100, 3000 (approx)
# of DC: 8
# of Vnodes per node: 256

Not sure what is causing this behavior, has any one come across this scenario? 
thanks in advance.