I was doing 500K inserts + 100K counter update in seconds on my cluster of 12 nodes (20 core/128GB ram/4 * 600 HDD 10K) using batch statements
with no problem.
I saw a lot of warning show that most of batches not concerning a single node, so they should not be in a batch, on the other hand input load of my application
increased by 50%, so i switched to non-batch async inserts and increased number of client threads so the load increased by 50%.
The system worked for 2 days with no problem with load of 750K inserts + 150K counter updates per seconds but suddendly a lot of timeout on insert generated in log files
Decreasing input load to previous load, even less than that did not help.
When i restart my client (after some hours that its been started log timeouts and erros) it works with no problem for 20 minutes but again starts logging timeout errors.
CPU load of nodes in cluster is less than 25%.
How can i solve this problem? I'm saving all jmx metrics of cassande\ra by monitoring system, What should i check?