osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+


metadata.csv: that helps a lot, thank you!

On Fri, Oct 5, 2018 at 5:42 AM Alain RODRIGUEZ <arodrime@xxxxxxxxx> wrote:
I feel you for most of the troubles you faced, I've been facing most of them too. Again, Datadog support can probably help you with most of those. You should really consider sharing this feedback to them.

there is re-namespacing of the metric names in lots of cases, and these don't appear to be centrally documented, but maybe i haven't found the magic page.

I don't know if that would be the 'magic' page, but that's something: https://github.com/DataDog/integrations-core/blob/master/cassandra/metadata.csv

There are sooooo many good stats.

Yes, and it's still improving. I love this about Cassandra. It's our work to pick the relevant ones for each situation. I would not like Cassandra to reduce the number of metrics exposed, we need to learn to handle them properly. Also, this is the reason we designed 4 dashboards out the box, the goal was to have everything we need for distinct scenarios:
- Overview - global health-check / anomaly detection
- Read Path - troubleshooting / optimizing read ops
- Write Path - troubleshooting / optimizing write ops
- SSTable Management - troubleshooting / optimizing - comapction/flushes/... anything related to sstables.

instead of the single overview dashboard that was present before. We are also perfectly aware that it's far from perfect, but aiming at perfect would only have had us never releasing anything. Anyone interested could now build missing dashboards or improve existing ones for himself or/and suggest improvements to Datadog :). I hope I'll do some more of this work at some point in the future.

Good luck,
C*heers,
-----------------------
Alain Rodriguez - @arodream - alain@xxxxxxxxxxxxxxxxx
France / Spain

The Last Pickle - Apache Cassandra Consulting

Le jeu. 4 oct. 2018 à 21:21, Carl Mueller <carl.mueller@xxxxxxxxxxxxxxx.invalid> a écrit :
for 2.1.x we had a custom reporter that delivered  metrics to datadog's endpoint via https, bypassing the agent-imposed 350. But integrating that required targetting the other shared libs in the cassandra path, so the build is a bit of a pain when we update major versions. 

We are migrating our 2.1.x specific dashboards, and we will use agent-delivered metrics for non-table, and adapt the custom library to deliver the table-based ones, at a slower rate than the "core" ones. 

Datadog is also super annoying because there doesn't appear to be anything that reports what metrics the agent is sending (the metric count can indicate if a configured new metric increased the count and is being reported, but it's still... a guess), and there is re-namespacing of the metric names in lots of cases, and these don't appear to be centrally documented, but maybe i haven't found the magic page.

There are sooooo many good stats. We might also implement some facility to dynamically turn on the delivery of detailed metrics on the nodes. 

On Tue, Oct 2, 2018 at 5:21 AM Alain RODRIGUEZ <arodrime@xxxxxxxxx> wrote:
Hello Carl,

I guess we can use bean_regex to do specific targetted metrics for the important tables anyway.

Yes, this would work, but 350 is very limited for Cassandra dashboards. We have a LOT of metrics available. 

Datadog 350 metric limit is a PITA for tables once you get over 10 tables

I noticed this while I was working on providing default dashboards for Cassandra-Datadog integration. I was told by Datadog team it would not be an issue for users, that I should not care about it. As you pointed out, per table metrics quickly increase the total number of metrics we need to collect.

I believe you can set the following option: "max_returned_metrics: 1000" - it can be used if metrics are missing to increase the limit of the number of collected metrics. Be aware of CPU utilization that this might imply (greatly improved in dd-agent version 6+ I believe -thanks Datadog teams for that- making this fully usable for Cassandra). This option should go in the cassandra.yaml file for Cassandra integrations, off the top of my head.

Also, do not hesitate to reach to Datadog directly for this kind of questions, I have always been very happy with their support so far, I am sure they would guide you through this as well, probably better than we can do :). It also provides them with feedback on what people are struggling with I imagine.

I am interested to know if you still have issues getting more metrics (option above not working / CPU under too much load) as this would make the dashboards we built mostly unusable for clusters with more tables. We might then need to review the design.

As a side note, I believe metrics are handled the same way cross version, they got the same name/label for C*2.1, 2.2 and 3+ on Datadog. There is an abstraction layer that removes this complexity (if I remember well, we built those dashboards a while ago).

C*heers
-----------------------
Alain Rodriguez - @arodream - alain@xxxxxxxxxxxxxxxxx
France / Spain

The Last Pickle - Apache Cassandra Consulting

Le lun. 1 oct. 2018 à 19:38, Carl Mueller <carl.mueller@xxxxxxxxxxxxxxx.invalid> a écrit :
That's great too, thank you.

Datadog 350 metric limit is a PITA for tables once you get over 10 tables, but I guess we can use bean_regex to do specific targetted metrics for the important tables anyway.

On Mon, Oct 1, 2018 at 4:21 AM Alain RODRIGUEZ <arodrime@xxxxxxxxx> wrote:
Hello Carl,

Here is a message I sent to my team a few months ago. I hope this will be helpful to you and more people around :). It might not be exhaustive and we were moving from C*2.1 to C*3+ in this case, thus skipping C*2.2, but C*2.2 is similar to C*3.0 if I remember correctly in terms of metrics. Here it is for what it's worth:

Quite a few things changed between metric reporter in C* 2.1 and C*3.0.
- ColumnFamily --> Table
- XXpercentile --> pXX
- 1MinuteRate -->  m1_rate
- metric name before KS and Table names and some other changes of this kind.
- ^ aggregations / aliases indexes changed because of this (using graphite for example) ^
- ‘.value’ is not appended in the metric name anymore for gauges, nothing instead.

For example (graphite):

From
aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile, 2, 3), 1, 7, 8, 9)

to
aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95, 2, 3), 1, 8, 9, 10)

C*heers,
-----------------------
Alain Rodriguez - @arodream - alain@xxxxxxxxxxxxxxxxx
France / Spain

The Last Pickle - Apache Cassandra Consulting

Le ven. 28 sept. 2018 à 20:38, Carl Mueller <carl.mueller@xxxxxxxxxxxxxxx.invalid> a écrit :
VERY NICE! Thank you very much

On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov <lyuben.todorov@xxxxxxxxxxxxxxx> wrote:
Nothing as fancy as a matrix but a list of what JMX term can see. 
Link to the online diff here: https://www.diffchecker.com/G9FE9swS

/lyubent

On Fri, 28 Sep 2018 at 19:04, Carl Mueller <carl.mueller@xxxxxxxxxxxxxxx.invalid> wrote:
It's my understanding that metrics got heavily re-namespaced in JMX for 2.2 from 2.1

Did anyone ever make a migration matrix/guide for conversion of old metrics to new metrics?