Can you show us the metrics-related configuration parameters in flink-conf.yaml?

Please also check the logs for any warnings from the MetricGroup and MetricRegistry classes.

Can you have a look at this JIRA ticket [1] and check if it is related to the problems your are facing?
We keep track of metrics by using the value of MetricGroup::getMetricIdentifier, which returns the fully qualified metric name. The query that we use to monitor metrics filters for metrics IDs that match '%Status.JVM.Memory%'. As long as the new metrics come online via the MetricReporter interface then I think the chart would be continuous; we would just see the old JVM memory metrics cycle into new metrics.

How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric.

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

