Re: Measuring Release Quality
I've spent a good bit of time thinking about the above and bounced off both
different ways to measure quality and progress as well as trying to
influence community behavior on this topic. My advice: start small and
simple (KISS, YAGNI, all that). Get metrics for pass/fail on
utest/dtest/flakiness over time, perhaps also aggregate bug count by
component over time. After spending a predetermined time doing that (a
couple months?) as an experiment, we retrospect as a project and see if
these efforts are adding value commensurate with the time investment
required to perform the measurement and analysis.
There's a lot of really good ideas in that linked wiki article / this email
thread. The biggest challenge, and risk of failure, is in translating good
ideas into action and selling project participants on the value of changing
their behavior. The latter is where we've fallen short over the years;
building consensus (especially regarding process /shudder) is Very Hard.
Also - thanks for spearheading this discussion Scott. It's one we come back
to with some regularity so there's real pain and opportunity here for the
On Wed, Sep 19, 2018 at 9:32 PM Scott Andreas <scott@xxxxxxxxxxxxxx> wrote:
> Hi everyone,
> Now that many teams have begun testing and validating Apache Cassandra
> 4.0, it’s useful to think about what “progress” looks like. While metrics
> alone may not tell us what “done” means, they do help us answer the
> question, “are we getting better or worse — and how quickly”?
> A friend described to me a few attributes of metrics he considered useful,
> suggesting that good metrics are actionable, visible, predictive, and
> – Actionable: We know what to do based on them – where to invest, what to
> fix, what’s fine, etc.
> – Visible: Everyone who has a stake in a metric has full visibility into
> it and participates in its definition.
> – Predictive: Good metrics enable forecasting of outcomes – e.g.,
> “consistent performance test results against build abc predict an x%
> reduction in 99%ile read latency for this workload in prod".
> – Consequent: We take actions based on them (e.g., not shipping if tests
> are failing).
> Here are some notes in Confluence toward metrics that may be useful to
> track beginning in this phase of the development + release cycle. I’m
> interested in your thoughts on these. They’re also copied inline for easier
> reading in your mail client.
> – Scott
> Measuring Release Quality:
> [ This document is a draft + sketch of ideas. It is located in the
> "discussion" section of this wiki to indicate that it is an active draft –
> not a document that has been voted on, achieved consensus, or in any way
> official. ]
> This document outlines a series of metrics that may be useful toward
> measuring release quality, and quantifying progress during the testing /
> validation phase of the Apache Cassandra 4.0 release cycle.
> The goal of this document is to think through what we should consider
> measuring to quantify our progress testing and validating Apache Cassandra
> 4.0. This document explicitly does not discuss release criteria – though
> metrics may be a useful input to a discussion on that topic.
> Metric: Build / Test Health (produced via CI, recorded in Confluence):
> Bread-and-butter metrics intended to capture baseline build health,
> flakiness in the test suite, and presented as a time series to understand
> how they’ve changed from build to build and release to release:
> – Pass / fail metrics for unit tests
> – Pass / fail metrics for dtests
> – Flakiness stats for unit and dtests
> Metric: “Found Bug” Count by Methodology (sourced via JQL, reported in
> These are intended to help us understand the efficacy of each methodology
> being applied. We might consider annotating bugs found in JIRA with the
> methodology that produced them. This could be consumed as input in a JQL
> query and reported on the Confluence dev wiki.
> As we reach a pareto-optimal level of investment in a methodology, we’d
> expect to see its found-bug rate taper. As we achieve higher quality across
> the board, we’d expect to see a tapering in found-bug counts across all
> methodologies. In the event that one or two approaches is an outlier, this
> could indicate the utility of doubling down on a particular form of testing.
> We might consider reporting “Found By” counts for methodologies such as:
> – Property-based / fuzz testing
> – Replay testing
> – Upgrade / Diff testing
> – Performance testing
> – Shadow traffic
> – Unit/dtest coverage of new areas
> – Source audit
> Metric: “Found Bug” Count by Subsystem/Component (sourced via JQL,
> reported in Confluence):
> Similar to “found by,” but “found where.” These metrics help us understand
> which components or subsystems of the database we’re finding issues in. In
> the event that a particular area stands out as “hot,” we’ll have the
> quantitative feedback we need to support investment there. Tracking these
> counts over time – and their first derivative – the rate – also helps us
> make statements regarding progress in various subsystems. Though we can’t
> prove a negative (“no bugs have been found, therefore there are no bugs”),
> we gain confidence as their rate decreases normalized to the effort we’re
> putting in.
> We might consider reporting “Found In” counts for components as enumerated
> in JIRA, such as:
> – Auth
> – Build
> – Compaction
> – Compression
> – Core
> – CQL
> – Distributed Metadata
> – …and so on.
> Metric: “Found Bug” Count by Severity (sourced via JQL, reported in
> Similar to “found by/where,” but “how bad”? These metrics help us
> understand the severity of the issues we encounter. As build quality
> improves, we would expect to see decreases in the severity of issues
> identified. A high rate of critical issues identified late in the release
> cycle would be cause for concern, though it may be expected at an earlier
> These could roughly be sourced from the “Priority” field in JIRA:
> – Trivial
> – Minor
> – Major
> – Critical
> – Blocker
> While “priority” doesn’t map directly to “severity,” it may be a useful
> proxy. Alternately, we could introduce a label intended to represent
> severity if we’d like to make that clear.
> Metric: Performance Tests
> Performance tests tell us “how fast” (and “how expensive”). There are many
> metrics we could capture here, and a variety of workloads they could be
> sourced from.
> I’ll refrain from proposing a particular methodology or reporting
> structure since many have thought about this. From a reporting perspective,
> I’m inspired by Mozilla’s “arewefastyet.com<http://arewefastyet.com>”
> Chrome’s: https://arewefastyet.com/win10/overview
> Having this sort of feedback on a build-by-build basis would help us catch
> regressions, quantify improvements, and provide a baseline against 3.0 and
> Metric: Code Coverage (/ other static analysis techniques)
> It may also be useful to publish metrics from CI on code coverage by
> package/class/method/branch. These might not be useful metrics for
> “quality” (the relationship between code coverage and quality is tenuous).
> However, it would be useful to quantify the trend over time between
> releases, and to source a “to-do” list for important but poorly-covered
> areas of the project.
> There are more things we could measure. We won’t want to drown ourselves
> in metrics (or the work required to gather them) –– but there are likely
> more not described here that could be useful to consider.
> Convergence Across Metrics:
> The thesis of this document is that improvements in each of these areas
> are correlated with increases in quality. Improvements across all areas are
> correlated with an increase in overall release quality. Tracking metrics
> like these provides the quantitative foundation for assessing progress,
> setting goals, and defining criteria. In that sense, they’re not an end –
> but a beginning.