osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Measuring Release Quality


Josh, thanks for reading and sharing feedback. Agreed with starting simple and measuring inputs that are high-signal; that’s a good place to begin.

To the challenge of building consensus, point taken + agreed. Perhaps the distinction is between producing something that’s “useful” vs. something that’s “authoritative” for decisionmaking purposes. My motivation is to work toward something “useful” (as measured by the value contributors find). I’d be happy to start putting some of these together as part of an experiment – and agreed on evaluating “value relative to cost” after we see how things play out.

To Benedict’s point on JIRA, agreed that plotting a value from messy input wouldn’t produce useful output. Some questions a small working group might take on toward better categorization might look like:

–––
– Revisiting the list of components: e.g., “Core” captures a lot right now.
– Revisiting which fields should be required when filing a ticket – and if there are any that should be removed from the form.
– Reviewing active labels: understanding what people have been trying to capture, and how they could be organized + documented better.
– Documenting “priority”: (e.g., a common standard we can point to, even if we’re pretty good now).
– Considering adding a "severity” field to capture the distinction between priority and severity.
–––

If there’s appetite for spending a little time on this, I’d put effort toward it if others are interested; is anyone?

Otherwise, I’m equally fine with an experiment to measure basics via the current structure as Josh mentioned, too.

– Scott


On September 20, 2018 at 8:22:55 AM, Benedict Elliott Smith (benedict@xxxxxxxxxx<mailto:benedict@xxxxxxxxxx>) wrote:

I think it would be great to start getting some high quality info out of JIRA, but I think we need to clean up and standardise how we use it to facilitate this.

Take the Component field as an example. This is the current list of options:

4.0
Auth
Build
Compaction
Configuration
Core
CQL
Distributed Metadata
Documentation and Website
Hints
Libraries
Lifecycle
Local Write-Read Paths
Materialized Views
Metrics
Observability
Packaging
Repair
SASI
Secondary Indexes
Streaming and Messaging
Stress
Testing
Tools

In some cases there's duplication (Metrics + Observability, Coordination (=“Storage Proxy, Hints, Batchlog, Counters…") + Hints, Local Write-Read Paths + Core)
In others, there’s a lack of granularity (Streaming + Messaging, Core, Coordination, Distributed Metadata)
In others, there’s a lack of clarity (Core, Lifecycle, Coordination)
Others are probably missing entirely (Transient Replication, …?)

Labels are also used fairly haphazardly, and there’s no clear definition of “priority”

Perhaps we should form a working group to suggest a methodology for filling out JIRA, standardise the necessary components, labels etc, and put together a wiki page with step-by-step instructions on how to do it?


> On 20 Sep 2018, at 15:29, Joshua McKenzie <jmckenzie@xxxxxxxxxx> wrote:
>
> I've spent a good bit of time thinking about the above and bounced off both
> different ways to measure quality and progress as well as trying to
> influence community behavior on this topic. My advice: start small and
> simple (KISS, YAGNI, all that). Get metrics for pass/fail on
> utest/dtest/flakiness over time, perhaps also aggregate bug count by
> component over time. After spending a predetermined time doing that (a
> couple months?) as an experiment, we retrospect as a project and see if
> these efforts are adding value commensurate with the time investment
> required to perform the measurement and analysis.
>
> There's a lot of really good ideas in that linked wiki article / this email
> thread. The biggest challenge, and risk of failure, is in translating good
> ideas into action and selling project participants on the value of changing
> their behavior. The latter is where we've fallen short over the years;
> building consensus (especially regarding process /shudder) is Very Hard.
>
> Also - thanks for spearheading this discussion Scott. It's one we come back
> to with some regularity so there's real pain and opportunity here for the
> project imo.
>
> On Wed, Sep 19, 2018 at 9:32 PM Scott Andreas <scott@xxxxxxxxxxxxxx> wrote:
>
>> Hi everyone,
>>
>> Now that many teams have begun testing and validating Apache Cassandra
>> 4.0, it’s useful to think about what “progress” looks like. While metrics
>> alone may not tell us what “done” means, they do help us answer the
>> question, “are we getting better or worse — and how quickly”?
>>
>> A friend described to me a few attributes of metrics he considered useful,
>> suggesting that good metrics are actionable, visible, predictive, and
>> consequent:
>>
>> – Actionable: We know what to do based on them – where to invest, what to
>> fix, what’s fine, etc.
>> – Visible: Everyone who has a stake in a metric has full visibility into
>> it and participates in its definition.
>> – Predictive: Good metrics enable forecasting of outcomes – e.g.,
>> “consistent performance test results against build abc predict an x%
>> reduction in 99%ile read latency for this workload in prod".
>> – Consequent: We take actions based on them (e.g., not shipping if tests
>> are failing).
>>
>> Here are some notes in Confluence toward metrics that may be useful to
>> track beginning in this phase of the development + release cycle. I’m
>> interested in your thoughts on these. They’re also copied inline for easier
>> reading in your mail client.
>>
>> Link:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=93324430
>>
>> Cheers,
>>
>> – Scott
>>
>> ––––––
>>
>> Measuring Release Quality:
>>
>> [ This document is a draft + sketch of ideas. It is located in the
>> "discussion" section of this wiki to indicate that it is an active draft –
>> not a document that has been voted on, achieved consensus, or in any way
>> official. ]
>>
>> Introduction:
>>
>> This document outlines a series of metrics that may be useful toward
>> measuring release quality, and quantifying progress during the testing /
>> validation phase of the Apache Cassandra 4.0 release cycle.
>>
>> The goal of this document is to think through what we should consider
>> measuring to quantify our progress testing and validating Apache Cassandra
>> 4.0. This document explicitly does not discuss release criteria – though
>> metrics may be a useful input to a discussion on that topic.
>>
>>
>> Metric: Build / Test Health (produced via CI, recorded in Confluence):
>>
>> Bread-and-butter metrics intended to capture baseline build health,
>> flakiness in the test suite, and presented as a time series to understand
>> how they’ve changed from build to build and release to release:
>>
>> Metrics:
>>
>> – Pass / fail metrics for unit tests
>> – Pass / fail metrics for dtests
>> – Flakiness stats for unit and dtests
>>
>>
>> Metric: “Found Bug” Count by Methodology (sourced via JQL, reported in
>> Confluence):
>>
>> These are intended to help us understand the efficacy of each methodology
>> being applied. We might consider annotating bugs found in JIRA with the
>> methodology that produced them. This could be consumed as input in a JQL
>> query and reported on the Confluence dev wiki.
>>
>> As we reach a pareto-optimal level of investment in a methodology, we’d
>> expect to see its found-bug rate taper. As we achieve higher quality across
>> the board, we’d expect to see a tapering in found-bug counts across all
>> methodologies. In the event that one or two approaches is an outlier, this
>> could indicate the utility of doubling down on a particular form of testing.
>>
>> We might consider reporting “Found By” counts for methodologies such as:
>>
>> – Property-based / fuzz testing
>> – Replay testing
>> – Upgrade / Diff testing
>> – Performance testing
>> – Shadow traffic
>> – Unit/dtest coverage of new areas
>> – Source audit
>>
>>
>> Metric: “Found Bug” Count by Subsystem/Component (sourced via JQL,
>> reported in Confluence):
>>
>> Similar to “found by,” but “found where.” These metrics help us understand
>> which components or subsystems of the database we’re finding issues in. In
>> the event that a particular area stands out as “hot,” we’ll have the
>> quantitative feedback we need to support investment there. Tracking these
>> counts over time – and their first derivative – the rate – also helps us
>> make statements regarding progress in various subsystems. Though we can’t
>> prove a negative (“no bugs have been found, therefore there are no bugs”),
>> we gain confidence as their rate decreases normalized to the effort we’re
>> putting in.
>>
>> We might consider reporting “Found In” counts for components as enumerated
>> in JIRA, such as:
>> – Auth
>> – Build
>> – Compaction
>> – Compression
>> – Core
>> – CQL
>> – Distributed Metadata
>> – …and so on.
>>
>>
>> Metric: “Found Bug” Count by Severity (sourced via JQL, reported in
>> Confluence)
>>
>> Similar to “found by/where,” but “how bad”? These metrics help us
>> understand the severity of the issues we encounter. As build quality
>> improves, we would expect to see decreases in the severity of issues
>> identified. A high rate of critical issues identified late in the release
>> cycle would be cause for concern, though it may be expected at an earlier
>> time.
>>
>> These could roughly be sourced from the “Priority” field in JIRA:
>> – Trivial
>> – Minor
>> – Major
>> – Critical
>> – Blocker
>>
>> While “priority” doesn’t map directly to “severity,” it may be a useful
>> proxy. Alternately, we could introduce a label intended to represent
>> severity if we’d like to make that clear.
>>
>>
>> Metric: Performance Tests
>>
>> Performance tests tell us “how fast” (and “how expensive”). There are many
>> metrics we could capture here, and a variety of workloads they could be
>> sourced from.
>>
>> I’ll refrain from proposing a particular methodology or reporting
>> structure since many have thought about this. From a reporting perspective,
>> I’m inspired by Mozilla’s “arewefastyet.com<http://arewefastyet.com>”
>> used to report the performance of their Javascript engine relative to
>> Chrome’s: https://arewefastyet.com/win10/overview
>>
>> Having this sort of feedback on a build-by-build basis would help us catch
>> regressions, quantify improvements, and provide a baseline against 3.0 and
>> 3.x.
>>
>>
>> Metric: Code Coverage (/ other static analysis techniques)
>>
>> It may also be useful to publish metrics from CI on code coverage by
>> package/class/method/branch. These might not be useful metrics for
>> “quality” (the relationship between code coverage and quality is tenuous).
>>
>> However, it would be useful to quantify the trend over time between
>> releases, and to source a “to-do” list for important but poorly-covered
>> areas of the project.
>>
>>
>> Others:
>>
>> There are more things we could measure. We won’t want to drown ourselves
>> in metrics (or the work required to gather them) –– but there are likely
>> more not described here that could be useful to consider.
>>
>>
>> Convergence Across Metrics:
>>
>> The thesis of this document is that improvements in each of these areas
>> are correlated with increases in quality. Improvements across all areas are
>> correlated with an increase in overall release quality. Tracking metrics
>> like these provides the quantitative foundation for assessing progress,
>> setting goals, and defining criteria. In that sense, they’re not an end –
>> but a beginning.
>>