| name | quality-metrics-and-kpis |
| description | Use when setting up quality dashboards, defining test coverage targets, tracking quality trends, configuring CI/CD quality gates, or reporting quality metrics to stakeholders - provides metric selection, threshold strategies, and dashboard design patterns |
Quality Metrics & KPIs
Overview
Core principle: Measure what matters. Track trends, not absolutes. Use metrics to drive action, not for vanity.
Rule: Every metric must have a defined threshold and action plan. If a metric doesn't change behavior, stop tracking it.
Quality Metrics vs Vanity Metrics
| Type | Example | Problem | Better Metric |
|---|---|---|---|
| Vanity | "We have 10,000 tests!" | Doesn't indicate quality | Pass rate, flakiness rate |
| Vanity | "95% code coverage!" | Can be gamed, doesn't mean tests are good | Coverage delta (new code), mutation score |
| Actionable | "Test flakiness: 5% → 2%" | Drives action | Track trend, set target |
| Actionable | "P95 build time: 15 min" | Identifies bottleneck | Optimize slow tests |
Actionable metrics answer: "What should I fix next?"
Core Quality Metrics
1. Test Pass Rate
Definition: % of tests that pass on first run
Pass Rate = (Passing Tests / Total Tests) × 100
Thresholds:
- > 98%: Healthy
- 95-98%: Investigate failures
- < 95%: Critical (tests are unreliable)
Why it matters: Low pass rate means flaky tests or broken code
Action: If < 98%, run flaky-test-prevention skill
2. Test Flakiness Rate
Definition: % of tests that fail intermittently
Flakiness Rate = (Flaky Tests / Total Tests) × 100
How to measure:
# Run each test 100 times
pytest --count=100 test_checkout.py
# Flaky if passes 1-99 times (not 0 or 100)
Thresholds:
- < 1%: Healthy
- 1-5%: Moderate (fix soon)
- > 5%: Critical (CI is unreliable)
Action: Fix flaky tests before adding new tests
3. Code Coverage
Definition: % of code lines executed by tests
Coverage = (Executed Lines / Total Lines) × 100
Thresholds (by test type):
- Unit tests: 80-90% of business logic
- Integration tests: 60-70% of integration points
- E2E tests: 40-50% of critical paths
Configuration (pytest):
# .coveragerc
[run]
source = src
omit = */tests/*, */migrations/*
[report]
fail_under = 80 # Fail if coverage < 80%
show_missing = True
Anti-pattern: 100% coverage as goal
Why it's wrong: Easy to game (tests that execute code without asserting anything)
Better metric: Coverage + mutation score (see mutation-testing skill)
4. Coverage Delta (New Code)
Definition: Coverage of newly added code
Why it matters: More actionable than absolute coverage
# Measure coverage on changed files only
pytest --cov=src --cov-report=term-missing \
$(git diff --name-only origin/main...HEAD | grep '\.py$')
Threshold: 90% for new code (stricter than legacy)
Action: Block PR if new code coverage < 90%
5. Build Time (CI/CD)
Definition: Time from commit to merge-ready
Track by stage:
- Lint/format: < 30s
- Unit tests: < 2 min
- Integration tests: < 5 min
- E2E tests: < 15 min
- Total PR pipeline: < 20 min
Why it matters: Slow CI blocks developer productivity
Action: If build > 20 min, see test-automation-architecture for optimization patterns
6. Test Execution Time Trend
Definition: How test suite duration changes over time
# Track in CI
import time
import json
start = time.time()
pytest.main()
duration = time.time() - start
metrics = {"test_duration_seconds": duration, "timestamp": time.time()}
with open("metrics.json", "w") as f:
json.dump(metrics, f)
Threshold: < 5% growth per month
Action: If growth > 5%/month, parallelize tests or refactor slow tests
7. Defect Escape Rate
Definition: Bugs found in production that should have been caught by tests
Defect Escape Rate = (Production Bugs / Total Releases) × 100
Thresholds:
- < 2%: Excellent
- 2-5%: Acceptable
- > 5%: Tests are missing critical scenarios
Action: For each escape, write regression test to prevent recurrence
8. Mean Time to Detection (MTTD)
Definition: Time from bug introduction to discovery
MTTD = Deployment Time - Bug Introduction Time
Thresholds:
- < 1 hour: Excellent (caught in CI)
- 1-24 hours: Good (caught in staging/canary)
- > 24 hours: Poor (caught in production)
Action: If MTTD > 24h, improve observability (see observability-and-monitoring skill)
9. Mean Time to Recovery (MTTR)
Definition: Time from bug detection to fix deployed
MTTR = Fix Deployment Time - Bug Detection Time
Thresholds:
- < 1 hour: Excellent
- 1-8 hours: Acceptable
- > 8 hours: Poor
Action: If MTTR > 8h, improve rollback procedures (see testing-in-production skill)
Dashboard Design
Grafana Dashboard Example
# grafana-dashboard.json
{
"panels": [
{
"title": "Test Pass Rate (7 days)",
"targets": [{
"expr": "sum(tests_passed) / sum(tests_total) * 100"
}],
"thresholds": [
{"value": 95, "color": "red"},
{"value": 98, "color": "yellow"},
{"value": 100, "color": "green"}
]
},
{
"title": "Build Time Trend (30 days)",
"targets": [{
"expr": "avg_over_time(ci_build_duration_seconds[30d])"
}]
},
{
"title": "Coverage Delta (per PR)",
"targets": [{
"expr": "coverage_new_code_percent"
}],
"thresholds": [
{"value": 90, "color": "green"},
{"value": 80, "color": "yellow"},
{"value": 0, "color": "red"}
]
}
]
}
CI/CD Quality Gates
GitHub Actions example:
# .github/workflows/quality-gates.yml
name: Quality Gates
on: [pull_request]
jobs:
quality-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run tests with coverage
run: pytest --cov=src --cov-report=json
- name: Check coverage threshold
run: |
COVERAGE=$(jq '.totals.percent_covered' coverage.json)
if (( $(echo "$COVERAGE < 80" | bc -l) )); then
echo "Coverage $COVERAGE% below 80% threshold"
exit 1
fi
- name: Check build time
run: |
DURATION=$(jq '.duration' test-results.json)
if (( $(echo "$DURATION > 300" | bc -l) )); then
echo "Build time ${DURATION}s exceeds 5-minute threshold"
exit 1
fi
Reporting Patterns
Weekly Quality Report
Template:
# Quality Report - Week of 2025-01-13
## Summary
- **Test pass rate:** 98.5% (+0.5% from last week)
- **Flakiness rate:** 2.1% (-1.3% from last week) ✅
- **Coverage:** 85.2% (+2.1% from last week) ✅
- **Build time:** 18 min (-2 min from last week) ✅
## Actions Taken
- Fixed 8 flaky tests in checkout flow
- Added integration tests for payment service (+5% coverage)
- Parallelized slow E2E tests (reduced build time by 2 min)
## Action Items
- [ ] Fix remaining 3 flaky tests in user registration
- [ ] Increase coverage of order service (currently 72%)
- [ ] Investigate why staging MTTD increased to 4 hours
Stakeholder Dashboard (Executive View)
Metrics to show:
- Quality trend (6 months): Pass rate over time
- Velocity impact: How long does CI take per PR?
- Production stability: Defect escape rate
- Recovery time: MTTR for incidents
What NOT to show:
- Absolute test count (vanity metric)
- Lines of code (meaningless)
- Individual developer metrics (creates wrong incentives)
Anti-Patterns Catalog
❌ Coverage as the Only Metric
Symptom: "We need 100% coverage!"
Why bad: Easy to game with meaningless tests
# ❌ BAD: 100% coverage, 0% value
def calculate_tax(amount):
return amount * 0.08
def test_calculate_tax():
calculate_tax(100) # Executes code, asserts nothing!
Fix: Use coverage + mutation score
❌ Tracking Metrics Without Thresholds
Symptom: Dashboard shows metrics but no action taken
Why bad: Metrics become noise
Fix: Every metric needs:
- Target threshold (e.g., flakiness < 1%)
- Alert level (e.g., alert if flakiness > 5%)
- Action plan (e.g., "Fix flaky tests before adding new features")
❌ Optimizing for Metrics, Not Quality
Symptom: Gaming metrics to hit targets
Example: Removing tests to increase pass rate
Fix: Track multiple complementary metrics (pass rate + flakiness + coverage)
❌ Measuring Individual Developer Productivity
Symptom: "Developer A writes more tests than Developer B"
Why bad: Creates wrong incentives (quantity over quality)
Fix: Measure team metrics, not individual
Tool Integration
SonarQube Metrics
Quality Gate:
# sonar-project.properties
sonar.qualitygate.wait=true
# Metrics tracked:
# - Bugs (target: 0)
# - Vulnerabilities (target: 0)
# - Code smells (target: < 100)
# - Coverage (target: > 80%)
# - Duplications (target: < 3%)
Codecov Integration
# codecov.yml
coverage:
status:
project:
default:
target: 80% # Overall coverage target
threshold: 2% # Allow 2% drop
patch:
default:
target: 90% # New code must have 90% coverage
threshold: 0% # No drops allowed
Bottom Line
Track actionable metrics with defined thresholds. Use metrics to drive improvement, not for vanity.
Core dashboard:
- Test pass rate (> 98%)
- Flakiness rate (< 1%)
- Coverage delta on new code (> 90%)
- Build time (< 20 min)
- Defect escape rate (< 2%)
Weekly actions:
- Review metrics against thresholds
- Identify trends (improving/degrading)
- Create action items for violations
- Track progress on improvements
If you're tracking a metric but not acting on it, stop tracking it. Metrics exist to drive action, not to fill dashboards.