Claude Code Plugins

Community-maintained marketplace

Feedback

flaky-test-prevention

@tachyon-beep/skillpacks
4
0

Use when debugging intermittent test failures, choosing between retries vs fixes, quarantining flaky tests, calculating flakiness rates, or preventing non-deterministic behavior - provides root cause diagnosis, anti-patterns, and systematic debugging

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name flaky-test-prevention
description Use when debugging intermittent test failures, choosing between retries vs fixes, quarantining flaky tests, calculating flakiness rates, or preventing non-deterministic behavior - provides root cause diagnosis, anti-patterns, and systematic debugging

Flaky Test Prevention

Overview

Core principle: Fix root causes, don't mask symptoms.

Rule: Flaky tests indicate real problems - in test design, application code, or infrastructure.

Flakiness Decision Tree

Symptom Root Cause Category Diagnostic Fix
Passes alone, fails in suite Test Interdependence Run tests in random order Use test isolation (transactions, unique IDs)
Fails randomly ~10% Timing/Race Condition Add logging, run 100x Replace sleeps with explicit waits
Fails only in CI, not locally Environment Difference Compare CI vs local env Match environments, use containers
Fails at specific times Time Dependency Check for date/time usage Mock system time
Fails under load Resource Contention Run in parallel locally Add resource isolation, increase limits
Different results each run Non-Deterministic Code Check for randomness Seed random generators, use fixtures

First step: Identify symptom, trace to root cause category.

Anti-Patterns Catalog

❌ Sleepy Assertion

Symptom: Using fixed sleep() or wait() instead of condition-based waits

Why bad: Wastes time on fast runs, still fails on slow runs, brittle

Fix: Explicit waits for conditions

# ❌ Bad
time.sleep(5)  # Hope 5 seconds is enough
assert element.text == "Loaded"

# ✅ Good
WebDriverWait(driver, 10).until(
    lambda d: d.find_element_by_id("status").text == "Loaded"
)
assert element.text == "Loaded"

❌ Test Interdependence

Symptom: Tests pass when run in specific order, fail when shuffled

Why bad: Hidden dependencies, can't run in parallel, breaks test isolation

Fix: Each test creates its own data, no shared state

# ❌ Bad
def test_create_user():
    user = create_user("test_user")  # Shared ID

def test_update_user():
    update_user("test_user")  # Depends on test_create_user

# ✅ Good
def test_create_user():
    user_id = f"user_{uuid4()}"
    user = create_user(user_id)

def test_update_user():
    user_id = f"user_{uuid4()}"
    user = create_user(user_id)  # Independent
    update_user(user_id)

❌ Hidden Dependencies

Symptom: Tests fail due to external state (network, database, file system) beyond test control

Why bad: Unpredictable failures, environment-specific issues

Fix: Mock external dependencies

# ❌ Bad
def test_weather_api():
    response = requests.get("https://api.weather.com/...")
    assert response.json()["temp"] > 0  # Fails if API is down

# ✅ Good
@mock.patch('requests.get')
def test_weather_api(mock_get):
    mock_get.return_value.json.return_value = {"temp": 75}
    response = get_weather("Seattle")
    assert response["temp"] == 75

❌ Time Bomb

Symptom: Tests that depend on current date/time and fail at specific moments (midnight, month boundaries, DST)

Why bad: Fails unpredictably based on when tests run

Fix: Mock system time

# ❌ Bad
def test_expiration():
    created_at = datetime.now()
    assert is_expired(created_at) == False  # Fails at midnight

# ✅ Good
@freeze_time("2025-11-15 12:00:00")
def test_expiration():
    created_at = datetime(2025, 11, 15, 12, 0, 0)
    assert is_expired(created_at) == False

❌ Timeout Inflation

Symptom: Continuously increasing timeouts to "fix" flaky tests (5s → 10s → 30s)

Why bad: Masks root cause, slows test suite, doesn't guarantee reliability

Fix: Investigate why operation is slow, use explicit waits

# ❌ Bad
await page.waitFor(30000)  # Increased from 5s hoping it helps

# ✅ Good
await page.waitForSelector('.data-loaded', {timeout: 10000})
await page.waitForNetworkIdle()

Detection Strategies

Proactive Identification

Run tests multiple times (statistical detection):

# pytest with repeat plugin
pip install pytest-repeat
pytest --count=50 test_flaky.py

# Track pass rate
# 50/50 = 100% reliable
# 45/50 = 90% flaky (investigate immediately)
# <95% = quarantine

CI Integration (automatic tracking):

# GitHub Actions example
- name: Run tests with flakiness detection
  run: |
    pytest --count=3 --junit-xml=results.xml
    python scripts/calculate_flakiness.py results.xml

Flakiness metrics to track:

  • Pass rate per test (target: >99%)
  • Mean Time Between Failures (MTBF)
  • Failure clustering (same test failing together)

Systematic Debugging

When a test fails intermittently:

  1. Reproduce consistently - Run 100x to establish failure rate
  2. Isolate - Run alone, with subset, with full suite (find interdependencies)
  3. Add logging - Capture state before assertion, screenshot on failure
  4. Bisect - If fails in suite, binary search which other test causes it
  5. Environment audit - Compare CI vs local (env vars, resources, timing)

Flakiness Metrics Guide

Calculating flake rate:

# Flakiness formula
flake_rate = (failed_runs / total_runs) * 100

# Example
# Test run 100 times: 7 failures
# Flake rate = 7/100 = 7%

Thresholds:

Flake Rate Action Priority
0% (100% pass) Reliable Monitor
0.1-1% Investigate Low
1-5% Quarantine + Fix Medium
5-10% Quarantine + Fix Urgently High
>10% Disable immediately Critical

Target: All tests should maintain >99% pass rate (< 1% flake rate)

Quarantine Workflow

Purpose: Keep CI green while fixing flaky tests systematically

Process:

  1. Detect - Test fails >1% of runs
  2. Quarantine - Mark with @pytest.mark.quarantine, exclude from CI
  3. Track - Create issue with flake rate, failure logs, reproduction steps
  4. Fix - Assign owner, set SLA (e.g., 2 weeks to fix or delete)
  5. Validate - Run fixed test 100x, must achieve >99% pass rate
  6. Re-Enable - Remove quarantine mark, monitor for 1 week

Marking quarantined tests:

@pytest.mark.quarantine(reason="Flaky due to timing issue #1234")
@pytest.mark.skip("Quarantined")
def test_flaky_feature():
    pass

CI configuration:

# Run all tests except quarantined
pytest -m "not quarantine"

SLA: Quarantined tests must be fixed within 2 weeks or deleted. No test stays quarantined indefinitely.

Tool Ecosystem Quick Reference

Tool Purpose When to Use
pytest-repeat Run test N times Statistical detection
pytest-xdist Parallel execution Expose race conditions
pytest-rerunfailures Auto-retry on failure Temporary mitigation during fix
pytest-randomly Randomize test order Detect test interdependence
freezegun Mock system time Fix time bombs
pytest-timeout Prevent hanging tests Catch infinite loops

Installation:

pip install pytest-repeat pytest-xdist pytest-rerunfailures pytest-randomly freezegun pytest-timeout

Usage examples:

# Detect flakiness (run 50x)
pytest --count=50 test_suite.py

# Detect interdependence (random order)
pytest --randomly-seed=12345 test_suite.py

# Expose race conditions (parallel)
pytest -n 4 test_suite.py

# Temporary mitigation (reruns, not a fix!)
pytest --reruns 2 --reruns-delay 1 test_suite.py

Prevention Checklist

Use during test authoring to prevent flakiness:

  • No fixed time.sleep() - use explicit waits for conditions
  • Each test creates its own data (UUID-based IDs)
  • No shared global state between tests
  • External dependencies mocked (APIs, network, databases)
  • Time/date frozen with @freeze_time if time-dependent
  • Random values seeded (random.seed(42))
  • Tests pass when run in any order (pytest --randomly-seed)
  • Tests pass when run in parallel (pytest -n 4)
  • Tests pass 100/100 times (pytest --count=100)
  • Teardown cleans up all resources (files, database, cache)

Common Fixes Quick Reference

Problem Fix Pattern Example
Timing issues Explicit waits WebDriverWait(driver, 10).until(condition)
Test interdependence Unique IDs per test user_id = f"test_{uuid4()}"
External dependencies Mock/stub @mock.patch('requests.get')
Time dependency Freeze time @freeze_time("2025-11-15")
Random behavior Seed randomness random.seed(42)
Shared state Test isolation Transactions, teardown fixtures
Resource contention Unique resources Separate temp dirs, DB namespaces

Your First Flaky Test Fix

Systematic approach for first fix:

Step 1: Reproduce (Day 1)

# Run test 100 times, capture failures
pytest --count=100 --verbose test_flaky.py | tee output.log

Step 2: Categorize (Day 1)

Check output.log:

  • Same failure message? → Likely timing/race condition
  • Different failures? → Likely test interdependence
  • Only fails in CI? → Environment difference

Step 3: Fix Based on Category (Day 2)

If timing issue:

# Before
time.sleep(2)
assert element.text == "Loaded"

# After
wait.until(lambda: element.text == "Loaded")

If interdependence:

# Before
user = User.objects.get(id=1)  # Assumes user exists

# After
user = create_test_user(id=f"test_{uuid4()}")  # Creates own data

Step 4: Validate (Day 2)

# Must pass 100/100 times
pytest --count=100 test_flaky.py
# Expected: 100 passed

Step 5: Monitor (Week 1)

Track in CI - test should maintain >99% pass rate for 1 week before considering it fixed.

CI-Only Flakiness (Can't Reproduce Locally)

Symptom: Test fails intermittently in CI but passes 100% locally

Root cause: Environment differences between CI and local (resources, parallelization, timing)

Systematic CI Debugging

Step 1: Environment Fingerprinting

Capture exact environment in both CI and locally:

# Add to conftest.py
import os, sys, platform, tempfile

def pytest_configure(config):
    print(f"Python: {sys.version}")
    print(f"Platform: {platform.platform()}")
    print(f"CPU count: {os.cpu_count()}")
    print(f"TZ: {os.environ.get('TZ', 'not set')}")
    print(f"Temp dir: {tempfile.gettempdir()}")
    print(f"Parallel: {os.environ.get('PYTEST_XDIST_WORKER', 'not parallel')}")

Run in both environments, compare all outputs.

Step 2: Increase CI Observation Window

For low-probability failures (<5%), run more iterations:

# GitHub Actions example
- name: Run test 200x to catch 1% flake
  run: pytest --count=200 --verbose --log-cli-level=DEBUG test.py

- name: Upload failure artifacts
  if: failure()
  uses: actions/upload-artifact@v3
  with:
    name: failure-logs
    path: |
      *.log
      screenshots/

Step 3: Check CI-Specific Factors

Factor Diagnostic Fix
Parallelization Run pytest -n 4 locally Add test isolation (unique IDs, transactions)
Resource limits Compare CI RAM/CPU to local Mock expensive operations, add retries
Cold starts First run vs warm runs Check caching assumptions
Disk I/O speed CI may use slower disks Mock file operations
Network latency CI network may be slower/different Mock external calls

Step 4: Replicate CI Environment Locally

Use exact CI container:

# GitHub Actions uses Ubuntu 22.04
docker run -it ubuntu:22.04 bash

# Install dependencies
apt-get update && apt-get install python3.11

# Run test in container
pytest --count=500 test.py

Step 5: Enable CI Debug Mode

# GitHub Actions - Interactive debugging
- name: Setup tmate session (on failure)
  if: failure()
  uses: mxschmitt/action-tmate@v3

Quick CI Debugging Checklist

When test fails only in CI:

  • Capture environment fingerprint in both CI and local
  • Run test with parallelization locally (pytest -n auto)
  • Check for resource contention (CPU, memory, disk)
  • Compare timezone settings (TZ env var)
  • Upload CI artifacts (logs, screenshots) on failure
  • Replicate CI environment with Docker
  • Check for cold start issues (first vs subsequent runs)

Common Mistakes

❌ Using Retries as Permanent Solution

Fix: Retries (@pytest.mark.flaky or --reruns) are temporary mitigation during investigation, not fixes


❌ No Flakiness Tracking

Fix: Track pass rates in CI, set up alerts for tests dropping below 99%


❌ Fixing Flaky Tests by Making Them Slower

Fix: Diagnose root cause - don't just add more wait time


❌ Ignoring Flaky Tests

Fix: Quarantine workflow - either fix or delete, never ignore indefinitely

Quick Reference

Flakiness Thresholds:

  • <1% flake rate: Monitor
  • 1-5%: Quarantine + fix (medium priority)
  • 5%: Disable + fix urgently (high priority)

Root Cause Categories:

  1. Timing/race conditions → Explicit waits
  2. Test interdependence → Unique IDs, test isolation
  3. External dependencies → Mocking
  4. Time bombs → Freeze time
  5. Resource contention → Unique resources

Detection Tools:

  • pytest-repeat (statistical detection)
  • pytest-randomly (interdependence)
  • pytest-xdist (race conditions)

Quarantine Process:

  1. Detect (>1% flake rate)
  2. Quarantine (mark, exclude from CI)
  3. Track (create issue)
  4. Fix (assign owner, 2-week SLA)
  5. Validate (100/100 passes)
  6. Re-enable (monitor 1 week)

Bottom Line

Flaky tests are fixable - find the root cause, don't mask with retries.

Use detection tools to find flaky tests early. Categorize by symptom, diagnose root cause, apply pattern-based fix. Quarantine if needed, but always with SLA to fix or delete.