name	flaky-test-prevention
description	Use when debugging intermittent test failures, choosing between retries vs fixes, quarantining flaky tests, calculating flakiness rates, or preventing non-deterministic behavior - provides root cause diagnosis, anti-patterns, and systematic debugging

Flaky Test Prevention

Overview

Core principle: Fix root causes, don't mask symptoms.

Rule: Flaky tests indicate real problems - in test design, application code, or infrastructure.

Flakiness Decision Tree

Symptom	Root Cause Category	Diagnostic	Fix
Passes alone, fails in suite	Test Interdependence	Run tests in random order	Use test isolation (transactions, unique IDs)
Fails randomly ~10%	Timing/Race Condition	Add logging, run 100x	Replace sleeps with explicit waits
Fails only in CI, not locally	Environment Difference	Compare CI vs local env	Match environments, use containers
Fails at specific times	Time Dependency	Check for date/time usage	Mock system time
Fails under load	Resource Contention	Run in parallel locally	Add resource isolation, increase limits
Different results each run	Non-Deterministic Code	Check for randomness	Seed random generators, use fixtures

First step: Identify symptom, trace to root cause category.

Anti-Patterns Catalog

❌ Sleepy Assertion

Symptom: Using fixed sleep() or wait() instead of condition-based waits

Why bad: Wastes time on fast runs, still fails on slow runs, brittle

Fix: Explicit waits for conditions

# ❌ Bad
time.sleep(5)  # Hope 5 seconds is enough
assert element.text == "Loaded"

# ✅ Good
WebDriverWait(driver, 10).until(
    lambda d: d.find_element_by_id("status").text == "Loaded"
)
assert element.text == "Loaded"

❌ Test Interdependence

Symptom: Tests pass when run in specific order, fail when shuffled

Why bad: Hidden dependencies, can't run in parallel, breaks test isolation

Fix: Each test creates its own data, no shared state

# ❌ Bad
def test_create_user():
    user = create_user("test_user")  # Shared ID

def test_update_user():
    update_user("test_user")  # Depends on test_create_user

# ✅ Good
def test_create_user():
    user_id = f"user_{uuid4()}"
    user = create_user(user_id)

def test_update_user():
    user_id = f"user_{uuid4()}"
    user = create_user(user_id)  # Independent
    update_user(user_id)

❌ Hidden Dependencies

Symptom: Tests fail due to external state (network, database, file system) beyond test control

Why bad: Unpredictable failures, environment-specific issues

Fix: Mock external dependencies

# ❌ Bad
def test_weather_api():
    response = requests.get("https://api.weather.com/...")
    assert response.json()["temp"] > 0  # Fails if API is down

# ✅ Good
@mock.patch('requests.get')
def test_weather_api(mock_get):
    mock_get.return_value.json.return_value = {"temp": 75}
    response = get_weather("Seattle")
    assert response["temp"] == 75

❌ Time Bomb

Symptom: Tests that depend on current date/time and fail at specific moments (midnight, month boundaries, DST)

Why bad: Fails unpredictably based on when tests run

Fix: Mock system time

# ❌ Bad
def test_expiration():
    created_at = datetime.now()
    assert is_expired(created_at) == False  # Fails at midnight

# ✅ Good
@freeze_time("2025-11-15 12:00:00")
def test_expiration():
    created_at = datetime(2025, 11, 15, 12, 0, 0)
    assert is_expired(created_at) == False

❌ Timeout Inflation

Symptom: Continuously increasing timeouts to "fix" flaky tests (5s → 10s → 30s)

Why bad: Masks root cause, slows test suite, doesn't guarantee reliability

Fix: Investigate why operation is slow, use explicit waits

# ❌ Bad
await page.waitFor(30000)  # Increased from 5s hoping it helps

# ✅ Good
await page.waitForSelector('.data-loaded', {timeout: 10000})
await page.waitForNetworkIdle()

Detection Strategies

Proactive Identification

Run tests multiple times (statistical detection):

# pytest with repeat plugin
pip install pytest-repeat
pytest --count=50 test_flaky.py

# Track pass rate
# 50/50 = 100% reliable
# 45/50 = 90% flaky (investigate immediately)
# <95% = quarantine

CI Integration (automatic tracking):

# GitHub Actions example
- name: Run tests with flakiness detection
  run: |
    pytest --count=3 --junit-xml=results.xml
    python scripts/calculate_flakiness.py results.xml

Flakiness metrics to track:

Pass rate per test (target: >99%)
Mean Time Between Failures (MTBF)
Failure clustering (same test failing together)

Systematic Debugging

When a test fails intermittently:

Reproduce consistently - Run 100x to establish failure rate
Isolate - Run alone, with subset, with full suite (find interdependencies)
Add logging - Capture state before assertion, screenshot on failure
Bisect - If fails in suite, binary search which other test causes it
Environment audit - Compare CI vs local (env vars, resources, timing)

Flakiness Metrics Guide

Calculating flake rate:

# Flakiness formula
flake_rate = (failed_runs / total_runs) * 100

# Example
# Test run 100 times: 7 failures
# Flake rate = 7/100 = 7%

Thresholds:

Flake Rate	Action	Priority
0% (100% pass)	Reliable	Monitor
0.1-1%	Investigate	Low
1-5%	Quarantine + Fix	Medium
5-10%	Quarantine + Fix Urgently	High
>10%	Disable immediately	Critical

Target: All tests should maintain >99% pass rate (< 1% flake rate)

Quarantine Workflow

Purpose: Keep CI green while fixing flaky tests systematically

Process:

Detect - Test fails >1% of runs
Quarantine - Mark with @pytest.mark.quarantine, exclude from CI
Track - Create issue with flake rate, failure logs, reproduction steps
Fix - Assign owner, set SLA (e.g., 2 weeks to fix or delete)
Validate - Run fixed test 100x, must achieve >99% pass rate
Re-Enable - Remove quarantine mark, monitor for 1 week

Marking quarantined tests:

@pytest.mark.quarantine(reason="Flaky due to timing issue #1234")
@pytest.mark.skip("Quarantined")
def test_flaky_feature():
    pass

CI configuration:

# Run all tests except quarantined
pytest -m "not quarantine"

SLA: Quarantined tests must be fixed within 2 weeks or deleted. No test stays quarantined indefinitely.

Tool Ecosystem Quick Reference

Tool	Purpose	When to Use
pytest-repeat	Run test N times	Statistical detection
pytest-xdist	Parallel execution	Expose race conditions
pytest-rerunfailures	Auto-retry on failure	Temporary mitigation during fix
pytest-randomly	Randomize test order	Detect test interdependence
freezegun	Mock system time	Fix time bombs
pytest-timeout	Prevent hanging tests	Catch infinite loops

Installation:

pip install pytest-repeat pytest-xdist pytest-rerunfailures pytest-randomly freezegun pytest-timeout

Usage examples:

# Detect flakiness (run 50x)
pytest --count=50 test_suite.py

# Detect interdependence (random order)
pytest --randomly-seed=12345 test_suite.py

# Expose race conditions (parallel)
pytest -n 4 test_suite.py

# Temporary mitigation (reruns, not a fix!)
pytest --reruns 2 --reruns-delay 1 test_suite.py

Prevention Checklist

Use during test authoring to prevent flakiness:

No fixed time.sleep() - use explicit waits for conditions
Each test creates its own data (UUID-based IDs)
No shared global state between tests
External dependencies mocked (APIs, network, databases)
Time/date frozen with @freeze_time if time-dependent
Random values seeded (random.seed(42))
Tests pass when run in any order (pytest --randomly-seed)
Tests pass when run in parallel (pytest -n 4)
Tests pass 100/100 times (pytest --count=100)
Teardown cleans up all resources (files, database, cache)

Common Fixes Quick Reference

Problem	Fix Pattern	Example
Timing issues	Explicit waits	`WebDriverWait(driver, 10).until(condition)`
Test interdependence	Unique IDs per test	`user_id = f"test_{uuid4()}"`
External dependencies	Mock/stub	`@mock.patch('requests.get')`
Time dependency	Freeze time	`@freeze_time("2025-11-15")`
Random behavior	Seed randomness	`random.seed(42)`
Shared state	Test isolation	Transactions, teardown fixtures
Resource contention	Unique resources	Separate temp dirs, DB namespaces

Your First Flaky Test Fix

Systematic approach for first fix:

Step 1: Reproduce (Day 1)

# Run test 100 times, capture failures
pytest --count=100 --verbose test_flaky.py | tee output.log

Step 2: Categorize (Day 1)

Check output.log:

Same failure message? → Likely timing/race condition
Different failures? → Likely test interdependence
Only fails in CI? → Environment difference

Step 3: Fix Based on Category (Day 2)

If timing issue:

# Before
time.sleep(2)
assert element.text == "Loaded"

# After
wait.until(lambda: element.text == "Loaded")

If interdependence:

# Before
user = User.objects.get(id=1)  # Assumes user exists

# After
user = create_test_user(id=f"test_{uuid4()}")  # Creates own data

Step 4: Validate (Day 2)

# Must pass 100/100 times
pytest --count=100 test_flaky.py
# Expected: 100 passed

Step 5: Monitor (Week 1)

Track in CI - test should maintain >99% pass rate for 1 week before considering it fixed.

CI-Only Flakiness (Can't Reproduce Locally)

Symptom: Test fails intermittently in CI but passes 100% locally

Root cause: Environment differences between CI and local (resources, parallelization, timing)

Systematic CI Debugging

Step 1: Environment Fingerprinting

Capture exact environment in both CI and locally:

# Add to conftest.py
import os, sys, platform, tempfile

def pytest_configure(config):
    print(f"Python: {sys.version}")
    print(f"Platform: {platform.platform()}")
    print(f"CPU count: {os.cpu_count()}")
    print(f"TZ: {os.environ.get('TZ', 'not set')}")
    print(f"Temp dir: {tempfile.gettempdir()}")
    print(f"Parallel: {os.environ.get('PYTEST_XDIST_WORKER', 'not parallel')}")

Run in both environments, compare all outputs.

Step 2: Increase CI Observation Window

For low-probability failures (<5%), run more iterations:

# GitHub Actions example
- name: Run test 200x to catch 1% flake
  run: pytest --count=200 --verbose --log-cli-level=DEBUG test.py

- name: Upload failure artifacts
  if: failure()
  uses: actions/upload-artifact@v3
  with:
    name: failure-logs
    path: |
      *.log
      screenshots/

Step 3: Check CI-Specific Factors

Factor	Diagnostic	Fix
Parallelization	Run `pytest -n 4` locally	Add test isolation (unique IDs, transactions)
Resource limits	Compare CI RAM/CPU to local	Mock expensive operations, add retries
Cold starts	First run vs warm runs	Check caching assumptions
Disk I/O speed	CI may use slower disks	Mock file operations
Network latency	CI network may be slower/different	Mock external calls

Step 4: Replicate CI Environment Locally

Use exact CI container:

# GitHub Actions uses Ubuntu 22.04
docker run -it ubuntu:22.04 bash

# Install dependencies
apt-get update && apt-get install python3.11

# Run test in container
pytest --count=500 test.py

Step 5: Enable CI Debug Mode

# GitHub Actions - Interactive debugging
- name: Setup tmate session (on failure)
  if: failure()
  uses: mxschmitt/action-tmate@v3

Quick CI Debugging Checklist

When test fails only in CI:

Capture environment fingerprint in both CI and local
Run test with parallelization locally (pytest -n auto)
Check for resource contention (CPU, memory, disk)
Compare timezone settings (TZ env var)
Upload CI artifacts (logs, screenshots) on failure
Replicate CI environment with Docker
Check for cold start issues (first vs subsequent runs)

Common Mistakes

❌ Using Retries as Permanent Solution

Fix: Retries (@pytest.mark.flaky or --reruns) are temporary mitigation during investigation, not fixes

❌ No Flakiness Tracking

Fix: Track pass rates in CI, set up alerts for tests dropping below 99%

❌ Fixing Flaky Tests by Making Them Slower

Fix: Diagnose root cause - don't just add more wait time

❌ Ignoring Flaky Tests

Fix: Quarantine workflow - either fix or delete, never ignore indefinitely

Quick Reference

Flakiness Thresholds:

<1% flake rate: Monitor
1-5%: Quarantine + fix (medium priority)
5%: Disable + fix urgently (high priority)

Root Cause Categories:

Timing/race conditions → Explicit waits
Test interdependence → Unique IDs, test isolation
External dependencies → Mocking
Time bombs → Freeze time
Resource contention → Unique resources

Detection Tools:

pytest-repeat (statistical detection)
pytest-randomly (interdependence)
pytest-xdist (race conditions)

Quarantine Process:

Detect (>1% flake rate)
Quarantine (mark, exclude from CI)
Track (create issue)
Fix (assign owner, 2-week SLA)
Validate (100/100 passes)
Re-enable (monitor 1 week)

Bottom Line

Flaky tests are fixable - find the root cause, don't mask with retries.

Use detection tools to find flaky tests early. Categorize by symptom, diagnose root cause, apply pattern-based fix. Quarantine if needed, but always with SLA to fix or delete.

flaky-test-prevention

Install Skill

SKILL.md