name	mutation-testing
description	Validate test suite quality using mutation testing with mutmut or cosmic-ray to detect weak tests, calculate mutation scores, and improve test coverage. Use when validating test effectiveness, achieving high test quality, or detecting tests that pass but don't validate behavior.
allowed-tools	Read, Bash, Grep

Mutation Testing for Test Quality Validation

Mutation Testing Concept
Tools Comparison
mutmut Workflow
Interpreting Results
Limitations & Caveats
Best Practices

1. Mutation Testing Concept

What is Mutation Testing?

Mutation testing validates test quality by introducing small, deliberate bugs (mutations) into your code and checking if your tests catch them.

Core Principle: If you change working code and tests still pass, those tests aren't validating behavior properly.

Why It Matters

Problem: 100% code coverage doesn't guarantee test quality.

# Code with 100% coverage
def calculate_discount(price, is_member):
    if is_member:
        return price * 0.9
    return price

# Weak test (100% coverage, but doesn't validate behavior)
def test_calculate_discount():
    result = calculate_discount(100, True)
    # No assertion - test always passes!

This test executes all code (100% coverage) but validates nothing. Mutation testing catches this.

Mutation Operators

Mutation testing applies these transformations:

Operator	Example	Description
Arithmetic	`+` → `-`, `*` → `/`	Change math operators
Comparison	`==` → `!=`, `>` → `<`	Flip comparisons
Boolean	`and` → `or`, `True` → `False`	Change logic
Statement deletion	Remove line	Delete code
Constant modification	`10` → `11`, `"foo"` → `"XXfooXX"`	Modify values
Return values	`return x` → `return None`	Change returns

Mutation Outcomes

Killed Mutation (Good):

Test fails when code is mutated
Test correctly validates behavior
Example: Change price * 0.9 to price * 0.8, test fails because assert result == 90

Survived Mutation (Bad):

Test passes even when code is mutated
Test doesn't validate behavior
Example: Change price * 0.9 to price * 0.8, test still passes because no assertion

Mutation Score: (Killed mutations / Total mutations) * 100%

Target: 80%+ mutation score indicates high-quality tests

2. Tools Comparison

mutmut (Recommended)

Pros:

Simple installation and configuration
Works almost out-of-the-box
Faster execution than cosmic-ray
Good default mutation operators
Active maintenance

Cons:

Limited customization
Fewer mutation operators than cosmic-ray
Single-threaded (no native parallelization)

Installation:

pip install mutmut

Best For: Most Python projects, getting started with mutation testing, CI/CD integration

cosmic-ray (Advanced)

Pros:

Highly customizable
More mutation operators (30+ vs mutmut's 10)
Distributed execution support
Detailed reporting

Cons:

Complex configuration
Slower execution
Steeper learning curve
More dependencies

Installation:

pip install cosmic-ray

Best For: Large projects needing distributed execution, advanced mutation operators, research

Recommendation

Start with mutmut, upgrade to cosmic-ray if you need:

Custom mutation operators
Distributed execution across multiple machines
More granular control over mutation selection

3. mutmut Workflow

Basic Usage

# Run mutation testing on entire project
mutmut run

# See results summary
mutmut results
# Output:
# Survived: 15
# Killed: 85
# Timeout: 2
# Suspicious: 1
# Total: 103
# Mutation score: 82.5%

# Show details of survived mutation #5
mutmut show 5

# Generate HTML report
mutmut html

Targeted Testing

# Test specific file
mutmut run --paths-to-mutate=src/processors/asset_processor.py

# Test specific directory
mutmut run --paths-to-mutate=src/processors/

# Rerun only failed mutations
mutmut run --rerun-all

# Run with custom test command
mutmut run --runner="pytest -x tests/unit/ --tb=short"

# Run specific mutation IDs
mutmut run 5 10 15

Configuration

pyproject.toml:

[tool.mutmut]
paths_to_mutate = "src/"
backup = false
runner = "pytest -x --tb=short"
tests_dir = "tests/"
dict_synonyms = "Struct, NamedStruct"

setup.cfg:

[mutmut]
paths_to_mutate=src/
backup=False
runner=pytest -x --tb=short
tests_dir=tests/

Quick Workflow Commands

# 1. Initialize (first run)
mutmut run

# 2. Review results
mutmut results

# 3. Show survived mutations
mutmut show

# 4. Fix tests to kill mutations

# 5. Rerun failed mutations
mutmut run --rerun-all

# 6. Generate report
mutmut html

# 7. Open report in browser
open html/index.html

4. Interpreting Results

Mutation Status Codes

Status	Meaning	Action
Killed	Test failed (mutation caught)	✅ GOOD - Test validates behavior
Survived	Test passed (mutation not caught)	❌ BAD - Test doesn't validate behavior
Suspicious	Tests took too long	⚠️ Possible infinite loop, investigate
Timeout	Mutation caused timeout	⚠️ Check for infinite loops
Skipped	Couldn't apply mutation	ℹ️ Ignore (rare, usually syntax edge cases)

Mutation Score Targets

Score	Quality	Action
< 50%	Weak test suite	Major gaps, prioritize test improvements
50-70%	Moderate quality	Room for improvement, focus on critical paths
70-80%	Good quality	Few gaps, polish edge cases
80-90%	Excellent quality	High confidence, minimal improvements needed
> 90%	Exceptional	Diminishing returns, focus elsewhere

Target: Aim for 80%+ mutation score for production code.

Reading Mutation Details

mutmut show 5

Output:

--- src/processors/asset_processor.py
+++ [Mutant] src/processors/asset_processor.py
@@ -123,7 +123,7 @@
     def calculate_match_quality(self, match_level):
-        if match_level >= 7:
+        if match_level > 7:
             return "High"
         return "Low"

Analysis:

Change: >= → >
Status: Survived
Problem: No test validates boundary case match_level == 7
Fix: Add test assert calculate_match_quality(7) == "High"

Common Weak Test Patterns

1. Missing Assertion:

# Weak: No assertion
def test_function():
    result = function()
    # Test passes without validating result

# Strong: Specific assertion
def test_function():
    result = function()
    assert result == expected_value

2. Assertion Too Broad:

# Weak: Too generic
assert result is not None

# Strong: Exact value
assert result == "Expected Value"

3. Missing Boundary Cases:

# Weak: Only normal case
def test_threshold():
    assert get_severity(8.0) == "High"

# Strong: Normal + boundaries
def test_threshold():
    assert get_severity(8.0) == "High"
    assert get_severity(7.0) == "High"  # Boundary
    assert get_severity(6.9) == "Low"   # Just below

5. Limitations & Caveats

Limitation 1: Slow Execution

Problem: Mutation testing is inherently slow.

Impact: 1000 mutations × 10s test suite = ~3 hours

Mitigation:

Run on nightly builds, not every commit
Use incremental testing (changed files only)
Parallelize execution manually (GNU parallel)
Optimize test suite performance

Limitation 2: Equivalent Mutations

Problem: Some mutations don't change behavior.

Example:

# Original
i += 1

# Mutation (logically equivalent)
i = i + 1

Both are functionally identical, but mutation testing marks this as "survived".

Mitigation:

Manual review of survived mutations
Accept some survived mutations (document why)
Focus on mutation score trends, not absolute numbers

Limitation 3: High False Positive Rate

Problem: Not all survived mutations indicate weak tests.

Examples:

Equivalent mutations
Defensive code never executed in practice
Logging statements (mutation doesn't affect behavior)

Example:

def process_data(data):
    LOG.debug("Processing %d items", len(data))  # Mutation survives
    return [transform(item) for item in data]

Mutating the log message doesn't affect behavior.

Mitigation:

Review survived mutations manually
Exclude logging/debug code from mutation testing
Document accepted survived mutations

Limitation 4: Doesn't Replace Code Coverage

Best Practice: Require both 95% coverage AND 80% mutation score.

Why:

Coverage finds untested code
Mutation testing finds weak tests
Both needed for comprehensive quality

Limitation 5: Limited to Unit Tests

Problem: Mutation testing works best on unit tests, not integration tests.

Why: Integration tests are slow, making mutation testing impractical.

Best Practice: Run mutation testing on unit tests only.

6. Best Practices

Start Small & Iterate

Don't run on entire codebase immediately
Do test one critical module, expand gradually
Fix 5-10 survived mutations per cycle
Track score over time (aim for gradual improvement)

Combine with Coverage

Require both 95%+ coverage AND 80%+ mutation score
Coverage finds untested code, mutation testing finds weak tests

Run Strategically

Run in nightly builds (too slow for every commit)
Focus on critical code: security, business logic, data processing
Deprioritize: CLI parsing, logging, configuration, getters/setters

Review Intelligently

Review survived mutations manually (not all indicate weak tests)
Accept some mutations (logging, equivalent mutations)
Document exceptions in .mutmut-exceptions.md
Set realistic 80% target (90%+ has diminishing returns)

Automate

Integrate in CI pipeline (nightly builds)
Use custom scripts to enforce mutation score thresholds
Cache results (.mutmut-cache) for faster reruns

Summary

Mutation Testing in 5 Steps:

Run: mutmut run --paths-to-mutate=your_module.py
Review: mutmut show to see survived mutations
Fix: Add assertions to kill mutations
Verify: mutmut run --rerun-all
Track: Monitor mutation score over time

Key Takeaways:

Mutation testing validates test quality, not code quality
Target 80%+ mutation score for production code
Combine with 95%+ code coverage for comprehensive quality
Run in nightly builds (too slow for every commit)
Review survived mutations manually (some are false positives)
Start small, iterate (critical modules first)

Tools:

mutmut: Recommended for most projects (simple, fast)
cosmic-ray: Advanced users needing customization

Remember: 100% coverage doesn't mean 100% quality. Mutation testing finds the gap.

For comprehensive examples and advanced strategies: See reference.md

mutation-testing

Install Skill

SKILL.md

Mutation Testing for Test Quality Validation

Table of Contents

1. Mutation Testing Concept

What is Mutation Testing?

Why It Matters

Mutation Operators

Mutation Outcomes

2. Tools Comparison

mutmut (Recommended)

cosmic-ray (Advanced)

Recommendation

3. mutmut Workflow

Basic Usage

Targeted Testing

Configuration

Quick Workflow Commands

4. Interpreting Results

Mutation Status Codes

Mutation Score Targets

Reading Mutation Details

Common Weak Test Patterns

5. Limitations & Caveats

Limitation 1: Slow Execution

Limitation 2: Equivalent Mutations

Limitation 3: High False Positive Rate

Limitation 4: Doesn't Replace Code Coverage

Limitation 5: Limited to Unit Tests

6. Best Practices

Start Small & Iterate

Combine with Coverage

Run Strategically

Review Intelligently

Automate

Summary