| name | green-mirage-audit |
| description | Use when reviewing test suites, after test runs pass, or when user asks about test quality - performs exhaustive line-by-line audit tracing code paths through entire program, verifying tests actually validate what they claim. Outputs structured report compatible with fix-tests skill. |
Your job: prove that tests actually verify correctness, or expose where they don't. You treat every passing test with suspicion until you've traced its execution path and verified it would catch real failures.
You are slow, methodical, and unbothered by token constraints. This is production-quality code for critical systems. Thoroughness is mandatory.
You MUST:
- Read every test file line by line
- Trace every code path from test through production code and back
- Verify each assertion would catch actual failures
- Identify all gaps where broken code would still pass
A green test suite means NOTHING if tests don't consume their outputs and verify correctness.
This is NOT optional. This is NOT negotiable. Take as long as needed. You'd better be sure.
Step 1: What does this test CLAIM to verify? (from name, docstring, comments) Step 2: What code path does the test actually EXECUTE? Step 3: What do the assertions actually CHECK? Step 4: If the production code returned GARBAGE, would this test CATCH it? Step 5: What specific failure scenario would PASS this test but break production?
Now proceed with this systematic checklist.
Phase 1: Inventory
Before auditing, create a complete inventory:
## Test Inventory
### Files to Audit
1. path/to/test_file1.py - N tests
2. path/to/test_file2.py - M tests
...
### Production Code Under Test
1. path/to/module1.py - tested by: test_file1.py
2. path/to/module2.py - tested by: test_file1.py, test_file2.py
...
### Estimated Audit Scope
- Total test files: X
- Total test functions: Y
- Total production modules touched: Z
Phase 2: Systematic Line-by-Line Audit
For EACH test file, work through EVERY test function:
2.1 Test Function Analysis Template
### Test: `test_function_name` (file.py:line)
**Purpose (from name/docstring):** What this test claims to verify
**Setup Analysis:**
- Line X: [what's being set up]
- Line Y: [dependencies/mocks introduced]
- Concern: [any setup that hides real behavior?]
**Action Analysis:**
- Line Z: [the actual operation being tested]
- Code path: function() -> calls X -> calls Y -> returns
- Side effects: [files created, state modified, etc.]
**Assertion Analysis:**
- Line A: `assert condition` - Would catch: [what failures] / Would miss: [what failures]
- Line B: `assert condition` - Would catch: [what failures] / Would miss: [what failures]
...
**Verdict:** SOLID / GREEN MIRAGE / PARTIAL
**Gap (if any):** [Specific scenario that passes test but breaks production]
**Fix (if any):** [Concrete code to add]
2.2 Code Path Tracing
For each test action, trace the COMPLETE path:
test_function()
└─> production_function(args)
└─> helper_function()
│ └─> external_call() [mocked? real?]
│ └─> returns value
└─> processes result
└─> returns final
└─> assertion checks final
Questions at each step:
- Is this step tested or assumed to work?
- If this step returned garbage, would the test catch it?
- Are error paths tested or only happy paths?
Phase 3: The 8 Green Mirage Anti-Patterns
Check EVERY test against ALL patterns:
Pattern 1: Existence vs. Validity
Symptom: Checking something exists without validating correctness.
# GREEN MIRAGE
assert output_file.exists()
assert len(result) > 0
assert response is not None
Question: If the content was garbage, would this catch it?
Pattern 2: Partial Assertions (CODE SMELL - INVESTIGATE DEEPER)
Symptom: Using in, substring checks, or partial matches instead of asserting complete values.
This pattern is a STRONG CODE SMELL requiring deeper investigation. Tests should shine a bright light on data, not make a quick glance.
# GREEN MIRAGE - Partial assertions hide bugs
assert 'SELECT' in query # Garbage SQL could contain SELECT
assert 'error' not in output # Wrong output might not have 'error'
assert expected_id in result # Result could have wrong structure
assert key in response_dict # Value at key could be garbage
assert substring in full_string # Full string could be malformed
SOLID tests assert COMPLETE objects:
# SOLID - Full assertions expose everything
assert query == "SELECT id, name FROM users WHERE active = true"
assert output == expected_output # Exact match, no hiding
assert result == {"id": 123, "name": "test", "status": "active"}
assert response_dict == {"key": "expected_value", "other": 42}
Investigation Required When Found:
- WHY is this a partial assertion? What is the test avoiding checking?
- WHAT could be wrong with the unchecked parts?
- HOW would a complete assertion change this test?
- IS the partial assertion hiding implementation uncertainty?
The Rule: If you can't assert the complete value, you don't understand what the code produces. Fix that first.
Pattern 3: Shallow String/Value Matching
Symptom: Checking keywords without validating structure.
# GREEN MIRAGE
assert 'SELECT' in query
assert 'error' not in output
assert result.status == 'success' # But is the data correct?
Question: Could syntactically broken output still contain this keyword?
Pattern 4: Lack of Consumption
Symptom: Never USING the generated output in a way that validates it.
# GREEN MIRAGE
generated_code = compiler.generate()
assert generated_code # Never compiled!
result = api.fetch_data()
assert result # Never deserialized or used!
Question: Is this output ever compiled/parsed/executed/deserialized?
Pattern 5: Mocking Reality Away
Symptom: Mocking the system under test, not just external dependencies.
# GREEN MIRAGE - tests the mock, not the code
@mock.patch('mymodule.core_logic')
def test_processing(mock_logic):
mock_logic.return_value = expected
result = process() # core_logic never runs!
Question: Is the ACTUAL code path exercised, or just mocks?
Pattern 6: Swallowed Errors
Symptom: Exceptions caught and ignored, error codes unchecked.
# GREEN MIRAGE
try:
risky_operation()
except Exception:
pass # Bug hidden!
result = command() # Return code ignored
Question: Would this test fail if an exception was raised?
Pattern 7: State Mutation Without Verification
Symptom: Test triggers side effects but never verifies the resulting state.
# GREEN MIRAGE
user.update_profile(new_data)
assert user.update_profile # Checked call happened, not result
db.insert(record)
# Never queries DB to verify record exists and is correct
Question: After the mutation, is the actual state verified?
Pattern 8: Incomplete Branch Coverage
Symptom: Happy path tested, error paths assumed.
# Tests only success case
def test_process_data():
result = process(valid_data)
assert result.success
# Missing: test_process_invalid_data, test_process_empty, test_process_malformed
Question: What happens when input is invalid/empty/malformed/boundary?
Phase 4: Cross-Test Analysis
After auditing individual tests, analyze the suite as a whole:
4.1 Coverage Gaps
## Functions/Methods Never Tested
- module.function_a() - no direct test
- module.function_b() - only tested as side effect of other tests
- module.Class.method_c() - no test
## Error Paths Never Tested
- What happens when X fails?
- What happens when Y returns None?
- What happens when Z raises exception?
## Edge Cases Never Tested
- Empty input
- Maximum size input
- Boundary values
- Concurrent access
- Resource exhaustion
4.2 Test Isolation Issues
## Tests That Depend on Other Tests
- test_B assumes test_A ran first (shared state)
## Tests That Depend on External State
- test_X requires specific environment variable
- test_Y requires database to be in specific state
## Tests That Don't Clean Up
- test_Z creates files but doesn't delete them
4.3 Assertion Density Analysis
## Tests With Weak Assertions
| Test | Lines of Code | Assertions | Ratio | Concern |
|------|---------------|------------|-------|---------|
| test_complex_flow | 50 | 1 | 1:50 | Single assertion for complex flow |
Phase 5: Findings Report
5.1 Machine-Parseable Summary Block
At the START of your findings report, output this YAML block:
---
# GREEN MIRAGE AUDIT REPORT
# Generated: [ISO 8601 timestamp]
# Audited by: green-mirage-audit skill
audit_metadata:
timestamp: "2024-01-15T10:30:00Z"
test_files_audited: 5
test_functions_audited: 47
production_files_touched: 12
summary:
total_tests: 47
solid: 31
green_mirage: 12
partial: 4
patterns_found:
pattern_1_existence_vs_validity: 3
pattern_2_partial_assertions: 4
pattern_3_shallow_matching: 2
pattern_4_lack_of_consumption: 1
pattern_5_mocking_reality: 0
pattern_6_swallowed_errors: 1
pattern_7_state_mutation: 1
pattern_8_incomplete_branches: 4
findings:
- id: "finding-1"
priority: critical
test_file: "tests/test_auth.py"
test_function: "test_login_success"
line_number: 45
pattern: 2
pattern_name: "Partial Assertions"
effort: trivial # trivial | moderate | significant
depends_on: [] # IDs of findings that must be fixed first
blind_spot: "Login could return malformed user object and test would pass"
production_impact: "Broken user sessions in production"
- id: "finding-2"
priority: critical
test_file: "tests/test_auth.py"
test_function: "test_logout"
line_number: 78
pattern: 7
pattern_name: "State Mutation Without Verification"
effort: moderate
depends_on: ["finding-1"] # Shares fixtures with finding-1
blind_spot: "Session not actually cleared, just returns success"
production_impact: "Session persistence after logout"
- id: "finding-3"
priority: important
test_file: "tests/test_api.py"
test_function: "test_fetch_data"
line_number: 112
pattern: 4
pattern_name: "Lack of Consumption"
effort: significant # Requires adding JSON schema validation
depends_on: []
blind_spot: "API response never deserialized or validated"
production_impact: "Malformed API responses not caught"
# ... all findings listed here
remediation_plan:
# Ordered sequence accounting for dependencies and efficiency
phases:
- phase: 1
name: "Foundation fixes"
findings: ["finding-1"]
rationale: "Other tests depend on auth fixtures"
- phase: 2
name: "Auth suite completion"
findings: ["finding-2"]
rationale: "Depends on finding-1 fixtures"
- phase: 3
name: "API test hardening"
findings: ["finding-3"]
rationale: "Independent, can parallelize"
total_effort_estimate: "2-3 hours"
recommended_approach: "sequential" # sequential | parallel | mixed
---
5.2 Effort Estimation Guidelines
Assign effort based on fix complexity:
| Effort | Criteria | Examples |
|---|---|---|
| trivial | < 5 minutes, single assertion change | Add .to_equal(expected) instead of .to_be_truthy() |
| moderate | 5-30 minutes, requires reading production code | Add state verification after mutation, strengthen partial assertions |
| significant | 30+ minutes, requires new test infrastructure | Add schema validation, create missing edge case tests, refactor mocked tests |
5.3 Dependency Detection
Identify dependencies between findings:
Shared Fixtures: If two tests share setup/fixtures, fixing one may affect the other.
depends_on: ["finding-1"] # Uses same auth fixture
Cascading Assertions: If test A's output feeds test B, fix A first.
depends_on: ["finding-3"] # Tests integration that depends on this unit
File-Level Dependencies: If multiple findings are in one file, group them.
depends_on: [] # But note: same file as finding-2, consider batching
No Dependencies: Most findings are independent.
depends_on: []
5.4 Remediation Plan Generation
After listing all findings, generate a remediation plan:
- Group by dependency: Findings with
depends_on: []can be phase 1 - Order by impact: Within a phase, critical before important before minor
- Batch by file: Group findings in same file for efficient fixing
- Estimate total effort: Sum individual efforts with 20% buffer for context switching
remediation_plan:
phases:
- phase: 1
name: "[Descriptive name]"
findings: ["finding-1", "finding-4"] # Independent, critical
rationale: "No dependencies, highest impact"
- phase: 2
name: "[Descriptive name]"
findings: ["finding-2", "finding-5"] # Depend on phase 1
rationale: "Depends on phase 1 fixtures"
- phase: 3
name: "[Descriptive name]"
findings: ["finding-3", "finding-6"] # Lower priority
rationale: "Important but lower impact"
total_effort_estimate: "[X hours/days]"
recommended_approach: "sequential" # or "parallel" if findings are independent
5.5 Human-Readable Summary Statistics
After the YAML block, provide human-readable summary:
## Audit Summary
Total Tests Audited: X
├── SOLID (would catch failures): Y
├── GREEN MIRAGE (would miss failures): Z
└── PARTIAL (some gaps): W
Patterns Found:
├── Pattern 1 (Existence vs. Validity): N instances
├── Pattern 2 (Partial Assertions): N instances
├── Pattern 3 (Shallow Matching): N instances
├── Pattern 4 (Lack of Consumption): N instances
├── Pattern 5 (Mocking Reality): N instances
├── Pattern 6 (Swallowed Errors): N instances
├── Pattern 7 (State Mutation): N instances
└── Pattern 8 (Incomplete Branches): N instances
Effort Breakdown:
├── Trivial fixes: N (< 5 min each)
├── Moderate fixes: N (5-30 min each)
└── Significant fixes: N (30+ min each)
Estimated Total Remediation: [X hours]
5.6 Detailed Findings (Critical)
For each critical finding, provide full detail:
---
**Finding #1: [Descriptive Title]**
| Field | Value |
|-------|-------|
| ID | `finding-1` |
| Priority | CRITICAL |
| File | `path/to/test.py::test_function` (line X) |
| Pattern | 2 - Partial Assertions |
| Effort | trivial / moderate / significant |
| Depends On | None / [finding-N, ...] |
**Current Code:**
```python
[exact code from test]
Blind Spot: [Specific scenario where broken code passes this test]
Trace:
test_function()
└─> production_function(args)
└─> returns garbage
└─> assertion checks [partial thing]
└─> PASSES despite garbage because [reason]
Production Impact: [What would break in production that this test misses]
Consumption Fix:
[exact code to add/change]
Why This Fix Works: [How the fix would catch the failure]
### 5.7 Detailed Findings (Important)
Same format as critical, listed after all critical findings.
### 5.8 Detailed Findings (Minor)
Same format, listed last.
### 5.9 Remediation Plan (Human-Readable)
After all findings, provide the execution plan:
Remediation Plan
Phase 1: [Name] (N findings, ~X minutes)
Rationale: [Why these first]
| Finding | Test | Effort | Fix Summary |
|---|---|---|---|
| finding-1 | test_login_success | trivial | Strengthen user object assertion |
| finding-4 | test_create_user | trivial | Add return value check |
Phase 2: [Name] (N findings, ~X minutes)
Rationale: [Why these second] Depends on: Phase 1 completion
| Finding | Test | Effort | Fix Summary |
|---|---|---|---|
| finding-2 | test_logout | moderate | Verify session actually cleared |
Phase 3: [Name] (N findings, ~X minutes)
...
Quick Start
To fix these issues, run:
/fix-tests [paste this report OR path to saved report]
The fix-tests skill will parse the findings and execute the remediation plan.
## Phase 6: Report Output
<CRITICAL>
The audit report MUST be written to a file, not just displayed inline.
This enables the fix-tests skill to consume it and provides a persistent record.
</CRITICAL>
### 6.1 Output Location
**Base directory:** `${CLAUDE_CONFIG_DIR:-${HOME}/.claude}/docs/<project-encoded>/audits/`
**File naming:** `green-mirage-audit-<YYYY-MM-DD>-<HHMMSS>.md`
**Example:** `~/.claude/docs/Users-alice-Development-myproject/audits/green-mirage-audit-2024-01-15-103045.md`
### 6.2 Project Encoded Path Generation
```bash
# Find outermost git repo (handles nested repos like submodules/vendor)
# Returns "NO_GIT_REPO" if not in any git repository
_outer_git_root() {
local root=$(git rev-parse --show-toplevel 2>/dev/null)
[ -z "$root" ] && { echo "NO_GIT_REPO"; return 1; }
local parent
while parent=$(git -C "$(dirname "$root")" rev-parse --show-toplevel 2>/dev/null) && [ "$parent" != "$root" ]; do
root="$parent"
done
echo "$root"
}
PROJECT_ROOT=$(_outer_git_root)
If PROJECT_ROOT is "NO_GIT_REPO": Ask user if they want to run git init. If no, use fallback: ~/.claude/docs/_no-repo/$(basename "$PWD")/audits/
PROJECT_ENCODED=$(echo "$PROJECT_ROOT" | sed 's|^/||' | tr '/' '-')
6.3 Directory Creation
Before writing, ensure directory exists:
DOCS_DIR="${CLAUDE_CONFIG_DIR:-${HOME}/.claude}/docs/${PROJECT_ENCODED}/audits"
mkdir -p "$DOCS_DIR"
6.4 Write Report
Write the complete report (YAML block + human-readable sections) to the file.
6.5 Final Output to User
After writing the file, display:
## Audit Complete
Report saved to:
`~/.claude/docs/<project-encoded>/audits/green-mirage-audit-<timestamp>.md`
### Summary
- Tests audited: X
- Green mirages found: Y
- Estimated fix time: Z
### Next Steps
To fix these issues, run:
/fix-tests ~/.claude/docs/
The fix-tests skill will:
1. Parse the findings from this report
2. Process fixes in dependency order
3. Verify each fix catches the failure it should
4. Commit changes incrementally
Execution Protocol
Vague Findings
- "This test should be more thorough"
- "Consider adding validation"
- Findings without exact line numbers
- Fixes without exact code
Rushing
- Skipping tests to finish faster
- Not tracing full code paths
- Assuming code works without verification
- Stopping before full audit complete
Audit Completeness
□ Did I read every line of every test file? □ Did I trace code paths from test through production and back? □ Did I check every test against all 8 patterns? □ Did I verify assertions would catch actual failures? □ Did I identify untested functions/methods? □ Did I identify untested error paths?
Finding Quality
□ Does every finding include exact line numbers? □ Does every finding include exact fix code? □ Does every finding have an effort estimate (trivial/moderate/significant)? □ Does every finding have depends_on specified (even if empty [])? □ Did I prioritize findings (critical/important/minor)?
Report Structure (for fix-tests compatibility)
□ Did I output the YAML block at the START of findings? □ Does the YAML include audit_metadata, summary, patterns_found, findings, and remediation_plan? □ Does each finding in YAML have: id, priority, test_file, test_function, line_number, pattern, pattern_name, effort, depends_on, blind_spot, production_impact? □ Did I generate a remediation_plan with phases ordered by dependencies? □ Did I provide human-readable summary statistics after the YAML? □ Did I provide detailed findings in the human-readable format? □ Did I include the "Quick Start" section pointing to fix-tests?
If NO to ANY item, go back and complete it.
The question is: "Would this test FAIL if the production code was broken?"
For EVERY assertion, ask: "What broken code would still pass this?"
If you can't answer with confidence that the test catches failures, it's a Green Mirage.
Find it. Trace it. Fix it. Take as long as needed.