| name | debug-systematic |
| description | Systematic 4-phase debugging methodology for complex, intermittent, or mysterious issues. Use when investigating bugs, race conditions, or unexplained failures. |
Systematic Debugging Protocol
A disciplined, evidence-based approach to debugging that prevents guessing and ensures root cause discovery.
The 4-Phase Protocol
Phase 1: REPRODUCE (Establish Ground Truth)
Goal: Create reliable reproduction steps before ANY investigation.
Actions:
- Document exact steps to trigger the bug
- Record environment specifics (OS, versions, config, memory, network)
- Determine frequency: Always? Sometimes? Specific conditions?
- Capture exact error messages, stack traces, screenshots
- Test on different environments to isolate variables
Key Questions:
- When did it last work correctly?
- What changed since then? (code, deps, config, infrastructure)
- Is it environment-specific?
- Is it data-specific?
- Is it timing-specific?
Output: Clear reproduction steps that reliably trigger the issue.
Phase 2: ISOLATE (Narrow the Scope)
Goal: Reduce the search space from "entire codebase" to "specific component."
Techniques:
Binary Search:
- Identify two points: working state and broken state
- Test the midpoint
- Recurse into the broken half
- Continue until the change is identified
Git Bisect (for regressions):
git bisect start
git bisect bad HEAD
git bisect good <known-good-commit>
# Git will checkout commits for testing
# After each test:
git bisect good # or git bisect bad
# Continue until culprit found
Code Elimination:
- Comment out sections to isolate the problem
- Create minimal reproduction case
- Strip away everything non-essential
Environment Isolation:
- Test in isolation (unit test the failing path)
- Compare working vs broken environments
- Use fresh installs to eliminate pollution
Output: "The bug is in [specific component/function/line range]"
Phase 3: DIAGNOSE (Understand Root Cause)
Goal: Know exactly WHY the bug occurs, not just WHERE.
Scientific Method:
- Observe: What exactly is happening?
- Hypothesize: Why might this be happening?
- Predict: If hypothesis is correct, what else would be true?
- Test: Verify predictions with evidence
- Iterate: Refine hypothesis based on results
Logging Strategy:
// Add strategic logging at boundaries
console.log('[DEBUG] Function entry:', { input, state });
console.log('[DEBUG] After processing:', { result, sideEffects });
console.log('[DEBUG] Function exit:', { returnValue });
Common Root Causes:
| Symptom | Likely Causes |
|---|---|
| Works locally, fails in CI | Environment differences, timing, resources |
| Intermittent failure | Race condition, flaky network, resource contention |
| Works then stops working | State mutation, memory leak, cache poisoning |
| Wrong data | Type coercion, encoding, timezone, precision |
| Silent failure | Swallowed exception, async error, missing await |
Output: Clear explanation of the root cause with evidence.
Phase 4: FIX & VERIFY (Resolve and Prevent)
Goal: Fix the issue and prevent regression.
Fix Process:
- Write a failing test that captures the bug
- Implement minimal fix - change as little as possible
- Verify test passes - confirms fix works
- Check for similar patterns - same bug elsewhere?
- Review fix for side effects - does it break anything?
- Document the fix - why it happened, how to prevent
Verification Checklist:
- Test passes that specifically catches this bug
- Existing tests still pass
- Manual verification confirms fix
- Fix works in all affected environments
- No new warnings or errors introduced
Prevention:
- Add guards/validation at boundaries
- Improve error messages for easier future debugging
- Document gotchas for other developers
- Consider if architectural change prevents similar bugs
Debugging Anti-Patterns
DO NOT:
- Guess and hope (change things randomly)
- Assume you know the problem without evidence
- Trust comments/docs over actual code behavior
- Debug production with print statements you'll forget to remove
- Fix the symptom instead of the root cause
- Make multiple changes at once
DO:
- Verify assumptions with evidence
- Change one thing at a time
- Log actual values, not what you expect
- Trust the code over documentation
- Take breaks when stuck (fresh eyes help)
Quick Reference
1. REPRODUCE → Can I reliably trigger this?
2. ISOLATE → Where exactly is it failing?
3. DIAGNOSE → Why is it failing?
4. FIX → How do I fix it permanently?
Output Template
## Bug Investigation: [Title]
### Reproduction
- Steps to reproduce
- Environment details
- Frequency
### Isolation
- Search method used
- Scope narrowed to
### Root Cause
- What's actually wrong
- Why it happens
- Evidence
### Fix
- Code changes made
- Test added
### Prevention
- How to prevent similar bugs
- Documentation updates