| name | systematic-debugging |
| description | Required for any bug, error, test failure, or unexpected behavior. Enforces root cause investigation before fixes through 4-phase framework. Prevents random fix attempts and ensures evidence-based debugging. Auto-invokes on: test failures, errors, crashes, 'not working', performance issues. |
Systematic Debugging
Overview
Random fixes waste time and create new bugs. Quick patches mask underlying issues.
Core principle: Always find root cause before attempting fixes. Symptom fixes are failure.
No fixes without root cause investigation first. If you haven't completed Phase 1, you cannot propose fixes.
When to Use
For any technical issue: test failures, bugs, unexpected behavior, performance problems, build failures, integration issues
Especially when: Under time pressure, "quick fix" seems obvious, already tried multiple fixes, previous fix didn't work, don't fully understand issue
Never skip when: Issue seems simple, you're in a hurry, under pressure
Phase Tracking (Required)
At start of each phase:
- Use TodoWrite to create phase checkpoint
- Mark current phase as "in_progress"
- List specific tasks
Phase Transition Protocol
To move between phases:
- State: "I am completing Phase N because: [completion criteria met]"
- Update TodoWrite: Mark phase N as completed
- State: "I am beginning Phase N+1"
- Update TodoWrite: Mark phase N+1 as in_progress
Do not silently move between phases. Each transition must be explicit.
Tool Usage by Phase
Phase 1-3 (Investigation):
- Allowed: Read, Grep, Glob, Bash (read-only commands)
- Allowed: Edit/Write only for debug statements (must be reverted)
- Forbidden: Edit/Write for fixes
Phase 4 (Implementation):
- Allowed: Edit/Write for the one root cause fix
- Forbidden: Edit/Write for multiple changes (one at a time)
If you catch yourself using Edit/Write in Phase 1-3 for anything other than debug statements, stop.
Phase 1: Root Cause Investigation
Before attempting any fix, you must complete all steps below.
Forbidden until Phase 1 complete:
- Proposing any code changes
- Suggesting fixes or solutions
- Using Edit or Write tools for fixes
- Moving to Phase 2
You may only:
- Add debug statements (temporary, will be removed)
- Read files and search code
- Run debugging tools
- Create temporary test files (will be deleted)
Steps:
Read Error Messages Carefully: Don't skip errors/warnings, read stack traces completely, note line numbers/paths/codes
Reproduce Consistently: Can you trigger it reliably? Exact steps? Every time? If not reproducible, gather more data
Check Recent Changes: What changed that could cause this? Git diff, recent commits, dependencies, config, environment
Gather Evidence in Multi-Component Systems
When system has multiple components (API -> service -> database):
Add diagnostic instrumentation at each boundary:
- Log data entering component
- Log data exiting component
- Verify environment/config propagation
- Check state at each layer
Run once to gather evidence showing where it breaks, then analyze to identify failing component, then investigate that specific component only.
- Trace Data Flow
When error is deep in call stack:
- Where does bad value originate?
- What called this with bad value?
- Keep tracing up until you find the source
- Fix at source, not at symptom
Add debug statements at each level of the call stack.
Evidence Gathering Techniques
See @evidence-gathering.md for:
- Debug statement format and examples
- Language-specific debugging tools
- Investigation strategies by issue type (memory, concurrency, performance, state, integration)
Phase 1 Completion Checklist
You have completed Phase 1 only if you can answer all:
- What fails: Exact component, line number, function name
- When it fails: Reproducible steps or conditions
- What value is wrong: Actual vs expected value
- Where it originates: Source of bad data/state (not just where it manifests)
If you cannot answer all four, you are still in Phase 1.
Phase 2: Pattern Analysis
Find the pattern before fixing:
- Find Working Examples: Locate similar working code in same codebase
- Compare Against References: If implementing pattern, read reference implementation completely
- Identify Differences: What's different between working and broken? List every difference
- Understand Dependencies: What other components, settings, config, environment does this need?
Phase 3: Hypothesis and Testing
Scientific method:
- Form Single Hypothesis: State clearly "I think X is root cause because Y" - be specific
- Test Minimally: Make smallest possible change to test hypothesis, one variable at a time
- Verify Before Continuing: Did it work? Yes -> Phase 4. No -> Form new hypothesis, don't add more fixes
- When You Don't Know: Say "I don't understand X", don't pretend to know, ask for help, research
Evidence Requirements
Before forming hypothesis, gather sufficient evidence:
- Add debug statements at key points
- Run tests with multiple inputs, edge cases
- Log entry/exit for suspect functions
- Create isolated test cases
Goal is evidence-based investigation, not arbitrary thresholds. If you can form well-supported hypothesis with less evidence, acceptable. If you need more, gather it.
When to Escalate
If after thorough investigation you cannot identify root cause:
- Document your investigation process
- List hypotheses tested and evidence gathered
- Identify what remains unclear
- Escalate with your findings
Don't continue indefinitely. If multiple hypotheses tested with evidence and root cause remains elusive, escalation is appropriate.
Before Proposing Any Fix
Ask yourself:
- Have I completed Phase 1? (Can I answer all 4 checklist items?)
- Have I found a working counterexample? (Phase 2)
- Have I stated one specific hypothesis? (Phase 3)
- Have I tested that hypothesis minimally? (Phase 3)
If any answer is "no", you are not ready to fix. State which phase you're in and continue investigation.
Phase 4: Implementation
Fix root cause, not symptom:
Create Failing Test Case: Simplest possible reproduction, automated test if possible, one-off script if no framework, must have before fixing. Write test that fails with current code, will pass after fix.
Implement Single Fix: Address root cause identified, one change at a time, no "while I'm here" improvements, no bundled refactoring
Clean Up Investigation
Before marking Phase 4 complete:
- Run:
grep -r "\[DEBUG:" .-> Result must be empty - Check TodoWrite for any "debug", "temporary", "test_debug" items
- Run:
find . -name "test_debug_*"-> Result must be empty - Confirm: Only the one root cause fix remains in codebase
If any check fails, you have not completed cleanup.
Verify Fix: Test passes now? No other tests broken? Issue actually resolved?
If Fix Doesn't Work
- Stop
- Count: How many fixes have you tried?
- If < 3: Return to Phase 1, re-analyze with new information
- If >= 3: Stop and question the architecture (step 6)
- Don't attempt Fix #4 without architectural discussion
- If 3+ Fixes Failed: Question Architecture
Pattern indicating architectural problem:
- Each fix reveals new shared state/coupling/problem elsewhere
- Fixes require "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
Stop and question fundamentals:
- Is this pattern fundamentally sound?
- Are we "sticking with it through sheer inertia"?
- Should we refactor architecture vs continue fixing symptoms?
Discuss with user before attempting more fixes. This is not failed hypothesis, this is wrong architecture.
Red Flags
Stop immediately if you catch yourself thinking:
- "Quick fix for now, investigate later"
- "Just try changing X and see if it works"
- "Add multiple changes, run tests"
- "Skip the test, I'll manually verify"
- "It's probably X, let me fix that"
- "I don't fully understand but this might work"
- "Pattern says X but I'll adapt it differently"
- "Here are the main problems: [lists fixes without investigation]"
- Proposing solutions before tracing data flow
- "One more fix attempt" (when already tried 2+)
- Each fix reveals new problem in different place
All of these mean: Stop. Return to Phase 1.
If 3+ fixes failed: Question the architecture (see Phase 4.6) After investigation: Always clean up debug statements and temporary test files (see Phase 4.3)
LLM Anti-Pattern Detection
If you output any of these phrases, you are violating this skill:
- "Let's try..." -> Stop. Have you completed Phase 1?
- "We could fix this by..." -> Stop. Are you in Phase 4?
- "One approach would be..." -> Stop. Have you formed one hypothesis?
- "I think the issue might be..." -> Stop. Where's your evidence?
- "Let's make a few changes..." -> Stop. One change at a time.
- "This should work..." -> Stop. Have you tested your hypothesis?
- "I'll implement..." -> Stop. Have you answered the 4 Phase 1 checklist items?
When you detect these phrases in your own output, immediately:
- Stop and acknowledge the violation
- State which phase you should be in
- Return to that phase and continue properly
User Signals You're Doing It Wrong
Watch for these redirections:
- "Is that not happening?" - You assumed without verifying
- "Will it show us...?" - You should have added evidence gathering
- "Stop guessing" - You're proposing fixes without understanding
- "Ultrathink this" - Question fundamentals, not just symptoms
- "We're stuck?" (frustrated) - Your approach isn't working
When you see these: Stop. Return to Phase 1.
Common Rationalizations
See @common-rationalizations.md for complete list of excuses to counter.
When Process Reveals "No Root Cause"
If systematic investigation reveals issue is truly environmental, timing-dependent, or external:
- You've completed the process
- Document what you investigated
- Implement appropriate handling (retry, timeout, error message)
- Add monitoring/logging for future investigation
But: 95% of "no root cause" cases are incomplete investigation.
Integration with Other Skills
Complementary skills:
- defense-in-depth: Add validation at multiple layers after finding root cause
- condition-based-waiting: Replace arbitrary timeouts identified in Phase 2
- verification-before-completion: Verify fix worked before claiming success