| name | debug-root-cause-analysis |
| description | Use when fixing a bug or analyzing a stack trace. Enforces systematic debugging methodology to prevent shotgun debugging and verify fixes. |
| author | Claude Code Learning Flywheel Team |
| allowed-tools | Read, Bash, Grep, Glob, Edit |
| version | 1.0.0 |
| last_verified | 2026-01-01 |
| tags | debugging, root-cause, analysis, systematic |
| related-skills | test-driven-workflow, code-navigation |
Skill: Debug Root Cause Analysis
Purpose
Prevent "Shotgun Debugging" where agents randomly change code hoping errors vanish. Enforce a scientific method approach to debugging that isolates the root cause before applying fixes, preventing introduction of new bugs while fixing old ones.
1. Negative Knowledge (Anti-Patterns)
| Failure Pattern | Context | Why It Fails |
|---|---|---|
| Shotgun Debugging | Changing multiple variables at once | Cannot isolate cause, introduces new bugs |
| Premature Fix | Applying a patch before reproducing failure | Fix is unverified, may not address root cause |
| Log Blindness | Ignoring severity levels or timestamps | Chasing symptoms, not causes |
| Assumption-Based Debugging | "I think it's X" without verification | Wastes time on wrong leads |
| Print Debugging Only | Adding console.logs without structured approach | Creates noise, hard to trace logic flow |
| Copy-Paste Solutions | Applying Stack Overflow answers without understanding | May not fit context, creates technical debt |
| Skipping Reproduction | Attempting fix without reproducing bug | Cannot verify fix works |
2. Verified Debugging Procedure
The Scientific Method for Debugging
1. OBSERVE → Gather evidence (logs, stack traces, user reports)
2. REPRODUCE → Create minimal reproduction case
3. HYPOTHESIZE → Form testable theories about the cause
4. TEST → Verify each hypothesis systematically
5. FIX → Apply minimal fix to root cause
6. VERIFY → Confirm fix resolves issue and doesn't break anything
7. PREVENT → Add regression test
Phase 1: OBSERVE - Gather Evidence
Collect all available information:
# Check application logs
tail -n 100 logs/error.log
# Check recent commits that may have introduced the bug
git log --since="2 days ago" --oneline
# Check for similar errors
grep -r "ErrorMessage" logs/
# Review stack trace carefully
# Note: Line numbers, function names, timestamps
Questions to answer:
- When did the bug first appear?
- What changed recently (code, dependencies, config)?
- Is it consistent or intermittent?
- What's the exact error message and stack trace?
- What data was being processed when it failed?
Phase 2: REPRODUCE - Create Minimal Case
Goal: Reliably trigger the bug with minimal setup.
// Example: Creating a minimal reproduction test
describe('Bug Reproduction', () => {
it('should reproduce the error when...', async () => {
// Arrange: Set up minimal conditions
const input = {
userId: 'test-user',
amount: -100 // Hypothesis: negative amounts cause crash
};
// Act & Assert: Verify the bug occurs
await expect(
processPayment(input)
).rejects.toThrow('Cannot process negative amount');
});
});
If you cannot reproduce:
- Verify you have the same environment (Node version, dependencies)
- Check if it's environment-specific (production vs development)
- Look for missing configuration or state
Phase 3: HYPOTHESIZE - Form Theories
Generate testable hypotheses based on evidence:
- Analyze the stack trace: Start from the innermost function
- Check recent changes:
git diff main...HEAD - Review data flow: Trace how data reaches the failing code
- Consider edge cases: Null values, empty arrays, special characters
Example hypothesis formation:
Stack trace shows: TypeError: Cannot read property 'id' of undefined
Location: userService.ts:45
Hypothesis 1: User object is undefined when passed to service
Hypothesis 2: User object exists but missing 'id' property
Hypothesis 3: Async race condition, user not loaded yet
Next step: Add logging at userService.ts:44 to check user object
Phase 4: TEST - Verify Hypothesis
Use the zero-context script to analyze logs:
# Run log analysis to find patterns
python .claude/skills/debug-root-cause-analysis/scripts/analyze_logs.py \
--log-file logs/error.log \
--error-pattern "TypeError"
Add strategic logging to test hypothesis:
// BEFORE (hypothesis testing)
async function getUserById(id: string) {
console.log('[DEBUG] getUserById called with:', id);
const user = await db.users.findOne({ id });
console.log('[DEBUG] User found:', user);
if (!user) {
console.log('[DEBUG] User not found for ID:', id);
throw new Error('User not found');
}
return user;
}
Verify with a test:
it('should handle missing user gracefully', async () => {
const result = await getUserById('nonexistent-id');
// If this throws TypeError about 'id', hypothesis 1 is confirmed
});
Phase 5: FIX - Apply Minimal Fix
Once root cause is confirmed, apply the minimal fix:
// AFTER (minimal fix)
async function getUserById(id: string) {
const user = await db.users.findOne({ id });
if (!user) {
throw new Error(`User not found: ${id}`);
}
return user;
}
Principles:
- Fix ONLY the root cause
- Don't refactor during bug fixing
- Don't add features while fixing bugs
- Keep the fix as simple as possible
Phase 6: VERIFY - Confirm Fix
Verification checklist:
# 1. Run the reproduction test
npm test -- bug-reproduction.test.ts
# Expected: PASS
# 2. Run full test suite
npm test
# Expected: All tests PASS (no regressions)
# 3. Manual verification (if applicable)
npm run dev
# Test the actual user flow that triggered the bug
# 4. Check logs are clean
tail -f logs/error.log
# Expected: No errors when performing the previously failing action
Phase 7: PREVENT - Add Regression Test
Convert your reproduction case into a permanent test:
// tests/unit/services/UserService.test.ts
describe('UserService.getUserById', () => {
it('should throw clear error when user not found', async () => {
const service = new UserService(mockDb);
await expect(
service.getUserById('nonexistent-id')
).rejects.toThrow('User not found: nonexistent-id');
});
it('should handle null user gracefully', async () => {
mockDb.users.findOne.mockResolvedValue(null);
await expect(
service.getUserById('test-id')
).rejects.toThrow('User not found');
});
});
3. Zero-Context Scripts
analyze_logs.py
Located at: .claude/skills/debug-root-cause-analysis/scripts/analyze_logs.py
Purpose: Parse error logs for frequency, patterns, and correlations.
Usage:
python .claude/skills/debug-root-cause-analysis/scripts/analyze_logs.py \
--log-file logs/error.log \
--error-pattern "TypeError|ReferenceError|null" \
--time-range "last 24h"
Output:
Error Frequency Analysis:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
TypeError: Cannot read property 'id': 47 occurrences
First seen: 2026-01-01 10:23:45
Last seen: 2026-01-01 14:32:10
Peak time: 2026-01-01 12:00-13:00 (23 errors)
Stack trace (most common):
at userService.ts:45
at authMiddleware.ts:89
ReferenceError: user is not defined: 3 occurrences
...
4. Debugging Workflow for Agents
When you encounter a bug or error:
Stop and Observe:
- Don't immediately change code
- Read the full error message and stack trace
- Note the file, line number, and error type
Gather Context:
# What files are involved? grep -r "functionName" src/ # What changed recently? git log --oneline -10 # Are there logs? ls -la logs/ && tail logs/error.logCreate Reproduction Test:
- Write a failing test that triggers the bug
- Make it as minimal as possible
- Verify it actually fails
Analyze Systematically:
- Use the zero-context script to analyze logs
- Add strategic logging (NOT shotgun console.logs)
- Test one hypothesis at a time
Apply Minimal Fix:
- Fix only the root cause
- Don't refactor or add features
- Keep changes minimal and focused
Verify Thoroughly:
- Run the reproduction test
- Run the full test suite
- Manual verification if needed
Prevent Recurrence:
- Convert reproduction to permanent test
- Document the root cause in comments if complex
- Consider if similar bugs could exist elsewhere
Example Agent Session
User: "The app crashes when I try to delete a user"
Agent: "I'll debug this systematically using root cause analysis."
[OBSERVE]
Agent: Reading error logs...
Found: TypeError: Cannot read property 'posts' of undefined at deleteUser:67
[REPRODUCE]
Agent: Creating reproduction test...
Writing: tests/bugs/delete-user-crash.test.ts
[RUN TEST]
Agent: Running test to confirm reproduction...
Output: FAIL - Confirmed the crash occurs
[HYPOTHESIZE]
Agent: Stack trace shows crash at line 67 when accessing user.posts.
Hypothesis: user object is undefined before we try to access posts.
Let me check the code...
[READ CODE]
Agent: Reading userService.ts:60-70...
Found issue: deleteUser() doesn't check if user exists before accessing user.posts
[FIX]
Agent: Applying minimal fix - adding null check before accessing user.posts
[VERIFY]
Agent: Running reproduction test...
Output: PASS
Running full test suite...
Output: All tests PASS
[PREVENT]
Agent: Converting reproduction test to permanent regression test
Committing: "fix: add null check before accessing user.posts in deleteUser"
5. Failed Attempts (Negative Knowledge Evolution)
❌ Attempt: Debug by commenting out code
Context: Tried to isolate bug by commenting out sections Failure: Changed program behavior, couldn't identify actual cause Learning: Use logging and breakpoints, not code modification
❌ Attempt: Apply fix before confirming root cause
Context: Saw similar bug on Stack Overflow, applied their solution Failure: Didn't fix our bug, introduced new edge case Learning: Always verify the root cause matches before applying fixes
❌ Attempt: Add try-catch around everything
Context: Wrapped failing code in try-catch to "fix" errors Failure: Silenced errors, made debugging harder, root cause unfixed Learning: Fix the cause, don't suppress the symptom
❌ Attempt: Debug in production
Context: Couldn't reproduce locally, added logging to production Failure: Exposed sensitive data in logs, caused performance issues Learning: Reproduce locally or use proper observability tools
❌ Attempt: Multi-variable changes
Context: Changed error handling AND data validation AND logging Failure: Bug disappeared but couldn't identify which change fixed it Learning: Change one variable at a time, verify after each change
6. Common Bug Categories
Null/Undefined Errors
- Symptom: TypeError: Cannot read property 'X' of undefined
- Common causes: Missing null checks, async race conditions
- Fix approach: Add null checks, ensure async operations complete
Type Errors
- Symptom: Type mismatch, unexpected type coercion
- Common causes: Weak typing, incorrect assumptions
- Fix approach: Add type guards, validate inputs
Race Conditions
- Symptom: Intermittent failures, works sometimes
- Common causes: Async operations not awaited, shared state
- Fix approach: Proper async/await, eliminate shared mutable state
Off-by-One Errors
- Symptom: Array index out of bounds, fencepost errors
- Common causes: Loop conditions, array slicing
- Fix approach: Careful boundary analysis, add tests for edge cases
Configuration Errors
- Symptom: Works locally, fails in production
- Common causes: Missing env vars, different configurations
- Fix approach: Validate configuration on startup, use same config locally
7. Governance
- Token Budget: ~495 lines (within 500 limit)
- Dependencies: Python 3.8+ for log analysis script
- Pattern Origin: Scientific Method, Systematic Debugging (Andreas Zeller)
- Maintenance: Update anti-patterns as new failure modes discovered
- Verification Date: 2026-01-01