Claude Code Plugins

Community-maintained marketplace

Feedback

Use when fixing a bug or analyzing a stack trace. Enforces systematic debugging methodology to prevent shotgun debugging and verify fixes.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name debug-root-cause-analysis
description Use when fixing a bug or analyzing a stack trace. Enforces systematic debugging methodology to prevent shotgun debugging and verify fixes.
author Claude Code Learning Flywheel Team
allowed-tools Read, Bash, Grep, Glob, Edit
version 1.0.0
last_verified 2026-01-01
tags debugging, root-cause, analysis, systematic
related-skills test-driven-workflow, code-navigation

Skill: Debug Root Cause Analysis

Purpose

Prevent "Shotgun Debugging" where agents randomly change code hoping errors vanish. Enforce a scientific method approach to debugging that isolates the root cause before applying fixes, preventing introduction of new bugs while fixing old ones.

1. Negative Knowledge (Anti-Patterns)

Failure Pattern Context Why It Fails
Shotgun Debugging Changing multiple variables at once Cannot isolate cause, introduces new bugs
Premature Fix Applying a patch before reproducing failure Fix is unverified, may not address root cause
Log Blindness Ignoring severity levels or timestamps Chasing symptoms, not causes
Assumption-Based Debugging "I think it's X" without verification Wastes time on wrong leads
Print Debugging Only Adding console.logs without structured approach Creates noise, hard to trace logic flow
Copy-Paste Solutions Applying Stack Overflow answers without understanding May not fit context, creates technical debt
Skipping Reproduction Attempting fix without reproducing bug Cannot verify fix works

2. Verified Debugging Procedure

The Scientific Method for Debugging

1. OBSERVE   → Gather evidence (logs, stack traces, user reports)
2. REPRODUCE → Create minimal reproduction case
3. HYPOTHESIZE → Form testable theories about the cause
4. TEST → Verify each hypothesis systematically
5. FIX → Apply minimal fix to root cause
6. VERIFY → Confirm fix resolves issue and doesn't break anything
7. PREVENT → Add regression test

Phase 1: OBSERVE - Gather Evidence

Collect all available information:

# Check application logs
tail -n 100 logs/error.log

# Check recent commits that may have introduced the bug
git log --since="2 days ago" --oneline

# Check for similar errors
grep -r "ErrorMessage" logs/

# Review stack trace carefully
# Note: Line numbers, function names, timestamps

Questions to answer:

  • When did the bug first appear?
  • What changed recently (code, dependencies, config)?
  • Is it consistent or intermittent?
  • What's the exact error message and stack trace?
  • What data was being processed when it failed?

Phase 2: REPRODUCE - Create Minimal Case

Goal: Reliably trigger the bug with minimal setup.

// Example: Creating a minimal reproduction test
describe('Bug Reproduction', () => {
  it('should reproduce the error when...', async () => {
    // Arrange: Set up minimal conditions
    const input = {
      userId: 'test-user',
      amount: -100  // Hypothesis: negative amounts cause crash
    };

    // Act & Assert: Verify the bug occurs
    await expect(
      processPayment(input)
    ).rejects.toThrow('Cannot process negative amount');
  });
});

If you cannot reproduce:

  • Verify you have the same environment (Node version, dependencies)
  • Check if it's environment-specific (production vs development)
  • Look for missing configuration or state

Phase 3: HYPOTHESIZE - Form Theories

Generate testable hypotheses based on evidence:

  1. Analyze the stack trace: Start from the innermost function
  2. Check recent changes: git diff main...HEAD
  3. Review data flow: Trace how data reaches the failing code
  4. Consider edge cases: Null values, empty arrays, special characters

Example hypothesis formation:

Stack trace shows: TypeError: Cannot read property 'id' of undefined
Location: userService.ts:45

Hypothesis 1: User object is undefined when passed to service
Hypothesis 2: User object exists but missing 'id' property
Hypothesis 3: Async race condition, user not loaded yet

Next step: Add logging at userService.ts:44 to check user object

Phase 4: TEST - Verify Hypothesis

Use the zero-context script to analyze logs:

# Run log analysis to find patterns
python .claude/skills/debug-root-cause-analysis/scripts/analyze_logs.py \
  --log-file logs/error.log \
  --error-pattern "TypeError"

Add strategic logging to test hypothesis:

// BEFORE (hypothesis testing)
async function getUserById(id: string) {
  console.log('[DEBUG] getUserById called with:', id);
  const user = await db.users.findOne({ id });
  console.log('[DEBUG] User found:', user);

  if (!user) {
    console.log('[DEBUG] User not found for ID:', id);
    throw new Error('User not found');
  }

  return user;
}

Verify with a test:

it('should handle missing user gracefully', async () => {
  const result = await getUserById('nonexistent-id');
  // If this throws TypeError about 'id', hypothesis 1 is confirmed
});

Phase 5: FIX - Apply Minimal Fix

Once root cause is confirmed, apply the minimal fix:

// AFTER (minimal fix)
async function getUserById(id: string) {
  const user = await db.users.findOne({ id });

  if (!user) {
    throw new Error(`User not found: ${id}`);
  }

  return user;
}

Principles:

  • Fix ONLY the root cause
  • Don't refactor during bug fixing
  • Don't add features while fixing bugs
  • Keep the fix as simple as possible

Phase 6: VERIFY - Confirm Fix

Verification checklist:

# 1. Run the reproduction test
npm test -- bug-reproduction.test.ts
# Expected: PASS

# 2. Run full test suite
npm test
# Expected: All tests PASS (no regressions)

# 3. Manual verification (if applicable)
npm run dev
# Test the actual user flow that triggered the bug

# 4. Check logs are clean
tail -f logs/error.log
# Expected: No errors when performing the previously failing action

Phase 7: PREVENT - Add Regression Test

Convert your reproduction case into a permanent test:

// tests/unit/services/UserService.test.ts
describe('UserService.getUserById', () => {
  it('should throw clear error when user not found', async () => {
    const service = new UserService(mockDb);

    await expect(
      service.getUserById('nonexistent-id')
    ).rejects.toThrow('User not found: nonexistent-id');
  });

  it('should handle null user gracefully', async () => {
    mockDb.users.findOne.mockResolvedValue(null);

    await expect(
      service.getUserById('test-id')
    ).rejects.toThrow('User not found');
  });
});

3. Zero-Context Scripts

analyze_logs.py

Located at: .claude/skills/debug-root-cause-analysis/scripts/analyze_logs.py

Purpose: Parse error logs for frequency, patterns, and correlations.

Usage:

python .claude/skills/debug-root-cause-analysis/scripts/analyze_logs.py \
  --log-file logs/error.log \
  --error-pattern "TypeError|ReferenceError|null" \
  --time-range "last 24h"

Output:

Error Frequency Analysis:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
TypeError: Cannot read property 'id': 47 occurrences
  First seen: 2026-01-01 10:23:45
  Last seen:  2026-01-01 14:32:10
  Peak time:  2026-01-01 12:00-13:00 (23 errors)

  Stack trace (most common):
    at userService.ts:45
    at authMiddleware.ts:89

ReferenceError: user is not defined: 3 occurrences
  ...

4. Debugging Workflow for Agents

When you encounter a bug or error:

  1. Stop and Observe:

    • Don't immediately change code
    • Read the full error message and stack trace
    • Note the file, line number, and error type
  2. Gather Context:

    # What files are involved?
    grep -r "functionName" src/
    
    # What changed recently?
    git log --oneline -10
    
    # Are there logs?
    ls -la logs/ && tail logs/error.log
    
  3. Create Reproduction Test:

    • Write a failing test that triggers the bug
    • Make it as minimal as possible
    • Verify it actually fails
  4. Analyze Systematically:

    • Use the zero-context script to analyze logs
    • Add strategic logging (NOT shotgun console.logs)
    • Test one hypothesis at a time
  5. Apply Minimal Fix:

    • Fix only the root cause
    • Don't refactor or add features
    • Keep changes minimal and focused
  6. Verify Thoroughly:

    • Run the reproduction test
    • Run the full test suite
    • Manual verification if needed
  7. Prevent Recurrence:

    • Convert reproduction to permanent test
    • Document the root cause in comments if complex
    • Consider if similar bugs could exist elsewhere

Example Agent Session

User: "The app crashes when I try to delete a user"

Agent: "I'll debug this systematically using root cause analysis."

[OBSERVE]
Agent: Reading error logs...
Found: TypeError: Cannot read property 'posts' of undefined at deleteUser:67

[REPRODUCE]
Agent: Creating reproduction test...
Writing: tests/bugs/delete-user-crash.test.ts

[RUN TEST]
Agent: Running test to confirm reproduction...
Output: FAIL - Confirmed the crash occurs

[HYPOTHESIZE]
Agent: Stack trace shows crash at line 67 when accessing user.posts.
Hypothesis: user object is undefined before we try to access posts.
Let me check the code...

[READ CODE]
Agent: Reading userService.ts:60-70...
Found issue: deleteUser() doesn't check if user exists before accessing user.posts

[FIX]
Agent: Applying minimal fix - adding null check before accessing user.posts

[VERIFY]
Agent: Running reproduction test...
Output: PASS
Running full test suite...
Output: All tests PASS

[PREVENT]
Agent: Converting reproduction test to permanent regression test
Committing: "fix: add null check before accessing user.posts in deleteUser"

5. Failed Attempts (Negative Knowledge Evolution)

❌ Attempt: Debug by commenting out code

Context: Tried to isolate bug by commenting out sections Failure: Changed program behavior, couldn't identify actual cause Learning: Use logging and breakpoints, not code modification

❌ Attempt: Apply fix before confirming root cause

Context: Saw similar bug on Stack Overflow, applied their solution Failure: Didn't fix our bug, introduced new edge case Learning: Always verify the root cause matches before applying fixes

❌ Attempt: Add try-catch around everything

Context: Wrapped failing code in try-catch to "fix" errors Failure: Silenced errors, made debugging harder, root cause unfixed Learning: Fix the cause, don't suppress the symptom

❌ Attempt: Debug in production

Context: Couldn't reproduce locally, added logging to production Failure: Exposed sensitive data in logs, caused performance issues Learning: Reproduce locally or use proper observability tools

❌ Attempt: Multi-variable changes

Context: Changed error handling AND data validation AND logging Failure: Bug disappeared but couldn't identify which change fixed it Learning: Change one variable at a time, verify after each change

6. Common Bug Categories

Null/Undefined Errors

  • Symptom: TypeError: Cannot read property 'X' of undefined
  • Common causes: Missing null checks, async race conditions
  • Fix approach: Add null checks, ensure async operations complete

Type Errors

  • Symptom: Type mismatch, unexpected type coercion
  • Common causes: Weak typing, incorrect assumptions
  • Fix approach: Add type guards, validate inputs

Race Conditions

  • Symptom: Intermittent failures, works sometimes
  • Common causes: Async operations not awaited, shared state
  • Fix approach: Proper async/await, eliminate shared mutable state

Off-by-One Errors

  • Symptom: Array index out of bounds, fencepost errors
  • Common causes: Loop conditions, array slicing
  • Fix approach: Careful boundary analysis, add tests for edge cases

Configuration Errors

  • Symptom: Works locally, fails in production
  • Common causes: Missing env vars, different configurations
  • Fix approach: Validate configuration on startup, use same config locally

7. Governance

  • Token Budget: ~495 lines (within 500 limit)
  • Dependencies: Python 3.8+ for log analysis script
  • Pattern Origin: Scientific Method, Systematic Debugging (Andreas Zeller)
  • Maintenance: Update anti-patterns as new failure modes discovered
  • Verification Date: 2026-01-01