name	SystematicDebugging
description	Evidence-based debugging methodology emphasizing observation over assumptions following the scientific method. USE WHEN user reports a bug OR system behavior is unexpected OR troubleshooting issues OR investigating errors OR debugging failures. Follows observe, hypothesize, test, verify cycle with disciplined evidence gathering.

SystematicDebugging

Disciplined evidence-based approach to finding and fixing bugs using the scientific method.

The Scientific Method for Debugging

OBSERVE → HYPOTHESIZE → TEST → VERIFY
    ↑                              ↓
    └──────────────────────────────┘
         (Repeat until solved)

Core Principle

Evidence Over Assumptions

Never assume you know what's wrong. Always:

Gather evidence first
Form hypothesis based on evidence
Test hypothesis with minimal changes
Verify the fix works

The Process

Phase 1: Observe (Don't Assume)

Gather evidence systematically:

# Check application logs
journalctl -u ${service} -n 100 --no-pager

# Check system logs
tail -f /var/log/syslog

# Check service status
systemctl status ${service}

# Check resource usage
top -bn1 | head -20

# Check recent changes
git log --oneline --since="1 week ago" -- ${relevant_paths}

# Check environment
env | grep -i ${service}

Critical questions to answer:

What is the EXACT error message (copy it verbatim)?
When did this start happening?
What changed recently (code, config, environment)?
Can you reproduce it reliably?
What's the minimum reproduction case?

Document your observations:

# Create debug log
cat > /tmp/debug-$(date +%Y%m%d-%H%M%S).log <<EOF
## Observations

**Error**: [Exact error message]
**Started**: [When it began]
**Frequency**: [Always/Sometimes/Rare]
**Recent changes**: [Git commits, deployments, config changes]
**Environment**: [OS, version, dependencies]

## Reproduction Steps
1. [Step 1]
2. [Step 2]
3. [Observed result]
4. [Expected result]
EOF

Phase 2: Form Hypothesis

Based on evidence, create testable hypothesis:

Template:

Given [observation],
I hypothesize [root cause],
because [reasoning].

If this is true, then [expected outcome].

Example:

Given: Service fails to start after reboot with "connection refused"
I hypothesize: Missing network dependency in systemd unit
because: Service likely starts before network is ready

If this is true, then: Adding After=network-online.target should fix it

Document your hypothesis:

cat >> /tmp/debug-*.log <<EOF

## Hypothesis #1

**Given**: [Observation]
**Hypothesis**: [Root cause]
**Reasoning**: [Why you think this]
**Test**: [How to verify]
**Expected**: [What should happen if correct]
EOF

Phase 3: Test Hypothesis

Design minimal, isolated test:

Rules for testing:

Change ONE variable at a time
Add logging/instrumentation
Create reproducible test case
Document what you're changing

Example - Add debugging:

# Add logging to startup script
cat > /tmp/debug-startup.sh <<'EOF'
#!/bin/bash
echo "DEBUG: Starting service at $(date)" >> /tmp/service-debug.log
echo "DEBUG: Network status: $(ip addr show)" >> /tmp/service-debug.log
exec /usr/bin/actual-service
EOF

Example - Test specific condition:

# Test if file exists
if [ -f /var/run/service.pid ]; then
    echo "FOUND: PID file exists"
    cat /var/run/service.pid
else
    echo "MISSING: PID file does not exist"
    ls -la /var/run/
fi

Example - Isolate component:

// Add logging to isolate which component fails
func StartService() error {
    log.Println("DEBUG: Initializing database connection")
    db, err := initDB()
    if err != nil {
        log.Printf("DEBUG: Database init failed: %v", err)
        return err
    }

    log.Println("DEBUG: Starting HTTP server")
    return startHTTP(db)
}

Document your test:

cat >> /tmp/debug-*.log <<EOF

## Test #1

**Change**: [What you changed]
**Expected**: [What should happen if hypothesis is correct]
**Command**: \`[Command you ran]\`
**Result**: [What actually happened]
**Conclusion**: [Hypothesis confirmed/rejected]
EOF

Phase 4: Verify Fix

After implementing a fix:

Verify bug no longer reproduces
Verify no new bugs introduced
Run full test suite
Check logs for expected behavior
Test edge cases

Verification checklist:

# 1. Original bug doesn't reproduce
[reproduction command]
# Expected: Works correctly

# 2. Related functionality still works
[test related features]

# 3. Tests pass
go test ./...
pytest
cargo test

# 4. Clean logs
journalctl -u ${service} -n 20 --no-pager
# Expected: No errors

# 5. Edge cases work
[test boundary conditions]

Add regression test:

// Prevent bug from returning
func TestBugFix_NegativePriceHandling(t *testing.T) {
    // Bug: Negative prices caused panic
    // Fix: Added validation to reject negative prices
    order := Order{Price: -10}
    err := order.Validate()
    if err == nil {
        t.Error("Expected error for negative price, got nil")
    }
    if !strings.Contains(err.Error(), "negative") {
        t.Errorf("Expected error about negative price, got: %v", err)
    }
}

Document the fix:

cat >> /tmp/debug-*.log <<EOF

## Solution

**Root cause**: [What was actually wrong]
**Fix**: [What you changed]
**Verification**:
- [✓] Original bug resolved
- [✓] No regressions
- [✓] Tests pass
- [✓] Regression test added

**Files changed**:
- [file1]: [description]
- [file2]: [description]

**Commit**: [commit hash]
EOF

# Save to history
cp /tmp/debug-*.log ~/.config/claude/history/debugging/$(date +%Y-%m)/

Debugging Checklist

Before Diving In

Read the COMPLETE error message (including stack trace)
Check logs for full context (before and after error)
Identify what changed recently (git log, deployments)
Create minimal reproduction case
State your hypothesis explicitly before testing

While Debugging

Test ONE hypothesis at a time
Add logging, don't assume
Document what you tried and results
Keep track of working states
Don't change multiple things simultaneously
Take breaks if stuck (fresh perspective helps)

After Fixing

Verify fix solves original problem
Check for side effects
Run full test suite
Add regression test
Remove debug logging
Document root cause
Update relevant documentation

Common Pitfalls

Don't Do This

❌ Changing multiple things without testing

# Wrong: Shotgun debugging
sed -i 's/timeout=5/timeout=30/' config.yaml
systemctl restart service
sed -i 's/retries=3/retries=10/' config.yaml
systemctl restart service

❌ Assuming you know the cause

"It's probably the database connection"
[Spends hours investigating database]
[Actual cause: typo in config file]

❌ Debugging without logs

// Wrong: No visibility
func Process() error {
    db.Connect()
    data := fetch()
    process(data)
    return nil
}

❌ Skipping reproduction

"User says it sometimes fails"
[Tries to fix without reproducing]
[Can't verify fix works]

Do This Instead

✅ Change one thing, test, observe

# Right: Systematic testing
sed -i 's/timeout=5/timeout=30/' config.yaml
systemctl restart service
journalctl -u service -n 20  # Check result

✅ Gather evidence first

# Check recent changes
git log --oneline --since="2 days ago"

# Check logs
journalctl -u service --since="2 days ago" | grep -i error

# Check config
diff config.yaml.backup config.yaml

✅ Add comprehensive logging

// Right: Instrument for visibility
func Process() error {
    log.Println("DEBUG: Connecting to database")
    if err := db.Connect(); err != nil {
        log.Printf("DEBUG: DB connection failed: %v", err)
        return err
    }

    log.Println("DEBUG: Fetching data")
    data := fetch()
    log.Printf("DEBUG: Fetched %d records", len(data))

    log.Println("DEBUG: Processing data")
    process(data)
    return nil
}

✅ Create reproduction case

# Minimal reproduction script
cat > reproduce-bug.sh <<'EOF'
#!/bin/bash
set -e

echo "Step 1: Start service"
systemctl start service

echo "Step 2: Send request"
curl http://localhost:8080/trigger-bug

echo "Step 3: Check logs"
journalctl -u service -n 10
EOF

chmod +x reproduce-bug.sh
./reproduce-bug.sh

Debugging Tools by Language/Environment

NixOS/systemd Services

# Service status and recent logs
systemctl status ${service}

# Follow logs in real-time
journalctl -u ${service} -f

# Check service dependencies
systemctl list-dependencies ${service}

# Verify configuration
nixos-rebuild build && nix-instantiate --eval '<nixpkgs/nixos>' -A config.systemd.services.${service}

Go

// Use delve debugger
dlv debug ./cmd/app -- --config=dev.yaml

// Add instrumentation
import _ "net/http/pprof"

// Runtime stack traces
import "runtime/debug"
debug.PrintStack()

Python

# Use pdb
import pdb; pdb.set_trace()

# Logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Trace function calls
import sys
sys.settrace(trace_function)

Rust

// Use RUST_BACKTRACE
RUST_BACKTRACE=1 cargo run

// Debug logging
env_logger::init();
log::debug!("Value: {:?}", value);

// Use rust-lldb
rust-lldb target/debug/app

Network Issues

# Check connectivity
ping ${host}
telnet ${host} ${port}
curl -v http://${host}:${port}

# Check DNS
dig ${domain}
nslookup ${domain}

# Check open ports
ss -tlnp | grep ${port}
netstat -tlnp | grep ${port}

# Packet capture
tcpdump -i any -n port ${port}

Integration with Other Skills

Before debugging:

Use Grep and Read to examine code
Check Git history for recent changes
Review service configurations

After fixing:

Use TestDrivenDevelopment to add regression tests
Update documentation with Notes skill
Consider if fix should be documented in troubleshooting guide

When debugging is extensive:

Document the investigation process
Create Notes entry with solution
Add to troubleshooting documentation

Examples

Example 1: Service won't start

User: "The jellyfin service fails to start after reboot"

→ Invoke SystematicDebugging skill
→ OBSERVE: Check systemctl status jellyfin
→ OBSERVE: Check journalctl -u jellyfin
→ Error: "Failed to bind to port 8096: address already in use"
→ OBSERVE: Check what's using port 8096
→ Find: Old process still running
→ HYPOTHESIS: Process not cleaned up on shutdown
→ TEST: Add KillMode=control-group to systemd unit
→ TEST: Reboot and verify
→ VERIFY: Service starts successfully
→ FIX: Update NixOS configuration
→ Add regression test in CI

Example 2: Intermittent failures

User: "API sometimes returns 500, can't reproduce consistently"

→ Invoke SystematicDebugging skill
→ OBSERVE: Gather error logs with timestamps
→ OBSERVE: Check for patterns (time of day, request rate, specific endpoints)
→ Pattern found: Errors increase under load
→ HYPOTHESIS: Race condition or resource exhaustion
→ TEST: Add connection pool monitoring
→ Find: Database connection pool exhausted during spikes
→ HYPOTHESIS: Connection pool too small
→ TEST: Increase pool size from 10 to 50
→ TEST: Load test with monitoring
→ VERIFY: No more 500 errors under load
→ FIX: Update configuration
→ Add metrics for connection pool usage
→ Set up alerts for pool exhaustion

Example 3: Configuration issue after NixOS update

User: "After updating to nixos-unstable, my service is broken"

→ Invoke SystematicDebugging skill
→ OBSERVE: What changed in the update?
→ Check: nix store diff-closures
→ Find: Service package updated from 2.1 to 2.2
→ OBSERVE: Check service logs
→ Error: "Unknown configuration option: legacy_mode"
→ HYPOTHESIS: Config option removed in v2.2
→ TEST: Check v2.2 changelog
→ Confirmed: Option removed, replaced with new_mode
→ FIX: Update NixOS config to use new_mode
→ TEST: nixos-rebuild build
→ TEST: nixos-rebuild switch
→ VERIFY: Service starts successfully
→ Document: Add note about migration in comments

Example 4: Memory leak investigation

User: "Service memory usage keeps growing until it crashes"

→ Invoke SystematicDebugging skill
→ OBSERVE: Monitor memory over time
→ OBSERVE: Check for goroutine leaks (if Go)
→ Find: Goroutines increasing steadily
→ HYPOTHESIS: Goroutines not being cleaned up
→ TEST: Add pprof profiling
→ Analyze goroutine stack traces
→ Find: WebSocket connections not closing properly
→ HYPOTHESIS: Missing context cancellation
→ TEST: Add context with timeout to WebSocket handler
→ TEST: Monitor goroutine count
→ VERIFY: Goroutines stable, memory stable
→ FIX: Update WebSocket handler
→ Add test: Verify connections close on timeout
→ Add metrics: Track active WebSocket connections

Example 5: Performance degradation

User: "Queries are getting slower over time"

→ Invoke SystematicDebugging skill
→ OBSERVE: Measure current query performance
→ OBSERVE: Check database indices
→ OBSERVE: Check table sizes
→ Find: Large table with no index on commonly queried column
→ HYPOTHESIS: Missing index causing table scans
→ TEST: Analyze query execution plan
→ Confirmed: Full table scan on 10M rows
→ HYPOTHESIS: Adding index will improve performance
→ TEST: Add index in development environment
→ TEST: Measure query time improvement
→ Result: Query time: 5s → 50ms
→ FIX: Add index migration
→ VERIFY: Performance in production
→ Document: Add comment explaining index purpose

SystematicDebugging

Install Skill

SKILL.md

SystematicDebugging

The Scientific Method for Debugging

Core Principle

The Process

Phase 1: Observe (Don't Assume)

Phase 2: Form Hypothesis

Phase 3: Test Hypothesis

Phase 4: Verify Fix

Debugging Checklist

Before Diving In

While Debugging

After Fixing

Common Pitfalls

Don't Do This

Do This Instead

Debugging Tools by Language/Environment

NixOS/systemd Services

Go

Python

Rust

Network Issues

Integration with Other Skills

Examples