name	quality-debugging-troubleshooting
description	Systematic debugging methodologies, troubleshooting workflows, logging strategies, error tracking, performance profiling, stack trace analysis, and debugging tools across languages and environments. Covers local debugging, distributed systems, production issues, and root cause analysis.

Debugging & Troubleshooting — Quick Reference

This skill provides execution-ready debugging strategies, troubleshooting workflows, and root cause analysis techniques. Claude should apply these patterns when users encounter bugs, errors, performance issues, or production incidents.

Modern Best Practices (2025): Structured logging (Pino/Winston), distributed tracing (OpenTelemetry), error tracking (Sentry/Rollbar), observability-first debugging, time-travel debugging, AI-assisted error analysis, and proactive monitoring.

Quick Reference

Symptom	Tool/Technique	Command/Approach	When to Use
Application crashes	Stack trace analysis	Check error logs, identify first line in your code	Unhandled exceptions
Slow performance	Profiling (CPU/memory)	`node --prof`, Chrome DevTools, cProfile	High CPU, latency issues
Memory leak	Heap snapshots	`node --inspect`, compare snapshots over time	Memory usage grows
Database slow	Query profiling	`EXPLAIN ANALYZE`, slow query log	Slow queries, high DB CPU
Production-only bug	Log analysis + feature flags	`grep "ERROR"`, enable verbose logging for user	Can't reproduce locally
Distributed system issue	Distributed tracing	OpenTelemetry, Jaeger, trace request ID	Microservices, async workflows
Intermittent failures	Logging + monitoring	Add detailed logs, monitor metrics	Race conditions, timeouts
Network timeout	Network debugging	`curl`, Postman, check firewall/DNS	External API failures

Decision Tree: Debugging Strategy

Issue type: [Problem Scenario]
    ├─ Application Behavior?
    │   ├─ Crashes immediately? → Check stack trace, error logs
    │   ├─ Slow/hanging? → CPU/memory profiling
    │   ├─ Intermittent failures? → Add logging, reproduce consistently
    │   └─ Unexpected output? → Binary search (add logs to narrow down)
    │
    ├─ Performance Issues?
    │   ├─ High CPU? → CPU profiler to find hot functions
    │   ├─ Memory leak? → Heap snapshots, track over time
    │   ├─ Slow database? → EXPLAIN ANALYZE, check indexes
    │   ├─ Network latency? → Trace external API calls
    │   └─ Frontend slow? → Lighthouse, Web Vitals profiling
    │
    ├─ Production-Only?
    │   ├─ Can't reproduce? → Analyze logs for patterns
    │   ├─ Environment difference? → Compare configs, data volume
    │   ├─ Need safe debugging? → Feature flags for verbose logging
    │   └─ Recent deployment? → Git bisect to find regression
    │
    ├─ Distributed Systems?
    │   ├─ Multiple services involved? → Distributed tracing (Jaeger)
    │   ├─ Request lost? → Search logs by request ID
    │   ├─ Service dependency? → Check health checks, circuit breakers
    │   └─ Async workflow? → Trace message queue, event logs
    │
    └─ Error Type?
        ├─ TypeError/NullPointer? → Check object existence, defensive coding
        ├─ Network timeout? → Check external service health, retry logic
        ├─ Database error? → Check connection pool, query syntax
        └─ Unknown error? → Systematic debugging workflow (observe, hypothesize, test)

When to Use This Skill

Claude should invoke this skill when a user reports:

Application crashes or errors
Unexpected behavior or bugs
Performance issues (slow queries, memory leaks, high CPU)
Production incidents requiring root cause analysis
Stack trace or error message interpretation
Debugging strategies for specific scenarios
Log analysis and pattern detection
Distributed system debugging (microservices, async workflows)
Memory leaks and resource exhaustion
Race conditions and concurrency issues
Network connectivity problems
Database query optimization
Third-party API integration issues

Operational Deep Dives

See resources/operational-patterns.md for systematic debugging workflows, logging strategy details, stack trace and performance profiling guides, and language-specific tooling checklists.

Templates (Copy-Paste Ready)

Production templates organized by workflow type:

Debugging Workflow: templates/debugging/template-debugging-checklist.md - Universal debugging checklist with specialized checklists for performance, memory leaks, distributed systems, and production incidents
Incident Response: templates/incidents/template-incident-response.md - Complete incident response playbook with severity levels, communication templates, and postmortem format
Logging Setup: templates/observability/template-logging-setup.md - Production logging configurations for Node.js (Pino), Python (structlog), Go (zap), with Docker and CloudWatch integration

Resources (Deep-Dive Guides)

Operational best practices by domain:

Operational Patterns: resources/operational-patterns.md - Core debugging workflows, stack trace triage, profiling guides, and tool selection
Debugging Methodologies: resources/debugging-methodologies.md - Scientific method, binary search, delta debugging, rubber duck, time-travel debugging, observability-first approaches
Logging Best Practices: resources/logging-best-practices.md - Structured logging, log levels, what to log/not log, implementations by language, request ID propagation, performance optimization
Production Debugging: resources/production-debugging-patterns.md - Safe production debugging techniques, log analysis, metrics, distributed tracing, feature flags, incident response workflow

Navigation

Resources

Templates

Data

data/sources.json — Curated external references

External Resources

See data/sources.json for:

Debugging tool documentation
Error tracking platforms (Sentry, Rollbar, Bugsnag)
Observability platforms (Datadog, New Relic, Honeycomb)
Profiling tutorials and guides
Production debugging best practices

Quick Decision Matrix

Symptom	Likely Cause	First Action
Application crashes	Unhandled exception	Check error logs and stack trace
Slow performance	Database/network/CPU bottleneck	Profile with performance tools
Memory usage grows	Memory leak	Take heap snapshots over time
Intermittent failures	Race condition, network timeout	Add detailed logging around failure
Production-only bug	Environment difference, data volume	Compare prod vs dev config/data
High CPU usage	Infinite loop, inefficient algorithm	CPU profiler to find hot functions
Database slow	Missing index, N+1 queries	Run EXPLAIN ANALYZE on slow queries

Anti-Patterns to Avoid

Random changes - Making changes without hypothesis
Inadequate logging - Can't debug what you can't see
Debugging in production - Always reproduce locally when possible
Ignoring stack traces - Stack trace tells you exactly where error occurred
Not writing tests - Fix today, break tomorrow
Symptom fixing - Treating symptoms instead of root cause
No monitoring - Flying blind in production
Skipping postmortems - Not learning from incidents

Related Skills

This skill works with other skills in the framework:

Development & Operations:

foundation-git-workflow - Git bisect for finding regressions, version control workflows
foundation-api-design - API debugging, error handling, REST patterns, status codes

Infrastructure & Platform:

ops-devops-platform - CI/CD pipelines, monitoring, incident response, SRE practices, Kubernetes ops
ops-database-sql - Database query optimization, EXPLAIN ANALYZE, index tuning, slow query debugging

AI/ML Operations:

ai-ml-ops-production - ML model debugging, drift detection, API monitoring, batch pipeline troubleshooting
ai-ml-ops-security - Security debugging, jailbreak detection, privacy issues, threat modeling

Success Criteria: Issues are diagnosed systematically, root causes are identified accurately, fixes include regression tests, and debugging knowledge is documented for future reference.

quality-debugging-troubleshooting

Install Skill

SKILL.md