| name | incident-analysis |
| description | Analyze and resolve production incidents using systematic investigation, root cause analysis, and autonomous remediation |
Incident Analysis and Resolution Skill
This skill guides you through systematic incident response, from initial alert to resolution and documentation.
When to Use This Skill
Use this skill when:
- Production system is degraded or failing
- Users are reporting issues or errors
- Metrics show abnormal behavior
- Need to investigate and resolve incidents autonomously
Incident Response Workflow
MANDATORY FIRST STEP:
Before using any tools, use the Read tool to read:
.claude/skills/incident-analysis/phases/triage.md
This file contains Phase 1 instructions and will tell you which file to read next.
DO NOT proceed with tool calls until you've read Phase 1.
The complete workflow consists of 6 phases:
- Triage & Assessment (2-5 min) →
phases/triage.md - Investigation (5-10 min) →
phases/investigation.md - Root Cause Analysis (2-5 min) →
phases/rca.md - Remediation Planning (2-3 min) →
phases/remediation.md - Execution (2-5 min) →
phases/execution.md - Communication & Documentation (3-5 min) →
phases/documentation.md
Each phase file contains the "Next Step" section that directs you to the next phase file.
Available Tools
MCP servers provide the necessary tools:
- monitoring-analysis server - System metrics, log analysis, health checks
- workflow-orchestration server - Incident tickets, remediation, notifications
Tools are available throughout all phases as needed.
Key Principles
Progressive Disclosure: The phase files reveal detailed instructions progressively. Read each phase file in sequence - do not skip ahead or assume you know what to do.
Autonomous Execution: Make decisions based on evidence. Take action. Show reasoning.
TodoWrite: Create todos at start for all 6 phases. Update status as you progress.
Success Criteria
Incident is resolved when:
- ✅ Root cause identified with 90%+ confidence
- ✅ Remediation executed and verified
- ✅ Metrics returned to baseline
- ✅ No new errors occurring
- ✅ Team notified and documented