| name | incident-root-cause-analyzer |
| description | Comprehensive incident root cause analysis skill for distributed systems. Analyzes logs, metrics, and traces to identify cascading failures, resource contention, and root causes through systematic anomaly detection, timeline correlation, and evidence-based hypothesis testing. |
| license | MIT |
| allowed-tools | code-interpreter, file-system |
| metadata | [object Object] |
Incident Root Cause Analyzer
This skill provides a comprehensive methodology for analyzing production incidents in distributed systems, with special focus on cascading failures, resource contention, and backpressure propagation.
When to Use This Skill
Use this skill when:
- Analyzing production incidents with timeout errors or performance degradation
- Investigating cascading failures across multiple services
- Identifying resource contention and backpressure issues
- Tracing root causes through metrics, logs, and traces
- Generating evidence-based root cause analysis reports
Analysis Methodology
Phase 1: Data Collection and Preparation
Collect Evidence:
- Log files (structured and unstructured)
- Metrics data (Prometheus, Datadog, etc.)
- Trace data (if available)
- Incident timeline (error counts, time windows)
Data Cleaning:
- Parse structured logs (JSON, key-value pairs)
- Extract key fields: timestamp, service, API path, latency, error codes
- Normalize timestamps to common format
- Filter and consolidate by service/host
Data Validation:
- Check data completeness
- Identify missing metrics or time windows
- Validate timestamp ranges match incident window
Phase 2: Anomaly Detection
Metric Anomaly Scoring:
- Calculate baseline metrics (P50, P95, P99) for normal periods
- Compute anomaly scores for each metric at incident time
- Use statistical methods (z-score, IQR) for outlier detection
- Critical: Identify highest anomaly score (likely root trigger)
Temporal Pattern Analysis:
- Align metrics by timestamp
- Identify which anomalies occurred FIRST
- Track cascade sequence (downstream → upstream or vice versa)
Key Metrics to Analyze:
- Throughput (QPS): Did it increase (traffic surge) or decrease (capacity collapse)?
- Error Rate: When did errors spike relative to throughput?
- Latency Metrics: P95, P99, long-tail patterns
- Resource Metrics: CPU, memory, thread pool, GPU utilization
- Network Metrics: TX/RX traffic, connection counts
Phase 3: Root Cause Hypothesis Generation
Evidence Review:
- Evidence 1: Throughput analysis (stable vs. changing)
- Evidence 2: Error timing relative to latency spikes
- Evidence 3: Highest anomaly score (root trigger candidate)
- Evidence 4: Timeline correlation (what happened first)
- Evidence 5: Resource contention patterns
Hypothesis Testing:
Hypothesis 1: Downstream Pressure → Backpressure → Resource Starvation ⭐⭐⭐⭐⭐
- Pattern: Throughput stable, error rate spikes, extreme long-tail latency
- Timeline: Downstream service anomaly → upstream latency spike → resource starvation
- Evidence: Memory/CPU anomaly in downstream service BEFORE upstream issues
Hypothesis 2: Traffic Surge ⭐⭐
- Pattern: Throughput increases, resources exhausted
- Evidence: QPS spike, simultaneous latency increase
- Rejection: If throughput didn't increase, reject this hypothesis
Hypothesis 3: Capacity Collapse ⭐⭐⭐
- Pattern: Throughput decreases, all metrics degrade simultaneously
- Evidence: Service health metrics all fail together
- Distinction: Different from backpressure (backpressure = stable throughput, high latency)
Evidence Chain Construction:
- Build causal chain: Root Trigger → First Cascade → Resource Contention → Second Cascade → Incident
- Verify temporal order (root cause must precede cascades)
- Calculate time gaps between events
Phase 4: Visualization Generation
Evidence Charts (6 critical visualizations):
- Evidence 1: Root trigger metric (highest anomaly score) - time series
- Evidence 2: First cascade point (cut phase latency spike)
- Evidence 3: Second cascade point (detection phase latency spike)
- Evidence 4: Resource contention (thread pool, active/wait queue)
- Evidence 5: Timeline correlation (all metrics aligned)
- Evidence 6: Network traffic anomaly (TX/RX spikes)
Charts Must Include:
- Clear English labels
- Incident time window highlighted
- Anomaly scores and peak values annotated
- Temporal markers (e.g., "06:55:00 - First cascade")
Fault Evolution Diagram:
- Create Mermaid diagram showing:
- Root trigger → Cascades → Resource contention → Incident
- Time points and key metrics
- Causal relationships
- Create Mermaid diagram showing:
Phase 5: Report Generation
Executive Summary:
- Incident description (what, when, impact)
- Root cause statement (concise, evidence-based)
- Key findings (3-5 bullet points)
Detailed Analysis:
- Evidence review (5 critical evidence points)
- Hypothesis evaluation (ranked by likelihood)
- Root cause chain (temporal sequence)
Recommendations:
- Immediate actions (fix root cause)
- Short-term (mitigate cascades)
- Long-term (architectural improvements)
Key Principles
1. Temporal Causality
- Root cause MUST precede cascades in time
- Build timeline showing: Root trigger → Cascade 1 → Cascade 2 → Incident
- Calculate time gaps (e.g., "5 minutes before incident")
2. Anomaly Score Hierarchy
- Highest anomaly score = most likely root trigger
- Compare all metric anomaly scores
- Don't ignore metrics with highest scores
3. Throughput vs. Error Rate Analysis
- Stable throughput + spiking errors = resource contention (not traffic surge)
- Increasing throughput + errors = traffic surge (capacity issue)
- Decreasing throughput + errors = capacity collapse (service failure)
4. Long-tail Latency Patterns
- Consistent extreme long-tail (~5000ms) = resource starvation (queuing)
- Random spikes = transient issues or model complexity
- Long-tail timing = when did it start? (after first cascade?)
5. Resource Contention Indicators
- Thread pool: High active + Low wait queue = threads blocked
- Memory: High usage + GC pressure = performance degradation
- Network: High TX traffic = data backlog or retry storms
Output Structure
Required Outputs
Root Cause Analysis Report (
ROOT_CAUSE_REANALYSIS.md):- Executive summary
- Critical evidence review
- Hypothesis evaluation
- Root cause chain
- Recommendations
Evidence Visualizations (6 PNG files):
- Root trigger metric
- First cascade
- Second cascade
- Resource contention
- Timeline correlation
- Network traffic
Fault Evolution Diagram (
fault_evolution_path.md):- Mermaid diagrams (multiple views)
- Timeline view
- Component interaction view
- Sequence diagram
Evidence README (
root_cause_evidence/README.md):- Evidence index
- Chart descriptions
- Data points summary
- Evidence chain explanation
Execution Workflow
Initialize Analysis:
# Set up analysis directory structure mkdir -p analysis/reports/metrics/anomaly_detection/{root_cause_evidence,core_charts}Load and Process Data:
# Use scripts/analyze_metrics.py to: # - Load metric CSV files # - Calculate anomaly scores # - Generate time series dataGenerate Visualizations:
# Use scripts/generate_evidence_charts.py to: # - Create 6 evidence charts # - Add annotations and highlights # - Export as PNG filesBuild Root Cause Chain:
# Use scripts/root_cause_analysis.py to: # - Evaluate hypotheses # - Build causal chain # - Generate timeline alignmentGenerate Reports:
# Use scripts/generate_report.py to: # - Create markdown reports # - Generate Mermaid diagrams # - Build evidence index
Example Analysis Flow
Scenario: 685 HTTP 504 Gateway Timeout errors at 07:00:00
- Data Collection: Load metrics for 06:30-07:30 window
- Anomaly Detection: Calculate scores, find highest = Ctlgo-Center memory (2316.86)
- Timeline Analysis:
- 06:55:00 - Memory pressure (root trigger)
- 06:55:00 - Cut phase spike (first cascade)
- 06:56:00 - Detection spike (second cascade)
- 07:00:00 - Timeout incident
- Hypothesis: Ctlgo-Center memory pressure → Backpressure → Resource starvation
- Evidence: Throughput stable, extreme long-tail latency, temporal causality confirmed
- Conclusion: Root cause = Ctlgo-Center memory pressure triggering cascading resource starvation
Scripts Reference
scripts/analyze_metrics.py: Anomaly detection and scoringscripts/generate_evidence_charts.py: Visualization generationscripts/root_cause_analysis.py: Hypothesis evaluation and chain buildingscripts/generate_report.py: Report generation with Mermaid diagrams
References
See references/ directory for:
analysis_methodology.md: Detailed methodologyhypothesis_patterns.md: Common failure patternsvisualization_templates.md: Chart templates and styles