| name | rca-analyst |
| description | Structured root cause analysis methodology with three-test isolation and prevention analysis |
Root Cause Analysis Skill
Evidence-based methodology for isolating true root causes from contributing factors.
Core Philosophy
Root cause analysis is evidence gathering, not solution design.
The RCA process produces:
- Clear problem definition with failure boundaries
- Complete evidence timeline
- Causal chain with differentiated root cause vs contributing factors
- Prevention analysis (WHAT needs to change, not HOW)
The RCA process does NOT produce:
- Proposed solutions
- Implementation roadmaps
- Prioritization decisions
- Resource allocation
The Three Tests for Root Cause Isolation
Most teams conflate root cause with contributing factors. Use these three tests to isolate the true root cause:
Test 1: Counterfactual Test
"If this factor didn't exist, would the failure still have occurred?"
- Root Cause: NO - removing it prevents the failure
- Contributing Factor: YES - failure would still occur via another path
Test 2: Sufficiency Test
"Is this factor alone sufficient to cause the failure?"
- Root Cause: YES - this factor alone can cause the failure
- Contributing Factor: NO - requires other factors to cause failure
Test 3: Necessity Test
"Is this factor necessary for the failure to occur?"
- Root Cause: YES - failure cannot occur without it
- Contributing Factor: NO - failure could occur without it
Applying the Tests
| Factor | Counterfactual | Sufficiency | Necessity | Classification |
|---|---|---|---|---|
| True Root Cause | NO (prevents failure) | YES | YES | ROOT CAUSE |
| Contributing Factor | YES (still fails) | NO | NO | Contributing |
| Necessary Condition | NO | NO | YES | Enabler |
| Amplifying Factor | YES | NO | NO | Amplifier |
The root cause is the factor that passes ALL THREE tests.
Five-Phase Workflow
Phase 1: Problem Definition
Goal: Establish clear boundaries around what failed and what didn't.
Required Outputs:
- Failure Statement: One sentence describing what failed
- Scope Boundaries: What's in scope, what's explicitly out
- Success Criteria: What would "working correctly" look like?
- Timeline Bounds: When did the failure window start/end?
Checkpoint Questions:
- Can you state the failure in one sentence?
- Are scope boundaries explicit?
- Is the success criteria measurable?
Phase 2: Evidence Gathering
Goal: Collect all relevant data without interpretation.
Evidence Types:
- System Evidence: Logs, metrics, traces, error messages
- Human Evidence: Who did what, when, communications
- Environmental Evidence: Config states, deployment status, external factors
- Temporal Evidence: Timeline of events
Rules:
- Record timestamps for everything
- Capture exact error messages/codes
- Note absence of expected events
- Don't interpret yet - just collect
Checkpoint Questions:
- Do you have a complete timeline?
- Are all system states captured?
- Are human actions documented?
- Are environmental factors recorded?
Phase 3: Causal Chain Construction
Goal: Build the chain of events that led to failure.
Process:
- Start from the failure event
- Ask "What directly caused this?" for each event
- Continue until you reach actionable factors
- Map dependencies between events
Output Format:
[Triggering Event]
↓
[Intermediate Event 1]
↓
[Intermediate Event 2]
↓
[Failure Event]
Checkpoint Questions:
- Does each link have evidence?
- Are there gaps in the chain?
- Have you identified branch points?
Phase 4: Root Cause Isolation
Goal: Apply the three tests to identify the true root cause.
Process:
- List all candidate factors from the causal chain
- Apply Counterfactual Test to each
- Apply Sufficiency Test to survivors
- Apply Necessity Test to survivors
- Factor passing all three = Root Cause
Classification Output:
| Factor | Counterfactual | Sufficiency | Necessity | Classification |
|---|---|---|---|---|
| [Factor A] | [Result] | [Result] | [Result] | [Type] |
| [Factor B] | [Result] | [Result] | [Result] | [Type] |
Checkpoint Questions:
- Did you test every candidate?
- Does your root cause pass all three tests?
- Can you explain why contributing factors failed each test?
Phase 5: Prevention Analysis
Goal: Identify WHAT needs to change to prevent recurrence (not HOW to change it).
Categories:
- Detection Gaps: What monitoring/alerting would have caught this earlier?
- Process Gaps: What process changes would prevent the root cause?
- System Gaps: What system changes would eliminate the failure mode?
- Knowledge Gaps: What documentation/training gaps contributed?
Rules:
- State the gap, not the solution
- Focus on prevention, not recovery
- Identify gaps at multiple levels
Checkpoint Questions:
- Have you identified gaps in each category?
- Are gaps stated without prescribing solutions?
- Do gaps map back to root cause and contributing factors?
Differential Analysis
Critical Question: Why did THIS case fail when similar cases succeed?
This section is REQUIRED for every RCA. It forces examination of:
- What was different about this specific failure?
- What conditions existed that don't exist in successful cases?
- Why didn't existing safeguards prevent this?
Template:
## Differential Analysis
### Similar Cases That Succeeded
[List similar situations that didn't fail]
### Key Differences in Failure Case
[What was different about this case?]
### Safeguard Failures
[Why didn't existing protections work?]
### Unique Conditions
[What conditions were present only in this failure?]
RCA Output Template
# Root Cause Analysis: [Incident Name]
**Date**: [Analysis Date]
**Incident Date**: [When failure occurred]
**Analyst**: [Who conducted RCA]
**Status**: [Draft | Review | Final]
---
## 1. Problem Definition
### Failure Statement
[One sentence describing what failed]
### Scope
- **In Scope**: [What this RCA covers]
- **Out of Scope**: [What this RCA excludes]
### Success Criteria
[What "working correctly" looks like]
### Timeline Bounds
- **Start**: [When failure window began]
- **End**: [When failure was resolved]
---
## 2. Evidence Timeline
| Timestamp | Event | Source | Evidence |
|-----------|-------|--------|----------|
| [Time] | [What happened] | [Log/Person/System] | [Exact data] |
---
## 3. Causal Chain
[Visual chain from trigger to failure]
[Event 1] → [Event 2] → [Event 3] → [FAILURE]
### Chain Narrative
[Prose explanation of how events connected]
---
## 4. Root Cause Isolation
### Candidate Factors
| Factor | Counterfactual | Sufficiency | Necessity | Classification |
|--------|----------------|-------------|-----------|----------------|
| [Factor] | [YES/NO] | [YES/NO] | [YES/NO] | [Type] |
### Root Cause Statement
[The factor that passed all three tests]
### Contributing Factors
[Factors that amplified or enabled but aren't root cause]
---
## 5. Differential Analysis
### Similar Cases That Succeeded
[List similar situations that didn't fail]
### Key Differences
[What was unique about this failure case?]
### Safeguard Failures
[Why didn't existing protections work?]
---
## 6. Prevention Analysis
### Detection Gaps
- [What monitoring would catch this earlier?]
### Process Gaps
- [What process changes would prevent root cause?]
### System Gaps
- [What system changes would eliminate failure mode?]
### Knowledge Gaps
- [What documentation/training gaps contributed?]
---
## Appendix: Raw Evidence
[Attach logs, screenshots, configs, communications]
Interactive Workflow for Agents
When conducting RCA with a user, follow this interactive flow:
Starting the RCA
I'll guide you through a structured Root Cause Analysis. We'll work through 5 phases:
1. Problem Definition
2. Evidence Gathering
3. Causal Chain Construction
4. Root Cause Isolation (using three tests)
5. Prevention Analysis
Let's start with Phase 1: Problem Definition.
Can you describe in one sentence what failed?
Phase Transitions
At each phase checkpoint, verify completion before proceeding:
Before we move to [Next Phase], let me verify:
- [ ] [Checkpoint 1]
- [ ] [Checkpoint 2]
- [ ] [Checkpoint 3]
[If incomplete]: We're missing [X]. Can you provide [specific ask]?
[If complete]: Great, let's proceed to [Next Phase].
Root Cause Isolation Dialogue
Now let's apply the three tests to isolate the root cause.
For [Factor]:
1. Counterfactual: If [Factor] didn't happen, would the failure still occur?
2. Sufficiency: Is [Factor] alone enough to cause this failure?
3. Necessity: Is [Factor] required for this failure to occur?
Based on your answers: [Classification]
Completing the RCA
Based on our analysis:
**Root Cause**: [Statement]
**Contributing Factors**: [List]
**Prevention Gaps**: [Categories]
Would you like me to generate the full RCA document?
Worked Example: API Rate Limiting Incident
Context
Merchant integration failed during high-traffic sale event.
Phase 1: Problem Definition
- Failure: Merchant orders dropped for 47 minutes during Black Friday
- Scope: Order processing via Channel API
- Success: All valid orders processed within SLA
Phase 2: Evidence
| Time | Event | Source |
|---|---|---|
| 09:15 | Traffic spike began | Metrics |
| 09:23 | MAX_COST_EXCEEDED errors started | API logs |
| 09:24 | Retry storm began | Client logs |
| 10:10 | Manual intervention resolved | Incident channel |
Phase 3: Causal Chain
[Traffic spike] → [Query complexity exceeded limit] → [MAX_COST_EXCEEDED] → [Client retries] → [Amplified load] → [Extended outage]
Phase 4: Root Cause Isolation
| Factor | Counterfactual | Sufficiency | Necessity | Classification |
|---|---|---|---|---|
| Traffic spike | YES (still fails if queries complex) | NO | NO | Amplifier |
| Complex queries | NO (prevents error) | YES | YES | ROOT CAUSE |
| Aggressive retries | YES (still errors) | NO | NO | Amplifier |
| No alerting | YES (still errors) | NO | NO | Enabler |
Root Cause: Query complexity exceeded platform limits under normal traffic patterns.
Phase 5: Differential Analysis
Why This Merchant?
- Other merchants with same traffic didn't fail
- This merchant's query pattern was 3x more complex
- Query complexity wasn't validated during onboarding
Prevention Gaps
- Detection: No query complexity monitoring pre-incident
- Process: No onboarding validation of query patterns
- System: No graceful degradation for complex queries
- Knowledge: Rate limit documentation didn't cover complexity costs