| name | error-recovery |
| description | Use when encountering failures - assess severity, preserve evidence, execute rollback decision tree, and verify post-recovery state |
Error Recovery
Overview
Handle failures gracefully with structured recovery.
Core principle: When things break, don't panic. Assess, preserve, recover, verify.
Announce at start: "I'm using error-recovery to handle this failure."
The Recovery Protocol
Error Detected
│
▼
┌─────────────┐
│ 1. ASSESS │ ← Severity? Scope? Impact?
└──────┬──────┘
│
▼
┌─────────────┐
│ 2. PRESERVE │ ← Capture evidence before it's lost
└──────┬──────┘
│
▼
┌─────────────┐
│ 3. RECOVER │ ← Follow decision tree
└──────┬──────┘
│
▼
┌─────────────┐
│ 4. VERIFY │ ← Confirm clean state
└──────┬──────┘
│
▼
┌─────────────┐
│ 5. DOCUMENT │ ← Record what happened
└─────────────┘
Step 1: Assess Severity
Severity Levels
| Level | Description | Examples |
|---|---|---|
| Critical | System unusable, data at risk | Build completely broken, tests cause data loss |
| Major | Significant functionality broken | Feature doesn't work, many tests failing |
| Minor | Isolated issue, workaround exists | Single test flaky, style error |
| Info | Warning only, not blocking | Deprecation notice, performance hint |
Assessment Questions
## Error Assessment
**Error:** [Description of error]
**Location:** [Where it occurred]
### Severity Checklist
- [ ] Is the system still functional?
- [ ] Is any data at risk?
- [ ] Are other features affected?
- [ ] Is this blocking progress?
### Scope
- Files affected: [list]
- Features affected: [list]
- Users affected: [none/some/all]
Step 2: Preserve Evidence
Capture BEFORE attempting fixes:
Error Logs
# Capture error output
pnpm test 2>&1 | tee error-log.txt
# Or from failed command
./failing-command 2>&1 | tee error-log.txt
Stack Traces
## Stack Trace
Error: Connection refused at Database.connect (src/db/connection.ts:45) at UserService.init (src/services/user.ts:23) at main (src/index.ts:12)
State Capture
# Git state
git status
git diff
# Environment state
env | grep -E "NODE|NPM|PATH"
# Dependency state
pnpm list
Screenshot (if visual)
For UI errors, capture screenshots before changes.
Step 3: Recover
Decision Tree
What type of failure?
│
┌────┴────┬────────────┬────────────┐
│ │ │ │
Code Build Environment External
Error Error Issue Service
│ │ │ │
▼ ▼ ▼ ▼
┌────┐ ┌────┐ ┌────┐ ┌────┐
│Git │ │Clean│ │Re- │ │Wait/│
│reco│ │build│ │init │ │Retry│
│very│ │ │ │ │ │ │
└────┘ └────┘ └────┘ └────┘
Code Error Recovery
Single file broken:
# Revert just that file
git checkout HEAD -- path/to/file.ts
Feature broken (multiple files):
# Find last good commit
git log --oneline
# Revert to that commit (soft reset keeps changes staged)
git reset --soft [GOOD_COMMIT]
# Or hard reset (discards changes)
git reset --hard [GOOD_COMMIT]
Working directory is a mess:
# Stash current changes
git stash
# Verify clean state
git status
# Optionally recover stash later
git stash pop
Build Error Recovery
# Clean build artifacts
rm -rf node_modules dist build .cache
# Reinstall dependencies
pnpm install --frozen-lockfile # Clean install from lock file
# Rebuild
pnpm build
Environment Error Recovery
# Check environment
env | grep -E "NODE|PNPM"
# Reset Node modules
rm -rf node_modules
pnpm install --frozen-lockfile
# If using nvm, verify version
nvm use
# Re-run init script
./scripts/init.sh
External Service Error
# Check if service is up
curl -I https://service.example.com/health
# If down, wait and retry
sleep 60
curl -I https://service.example.com/health
# If still down, check status page
# Document as external blocker
Step 4: Verify
After recovery, verify clean state:
Basic Verification
# Clean working directory
git status
# Expected: "nothing to commit, working tree clean" or known changes
# Tests pass
pnpm test
# Build succeeds
pnpm build
# Types check
pnpm typecheck
Functionality Verification
# Run the specific thing that was broken
pnpm test --grep "specific test"
# Or verify the feature manually
Step 5: Document
Issue Comment
gh issue comment [ISSUE_NUMBER] --body "## Error Recovery
**Error encountered:** [Description]
**Severity:** Major
**Evidence:**
\`\`\`
[Error output]
\`\`\`
**Recovery actions:**
1. [Action 1]
2. [Action 2]
**Verification:**
- [x] Tests pass
- [x] Build succeeds
**Root cause:** [If known]
**Prevention:** [If applicable]
"
Knowledge Graph
// Store for future reference
mcp__memory__add_observations({
observations: [{
entityName: "Issue #[NUMBER]",
contents: [
"Encountered [error type] on [date]",
"Caused by: [root cause]",
"Resolved by: [recovery action]"
]
}]
});
Common Recovery Patterns
"Tests were passing, now failing"
# What changed?
git diff HEAD~3
# Did dependencies change?
git diff HEAD~3 pnpm-lock.yaml
# Clean reinstall
rm -rf node_modules && pnpm install --frozen-lockfile
"Works locally, fails in CI"
# Check for environment differences
# - Node version
# - OS differences
# - Env vars
# Run with CI-like settings
CI=true pnpm test
"Build was working, now broken"
# Check TypeScript errors
pnpm typecheck
# Check for circular dependencies
pnpm dlx madge --circular src/
# Clean build
rm -rf dist && pnpm build
"I broke everything"
# Don't panic
# Find last known good state
git log --oneline
# Reset to that state
git reset --hard [GOOD_COMMIT]
# Verify
pnpm test
# Start again more carefully
Escalation
If recovery fails after 2-3 attempts:
## Escalation: Unrecoverable Error
**Issue:** #[NUMBER]
**Error:** [Description]
**Recovery attempts:**
1. [Attempt 1] - [Result]
2. [Attempt 2] - [Result]
**Current state:** [Broken/Partially working]
**Evidence preserved:** [Links to logs, screenshots]
**Requesting help with:** [Specific question]
Mark issue as Blocked and await human input.
Checklist
When error occurs:
- Severity assessed
- Evidence preserved (logs, state, screenshots)
- Recovery action selected
- Recovery executed
- Clean state verified
- Tests pass
- Build succeeds
- Issue documented
Integration
This skill is called by:
issue-driven-development- When errors occurci-monitoring- CI failures
This skill may trigger:
research-after-failure- If cause is unknown- Issue update via
issue-lifecycle