| name | qa-resilience |
| description | Production resilience patterns with circuit breakers, retry strategies, bulkheads, timeouts, graceful degradation, health checks, and chaos engineering for fault-tolerant distributed systems. |
Error Handling & Resilience — Production Patterns
This skill provides execution-ready patterns for building resilient, fault-tolerant systems that handle failures gracefully. Claude should apply these patterns when users need error handling strategies, circuit breakers, retry logic, or production hardening.
Modern Best Practices (2025): Circuit breaker pattern, exponential backoff, bulkhead isolation, timeout policies, graceful degradation, health check design, chaos engineering, and observability-driven resilience.
When to Use This Skill
Claude should invoke this skill when a user requests:
- Circuit breaker implementation
- Retry strategies and exponential backoff
- Bulkhead pattern for resource isolation
- Timeout policies for external dependencies
- Graceful degradation and fallback mechanisms
- Health check design (liveness vs readiness)
- Error handling best practices
- Chaos engineering setup
- Production hardening strategies
- Fault injection testing
Quick Reference
| Pattern | Library/Tool | When to Use | Configuration |
|---|---|---|---|
| Circuit Breaker | Opossum (Node.js), pybreaker (Python) | External API calls, database connections | Threshold: 50%, timeout: 30s, volume: 10 |
| Retry with Backoff | p-retry (Node.js), tenacity (Python) | Transient failures, rate limits | Max retries: 5, exponential backoff with jitter |
| Bulkhead Isolation | Semaphore pattern, thread pools | Prevent resource exhaustion | Pool size based on workload (CPU cores + wait/service time) |
| Timeout Policies | AbortSignal, statement timeout | Slow dependencies, database queries | Connection: 5s, API: 10-30s, DB query: 5-10s |
| Graceful Degradation | Feature flags, cached fallback | Non-critical features, ML recommendations | Cache recent data, default values, reduced functionality |
| Health Checks | Kubernetes probes, /health endpoints | Service orchestration, load balancing | Liveness: simple, readiness: dependency checks, startup: slow apps |
| Chaos Engineering | Chaos Toolkit, Netflix Chaos Monkey | Proactive resilience testing | Start non-prod, define hypothesis, automate failure injection |
Decision Tree: Resilience Pattern Selection
Failure scenario: [System Dependency Type]
├─ External API/Service?
│ ├─ Transient errors? → Retry with exponential backoff + jitter
│ ├─ Cascading failures? → Circuit breaker + fallback
│ ├─ Rate limiting? → Retry with Retry-After header respect
│ └─ Slow response? → Timeout + circuit breaker
│
├─ Database Dependency?
│ ├─ Connection pool exhaustion? → Bulkhead isolation + timeout
│ ├─ Query timeout? → Statement timeout (5-10s)
│ ├─ Replica lag? → Read from primary fallback
│ └─ Connection failures? → Retry + circuit breaker
│
├─ Non-Critical Feature?
│ ├─ ML recommendations? → Feature flag + default values fallback
│ ├─ Search service? → Cached results or basic SQL fallback
│ ├─ Email/notifications? → Log error, don't block main flow
│ └─ Analytics? → Fire-and-forget, circuit breaker for protection
│
├─ Kubernetes/Orchestration?
│ ├─ Service discovery? → Liveness + readiness + startup probes
│ ├─ Slow startup? → Startup probe (failureThreshold: 30)
│ ├─ Load balancing? → Readiness probe (check dependencies)
│ └─ Auto-restart? → Liveness probe (simple check)
│
└─ Testing Resilience?
├─ Pre-production? → Chaos Toolkit experiments
├─ Production (low risk)? → Feature flags + canary deployments
├─ Scheduled testing? → Game days (quarterly)
└─ Continuous chaos? → Netflix Chaos Monkey (1% failure injection)
Navigation: Core Resilience Patterns
Circuit Breaker Patterns - Prevent cascading failures
- Classic circuit breaker implementation (Node.js, Python)
- Adaptive circuit breakers with ML-based thresholds (2024-2025)
- Fallback strategies and event monitoring
Retry Patterns - Handle transient failures
- Exponential backoff with jitter
- Retry decision table (which errors to retry)
- Idempotency patterns and Retry-After headers
Bulkhead Isolation - Resource compartmentalization
- Semaphore pattern for thread/connection pools
- Database connection pooling strategies
- Queue-based bulkheads with load shedding
Timeout Policies - Prevent resource exhaustion
- Connection, request, and idle timeouts
- Database query timeouts (PostgreSQL, MySQL)
- Nested timeout budgets for chained operations
Graceful Degradation - Maintain partial functionality
- Cached fallback strategies
- Default values and feature toggles
- Partial responses with Promise.allSettled
Health Check Patterns - Service availability monitoring
- Liveness, readiness, and startup probes
- Kubernetes probe configuration
- Shallow vs deep health checks
Navigation: Operational Resources
Resilience Checklists - Production hardening checklists
- Dependency resilience
- Health and readiness probes
- Observability for resilience
- Failure testing
Chaos Engineering Guide - Safe reliability experiments
- Planning chaos experiments
- Common failure injection scenarios
- Execution steps and debrief checklist
Navigation: Templates
Resilience Runbook Template - Service hardening profile
- Dependencies and SLOs
- Fallback strategies
- Rollback procedures
Fault Injection Playbook - Chaos testing script
- Success signals
- Rollback criteria
- Post-experiment debrief
Quick Decision Matrix
| Scenario | Recommendation |
|---|---|
| External API calls | Circuit breaker + retry with exponential backoff |
| Database queries | Timeout + connection pooling + circuit breaker |
| Slow dependency | Bulkhead isolation + timeout |
| Non-critical feature | Feature flag + graceful degradation |
| Kubernetes deployment | Liveness + readiness + startup probes |
| Testing resilience | Chaos engineering experiments |
| Transient failures | Retry with exponential backoff + jitter |
| Cascading failures | Circuit breaker + bulkhead |
Anti-Patterns to Avoid
- No timeouts - Infinite waits exhaust resources
- Infinite retries - Amplifies problems (thundering herd)
- No circuit breakers - Cascading failures
- Tight coupling - One failure breaks everything
- Silent failures - No observability into degraded state
- No bulkheads - Shared thread pools exhaust all resources
- Testing only happy path - Production reveals failures
Related Skills
- ../ops-devops-platform/SKILL.md — Incident response, SLOs, and platform runbooks
- ../software-backend/SKILL.md — API error handling, retries, and database reliability patterns
- ../software-architecture-design/SKILL.md — System decomposition and dependency design for reliability
- ../qa-testing-strategy/SKILL.md — Regression, load, and fault-injection testing strategies
- ../software-security-appsec/SKILL.md — Security failure modes and guardrails
- ../qa-observability/SKILL.md — Metrics, tracing, logging, and performance monitoring
- ../qa-debugging/SKILL.md — Production debugging and incident investigation
- ../data-sql-optimization/SKILL.md — Database resilience, connection pooling, and query timeouts
- ../dev-api-design/SKILL.md — API design patterns including error handling and retry semantics
Usage Notes
Pattern Selection:
- Start with circuit breakers for external dependencies
- Add retries for transient failures (network, rate limits)
- Use bulkheads to prevent resource exhaustion
- Combine patterns for defense-in-depth
Observability:
- Track circuit breaker state changes
- Monitor retry attempts and success rates
- Alert on degraded mode duration
- Measure recovery time after failures
Testing:
- Start chaos experiments in non-production
- Define hypothesis before failure injection
- Set blast radius limits and auto-revert
- Document learnings and action items
Success Criteria: Systems gracefully handle failures, recover automatically, maintain partial functionality during outages, and fail fast to prevent cascading failures. Resilience is tested proactively through chaos engineering.