Claude Code Plugins

Community-maintained marketplace

Feedback

Production resilience patterns with circuit breakers, retry strategies, bulkheads, timeouts, graceful degradation, health checks, and chaos engineering for fault-tolerant distributed systems.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name qa-resilience
description Production resilience patterns with circuit breakers, retry strategies, bulkheads, timeouts, graceful degradation, health checks, and chaos engineering for fault-tolerant distributed systems.

Error Handling & Resilience — Production Patterns

This skill provides execution-ready patterns for building resilient, fault-tolerant systems that handle failures gracefully. Claude should apply these patterns when users need error handling strategies, circuit breakers, retry logic, or production hardening.

Modern Best Practices (2025): Circuit breaker pattern, exponential backoff, bulkhead isolation, timeout policies, graceful degradation, health check design, chaos engineering, and observability-driven resilience.


When to Use This Skill

Claude should invoke this skill when a user requests:

  • Circuit breaker implementation
  • Retry strategies and exponential backoff
  • Bulkhead pattern for resource isolation
  • Timeout policies for external dependencies
  • Graceful degradation and fallback mechanisms
  • Health check design (liveness vs readiness)
  • Error handling best practices
  • Chaos engineering setup
  • Production hardening strategies
  • Fault injection testing

Quick Reference

Pattern Library/Tool When to Use Configuration
Circuit Breaker Opossum (Node.js), pybreaker (Python) External API calls, database connections Threshold: 50%, timeout: 30s, volume: 10
Retry with Backoff p-retry (Node.js), tenacity (Python) Transient failures, rate limits Max retries: 5, exponential backoff with jitter
Bulkhead Isolation Semaphore pattern, thread pools Prevent resource exhaustion Pool size based on workload (CPU cores + wait/service time)
Timeout Policies AbortSignal, statement timeout Slow dependencies, database queries Connection: 5s, API: 10-30s, DB query: 5-10s
Graceful Degradation Feature flags, cached fallback Non-critical features, ML recommendations Cache recent data, default values, reduced functionality
Health Checks Kubernetes probes, /health endpoints Service orchestration, load balancing Liveness: simple, readiness: dependency checks, startup: slow apps
Chaos Engineering Chaos Toolkit, Netflix Chaos Monkey Proactive resilience testing Start non-prod, define hypothesis, automate failure injection

Decision Tree: Resilience Pattern Selection

Failure scenario: [System Dependency Type]
    ├─ External API/Service?
    │   ├─ Transient errors? → Retry with exponential backoff + jitter
    │   ├─ Cascading failures? → Circuit breaker + fallback
    │   ├─ Rate limiting? → Retry with Retry-After header respect
    │   └─ Slow response? → Timeout + circuit breaker
    │
    ├─ Database Dependency?
    │   ├─ Connection pool exhaustion? → Bulkhead isolation + timeout
    │   ├─ Query timeout? → Statement timeout (5-10s)
    │   ├─ Replica lag? → Read from primary fallback
    │   └─ Connection failures? → Retry + circuit breaker
    │
    ├─ Non-Critical Feature?
    │   ├─ ML recommendations? → Feature flag + default values fallback
    │   ├─ Search service? → Cached results or basic SQL fallback
    │   ├─ Email/notifications? → Log error, don't block main flow
    │   └─ Analytics? → Fire-and-forget, circuit breaker for protection
    │
    ├─ Kubernetes/Orchestration?
    │   ├─ Service discovery? → Liveness + readiness + startup probes
    │   ├─ Slow startup? → Startup probe (failureThreshold: 30)
    │   ├─ Load balancing? → Readiness probe (check dependencies)
    │   └─ Auto-restart? → Liveness probe (simple check)
    │
    └─ Testing Resilience?
        ├─ Pre-production? → Chaos Toolkit experiments
        ├─ Production (low risk)? → Feature flags + canary deployments
        ├─ Scheduled testing? → Game days (quarterly)
        └─ Continuous chaos? → Netflix Chaos Monkey (1% failure injection)

Navigation: Core Resilience Patterns

  • Circuit Breaker Patterns - Prevent cascading failures

    • Classic circuit breaker implementation (Node.js, Python)
    • Adaptive circuit breakers with ML-based thresholds (2024-2025)
    • Fallback strategies and event monitoring
  • Retry Patterns - Handle transient failures

    • Exponential backoff with jitter
    • Retry decision table (which errors to retry)
    • Idempotency patterns and Retry-After headers
  • Bulkhead Isolation - Resource compartmentalization

    • Semaphore pattern for thread/connection pools
    • Database connection pooling strategies
    • Queue-based bulkheads with load shedding
  • Timeout Policies - Prevent resource exhaustion

    • Connection, request, and idle timeouts
    • Database query timeouts (PostgreSQL, MySQL)
    • Nested timeout budgets for chained operations
  • Graceful Degradation - Maintain partial functionality

    • Cached fallback strategies
    • Default values and feature toggles
    • Partial responses with Promise.allSettled
  • Health Check Patterns - Service availability monitoring

    • Liveness, readiness, and startup probes
    • Kubernetes probe configuration
    • Shallow vs deep health checks

Navigation: Operational Resources

  • Resilience Checklists - Production hardening checklists

    • Dependency resilience
    • Health and readiness probes
    • Observability for resilience
    • Failure testing
  • Chaos Engineering Guide - Safe reliability experiments

    • Planning chaos experiments
    • Common failure injection scenarios
    • Execution steps and debrief checklist

Navigation: Templates


Quick Decision Matrix

Scenario Recommendation
External API calls Circuit breaker + retry with exponential backoff
Database queries Timeout + connection pooling + circuit breaker
Slow dependency Bulkhead isolation + timeout
Non-critical feature Feature flag + graceful degradation
Kubernetes deployment Liveness + readiness + startup probes
Testing resilience Chaos engineering experiments
Transient failures Retry with exponential backoff + jitter
Cascading failures Circuit breaker + bulkhead

Anti-Patterns to Avoid

  • No timeouts - Infinite waits exhaust resources
  • Infinite retries - Amplifies problems (thundering herd)
  • No circuit breakers - Cascading failures
  • Tight coupling - One failure breaks everything
  • Silent failures - No observability into degraded state
  • No bulkheads - Shared thread pools exhaust all resources
  • Testing only happy path - Production reveals failures

Related Skills


Usage Notes

Pattern Selection:

  • Start with circuit breakers for external dependencies
  • Add retries for transient failures (network, rate limits)
  • Use bulkheads to prevent resource exhaustion
  • Combine patterns for defense-in-depth

Observability:

  • Track circuit breaker state changes
  • Monitor retry attempts and success rates
  • Alert on degraded mode duration
  • Measure recovery time after failures

Testing:

  • Start chaos experiments in non-production
  • Define hypothesis before failure injection
  • Set blast radius limits and auto-revert
  • Document learnings and action items

Success Criteria: Systems gracefully handle failures, recover automatically, maintain partial functionality during outages, and fail fast to prevent cascading failures. Resilience is tested proactively through chaos engineering.