name

error-recovery-patterns

description

Error recovery and failure handling patterns for distributed systems including rollback strategies, circuit breakers, retry logic with exponential backoff, graceful degradation, and compensating transactions. Activates when task execution fails, timeouts occur, external services fail, database transactions fail, or cascade failure risks are detected. Use when recovering from failures, preventing data corruption, implementing resilience patterns, or handling partial system failures.

Error Recovery Patterns

Overview

When systems fail, how you recover determines the difference between a minor hiccup and a catastrophic outage. This skill provides systematic approaches to error recovery, failure handling, and resilience patterns.

When to use this skill:

Task execution fails with errors
Timeouts or external service failures detected
Database transaction failures occur
Risk of cascading failures across systems
Need to maintain data consistency during failures
Implementing resilience for distributed systems

Quick Diagnosis Framework

1. Classify the Failure

Transient vs Permanent:

Transient: Temporary issue, likely to succeed on retry (network blip, timeout, rate limit)
Permanent: Fundamental issue, won't succeed without intervention (invalid input, missing resource, configuration error)

Recoverable vs Fatal:

Recoverable: Can restore to consistent state (retry, rollback, compensating transaction)
Fatal: Cannot recover automatically (data corruption, invariant violation, critical dependency failure)

Scope:

Local: Single operation failure
Distributed: Multiple service failure
Cascading: Failure spreading across system boundaries

2. Select Recovery Strategy

Failure Type	Recovery Strategy	Pattern to Use
Transient + Recoverable	Retry with backoff	Exponential backoff, jitter
Transient + Service failure	Circuit breaker	Open/half-open/closed states
Permanent + Recoverable	Rollback or compensate	Transaction rollback, saga pattern
Permanent + Fatal	Fail gracefully	Graceful degradation, fallback
Distributed + Transient	Retry with timeout	Distributed timeout, bulkhead
Distributed + Permanent	Compensating transaction	Saga pattern, event sourcing
Cascading risk	Prevent cascade	Circuit breaker, rate limiting

Core Recovery Patterns

Pattern 1: Retry with Exponential Backoff

When to use:

Transient failures (network timeouts, temporary service unavailability)
Rate limiting (429 responses)
Database connection pool exhaustion

Implementation approach:

# Exponential backoff with jitter
def retry_with_backoff(func, max_retries=5, base_delay=1, max_delay=60):
    """
    Retry with exponential backoff and jitter.

    Args:
        func: Function to retry
        max_retries: Maximum retry attempts (default 5)
        base_delay: Initial delay in seconds (default 1)
        max_delay: Maximum delay in seconds (default 60)
    """
    for attempt in range(max_retries):
        try:
            return func()
        except TransientError as e:
            if attempt == max_retries - 1:
                raise  # Final attempt, re-raise

            # Exponential backoff: 1s, 2s, 4s, 8s, 16s...
            delay = min(base_delay * (2 ** attempt), max_delay)

            # Add jitter (±25%) to prevent thundering herd
            jitter = delay * 0.25 * (2 * random.random() - 1)
            actual_delay = delay + jitter

            logger.warning(f"Transient error on attempt {attempt + 1}, "
                         f"retrying in {actual_delay:.2f}s: {e}")
            time.sleep(actual_delay)

Key principles:

Exponential backoff: Double delay each retry (1s → 2s → 4s → 8s)
Max delay cap: Prevent excessive wait times (typically 60s)
Jitter: Add randomness (±25%) to prevent thundering herd problem
Max retries: Limit attempts (typically 3-5) to prevent infinite loops
Idempotency: Ensure retries are safe (use idempotency keys)

When NOT to use:

Permanent failures (invalid input, missing resources)
User-interactive operations (don't make users wait)
Write operations without idempotency guarantees

Pattern 2: Circuit Breaker

When to use:

Repeated failures to external service
Protecting downstream services from overload
Preventing cascading failures
Fast-failing when service is known to be down

Implementation approach:

class CircuitBreaker:
    """
    Circuit breaker pattern with three states:
    - CLOSED: Normal operation, requests pass through
    - OPEN: Failures exceeded threshold, requests fail fast
    - HALF_OPEN: Testing if service recovered
    """
    def __init__(self, failure_threshold=5, timeout=60, success_threshold=2):
        self.failure_threshold = failure_threshold  # Failures before opening
        self.timeout = timeout  # Seconds before trying again (half-open)
        self.success_threshold = success_threshold  # Successes to close

        self.failure_count = 0
        self.success_count = 0
        self.state = "CLOSED"
        self.opened_at = None

    def call(self, func):
        """Execute function through circuit breaker."""
        if self.state == "OPEN":
            # Check if timeout elapsed
            if time.time() - self.opened_at >= self.timeout:
                self.state = "HALF_OPEN"
                self.success_count = 0
                logger.info("Circuit breaker entering HALF_OPEN state")
            else:
                raise CircuitBreakerOpenError("Circuit breaker is OPEN")

        try:
            result = func()
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        """Handle successful call."""
        self.failure_count = 0

        if self.state == "HALF_OPEN":
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = "CLOSED"
                logger.info("Circuit breaker CLOSED (service recovered)")

    def _on_failure(self):
        """Handle failed call."""
        self.failure_count += 1

        if self.state == "HALF_OPEN":
            # Failure in half-open state, reopen circuit
            self.state = "OPEN"
            self.opened_at = time.time()
            logger.warning("Circuit breaker OPEN (half-open test failed)")

        elif self.failure_count >= self.failure_threshold:
            # Threshold exceeded, open circuit
            self.state = "OPEN"
            self.opened_at = time.time()
            logger.error(f"Circuit breaker OPEN ({self.failure_count} failures)")

Key principles:

CLOSED state: Normal operation, track failures
OPEN state: Fast-fail without calling service (save resources)
HALF_OPEN state: Test if service recovered (limited requests)
Failure threshold: Open after N consecutive failures (typically 5-10)
Timeout: Try again after N seconds (typically 30-60s)
Success threshold: Close after N successes in half-open (typically 2-3)

Monitoring:

Track state transitions (CLOSED → OPEN is critical alert)
Monitor failure rate before opening
Alert on circuit breaker open events
Dashboard: show circuit breaker states across services

Pattern 3: Transaction Rollback

When to use:

Database transaction failures
Multi-step operations where consistency required
ACID transactions (single database)
Failures during data modification

Implementation approach:

def transfer_funds(from_account, to_account, amount):
    """
    Transfer funds with automatic rollback on failure.
    Demonstrates transaction rollback pattern.
    """
    with db.transaction():  # Automatically rolls back on exception
        try:
            # Step 1: Debit source account
            from_balance = db.get_balance(from_account)
            if from_balance < amount:
                raise InsufficientFundsError(f"Balance {from_balance} < {amount}")

            db.debit(from_account, amount)
            logger.info(f"Debited {amount} from {from_account}")

            # Step 2: Credit destination account
            db.credit(to_account, amount)
            logger.info(f"Credited {amount} to {to_account}")

            # Step 3: Record transaction
            db.record_transaction(from_account, to_account, amount, timestamp=now())
            logger.info(f"Recorded transaction {from_account} → {to_account}")

            # All steps succeeded, commit
            return TransactionResult(success=True, transaction_id=...)

        except Exception as e:
            # Any failure triggers automatic rollback
            logger.error(f"Transaction failed, rolling back: {e}")
            # with statement handles rollback automatically
            raise TransactionFailedError(f"Transfer failed: {e}") from e

Key principles:

All-or-nothing: Either all steps succeed or none do
Automatic rollback: Use database transaction context managers
Idempotency: Safe to retry entire transaction
Logging: Log each step for debugging rolled-back transactions
Isolation: Prevent dirty reads during transaction

When NOT to use:

Distributed systems (multiple databases) - use saga pattern instead
Long-running operations (locks held too long)
Operations with external side effects (APIs, file writes)

Pattern 4: Compensating Transaction (Saga Pattern)

When to use:

Distributed systems (multiple services/databases)
Long-running business processes
Cannot use traditional transactions (no ACID across services)
Need eventual consistency

Implementation approach:

class BookingWorkflow:
    """
    Hotel booking saga with compensating transactions.
    Steps: Reserve hotel → Charge payment → Send confirmation
    Compensations: Cancel reservation ← Refund payment ← Cancel email
    """
    def execute(self, booking_request):
        """Execute saga with automatic compensation on failure."""
        executed_steps = []

        try:
            # Step 1: Reserve hotel room
            reservation = self.hotel_service.reserve_room(
                booking_request.hotel_id,
                booking_request.dates
            )
            executed_steps.append(("reserve_room", reservation.id))
            logger.info(f"Reserved room {reservation.id}")

            # Step 2: Charge payment
            payment = self.payment_service.charge(
                booking_request.card,
                booking_request.amount
            )
            executed_steps.append(("charge_payment", payment.id))
            logger.info(f"Charged payment {payment.id}")

            # Step 3: Send confirmation email
            email = self.email_service.send_confirmation(
                booking_request.email,
                reservation,
                payment
            )
            executed_steps.append(("send_email", email.id))
            logger.info(f"Sent confirmation {email.id}")

            # All steps succeeded
            return BookingResult(success=True, reservation_id=reservation.id)

        except Exception as e:
            # Failure occurred, compensate executed steps in reverse order
            logger.error(f"Booking failed at step {len(executed_steps)}, "
                        f"compensating: {e}")

            self._compensate(executed_steps)

            raise BookingFailedError(f"Booking failed: {e}") from e

    def _compensate(self, executed_steps):
        """Execute compensating transactions in reverse order."""
        for step_name, step_id in reversed(executed_steps):
            try:
                if step_name == "reserve_room":
                    self.hotel_service.cancel_reservation(step_id)
                    logger.info(f"Compensated: cancelled reservation {step_id}")

                elif step_name == "charge_payment":
                    self.payment_service.refund(step_id)
                    logger.info(f"Compensated: refunded payment {step_id}")

                elif step_name == "send_email":
                    self.email_service.send_cancellation(step_id)
                    logger.info(f"Compensated: sent cancellation {step_id}")

            except Exception as comp_error:
                # Compensation failed - critical error requiring manual intervention
                logger.critical(f"COMPENSATION FAILED for {step_name} {step_id}: "
                              f"{comp_error}")
                # Alert operations team
                self.alert_service.critical_alert(
                    f"Manual intervention required: compensation failed",
                    details={"step": step_name, "id": step_id, "error": comp_error}
                )

Key principles:

Forward steps: Execute business logic steps sequentially
Track execution: Record each completed step for compensation
Reverse compensation: Undo steps in reverse order on failure
Semantic compensation: Cancel reservation, refund payment (not database rollback)
Idempotent compensations: Safe to retry compensation
Handle compensation failures: Critical alerts for manual intervention

Saga orchestration patterns:

Orchestration: Central coordinator (shown above)
Choreography: Services coordinate via events (more decoupled)

Pattern 5: Graceful Degradation

When to use:

Non-critical service failures
Maintaining core functionality during partial outages
User experience over complete accuracy
Fallback to cached/simplified responses

Implementation approach:

def get_user_recommendations(user_id):
    """
    Get personalized recommendations with graceful degradation.
    Fallback chain: ML model → collaborative filtering → popular items
    """
    try:
        # Primary: ML-based personalized recommendations
        recommendations = ml_service.get_recommendations(user_id, timeout=2.0)
        logger.info(f"Returned ML recommendations for user {user_id}")
        return recommendations

    except (TimeoutError, ServiceUnavailableError) as e:
        logger.warning(f"ML service unavailable, falling back to collaborative: {e}")

        try:
            # Fallback 1: Collaborative filtering (faster, simpler)
            recommendations = collab_filter_service.get_similar_users_items(
                user_id,
                timeout=1.0
            )
            logger.info(f"Returned collaborative filtering for user {user_id}")
            return recommendations

        except (TimeoutError, ServiceUnavailableError) as e2:
            logger.warning(f"Collaborative filtering failed, falling back to popular: {e2}")

            try:
                # Fallback 2: Popular items (cached, always available)
                recommendations = cache.get_popular_items(category=user.category)
                logger.info(f"Returned popular items for user {user_id}")
                return recommendations

            except Exception as e3:
                # Final fallback: Empty recommendations (degraded but functional)
                logger.error(f"All recommendation sources failed, returning empty: {e3}")
                return []  # Graceful degradation: site still works, just no recommendations

Key principles:

Fallback chain: Primary → Fallback 1 → Fallback 2 → Minimal functionality
Decreasing quality: Each fallback is simpler/less accurate but faster/more reliable
Always functional: Never completely fail (return empty/cached data)
Timeouts: Use aggressive timeouts (1-2s) to fail fast and try fallback
Monitoring: Track fallback rate (high rate = investigate primary service)

Fallback strategies:

Cached data: Stale but available
Simplified logic: Faster, less accurate algorithm
Default values: Reasonable defaults (popular items, average values)
Degraded UI: Hide non-critical features

Data Consistency Strategies

Strategy 1: Immediate Consistency (Single Database)

Pattern: Use database transactions for ACID guarantees

When to use:

Single database operations
Strong consistency required
Operations must be atomic

Implementation: Transaction rollback pattern (see Pattern 3 above)

Strategy 2: Eventual Consistency (Distributed Systems)

Pattern: Accept temporary inconsistency, guarantee eventual convergence

When to use:

Distributed systems (multiple databases/services)
High availability more important than immediate consistency
Can tolerate temporary stale data

Implementation:

# Step 1: Write to primary database (source of truth)
db.create_order(order_id, user_id, items, total)

# Step 2: Publish event for eventual consistency
event_bus.publish(OrderCreatedEvent(
    order_id=order_id,
    user_id=user_id,
    items=items,
    total=total,
    timestamp=now()
))

# Other services consume event asynchronously:
# - Inventory service decrements stock
# - Analytics service records purchase
# - Email service sends confirmation
# Each service eventually becomes consistent

Key principles:

Source of truth: One service owns the data
Event-driven: Publish events for others to react
Idempotent consumers: Handle duplicate events safely
Monitoring: Track lag between write and propagation
Conflict resolution: Define merge strategy for concurrent updates

Strategy 3: Causal Consistency

Pattern: Preserve cause-effect relationships while allowing other operations to be reordered

When to use:

Social media feeds (comment after post)
Collaborative editing (see your own changes)
Chat applications (messages in order per conversation)

Implementation: Use vector clocks or logical timestamps to track causality

Failure Prevention Patterns

Bulkhead Pattern

Purpose: Isolate failures to prevent cascade

Implementation:

Connection pools: Separate pools per service (failure in service A doesn't exhaust pool for service B)
Thread pools: Separate threads per operation type
Resource limits: CPU/memory limits per service (cgroups, Docker limits)

Timeout Pattern

Purpose: Prevent indefinite waiting

Implementation:

# Set aggressive timeouts
response = requests.get(
    external_api_url,
    timeout=(3.0, 10.0)  # (connection timeout, read timeout)
)

# Async with timeout
async with asyncio.timeout(5.0):
    result = await long_running_operation()

Guidelines:

Connection timeout: 1-3s (how long to establish connection)
Read timeout: 5-10s (how long to wait for response)
User-facing: 2-5s max (prevent frustration)
Background jobs: 30-60s (allow more time)

Rate Limiting

Purpose: Prevent overload from excessive requests

Implementation:

Client-side: Throttle requests to prevent hitting limits
Server-side: Reject requests above threshold
Adaptive: Reduce rate during degraded performance

Monitoring & Alerting for Recovery

Key Metrics

Error rate: Percentage of failed requests (alert if >1%)
Retry rate: How often retries happen (high rate = transient failures)
Circuit breaker state: Alert when circuits open
Recovery time: How long to recover from failure (MTTR)
Compensation rate: How often sagas compensate (should be low)

Alerting Strategy

Symptom-based alerts (what users experience):

High error rate (>1% for 5 minutes)
High latency (p99 > 1s)
Circuit breaker open

Cause-based logging (for debugging):

Which specific operation failed
Transient vs permanent failure
Retry attempts and outcomes
Compensation steps executed

Testing Recovery Patterns

Chaos Engineering

Inject failures in test/staging:

Kill random services
Introduce network latency
Cause database timeouts
Fill disk space
Exhaust connection pools

Verify:

System recovers automatically
No data corruption
Appropriate fallbacks triggered
Monitoring alerts fire correctly

Recovery Testing Checklist

Retry logic works (verified with network failures)
Circuit breaker opens/closes correctly
Transaction rollback prevents partial writes
Saga compensation executes in correct order
Graceful degradation provides minimal functionality
Timeouts prevent indefinite hangs
Monitoring captures recovery events
Alerts fire for critical failures

Decision Framework

When task fails, ask:

Is failure transient or permanent?
- Transient → Retry with exponential backoff
- Permanent → Don't retry, log error
Is system distributed?
- Single database → Transaction rollback
- Multiple services → Saga pattern
Is service external/unreliable?
- Yes → Circuit breaker + fallback
- No → Standard retry
Can I degrade gracefully?
- Yes → Fallback chain (ML → simple → cached → empty)
- No → Fast fail with error message
What consistency guarantees needed?
- Strong (financial) → ACID transactions
- Eventual (social) → Event-driven updates
- Causal (chat) → Vector clocks

Common Anti-Patterns

❌ Retry Forever

Problem: Infinite retry loop exhausts resources Solution: Max retries (3-5), exponential backoff, circuit breaker

❌ No Jitter

Problem: Thundering herd (all clients retry simultaneously) Solution: Add jitter (±25% randomness) to backoff

❌ Retry Non-Idempotent Operations

Problem: Duplicate charges, duplicate emails Solution: Use idempotency keys, check before retrying

❌ Ignore Compensation Failures

Problem: Half-completed saga leaves inconsistent state Solution: Alert operations team, manual intervention required

❌ No Timeouts

Problem: Indefinite hangs consume resources Solution: Set timeouts on all external calls (3-10s)

❌ Synchronous Saga

Problem: Long-running saga blocks user Solution: Asynchronous saga (immediate response, poll for completion)

Real-World Examples

Example 1: Payment Processing Failure

Scenario: Credit card charge fails during checkout

Recovery strategy:

Classify: Transient (network) or permanent (declined card)?
If transient: Retry with exponential backoff (3 attempts)
If permanent: Show user error, don't retry
If retry exhausted: Circuit breaker opens, use backup payment processor

Example 2: Microservice Dependency Down

Scenario: Recommendation service unavailable

Recovery strategy:

Circuit breaker opens after 5 failures
Fast-fail for next 60 seconds (don't call service)
Graceful degradation: Show cached popular items
After 60s: Half-open state, test if service recovered
If recovered: Close circuit, resume normal operation

Example 3: Database Transaction Failure

Scenario: Transfer between accounts fails mid-transaction

Recovery strategy:

Transaction rollback: Undo debit from source account
Log failure for investigation
Return error to user: "Transfer failed, please try again"
User retries: Idempotent (check if already processed)

Summary Checklist

When implementing error recovery:

Classify failure: Transient vs permanent, recoverable vs fatal
Select pattern: Retry, circuit breaker, rollback, saga, or graceful degradation
Implement idempotency: Use keys to prevent duplicate operations
Add timeouts: Prevent indefinite hangs (3-10s)
Monitor recovery: Track error rates, retry rates, circuit breaker states
Test failures: Chaos engineering, inject faults, verify recovery
Alert on critical: Circuit open, compensation failures, high error rates
Document runbooks: How to manually intervene when automation fails

Remember: The goal is to recover automatically when possible, fail gracefully when not, and always maintain data consistency.

Install Skill

SKILL.md

Error Recovery Patterns

Overview

Quick Diagnosis Framework

1. Classify the Failure

2. Select Recovery Strategy

Core Recovery Patterns

Pattern 1: Retry with Exponential Backoff

Pattern 2: Circuit Breaker

Pattern 3: Transaction Rollback

Pattern 4: Compensating Transaction (Saga Pattern)

Pattern 5: Graceful Degradation

Data Consistency Strategies

Strategy 1: Immediate Consistency (Single Database)

Strategy 2: Eventual Consistency (Distributed Systems)

Strategy 3: Causal Consistency

Failure Prevention Patterns

Bulkhead Pattern

Timeout Pattern

Rate Limiting

Monitoring & Alerting for Recovery

Key Metrics

Alerting Strategy

Testing Recovery Patterns

Chaos Engineering

Recovery Testing Checklist

Decision Framework

When task fails, ask:

Common Anti-Patterns

❌ Retry Forever

❌ No Jitter

❌ Retry Non-Idempotent Operations

❌ Ignore Compensation Failures

❌ No Timeouts

❌ Synchronous Saga

Real-World Examples

Example 1: Payment Processing Failure

Example 2: Microservice Dependency Down

Example 3: Database Transaction Failure

Summary Checklist