Claude Code Plugins

Community-maintained marketplace

Feedback

chaos-engineering-resilience

@proffesor-for-testing/agentic-qe
98
0

Chaos engineering principles, controlled failure injection, resilience testing, and system recovery validation. Use when testing distributed systems, building confidence in fault tolerance, or validating disaster recovery.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name chaos-engineering-resilience
description Chaos engineering principles, controlled failure injection, resilience testing, and system recovery validation. Use when testing distributed systems, building confidence in fault tolerance, or validating disaster recovery.
category specialized-testing
priority high
tokenEstimate 900
agents qe-chaos-engineer, qe-performance-tester, qe-production-intelligence
implementation_status optimized
optimization_version 1
last_optimized Tue Dec 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time)
dependencies
quick_reference_card true
tags chaos, resilience, fault-injection, distributed-systems, recovery, netflix

Chaos Engineering & Resilience Testing

When testing system resilience or injecting failures: 1. DEFINE steady state (normal metrics: error rate, latency, throughput) 2. HYPOTHESIZE system continues in steady state during failure 3. INJECT real-world failures (network, instance, disk, CPU) 4. OBSERVE and measure deviation from steady state 5. FIX weaknesses discovered, document runbooks, repeat

Quick Chaos Steps:

  • Start small: Dev → Staging → 1% prod → gradual rollout
  • Define clear rollback triggers (error_rate > 5%)
  • Measure blast radius, never exceed planned scope
  • Document findings → runbooks → improved resilience

Critical Success Factors:

  • Controlled experiments with automatic rollback
  • Steady state must be measurable
  • Start in non-production, graduate to production

Quick Reference Card

When to Use

  • Distributed systems validation
  • Disaster recovery testing
  • Building confidence in fault tolerance
  • Pre-production resilience verification

Failure Types to Inject

Category Failures Tools
Network Latency, packet loss, partition tc, toxiproxy
Infrastructure Instance kill, disk failure, CPU Chaos Monkey
Application Exceptions, slow responses, leaks Gremlin, LitmusChaos
Dependencies Service outage, timeout WireMock

Blast Radius Progression

Dev (safe) → Staging → 1% prod → 10% → 50% → 100%
     ↓           ↓         ↓        ↓
  Learn      Validate   Careful   Full confidence

Steady State Metrics

Metric Normal Alert Threshold
Error rate < 0.1% > 1%
p99 latency < 200ms > 500ms
Throughput baseline -20%

Chaos Experiment Structure

// Chaos experiment definition
const experiment = {
  name: 'Database latency injection',
  hypothesis: 'System handles 500ms DB latency gracefully',
  steadyState: {
    errorRate: '< 0.1%',
    p99Latency: '< 300ms'
  },
  method: {
    type: 'network-latency',
    target: 'database',
    delay: '500ms',
    duration: '5m'
  },
  rollback: {
    automatic: true,
    trigger: 'errorRate > 5%'
  }
};

Agent-Driven Chaos

// qe-chaos-engineer runs controlled experiments
await Task("Chaos Experiment", {
  target: 'payment-service',
  failure: 'terminate-random-instance',
  blastRadius: '10%',
  duration: '5m',
  steadyStateHypothesis: {
    metric: 'success-rate',
    threshold: 0.99
  },
  autoRollback: true
}, "qe-chaos-engineer");

// Validates:
// - System recovers automatically
// - Error rate stays within threshold
// - No data loss
// - Alerts triggered appropriately

Agent Coordination Hints

Memory Namespace

aqe/chaos-engineering/
├── experiments/*       - Experiment definitions & results
├── steady-states/*     - Baseline measurements
├── runbooks/*          - Generated recovery procedures
└── blast-radius/*      - Impact analysis

Fleet Coordination

const chaosFleet = await FleetManager.coordinate({
  strategy: 'chaos-engineering',
  agents: [
    'qe-chaos-engineer',          // Experiment execution
    'qe-performance-tester',      // Baseline metrics
    'qe-production-intelligence'  // Production monitoring
  ],
  topology: 'sequential'
});

Related Skills


Remember

Break things on purpose to prevent unplanned outages. Find weaknesses before users do. Define steady state, inject failures, measure impact, fix weaknesses, create runbooks. Start small, increase blast radius gradually.

With Agents: qe-chaos-engineer automates chaos experiments with blast radius control, automatic rollback, and comprehensive resilience validation. Generates runbooks from experiment results.