Claude Code Plugins

Community-maintained marketplace

Feedback

Production-grade microservices design skill for service decomposition, service mesh, resilience patterns, and observability

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name microservices-design
description Production-grade microservices design skill for service decomposition, service mesh, resilience patterns, and observability
sasmp_version 1.3.0
bonded_agent 07-microservices
bond_type PRIMARY_BOND

Microservices Design Skill

Purpose: Atomic skill for microservices architecture with comprehensive resilience and observability patterns.

Skill Identity

Attribute Value
Scope Decomposition, Resilience, Observability
Responsibility Single: Service architecture patterns
Invocation Skill("microservices-design")

Parameter Schema

Input Validation

parameters:
  microservices_context:
    type: object
    required: true
    properties:
      project_type:
        type: string
        enum: [greenfield, monolith_extraction, optimization]
        required: true
      current_state:
        type: object
        properties:
          services: { type: array, items: { type: string } }
          pain_points: { type: array, items: { type: string } }
          team_structure: { type: string }
      requirements:
        type: object
        properties:
          team_size: { type: integer, minimum: 1 }
          deployment_frequency: { type: string, enum: [daily, weekly, monthly] }
          availability_sla: { type: string, pattern: "^\\d{2}\\.\\d+%$" }
          max_latency_ms: { type: integer, minimum: 1 }
      constraints:
        type: object
        properties:
          budget: { type: string }
          timeline: { type: string }
          technology_stack: { type: array, items: { type: string } }

validation_rules:
  - name: "team_size_for_microservices"
    rule: "team_size >= 2"
    warning: "Microservices add overhead; consider monolith for small teams"
  - name: "sla_feasibility"
    rule: "availability_sla <= '99.99%' or has_multi_region"
    warning: "99.99%+ SLA typically requires multi-region deployment"

Output Schema

output:
  type: object
  properties:
    service_catalog:
      type: array
      items:
        type: object
        properties:
          name: { type: string }
          responsibility: { type: string }
          api_type: { type: string }
          dependencies: { type: array }
          team_owner: { type: string }
          database: { type: string }
    architecture:
      type: object
      properties:
        communication: { type: object }
        service_mesh: { type: object }
        api_gateway: { type: object }
    resilience:
      type: object
      properties:
        patterns: { type: array }
        configuration: { type: object }
    observability:
      type: object
      properties:
        metrics: { type: array }
        tracing: { type: object }
        logging: { type: object }
        alerting: { type: object }

Core Patterns

Service Decomposition

By Business Capability:
├── Align with business domains
├── Stable boundaries over time
├── Example: Order, Inventory, Payment
└── Team: One team per capability

By Subdomain (DDD):
├── Core: Competitive advantage (build)
├── Supporting: Necessary (build or buy)
├── Generic: Commodity (buy)
└── Bounded Context = Service

By Team (Inverse Conway):
├── Structure services around teams
├── 2-3 services per team (2-pizza)
├── Full ownership model
└── DevOps: You build it, you run it

Anti-Patterns:
├── Distributed Monolith: Tight coupling
├── Nano-services: Too granular
├── Shared Database: Hidden coupling
├── Sync Chains: Latency multiplication

Resilience Patterns

Circuit Breaker:
├── States: Closed → Open → Half-Open
├── Config:
│   ├── failure_threshold: 50%
│   ├── slow_call_threshold: 50%
│   ├── wait_duration: 60s
│   └── half_open_calls: 3
├── Implementation: Resilience4j
└── Fallback: Cached data, default, queue

Retry with Backoff:
├── Exponential: delay * 2^attempt
├── Max attempts: 3-5
├── Jitter: ±20%
├── Idempotency: Required
└── Non-retryable: 4xx errors

Bulkhead:
├── Isolate failure domains
├── Thread pool per dependency
├── Semaphore for lightweight
└── Config: maxConcurrentCalls: 25

Timeout:
├── Connection: 1s
├── Read: 5s
├── Total: 10s
└── Cascading: outer > inner

Service Mesh

Capabilities:
├── Traffic Management
│   ├── Load balancing
│   ├── Traffic splitting (canary)
│   ├── Circuit breaking
│   └── Retries/timeouts
├── Security
│   ├── mTLS
│   ├── Service identity (SPIFFE)
│   └── Authorization policies
├── Observability
│   ├── Distributed tracing
│   ├── Service metrics
│   └── Access logging
└── Options
    ├── Istio: Full-featured
    ├── Linkerd: Lightweight
    ├── Consul: HashiCorp
    └── AWS App Mesh

Observability (Three Pillars)

Metrics:
├── RED: Request, Error, Duration
├── USE: Utilization, Saturation, Errors
├── Key Metrics:
│   ├── http_requests_total{method, path, status}
│   ├── http_request_duration_seconds{quantile}
│   └── http_requests_in_flight
└── Tools: Prometheus, Datadog

Logs:
├── Structured JSON
├── Correlation ID propagation
├── Level: DEBUG, INFO, WARN, ERROR
├── Format:
│   {
│     "timestamp": "ISO8601",
│     "level": "INFO",
│     "service": "order-service",
│     "trace_id": "abc123",
│     "message": "Order created"
│   }
└── Tools: ELK, Loki

Traces:
├── Distributed tracing
├── Span context propagation
├── W3C Trace Context
└── Tools: Jaeger, Zipkin, X-Ray

Retry Logic

Service Call Retry

retry_config:
  http_calls:
    max_attempts: 3
    initial_delay_ms: 100
    max_delay_ms: 5000
    multiplier: 2.0
    jitter_factor: 0.2

  grpc_calls:
    max_attempts: 5
    initial_delay_ms: 50
    max_delay_ms: 2000
    multiplier: 1.5

  retryable:
    - UNAVAILABLE
    - DEADLINE_EXCEEDED
    - RESOURCE_EXHAUSTED
    - 502, 503, 504

  non_retryable:
    - INVALID_ARGUMENT
    - NOT_FOUND
    - ALREADY_EXISTS
    - 400, 401, 403, 404

  idempotency:
    header: "Idempotency-Key"
    required_for: [POST, PATCH]
    cache_ttl: 86400

Logging & Observability

Log Format

log_schema:
  level: { type: string }
  timestamp: { type: string, format: ISO8601 }
  skill: { type: string, value: "microservices-design" }
  event:
    type: string
    enum:
      - service_designed
      - decomposition_planned
      - resilience_configured
      - mesh_deployed
      - sla_defined
  context:
    type: object
    properties:
      service_name: { type: string }
      pattern: { type: string }
      decision: { type: string }

example:
  level: INFO
  event: resilience_configured
  context:
    service_name: payment-service
    pattern: circuit_breaker
    decision: "5 failures in 60s triggers open state"

Metrics

metrics:
  - name: service_design_decisions
    type: counter
    labels: [service, decision_type]

  - name: decomposition_services_count
    type: gauge
    labels: [domain]

  - name: resilience_patterns_applied
    type: counter
    labels: [service, pattern]

  - name: sla_target
    type: gauge
    labels: [service]

Troubleshooting

Common Issues

Issue Cause Resolution
High latency Cascade calls Parallelize, cache
Partial failures No circuit breaker Add resilience
Data inconsistency Distributed tx Saga pattern
Deployment failures Coupling API contracts
Debug difficulty No tracing Distributed tracing
Cascading failures No bulkhead Thread isolation

Debug Checklist

□ Trace ID in all logs?
□ Circuit breakers monitored?
□ Timeouts on all calls?
□ Health checks passing?
□ Service mesh healthy?
□ Dependency graph documented?
□ SLOs defined and measured?
□ Alerting configured?

Unit Test Templates

Decomposition Tests

# test_microservices_design.py

def test_valid_microservices_context():
    params = {
        "microservices_context": {
            "project_type": "monolith_extraction",
            "current_state": {
                "services": ["monolith"],
                "pain_points": ["slow deployments", "scaling issues"]
            },
            "requirements": {
                "team_size": 15,
                "deployment_frequency": "daily",
                "availability_sla": "99.9%",
                "max_latency_ms": 200
            }
        }
    }
    result = validate_parameters(params)
    assert result.valid == True

def test_small_team_warning():
    params = {
        "microservices_context": {
            "project_type": "greenfield",
            "requirements": {"team_size": 1}
        }
    }
    result = validate_parameters(params)
    assert len(result.warnings) > 0
    assert "overhead" in result.warnings[0]

def test_service_decomposition():
    monolith = {
        "domains": ["users", "orders", "payments", "inventory"],
        "team_size": 12
    }
    result = plan_decomposition(monolith)

    assert len(result.services) == 4
    assert result.services[0].responsibility != ""
    assert result.communication_pattern in ["sync", "async", "mixed"]

Resilience Pattern Tests

def test_circuit_breaker_config():
    service = {"name": "payment-service", "sla": "99.9%"}
    config = generate_circuit_breaker_config(service)

    assert config.failure_rate_threshold == 50
    assert config.wait_duration_in_open_state == 60
    assert config.permitted_calls_in_half_open == 3

def test_timeout_hierarchy():
    services = {
        "gateway": {"timeout": 10000},
        "order": {"timeout": 8000},
        "payment": {"timeout": 5000},
        "db": {"timeout": 2000}
    }
    result = validate_timeout_hierarchy(services)
    assert result.valid == True  # Outer > Inner

def test_invalid_timeout_hierarchy():
    services = {
        "gateway": {"timeout": 5000},
        "order": {"timeout": 10000}  # Child > Parent
    }
    result = validate_timeout_hierarchy(services)
    assert result.valid == False
    assert "hierarchy" in result.errors[0]

def test_bulkhead_sizing():
    service = {
        "name": "inventory-service",
        "expected_concurrency": 100,
        "dependency_latency_ms": 50
    }
    config = calculate_bulkhead_size(service)

    # Thread pool sized for expected load + buffer
    assert config.max_concurrent_calls >= 100
    assert config.max_wait_duration_ms <= 1000

SLA Calculation Tests

def test_serial_availability():
    services = [0.999, 0.999, 0.999]  # Three 9s each
    result = calculate_serial_availability(services)
    assert abs(result - 0.997) < 0.001  # ~99.7%

def test_parallel_availability():
    replicas = [0.999, 0.999]  # Two replicas
    result = calculate_parallel_availability(replicas)
    assert abs(result - 0.999999) < 0.000001  # ~99.9999%

def test_sla_achievability():
    result = check_sla_achievable(
        target_sla="99.99%",
        service_count=5,
        per_service_availability=0.9999,
        has_redundancy=True
    )
    assert result.achievable == True

Version History

Version Date Changes
2.0.0 2025-01 Production-grade rewrite with resilience patterns
1.0.0 2024-12 Initial release