Microservices Design Skill
Purpose: Atomic skill for microservices architecture with comprehensive resilience and observability patterns.
Skill Identity
| Attribute |
Value |
| Scope |
Decomposition, Resilience, Observability |
| Responsibility |
Single: Service architecture patterns |
| Invocation |
Skill("microservices-design") |
Parameter Schema
Input Validation
parameters:
microservices_context:
type: object
required: true
properties:
project_type:
type: string
enum: [greenfield, monolith_extraction, optimization]
required: true
current_state:
type: object
properties:
services: { type: array, items: { type: string } }
pain_points: { type: array, items: { type: string } }
team_structure: { type: string }
requirements:
type: object
properties:
team_size: { type: integer, minimum: 1 }
deployment_frequency: { type: string, enum: [daily, weekly, monthly] }
availability_sla: { type: string, pattern: "^\\d{2}\\.\\d+%$" }
max_latency_ms: { type: integer, minimum: 1 }
constraints:
type: object
properties:
budget: { type: string }
timeline: { type: string }
technology_stack: { type: array, items: { type: string } }
validation_rules:
- name: "team_size_for_microservices"
rule: "team_size >= 2"
warning: "Microservices add overhead; consider monolith for small teams"
- name: "sla_feasibility"
rule: "availability_sla <= '99.99%' or has_multi_region"
warning: "99.99%+ SLA typically requires multi-region deployment"
Output Schema
output:
type: object
properties:
service_catalog:
type: array
items:
type: object
properties:
name: { type: string }
responsibility: { type: string }
api_type: { type: string }
dependencies: { type: array }
team_owner: { type: string }
database: { type: string }
architecture:
type: object
properties:
communication: { type: object }
service_mesh: { type: object }
api_gateway: { type: object }
resilience:
type: object
properties:
patterns: { type: array }
configuration: { type: object }
observability:
type: object
properties:
metrics: { type: array }
tracing: { type: object }
logging: { type: object }
alerting: { type: object }
Core Patterns
Service Decomposition
By Business Capability:
├── Align with business domains
├── Stable boundaries over time
├── Example: Order, Inventory, Payment
└── Team: One team per capability
By Subdomain (DDD):
├── Core: Competitive advantage (build)
├── Supporting: Necessary (build or buy)
├── Generic: Commodity (buy)
└── Bounded Context = Service
By Team (Inverse Conway):
├── Structure services around teams
├── 2-3 services per team (2-pizza)
├── Full ownership model
└── DevOps: You build it, you run it
Anti-Patterns:
├── Distributed Monolith: Tight coupling
├── Nano-services: Too granular
├── Shared Database: Hidden coupling
├── Sync Chains: Latency multiplication
Resilience Patterns
Circuit Breaker:
├── States: Closed → Open → Half-Open
├── Config:
│ ├── failure_threshold: 50%
│ ├── slow_call_threshold: 50%
│ ├── wait_duration: 60s
│ └── half_open_calls: 3
├── Implementation: Resilience4j
└── Fallback: Cached data, default, queue
Retry with Backoff:
├── Exponential: delay * 2^attempt
├── Max attempts: 3-5
├── Jitter: ±20%
├── Idempotency: Required
└── Non-retryable: 4xx errors
Bulkhead:
├── Isolate failure domains
├── Thread pool per dependency
├── Semaphore for lightweight
└── Config: maxConcurrentCalls: 25
Timeout:
├── Connection: 1s
├── Read: 5s
├── Total: 10s
└── Cascading: outer > inner
Service Mesh
Capabilities:
├── Traffic Management
│ ├── Load balancing
│ ├── Traffic splitting (canary)
│ ├── Circuit breaking
│ └── Retries/timeouts
├── Security
│ ├── mTLS
│ ├── Service identity (SPIFFE)
│ └── Authorization policies
├── Observability
│ ├── Distributed tracing
│ ├── Service metrics
│ └── Access logging
└── Options
├── Istio: Full-featured
├── Linkerd: Lightweight
├── Consul: HashiCorp
└── AWS App Mesh
Observability (Three Pillars)
Metrics:
├── RED: Request, Error, Duration
├── USE: Utilization, Saturation, Errors
├── Key Metrics:
│ ├── http_requests_total{method, path, status}
│ ├── http_request_duration_seconds{quantile}
│ └── http_requests_in_flight
└── Tools: Prometheus, Datadog
Logs:
├── Structured JSON
├── Correlation ID propagation
├── Level: DEBUG, INFO, WARN, ERROR
├── Format:
│ {
│ "timestamp": "ISO8601",
│ "level": "INFO",
│ "service": "order-service",
│ "trace_id": "abc123",
│ "message": "Order created"
│ }
└── Tools: ELK, Loki
Traces:
├── Distributed tracing
├── Span context propagation
├── W3C Trace Context
└── Tools: Jaeger, Zipkin, X-Ray
Retry Logic
Service Call Retry
retry_config:
http_calls:
max_attempts: 3
initial_delay_ms: 100
max_delay_ms: 5000
multiplier: 2.0
jitter_factor: 0.2
grpc_calls:
max_attempts: 5
initial_delay_ms: 50
max_delay_ms: 2000
multiplier: 1.5
retryable:
- UNAVAILABLE
- DEADLINE_EXCEEDED
- RESOURCE_EXHAUSTED
- 502, 503, 504
non_retryable:
- INVALID_ARGUMENT
- NOT_FOUND
- ALREADY_EXISTS
- 400, 401, 403, 404
idempotency:
header: "Idempotency-Key"
required_for: [POST, PATCH]
cache_ttl: 86400
Logging & Observability
Log Format
log_schema:
level: { type: string }
timestamp: { type: string, format: ISO8601 }
skill: { type: string, value: "microservices-design" }
event:
type: string
enum:
- service_designed
- decomposition_planned
- resilience_configured
- mesh_deployed
- sla_defined
context:
type: object
properties:
service_name: { type: string }
pattern: { type: string }
decision: { type: string }
example:
level: INFO
event: resilience_configured
context:
service_name: payment-service
pattern: circuit_breaker
decision: "5 failures in 60s triggers open state"
Metrics
metrics:
- name: service_design_decisions
type: counter
labels: [service, decision_type]
- name: decomposition_services_count
type: gauge
labels: [domain]
- name: resilience_patterns_applied
type: counter
labels: [service, pattern]
- name: sla_target
type: gauge
labels: [service]
Troubleshooting
Common Issues
| Issue |
Cause |
Resolution |
| High latency |
Cascade calls |
Parallelize, cache |
| Partial failures |
No circuit breaker |
Add resilience |
| Data inconsistency |
Distributed tx |
Saga pattern |
| Deployment failures |
Coupling |
API contracts |
| Debug difficulty |
No tracing |
Distributed tracing |
| Cascading failures |
No bulkhead |
Thread isolation |
Debug Checklist
□ Trace ID in all logs?
□ Circuit breakers monitored?
□ Timeouts on all calls?
□ Health checks passing?
□ Service mesh healthy?
□ Dependency graph documented?
□ SLOs defined and measured?
□ Alerting configured?
Unit Test Templates
Decomposition Tests
# test_microservices_design.py
def test_valid_microservices_context():
params = {
"microservices_context": {
"project_type": "monolith_extraction",
"current_state": {
"services": ["monolith"],
"pain_points": ["slow deployments", "scaling issues"]
},
"requirements": {
"team_size": 15,
"deployment_frequency": "daily",
"availability_sla": "99.9%",
"max_latency_ms": 200
}
}
}
result = validate_parameters(params)
assert result.valid == True
def test_small_team_warning():
params = {
"microservices_context": {
"project_type": "greenfield",
"requirements": {"team_size": 1}
}
}
result = validate_parameters(params)
assert len(result.warnings) > 0
assert "overhead" in result.warnings[0]
def test_service_decomposition():
monolith = {
"domains": ["users", "orders", "payments", "inventory"],
"team_size": 12
}
result = plan_decomposition(monolith)
assert len(result.services) == 4
assert result.services[0].responsibility != ""
assert result.communication_pattern in ["sync", "async", "mixed"]
Resilience Pattern Tests
def test_circuit_breaker_config():
service = {"name": "payment-service", "sla": "99.9%"}
config = generate_circuit_breaker_config(service)
assert config.failure_rate_threshold == 50
assert config.wait_duration_in_open_state == 60
assert config.permitted_calls_in_half_open == 3
def test_timeout_hierarchy():
services = {
"gateway": {"timeout": 10000},
"order": {"timeout": 8000},
"payment": {"timeout": 5000},
"db": {"timeout": 2000}
}
result = validate_timeout_hierarchy(services)
assert result.valid == True # Outer > Inner
def test_invalid_timeout_hierarchy():
services = {
"gateway": {"timeout": 5000},
"order": {"timeout": 10000} # Child > Parent
}
result = validate_timeout_hierarchy(services)
assert result.valid == False
assert "hierarchy" in result.errors[0]
def test_bulkhead_sizing():
service = {
"name": "inventory-service",
"expected_concurrency": 100,
"dependency_latency_ms": 50
}
config = calculate_bulkhead_size(service)
# Thread pool sized for expected load + buffer
assert config.max_concurrent_calls >= 100
assert config.max_wait_duration_ms <= 1000
SLA Calculation Tests
def test_serial_availability():
services = [0.999, 0.999, 0.999] # Three 9s each
result = calculate_serial_availability(services)
assert abs(result - 0.997) < 0.001 # ~99.7%
def test_parallel_availability():
replicas = [0.999, 0.999] # Two replicas
result = calculate_parallel_availability(replicas)
assert abs(result - 0.999999) < 0.000001 # ~99.9999%
def test_sla_achievability():
result = check_sla_achievable(
target_sla="99.99%",
service_count=5,
per_service_availability=0.9999,
has_redundancy=True
)
assert result.achievable == True
Version History
| Version |
Date |
Changes |
| 2.0.0 |
2025-01 |
Production-grade rewrite with resilience patterns |
| 1.0.0 |
2024-12 |
Initial release |