name	microservices-architecture
description	Use when designing microservices, splitting monoliths, handling distributed data consistency, choosing communication patterns, or implementing service boundaries - covers domain-driven design, saga patterns, API gateways, service mesh

Microservices Architecture

Overview

Microservices architecture specialist covering service boundaries, communication patterns, data consistency, and operational concerns.

Core principle: Microservices decompose applications into independently deployable services organized around business capabilities - enabling team autonomy and technology diversity at the cost of operational complexity and distributed system challenges.

When to Use This Skill

Use when encountering:

Service boundaries: Defining service scope, applying domain-driven design
Monolith decomposition: Strategies for splitting existing systems
Data consistency: Sagas, event sourcing, eventual consistency patterns
Communication: Sync (REST/gRPC) vs async (events/messages)
API gateways: Routing, authentication, rate limiting
Service discovery: Registry patterns, DNS, configuration
Resilience: Circuit breakers, retries, timeouts, bulkheads
Observability: Distributed tracing, logging aggregation, metrics
Deployment: Containers, orchestration, blue-green deployments

Do NOT use for:

Monolithic architectures (microservices aren't always better)
Single-team projects < 5 services (overhead exceeds benefits)
Simple CRUD applications (microservices add unnecessary complexity)

When NOT to Use Microservices

Stay monolithic if:

Team < 10 engineers
Domain is not well understood yet
Strong consistency required everywhere
Network latency is critical
You can't invest in observability/DevOps infrastructure

Microservices require: Mature DevOps, monitoring, distributed systems expertise, organizational support.

Service Boundary Patterns (Domain-Driven Design)

1. Bounded Contexts

Pattern: One microservice = One bounded context

❌ Too fine-grained (anemic services):
- UserService (just CRUD)
- OrderService (just CRUD)
- PaymentService (just CRUD)

✅ Business capability alignment:
- CustomerManagementService (user profiles, preferences, history)
- OrderFulfillmentService (order lifecycle, inventory, shipping)
- PaymentProcessingService (payment, billing, invoicing, refunds)

Identifying boundaries:

Ubiquitous language - Different terms for same concept = different contexts
Change patterns - Services that change together should stay together
Team ownership - One team should own one service
Data autonomy - Each service owns its data, no shared databases

2. Strategic DDD Patterns

Pattern	Use When	Example
Separate Ways	Contexts are independent	Analytics service, main app service
Partnership	Teams must collaborate closely	Order + Inventory services
Customer-Supplier	Upstream/downstream relationship	Payment gateway (upstream) → Order service
Conformist	Accept upstream model as-is	Third-party API integration
Anti-Corruption Layer	Isolate from legacy/external systems	ACL between new microservices and legacy monolith

3. Service Sizing Guidelines

Too small (Nanoservices):

Excessive network calls
Distributed monolith
Coordination overhead exceeds benefits

Too large (Minimonoliths):

Multiple teams modifying same service
Mixed deployment frequencies
Tight coupling re-emerges

Right size indicators:

Single team can own it
Deployable independently
Changes don't ripple to other services
Clear business capability
100-10,000 LOC (highly variable)

Communication Patterns

Synchronous Communication

REST APIs:

# Order service calling Payment service
async def create_order(order: Order):
    # Synchronous REST call
    payment = await payment_service.charge(
        amount=order.total,
        customer_id=order.customer_id
    )

    if payment.status == "success":
        order.status = "confirmed"
        await db.save(order)
        return order
    else:
        raise PaymentFailedException()

Pros: Simple, request-response, easy to debug Cons: Tight coupling, availability dependency, latency cascades

gRPC:

# Proto definition
service OrderService {
    rpc CreateOrder (OrderRequest) returns (OrderResponse);
}

# Implementation
class OrderServicer(order_pb2_grpc.OrderServiceServicer):
    async def CreateOrder(self, request, context):
        # Type-safe, efficient binary protocol
        payment = await payment_stub.Charge(
            PaymentRequest(amount=request.total)
        )
        return OrderResponse(order_id=order.id)

Pros: Type-safe, efficient, streaming support Cons: HTTP/2 required, less human-readable, proto dependencies

Asynchronous Communication

Event-Driven (Pub/Sub):

# Order service publishes event
await event_bus.publish("order.created", {
    "order_id": order.id,
    "customer_id": customer.id,
    "total": order.total
})

# Inventory service subscribes
@event_bus.subscribe("order.created")
async def reserve_inventory(event):
    await inventory.reserve(event["order_id"])
    await event_bus.publish("inventory.reserved", {...})

# Notification service subscribes
@event_bus.subscribe("order.created")
async def send_confirmation(event):
    await email.send_order_confirmation(event)

Pros: Loose coupling, services independent, scalable Cons: Eventual consistency, harder to trace, ordering challenges

Message Queues (Point-to-Point):

# Producer
await queue.send("payment-processing", {
    "order_id": order.id,
    "amount": order.total
})

# Consumer
@queue.consumer("payment-processing")
async def process_payment(message):
    result = await payment_gateway.charge(message["amount"])
    if result.success:
        await message.ack()
    else:
        await message.nack(requeue=True)

Pros: Guaranteed delivery, work distribution, retry handling Cons: Queue becomes bottleneck, requires message broker

Communication Pattern Decision Matrix

Scenario	Pattern	Why
User-facing request/response	Sync (REST/gRPC)	Low latency, immediate feedback
Background processing	Async (queue)	Don't block user, retry support
Cross-service notifications	Async (pub/sub)	Loose coupling, multiple consumers
Real-time updates	WebSocket/SSE	Bidirectional, streaming
Data replication	Event sourcing	Audit trail, rebuild state
High throughput	Async (messaging)	Buffer spikes, backpressure

Data Consistency Patterns

1. Saga Pattern (Distributed Transactions)

Choreography (Event-Driven):

# Order Service
async def create_order(order):
    order.status = "pending"
    await db.save(order)
    await events.publish("order.created", order)

# Payment Service
@events.subscribe("order.created")
async def handle_order(event):
    try:
        await charge_customer(event["total"])
        await events.publish("payment.completed", event)
    except PaymentError:
        await events.publish("payment.failed", event)

# Inventory Service
@events.subscribe("payment.completed")
async def reserve_items(event):
    try:
        await reserve(event["items"])
        await events.publish("inventory.reserved", event)
    except InventoryError:
        await events.publish("inventory.failed", event)

# Order Service (Compensation)
@events.subscribe("payment.failed")
async def cancel_order(event):
    order = await db.get(event["order_id"])
    order.status = "cancelled"
    await db.save(order)

@events.subscribe("inventory.failed")
async def refund_payment(event):
    await payment.refund(event["order_id"])
    await cancel_order(event)

Orchestration (Coordinator):

class OrderSaga:
    def __init__(self, order):
        self.order = order
        self.completed_steps = []

    async def execute(self):
        try:
            # Step 1: Reserve inventory
            await self.reserve_inventory()
            self.completed_steps.append("inventory")

            # Step 2: Process payment
            await self.process_payment()
            self.completed_steps.append("payment")

            # Step 3: Confirm order
            await self.confirm_order()

        except Exception as e:
            # Compensate in reverse order
            await self.compensate()
            raise

    async def compensate(self):
        for step in reversed(self.completed_steps):
            if step == "inventory":
                await inventory_service.release(self.order.id)
            elif step == "payment":
                await payment_service.refund(self.order.id)

Choreography vs Orchestration:

Aspect	Choreography	Orchestration
Coordination	Decentralized (events)	Centralized (orchestrator)
Coupling	Loose	Tight to orchestrator
Complexity	Distributed across services	Concentrated in orchestrator
Tracing	Harder (follow events)	Easier (single coordinator)
Failure handling	Implicit (event handlers)	Explicit (orchestrator logic)
Best for	Simple workflows	Complex workflows

2. Event Sourcing

Pattern: Store events, not state

# Traditional approach (storing state)
class Order:
    id: int
    status: str  # "pending" → "confirmed" → "shipped"
    total: float

# Event sourcing (storing events)
class OrderCreated(Event):
    order_id: int
    total: float

class OrderConfirmed(Event):
    order_id: int

class OrderShipped(Event):
    order_id: int

# Rebuild state from events
def rebuild_order(order_id):
    events = event_store.get_events(order_id)
    order = Order()
    for event in events:
        order.apply(event)  # Apply each event to rebuild state
    return order

Pros: Complete audit trail, time travel, event replay Cons: Complexity, eventual consistency, schema evolution challenges

3. CQRS (Command Query Responsibility Segregation)

Separate read and write models:

# Write model (commands)
class CreateOrder:
    def execute(self, data):
        order = Order(**data)
        await db.save(order)
        await event_bus.publish("order.created", order)

# Read model (projections)
class OrderReadModel:
    # Denormalized for fast reads
    def __init__(self):
        self.cache = {}

    @event_bus.subscribe("order.created")
    async def on_order_created(self, event):
        self.cache[event["order_id"]] = {
            "id": event["order_id"],
            "customer_name": await get_customer_name(event["customer_id"]),
            "status": "pending",
            "total": event["total"]
        }

    def get_order(self, order_id):
        return self.cache.get(order_id)  # Fast read, no joins

Use when: Read/write patterns differ significantly (e.g., analytics dashboards)

Resilience Patterns

1. Circuit Breaker

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
async def call_payment_service(amount):
    response = await http.post("http://payment-service/charge", json={"amount": amount})
    if response.status >= 500:
        raise PaymentServiceError()
    return response.json()

# Circuit states:
# CLOSED → normal operation
# OPEN → fails fast after threshold
# HALF_OPEN → test if service recovered

2. Retry with Exponential Backoff

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def call_with_retry(url):
    return await http.get(url)

# Retries: 2s → 4s → 8s

3. Timeout

import asyncio

async def call_with_timeout(url):
    try:
        return await asyncio.wait_for(
            http.get(url),
            timeout=5.0  # 5 second timeout
        )
    except asyncio.TimeoutError:
        return {"error": "Service timeout"}

4. Bulkhead

Isolate resources to prevent cascade failures:

# Separate thread pools for different services
payment_pool = ThreadPoolExecutor(max_workers=10)
inventory_pool = ThreadPoolExecutor(max_workers=5)

async def call_payment():
    return await asyncio.get_event_loop().run_in_executor(
        payment_pool,
        payment_service.call
    )

# If payment service is slow, it only exhausts payment_pool,
# inventory calls still work

API Gateway Pattern

Centralized entry point for client requests:

Client → API Gateway → [Order, Payment, Inventory services]

Responsibilities:

Routing requests to services
Authentication/authorization
Rate limiting
Request/response transformation
Caching
Logging/monitoring

Example (Kong, AWS API Gateway, Nginx):

# API Gateway config
routes:
  - path: /orders
    service: order-service
    auth: jwt
    ratelimit: 100/minute

  - path: /payments
    service: payment-service
    auth: oauth2
    ratelimit: 50/minute

Backend for Frontend (BFF) Pattern:

Web Client → Web BFF → Services
Mobile App → Mobile BFF → Services

Each client type has optimized gateway.

Service Discovery

1. Client-Side Discovery

# Service registry (Consul, Eureka)
registry = ServiceRegistry("http://consul:8500")

# Client looks up service
instances = registry.get_instances("payment-service")
instance = load_balancer.choose(instances)
response = await http.get(f"http://{instance.host}:{instance.port}/charge")

2. Server-Side Discovery (Load Balancer)

Client → Load Balancer → [Service Instance 1, Instance 2, Instance 3]

DNS-based: Kubernetes services, AWS ELB

Observability

Distributed Tracing

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

async def create_order(order):
    with tracer.start_as_current_span("create-order") as span:
        span.set_attribute("order.id", order.id)
        span.set_attribute("order.total", order.total)

        # Trace propagates to payment service
        payment = await payment_service.charge(
            amount=order.total,
            trace_context=span.context
        )

        span.add_event("payment-completed")
        return order

Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM

Log Aggregation

Structured logging with correlation IDs:

import logging
import uuid

logger = logging.getLogger(__name__)

async def handle_request(request):
    correlation_id = request.headers.get("X-Correlation-ID") or str(uuid.uuid4())

    logger.info("Processing request", extra={
        "correlation_id": correlation_id,
        "service": "order-service",
        "user_id": request.user_id
    })

Tools: ELK stack (Elasticsearch, Logstash, Kibana), Splunk, Datadog

Monolith Decomposition Strategies

1. Strangler Fig Pattern

Gradually replace monolith with microservices:

Phase 1: Monolith handles everything
Phase 2: Extract service, proxy some requests to it
Phase 3: More services extracted, proxy more requests
Phase 4: Monolith retired

2. Branch by Abstraction

Create abstraction layer in monolith
Implement new service
Gradually migrate code behind abstraction
Remove old implementation
Extract as microservice

3. Extract by Bounded Context

Priority order:

Services with clear boundaries (authentication, payments)
Services changing frequently
Services with different scaling needs
Services with technology mismatches (e.g., Java monolith, Python ML service)

Anti-Patterns

Anti-Pattern	Why Bad	Fix
Distributed Monolith	Services share database, deploy together	One DB per service, independent deployment
Nanoservices	Too fine-grained, excessive network calls	Merge related services, follow DDD
Shared Database	Tight coupling, schema changes break multiple services	Database per service
Synchronous Chains	A→B→C→D, latency adds up, cascading failures	Async events, parallelize where possible
Chatty Services	N+1 calls, excessive network overhead	Batch APIs, caching, coarser boundaries
No Circuit Breakers	Cascading failures bring down system	Circuit breakers + timeouts + retries
No Distributed Tracing	Impossible to debug cross-service issues	OpenTelemetry, correlation IDs

Cross-References

Related skills:

Message queues → message-queues (RabbitMQ, Kafka patterns)
REST APIs → rest-api-design (service interface design)
gRPC → Check if gRPC skill exists
Security → ordis-security-architect (service-to-service auth, zero trust)
Database → database-integration (per-service databases, migrations)
Deployment → backend-deployment (Docker, Kubernetes, CI/CD)
Testing → api-testing (contract testing, integration testing)

microservices-architecture

Install Skill

SKILL.md