name	architecture-design
description	Use when designing system architecture, making high-level technical decisions, or planning major system changes. Focuses on structure, patterns, and long-term strategy.
allowed-tools	Read, Grep, Glob, Bash

Architecture Design Skill

Design system architecture and make strategic technical decisions.

Core Principle

Good architecture enables change while maintaining simplicity.

Architecture vs Planning

Architecture Design (this skill):

Strategic: "How should the system be structured?"
Component interactions and boundaries
Technology and pattern choices
Long-term implications
System-level decisions

Technical Planning (technical-planning skill):

Tactical: "How do I implement feature X?"
Specific implementation tasks
Execution details
Short-term focus

Use architecture when:

Designing new systems or subsystems
Major refactors affecting multiple components
Technology selection decisions
Defining system boundaries and interfaces
Making decisions with long-term impact

Use planning when:

Implementing within existing architecture
Breaking down specific features
Task sequencing and execution

Architecture Process

1. Understand Context

Business context:

What problem are we solving?
Who are the users?
What are the business goals?
What are the success metrics?

Technical context:

What exists today?
What constraints exist?
What must we integrate with?
What scale must we support?

Team context:

What's our expertise?
What can we maintain?
What's our velocity?

2. Gather Requirements

Functional requirements:

What must the system do?
What are the features?
What are the user scenarios?

Non-functional requirements:

Performance: Response time, throughput
Scalability: Expected load, growth
Availability: Uptime requirements
Security: Compliance, data protection
Maintainability: Team size, skills
Cost: Budget constraints

Example:

## Requirements

### Functional
- Users can search products by name/category
- Users can add items to cart
- Users can checkout and pay

### Non-Functional
- Search response time < 200ms (p95)
- Support 10,000 concurrent users
- 99.9% uptime
- PCI DSS compliant for payments
- Team of 5 developers can maintain

3. Identify Constraints

Technical constraints:

Must use existing authentication system
Must integrate with legacy inventory system
Database must be PostgreSQL (existing infrastructure)

Business constraints:

Must launch in 3 months
Budget of $50k for infrastructure
Must support EU data residency

Team constraints:

Team experienced in Python, less in Go
No DevOps specialist on team
Remote team across timezones

4. Consider Alternatives

Never design in a vacuum - consider options:

Example: Data storage choice

Option 1: PostgreSQL

Pros: Team knows it, ACID guarantees, rich query support
Cons: Vertical scaling limits, setup complexity

Option 2: MongoDB

Pros: Flexible schema, horizontal scaling
Cons: Team unfamiliar, eventual consistency

Option 3: DynamoDB

Pros: Fully managed, auto-scaling
Cons: Vendor lock-in, query limitations, cost at scale

Decision: PostgreSQL

Team expertise outweighs scaling concerns
Can re-evaluate if scale becomes issue
Faster initial development

5. Design System Structure

Define components and their responsibilities:

┌─────────────────────────────────────────────┐
│             Client Apps                      │
│  (Web, iOS, Android)                         │
└────────────────┬────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────┐
│          API Gateway / Load Balancer         │
└────────────────┬────────────────────────────┘
                 │
        ┌────────┴────────┐
        ▼                 ▼
┌───────────────┐  ┌───────────────┐
│   Auth        │  │   Core API     │
│   Service     │  │   Service      │
└───────┬───────┘  └───────┬───────┘
        │                  │
        │         ┌────────┴────────┐
        │         ▼                 ▼
        │  ┌──────────────┐  ┌──────────────┐
        │  │  PostgreSQL  │  │   Redis      │
        │  │  (Primary)   │  │   (Cache)    │
        │  └──────────────┘  └──────────────┘
        │
        ▼
┌───────────────┐
│   User DB     │
└───────────────┘

Component descriptions:

## Components

### API Gateway
**Responsibility:** Route requests, rate limiting, authentication
**Technology:** Nginx
**Dependencies:** Auth Service, Core API Service
**Scale:** 2-3 instances behind load balancer

### Auth Service
**Responsibility:** User authentication, session management, JWT issuing
**Technology:** Python (Flask), PostgreSQL
**API:** REST
**Scale:** Stateless, 2-N instances

### Core API Service
**Responsibility:** Business logic, data access, external integrations
**Technology:** Python (FastAPI), PostgreSQL, Redis
**API:** REST
**Scale:** Stateless, 2-N instances

### PostgreSQL
**Responsibility:** Primary data store
**Scale:** Primary with read replica

### Redis
**Responsibility:** Session storage, caching, rate limiting
**Scale:** Cluster mode (3 nodes)

6. Define Interfaces

API contracts:

## API Design

### POST /api/auth/login
**Purpose:** Authenticate user, issue JWT

**Request:**
```json
{
  "email": "user@example.com",
  "password": "secure_password"
}

Response (200):

{
  "token": "eyJ...",
  "user": {
    "id": "123",
    "email": "user@example.com",
    "name": "John Doe"
  }
}

Errors:

400: Invalid request
401: Invalid credentials
429: Rate limit exceeded


### 7. Plan for Failure

**What can go wrong?**
- Database unavailable
- External API down
- Network partition
- High load
- Data corruption

**Mitigation strategies:**
- Retry with exponential backoff
- Circuit breakers for external services
- Graceful degradation
- Health checks and monitoring
- Database backups

**Example:**
```markdown
## Failure Scenarios

### Database Unavailable
**Impact:** Cannot read/write data
**Mitigation:**
- Read replica failover (automated)
- Circuit breaker after 3 failures
- Cache serves stale data for 5 minutes
- User sees degraded experience message
**Recovery:** Manual failover to replica, fix primary

### External Payment API Down
**Impact:** Cannot process payments
**Mitigation:**
- Retry 3 times with exponential backoff
- Queue payments for later processing
- User notified of delay
- Alert on-call engineer
**Recovery:** Process queued payments once API recovers

8. Document Decisions

Architecture Decision Record (ADR):

# ADR-001: Use PostgreSQL for Primary Database

**Status:** Accepted
**Date:** 2024-01-15
**Deciders:** Tech Lead, Backend Team

## Context

We need to choose a primary database for user data, products, and orders.

Requirements:
- Strong consistency (ACID)
- Complex queries (joins, aggregations)
- < 200ms query time for 90% of queries
- Support 100k users initially

## Decision

Use PostgreSQL as primary database.

## Alternatives Considered

### MongoDB
- **Pros:** Flexible schema, horizontal scaling
- **Cons:** Team unfamiliar, eventual consistency issues
- **Why not:** Team expertise more valuable than flexibility

### DynamoDB
- **Pros:** Managed service, auto-scaling
- **Cons:** Vendor lock-in, limited query capability, cost
- **Why not:** Query limitations would hurt development velocity

### MySQL
- **Pros:** Similar to PostgreSQL, team knows it
- **Cons:** Less feature-rich than PostgreSQL
- **Why not:** PostgreSQL offers JSON support, better full-text search

## Consequences

**Positive:**
- Team can be productive immediately
- Strong consistency guarantees
- Rich query capabilities
- JSON support for flexible data

**Negative:**
- Vertical scaling limits (mitigated: can add read replicas)
- More complex than managed services (mitigated: use RDS)
- Higher operational overhead

**Trade-offs:**
- Chose familiarity over horizontal scaling
- Chose rich queries over eventual consistency
- Can re-evaluate if scale requirements change

## Validation

- Team confirmed expertise in PostgreSQL
- Load testing shows meets performance requirements
- Cost analysis shows acceptable for first year

Architecture Principles

1. Simplicity

Start simple, add complexity only when needed.

❌ BAD: Microservices from day 1 with 20 services
✅ GOOD: Start with monolith, split when needed

Apply YAGNI: You Aren't Gonna Need It

Don't build for hypothetical future
Add when actually needed
Simpler is easier to maintain

2. Separation of Concerns

Each component has one clear responsibility.

✅ GOOD:
- Auth Service: Authentication only
- User Service: User profile management
- Order Service: Order processing

❌ BAD:
- God Service: Does everything

Apply SOLID principles:

Single Responsibility
Open/Closed
Liskov Substitution
Interface Segregation
Dependency Inversion

3. Loose Coupling

Components depend on interfaces, not implementations.

// ❌ BAD: Tight coupling
class OrderService {
  constructor(private db: PostgresDatabase) {}
}

// ✅ GOOD: Loose coupling
class OrderService {
  constructor(private db: Database) {}  // Interface
}

Benefits:

Easier to test (mock interface)
Easier to swap implementations
Components can evolve independently

4. High Cohesion

Related functionality stays together.

✅ GOOD:
user/
  - create_user.ts
  - update_user.ts
  - delete_user.ts
  - user_repository.ts

❌ BAD:
create/
  - create_user.ts
  - create_order.ts
update/
  - update_user.ts
  - update_order.ts

5. Explicit Over Implicit

Make dependencies and contracts clear.

// ❌ BAD: Implicit dependency
function processOrder(orderId: string) {
  const db = global.database  // Where does this come from?
  // ...
}

// ✅ GOOD: Explicit dependency
function processOrder(
  orderId: string,
  db: Database,
  logger: Logger
) {
  // Dependencies are clear
}

6. Fail Fast

Detect and report errors early.

// ❌ BAD: Silent failure
function divide(a: number, b: number) {
  if (b === 0) return 0  // Wrong!
  return a / b
}

// ✅ GOOD: Fail fast
function divide(a: number, b: number) {
  if (b === 0) {
    throw new Error('Division by zero')
  }
  return a / b
}

7. Design for Testability

Make it easy to test.

// ❌ BAD: Hard to test
class OrderService {
  processOrder(orderId: string) {
    const db = new PostgresDatabase()  // Can't mock
    const api = new PaymentAPI()       // Can't mock
    // ...
  }
}

// ✅ GOOD: Easy to test
class OrderService {
  constructor(
    private db: Database,      // Can inject mock
    private api: PaymentAPI    // Can inject mock
  ) {}

  processOrder(orderId: string) {
    // ...
  }
}

Common Architecture Patterns

Layered Architecture

┌─────────────────────┐
│  Presentation       │ (UI, API controllers)
├─────────────────────┤
│  Business Logic     │ (Domain, services)
├─────────────────────┤
│  Data Access        │ (Repositories, ORMs)
├─────────────────────┤
│  Database           │ (Storage)
└─────────────────────┘

When to use: Simple to moderate complexity

Hexagonal Architecture (Ports & Adapters)

        ┌───────────────────────┐
        │   External Systems    │
        │  (UI, DB, APIs)       │
        └──────────┬────────────┘
                   │
        ┌──────────▼────────────┐
        │      Adapters         │ (Implementation)
        │  (REST, PostgreSQL)   │
        └──────────┬────────────┘
                   │
        ┌──────────▼────────────┐
        │       Ports           │ (Interfaces)
        │  (IUserRepo, IAuth)   │
        └──────────┬────────────┘
                   │
        ┌──────────▼────────────┐
        │    Core Domain        │ (Business logic)
        │    (Pure logic)       │
        └───────────────────────┘

When to use: Want to isolate business logic, multiple frontends

Microservices

┌─────────┐  ┌─────────┐  ┌─────────┐
│  User   │  │  Order  │  │ Payment │
│ Service │  │ Service │  │ Service │
└────┬────┘  └────┬────┘  └────┬────┘
     │            │            │
     └────────────┴────────────┘
                  │
          ┌───────▼────────┐
          │  Message Bus   │
          │  (Event-driven)│
          └────────────────┘

When to use: Large team, need independent deploy, clear boundaries

Avoid when: Small team, unclear boundaries, early stage

Event-Driven Architecture

┌─────────┐       ┌─────────────┐       ┌─────────┐
│Producer │──────▶│ Event Bus   │──────▶│Consumer │
└─────────┘       └─────────────┘       └─────────┘

When to use: Async processing, decoupled systems, audit trails

Anti-Patterns

❌ Premature Optimization

Don't optimize for scale you don't have.

BAD: Build microservices for 100 users
GOOD: Start with monolith, split when needed

❌ Resume-Driven Architecture

Don't choose technology to pad resume.

BAD: "I want to learn Kubernetes, let's use it"
GOOD: "Kubernetes fits our scale needs"

❌ Distributed Monolith

Microservices that are tightly coupled.

BAD: Service A can't deploy without Service B
GOOD: Services are independently deployable

❌ Big Ball of Mud

No structure, everything depends on everything.

BAD: Any code can call any other code
GOOD: Clear layers and boundaries

❌ Analysis Paralysis

Over-analyzing, never shipping.

BAD: Spend 6 months on perfect architecture
GOOD: Design enough to start, iterate

Architecture Review Checklist

Business goals clearly understood
Functional requirements documented
Non-functional requirements defined
Constraints identified
Multiple alternatives considered
Trade-offs explicitly stated
Component responsibilities clear
Interfaces well-defined
Failure scenarios planned for
Security considered
Scalability addressed
Testability designed in
Decisions documented (ADRs)
Team can implement and maintain

Integration with Other Skills

Apply solid-principles - Guide component design
Apply simplicity-principles - KISS, YAGNI
Apply orthogonality-principle - Independent components
Apply structural-design-principles - Composition patterns
Use technical-planning - For implementation after design

Remember

Simplicity first - Start simple, add complexity when needed
Document decisions - Future you will thank you
Consider alternatives - Never the first idea only
State trade-offs - Every decision has consequences
Design for change - Systems evolve

The best architecture is the one that's simple enough to ship and flexible enough to evolve.

architecture-design

Install Skill

SKILL.md