name	system-architecture
description	System design and architecture expert for creating scalable distributed systems. Covers system design interviews, architecture patterns, and real-world case studies like Netflix, Twitter, Uber. Use when designing systems, writing architecture docs, or preparing for system design interviews.
allowed-tools	Read, Glob, Grep, Write

System Architecture Expert

When to use this Skill

Use this Skill when:

Designing distributed systems
Writing system design documentation
Preparing for system design interviews
Creating architecture diagrams
Analyzing trade-offs between design choices
Reviewing or improving existing system designs

System Design Framework

1. Requirements Gathering (5-10 minutes)

Functional Requirements:

What are the core features?
What actions can users perform?
What are the inputs and outputs?

Non-Functional Requirements:

Scale: How many users? How much data?
Performance: Latency requirements? (p50, p95, p99)
Availability: What uptime is needed? (99.9%, 99.99%)
Consistency: Strong or eventual consistency?

Constraints:

Budget limitations
Technology stack constraints
Team expertise
Timeline

Example Questions:

- How many daily active users?
- What's the read:write ratio?
- What's the average data size?
- What's the peak load vs average load?
- Do we need real-time updates?
- Can we have data loss?

2. Capacity Estimation (Back-of-the-envelope)

Calculate:

Traffic:
- DAU = 100M users
- Each user makes 10 requests/day
- QPS = 100M * 10 / 86400 ≈ 11,574 QPS
- Peak QPS = 2-3x average ≈ 30,000 QPS

Storage:
- 100M users * 1KB per user = 100GB
- With 3x replication = 300GB
- Growth: 300GB * 365 days = 109.5TB/year

Bandwidth:
- QPS * average request size
- 11,574 * 10KB = 115.74MB/s

Memory/Cache:

80-20 rule: 20% of data gets 80% of traffic
Cache = 20% of total data for hot data

3. High-Level Design

Core Components:

Client Layer (Web, Mobile, Desktop)
API Gateway / Load Balancer
Application Servers (Business logic)
Cache Layer (Redis, Memcached)
Database (SQL, NoSQL, or both)
Message Queue (Kafka, RabbitMQ)
Object Storage (S3, GCS)
CDN (CloudFront, Akamai)

Draw Architecture:

[Clients] → [CDN]
            ↓
        [Load Balancer]
            ↓
    [Application Servers]
        ↙     ↓     ↘
   [Cache] [DB] [Queue] → [Workers]
                            ↓
                      [Object Storage]

4. Database Design

SQL vs NoSQL Decision:

Use SQL when:

ACID transactions required
Complex queries with JOINs
Structured data with relationships
Examples: PostgreSQL, MySQL

Use NoSQL when:

Massive scale (horizontal scaling)
Flexible schema
High write throughput
Examples: Cassandra, DynamoDB, MongoDB

Sharding Strategy:

Hash-based: user_id % num_shards
Range-based: Users 1-100M on shard 1
Geographic: US users on US shard
Consistent hashing: For even distribution

Schema Design:

-- Example: URL Shortener
CREATE TABLE urls (
    id BIGSERIAL PRIMARY KEY,
    short_url VARCHAR(10) UNIQUE NOT NULL,
    long_url TEXT NOT NULL,
    user_id BIGINT,
    created_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP,
    click_count INT DEFAULT 0,
    INDEX (short_url),
    INDEX (user_id)
);

5. Deep Dive Components

Caching Strategy:

Cache-Aside: App reads from cache, loads from DB on miss
Write-Through: Write to cache and DB together
Write-Behind: Write to cache, async write to DB

Eviction Policies:

LRU (Least Recently Used) - Most common
LFU (Least Frequently Used)
TTL (Time To Live)

Load Balancing:

Round Robin: Simple, equal distribution
Least Connections: Route to least busy server
Consistent Hashing: Minimize redistribution
Weighted: Based on server capacity

Message Queue Patterns:

Pub/Sub: One-to-many (notifications)
Work Queue: Task distribution (job processing)
Fan-out: Broadcast to multiple queues

6. Scalability Patterns

Horizontal Scaling:

Add more servers
Use load balancers
Stateless application servers
Session stored in cache/DB

Vertical Scaling:

Add more CPU/RAM to servers
Limited by hardware
Simpler but has limits

Microservices:

Monolith:
[Single App] → [DB]

Microservices:
[User Service] → [User DB]
[Post Service] → [Post DB]
[Feed Service] → [Feed DB]

Benefits:

Independent scaling
Technology flexibility
Fault isolation

Drawbacks:

Increased complexity
Network latency
Distributed transactions

7. Reliability & Availability

Replication:

Master-Slave: One writer, multiple readers
Master-Master: Multiple writers (conflict resolution needed)
Multi-region: Geographic redundancy

Failover:

Active-Passive: Standby server takes over
Active-Active: Both servers handle traffic

Rate Limiting:

Token bucket algorithm
Leaky bucket algorithm
Fixed window counter
Sliding window log

Circuit Breaker:

States:
Closed → Normal operation
Open → Reject requests immediately
Half-Open → Test if service recovered

8. Common System Design Patterns

Content Delivery:

Use CDN for static assets
Geo-distributed edge servers
Cache at edge locations

Data Consistency:

Strong Consistency: Read reflects latest write (ACID)
Eventual Consistency: Reads eventually reflect write (BASE)
CAP Theorem: Choose 2 of 3: Consistency, Availability, Partition Tolerance

API Design:

RESTful:
GET    /api/users/{id}
POST   /api/users
PUT    /api/users/{id}
DELETE /api/users/{id}

GraphQL:
query {
  user(id: "123") {
    name
    posts {
      title
    }
  }
}

9. System Design Template

Use this structure (based on system_design/00_template.md):

# {System Name}

## 1. Requirements
### Functional
- [List core features]

### Non-Functional
- Scale: [Users, QPS, Data]
- Performance: [Latency requirements]
- Availability: [Uptime target]

## 2. Capacity Estimation
- Traffic: [QPS calculations]
- Storage: [Data size, growth]
- Bandwidth: [Network requirements]

## 3. API Design

[endpoint] - [description]


## 4. High-Level Architecture
[Diagram]

## 5. Database Schema
[Tables and relationships]

## 6. Detailed Design
### Component 1
[Deep dive]

### Component 2
[Deep dive]

## 7. Scalability
[How to scale each component]

## 8. Trade-offs
[Decisions and alternatives]

10. Real-World Examples

Reference case studies in system_design/:

Netflix: Video streaming, recommendation
Twitter: Timeline, tweet storage, trending
Uber: Real-time matching, location tracking
Instagram: Image storage, feed generation
WhatsApp: Message delivery, presence

Common Patterns:

News Feed: Fan-out on write vs fan-out on read
Rate Limiter: Token bucket with Redis
URL Shortener: Base62 encoding, hash collision
Chat System: WebSocket, message queue
Notification: Push notification service, APNs/FCM

Interview Tips

Time Management:

Requirements: 10%
High-level design: 25%
Deep dive: 50%
Wrap up: 15%

Communication:

Think out loud
Ask clarifying questions
Discuss trade-offs
Acknowledge limitations

What interviewers look for:

Problem-solving approach
Technical depth
Trade-off analysis
Scale awareness
Communication skills

Common Mistakes to Avoid

Jumping to solution without requirements
Over-engineering simple problems
Under-estimating scale requirements
Ignoring single points of failure
Not considering monitoring/alerting
Forgetting about data consistency
Missing security considerations

Project Context

Templates in system_design/00_template.md
Case studies in system_design/*.md
Reference materials in doc/system_design/
Follow the established documentation pattern

system-architecture

Install Skill

SKILL.md