| name | system-architecture |
| description | System design and architecture expert for creating scalable distributed systems. Covers system design interviews, architecture patterns, and real-world case studies like Netflix, Twitter, Uber. Use when designing systems, writing architecture docs, or preparing for system design interviews. |
| allowed-tools | Read, Glob, Grep, Write |
System Architecture Expert
When to use this Skill
Use this Skill when:
- Designing distributed systems
- Writing system design documentation
- Preparing for system design interviews
- Creating architecture diagrams
- Analyzing trade-offs between design choices
- Reviewing or improving existing system designs
System Design Framework
1. Requirements Gathering (5-10 minutes)
Functional Requirements:
- What are the core features?
- What actions can users perform?
- What are the inputs and outputs?
Non-Functional Requirements:
- Scale: How many users? How much data?
- Performance: Latency requirements? (p50, p95, p99)
- Availability: What uptime is needed? (99.9%, 99.99%)
- Consistency: Strong or eventual consistency?
Constraints:
- Budget limitations
- Technology stack constraints
- Team expertise
- Timeline
Example Questions:
- How many daily active users?
- What's the read:write ratio?
- What's the average data size?
- What's the peak load vs average load?
- Do we need real-time updates?
- Can we have data loss?
2. Capacity Estimation (Back-of-the-envelope)
Calculate:
Traffic:
- DAU = 100M users
- Each user makes 10 requests/day
- QPS = 100M * 10 / 86400 ≈ 11,574 QPS
- Peak QPS = 2-3x average ≈ 30,000 QPS
Storage:
- 100M users * 1KB per user = 100GB
- With 3x replication = 300GB
- Growth: 300GB * 365 days = 109.5TB/year
Bandwidth:
- QPS * average request size
- 11,574 * 10KB = 115.74MB/s
Memory/Cache:
- 80-20 rule: 20% of data gets 80% of traffic
- Cache = 20% of total data for hot data
3. High-Level Design
Core Components:
- Client Layer (Web, Mobile, Desktop)
- API Gateway / Load Balancer
- Application Servers (Business logic)
- Cache Layer (Redis, Memcached)
- Database (SQL, NoSQL, or both)
- Message Queue (Kafka, RabbitMQ)
- Object Storage (S3, GCS)
- CDN (CloudFront, Akamai)
Draw Architecture:
[Clients] → [CDN]
↓
[Load Balancer]
↓
[Application Servers]
↙ ↓ ↘
[Cache] [DB] [Queue] → [Workers]
↓
[Object Storage]
4. Database Design
SQL vs NoSQL Decision:
Use SQL when:
- ACID transactions required
- Complex queries with JOINs
- Structured data with relationships
- Examples: PostgreSQL, MySQL
Use NoSQL when:
- Massive scale (horizontal scaling)
- Flexible schema
- High write throughput
- Examples: Cassandra, DynamoDB, MongoDB
Sharding Strategy:
- Hash-based:
user_id % num_shards - Range-based: Users 1-100M on shard 1
- Geographic: US users on US shard
- Consistent hashing: For even distribution
Schema Design:
-- Example: URL Shortener
CREATE TABLE urls (
id BIGSERIAL PRIMARY KEY,
short_url VARCHAR(10) UNIQUE NOT NULL,
long_url TEXT NOT NULL,
user_id BIGINT,
created_at TIMESTAMP DEFAULT NOW(),
expires_at TIMESTAMP,
click_count INT DEFAULT 0,
INDEX (short_url),
INDEX (user_id)
);
5. Deep Dive Components
Caching Strategy:
- Cache-Aside: App reads from cache, loads from DB on miss
- Write-Through: Write to cache and DB together
- Write-Behind: Write to cache, async write to DB
Eviction Policies:
- LRU (Least Recently Used) - Most common
- LFU (Least Frequently Used)
- TTL (Time To Live)
Load Balancing:
- Round Robin: Simple, equal distribution
- Least Connections: Route to least busy server
- Consistent Hashing: Minimize redistribution
- Weighted: Based on server capacity
Message Queue Patterns:
- Pub/Sub: One-to-many (notifications)
- Work Queue: Task distribution (job processing)
- Fan-out: Broadcast to multiple queues
6. Scalability Patterns
Horizontal Scaling:
- Add more servers
- Use load balancers
- Stateless application servers
- Session stored in cache/DB
Vertical Scaling:
- Add more CPU/RAM to servers
- Limited by hardware
- Simpler but has limits
Microservices:
Monolith:
[Single App] → [DB]
Microservices:
[User Service] → [User DB]
[Post Service] → [Post DB]
[Feed Service] → [Feed DB]
Benefits:
- Independent scaling
- Technology flexibility
- Fault isolation
Drawbacks:
- Increased complexity
- Network latency
- Distributed transactions
7. Reliability & Availability
Replication:
- Master-Slave: One writer, multiple readers
- Master-Master: Multiple writers (conflict resolution needed)
- Multi-region: Geographic redundancy
Failover:
- Active-Passive: Standby server takes over
- Active-Active: Both servers handle traffic
Rate Limiting:
- Token bucket algorithm
- Leaky bucket algorithm
- Fixed window counter
- Sliding window log
Circuit Breaker:
States:
Closed → Normal operation
Open → Reject requests immediately
Half-Open → Test if service recovered
8. Common System Design Patterns
Content Delivery:
- Use CDN for static assets
- Geo-distributed edge servers
- Cache at edge locations
Data Consistency:
- Strong Consistency: Read reflects latest write (ACID)
- Eventual Consistency: Reads eventually reflect write (BASE)
- CAP Theorem: Choose 2 of 3: Consistency, Availability, Partition Tolerance
API Design:
RESTful:
GET /api/users/{id}
POST /api/users
PUT /api/users/{id}
DELETE /api/users/{id}
GraphQL:
query {
user(id: "123") {
name
posts {
title
}
}
}
9. System Design Template
Use this structure (based on system_design/00_template.md):
# {System Name}
## 1. Requirements
### Functional
- [List core features]
### Non-Functional
- Scale: [Users, QPS, Data]
- Performance: [Latency requirements]
- Availability: [Uptime target]
## 2. Capacity Estimation
- Traffic: [QPS calculations]
- Storage: [Data size, growth]
- Bandwidth: [Network requirements]
## 3. API Design
[endpoint] - [description]
## 4. High-Level Architecture
[Diagram]
## 5. Database Schema
[Tables and relationships]
## 6. Detailed Design
### Component 1
[Deep dive]
### Component 2
[Deep dive]
## 7. Scalability
[How to scale each component]
## 8. Trade-offs
[Decisions and alternatives]
10. Real-World Examples
Reference case studies in system_design/:
- Netflix: Video streaming, recommendation
- Twitter: Timeline, tweet storage, trending
- Uber: Real-time matching, location tracking
- Instagram: Image storage, feed generation
- WhatsApp: Message delivery, presence
Common Patterns:
- News Feed: Fan-out on write vs fan-out on read
- Rate Limiter: Token bucket with Redis
- URL Shortener: Base62 encoding, hash collision
- Chat System: WebSocket, message queue
- Notification: Push notification service, APNs/FCM
Interview Tips
Time Management:
- Requirements: 10%
- High-level design: 25%
- Deep dive: 50%
- Wrap up: 15%
Communication:
- Think out loud
- Ask clarifying questions
- Discuss trade-offs
- Acknowledge limitations
What interviewers look for:
- Problem-solving approach
- Technical depth
- Trade-off analysis
- Scale awareness
- Communication skills
Common Mistakes to Avoid
- Jumping to solution without requirements
- Over-engineering simple problems
- Under-estimating scale requirements
- Ignoring single points of failure
- Not considering monitoring/alerting
- Forgetting about data consistency
- Missing security considerations
Project Context
- Templates in
system_design/00_template.md - Case studies in
system_design/*.md - Reference materials in
doc/system_design/ - Follow the established documentation pattern