name	system-design-generator
description	Produces comprehensive system architecture plans for features and products including component breakdown, data flow diagrams, system boundaries, API contracts, and scaling considerations. Use for "system design", "architecture planning", "feature design", or "technical specs".

System Design Generator

Create comprehensive system architecture plans from requirements.

System Design Document Template

# System Design: [Feature/Product Name]

## Overview

Brief description of what we're building and why.

## Requirements

### Functional

- User can upload videos (max 1GB)
- System processes video within 5 minutes
- User receives notification when complete

### Non-Functional

- Handle 1000 uploads/day
- 99.9% uptime
- Process videos in <5 minutes (p95)
- Cost: <$0.50 per video

## High-Level Architecture

┌─────────┐ ┌──────────┐ ┌─────────────┐ │ Client │─────▶│ API │─────▶│ Upload │ │ │ │ Gateway │ │ Service │ └─────────┘ └──────────┘ └─────────────┘ │ ▼ ┌─────────────┐ │ Storage │ │ (S3) │ └─────────────┘ │ ▼ ┌─────────────┐ │ Processing │◀─┐ │ Queue │ │ └─────────────┘ │ │ │ ▼ │ ┌─────────────┐ │ │ Processor │─┘ │ Workers │ └─────────────┘ │ ▼ ┌─────────────┐ │Notification │ │ Service │ └─────────────┘


## Components

### 1. API Gateway
**Responsibilities:**
- Authentication
- Rate limiting
- Request routing

**Technology:** Kong/AWS API Gateway
**Scaling:** Auto-scale based on requests/sec

### 2. Upload Service
**Responsibilities:**
- Generate pre-signed S3 URLs
- Validate file metadata
- Enqueue processing jobs

**API:**

POST /uploads Request: { filename, size, content_type } Response: { upload_url, upload_id }


**Technology:** Node.js + Express
**Scaling:** Horizontal (stateless)

### 3. Storage (S3)
**Responsibilities:**
- Store raw videos
- Store processed outputs
- Serve content via CDN

**Structure:**

/uploads/{user_id}/{upload_id}/original.mp4 /processed/{user_id}/{upload_id}/output.mp4


### 4. Processing Queue
**Responsibilities:**
- Buffer processing jobs
- Ensure at-least-once delivery
- DLQ for failed jobs

**Technology:** AWS SQS
**Configuration:**
- Visibility timeout: 15 minutes
- DLQ after 3 retries

### 5. Processor Workers
**Responsibilities:**
- Transcode videos
- Generate thumbnails
- Update database

**Technology:** Python + FFmpeg
**Scaling:** Auto-scale on queue depth

## Data Flow

### Upload Flow
1. Client requests upload URL from Upload Service
2. Upload Service generates pre-signed S3 URL
3. Client uploads directly to S3
4. Client notifies Upload Service of completion
5. Upload Service enqueues processing job
6. Returns upload_id to client

### Processing Flow
1. Worker polls queue for jobs
2. Downloads video from S3
3. Processes video (transcode, thumbnail)
4. Uploads results to S3
5. Updates database status
6. Sends notification
7. Deletes message from queue

## Data Model

```typescript
interface Upload {
  id: string;
  user_id: string;
  filename: string;
  size: number;
  status: 'pending' | 'processing' | 'complete' | 'failed';
  original_url: string;
  processed_url?: string;
  created_at: Date;
  processed_at?: Date;
}

interface ProcessingJob {
  upload_id: string;
  attempts: number;
  error?: string;
}

API Contract

Upload Endpoints

POST   /uploads           - Request upload URL
GET    /uploads/:id       - Get upload status
DELETE /uploads/:id       - Cancel upload
GET    /uploads           - List user uploads

Webhooks

POST {webhook_url}
{
  "event": "upload.completed",
  "upload_id": "...",
  "status": "complete",
  "processed_url": "..."
}

Scaling Considerations

Current Capacity

1000 uploads/day = ~1 per minute
Single worker can process 1 video every 5 minutes
Need 5 workers for current load

10x Scale (10,000/day)

~10 uploads per minute
Need 50 workers
Use spot instances for cost savings
Add Redis cache for status checks

100x Scale (100,000/day)

~100 uploads per minute
Partition by region
Use Kafka instead of SQS
Database sharding by user_id

Failure Modes

S3 Unavailable

Impact: Uploads fail
Mitigation: Multi-region S3 replication

Queue Backed Up

Impact: Processing delays
Mitigation: Auto-scale workers faster

Worker Crash During Processing

Impact: Job retried
Mitigation: Idempotent processing

Cost Estimate

Monthly (1000 uploads/day):

S3 Storage: $50
S3 Transfer: $100
SQS: $10
Workers (EC2): $300
Database: $100 Total: ~$560/month

Security

Pre-signed URLs expire in 1 hour
Videos in private S3 buckets
CloudFront signed URLs for delivery
Rate limiting per user

Monitoring

Metrics:

Upload success rate
Processing time (p50, p95, p99)
Queue depth
Worker CPU/memory
Error rate by type

Alerts:

Queue depth >1000
Processing time p95 >10 minutes
Error rate >5%

Open Questions

Video retention policy? (30 days? 1 year?)
Maximum video duration? (affects processing time)
Regional data residency requirements?


## Component Template

```markdown
### Component Name

**Responsibilities:**
- Primary responsibility
- Secondary responsibility

**Technology Stack:**
- Language: [Python/Node/Go]
- Framework: [Express/FastAPI/Gin]
- Database: [PostgreSQL/MongoDB]

**API/Interface:**
```typescript
interface ComponentAPI {
  method(params): ReturnType;
}

Scaling Strategy:

Horizontal: Stateless, load balanced
Vertical: Cache layer, connection pooling

Dependencies:

Service A (for X)
Database B (for persistence)

Failure Handling:

Retry with exponential backoff
Circuit breaker for downstream services
Fallback to cached data


## Best Practices

1. **Start with requirements**: Functional + non-functional
2. **Draw diagrams first**: Visual clarity
3. **Define boundaries**: What's in scope vs out
4. **Document tradeoffs**: Every choice has costs
5. **Plan for failure**: What breaks and how to handle
6. **Consider scale**: Current, 10x, 100x
7. **Estimate costs**: Build vs buy decisions
8. **Leave open questions**: Don't pretend to know everything

## Output Checklist

- [ ] Requirements documented (functional + non-functional)
- [ ] High-level architecture diagram
- [ ] Component breakdown (3-7 components)
- [ ] Data flow documented
- [ ] Data model defined
- [ ] API contracts specified
- [ ] Scaling considerations (1x, 10x, 100x)
- [ ] Failure modes identified
- [ ] Cost estimate provided
- [ ] Security considerations
- [ ] Monitoring plan

system-design-generator

Install Skill

SKILL.md