| name | rate-limiting-patterns |
| description | Use when implementing rate limiting, throttling, or API quotas. Covers algorithms like token bucket and sliding window, plus distributed rate limiting patterns. |
| allowed-tools | Read, Glob, Grep |
Rate Limiting Patterns
Patterns for protecting APIs and services through rate limiting, throttling, and quota management.
When to Use This Skill
- Implementing API rate limiting
- Choosing rate limiting algorithms
- Designing distributed rate limiting
- Setting up quota management
- Protecting against abuse
Why Rate Limiting
Protection against:
- DDoS attacks
- Brute force attempts
- Resource exhaustion
- Cost overruns (cloud APIs)
- Cascading failures
Business benefits:
- Fair resource allocation
- Predictable performance
- Cost control
- SLA enforcement
Rate Limiting Algorithms
Token Bucket
Concept: Tokens added at fixed rate, requests consume tokens
Configuration:
- Bucket size (max tokens): 100
- Refill rate: 10 tokens/second
Behavior:
┌─────────────────────────┐
│ Bucket (capacity: 100) │
│ ████████████░░░░░░░░░░ │ 60 tokens available
└─────────────────────────┘
↑ ↓
10 tokens/s Request takes 1 token
Allows bursts up to bucket size, then rate-limited.
Characteristics:
- Allows controlled bursts
- Simple to implement
- Memory efficient
- Most common algorithm
Implementation sketch:
token_bucket:
tokens = min(tokens + (now - last_update) * rate, capacity)
if tokens >= cost:
tokens -= cost
return ALLOW
return DENY
Leaky Bucket
Concept: Requests queue and process at fixed rate
┌─────────────────────────┐
│ Queue (capacity: 100) │
│ ██████████████████████ │ Requests waiting
└──────────┬──────────────┘
│
▼ Process at fixed rate (10/sec)
[Processing]
Smooths traffic to constant rate.
Characteristics:
- Smooth output rate
- No bursts allowed
- Requests may queue
- Good for downstream protection
Fixed Window
Concept: Count requests in fixed time windows
Window: 1 minute, Limit: 100 requests
|-------- Window 1 --------|-------- Window 2 --------|
95 requests ? requests
[Allow] [Reset to 0]
Problem: Boundary burst
End of window 1: 100 requests
Start of window 2: 100 requests
= 200 requests in ~1 second span
Characteristics:
- Simple to implement
- Memory efficient
- Boundary burst problem
- Good for simple use cases
Sliding Window Log
Concept: Track timestamp of each request
Window: 1 minute, Limit: 100
Requests: [t-55s, t-50s, t-45s, ..., t-5s, t-2s, now]
Count all requests in [now - 60s, now]
No boundary burst problem, but memory intensive.
Characteristics:
- Precise limiting
- No boundary issues
- Memory intensive (stores all timestamps)
- Good for strict limits
Sliding Window Counter
Concept: Weighted average of current and previous windows
Previous window: 80 requests
Current window: 30 requests (40% through window)
Weighted count = 80 * 0.6 + 30 = 78
Limit: 100
Result: ALLOW (78 < 100)
Characteristics:
- Approximation (usually good enough)
- Memory efficient
- Smooths boundary issues
- Best balance for most cases
Algorithm Selection Guide
| Algorithm | Burst Handling | Memory | Precision | Use Case |
|---|---|---|---|---|
| Token Bucket | Allows bursts | Low | Good | General API limiting |
| Leaky Bucket | No bursts | Low | Good | Smooth rate enforcement |
| Fixed Window | Boundary burst | Very Low | Poor | Simple limits |
| Sliding Log | No bursts | High | Exact | Strict compliance |
| Sliding Counter | Minimal burst | Low | Good | Best general choice |
Distributed Rate Limiting
Challenge
Single node: Simple in-memory counter
Multiple nodes: Need coordination
Without coordination:
Node 1: 50 requests (under 100 limit)
Node 2: 50 requests (under 100 limit)
Node 3: 50 requests (under 100 limit)
Total: 150 requests (over 100 limit!)
Pattern 1: Centralized (Redis)
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└───────────────┼───────────────┘
│
┌──────▼──────┐
│ Redis │
│ (counters) │
└─────────────┘
Pros: Accurate, consistent
Cons: Redis dependency, latency, single point of failure
Pattern 2: Local + Sync
Each node gets fraction of limit:
- 3 nodes, 100 limit → 33 per node
Periodically sync to rebalance unused capacity.
Pros: Low latency, resilient
Cons: Less precise, sync complexity
Pattern 3: Sticky Sessions
Route same client to same node (by IP, API key, etc.)
Pros: Simple, no coordination needed
Cons: Uneven load, failover complexity
Redis Implementation
Token Bucket with Redis:
EVALSHA token_bucket_script 1 {key}
{capacity} {refill_rate} {tokens_requested}
Script:
1. Get current tokens and timestamp
2. Calculate tokens to add since last request
3. If enough tokens, decrement and allow
4. Return tokens remaining
Rate Limit Headers
Standard headers to communicate limits to clients:
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1640000000
Retry-After: 30 (when rate limited)
Or draft standard:
RateLimit-Limit: 100
RateLimit-Remaining: 45
RateLimit-Reset: 30
Rate Limit Response
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
{
"error": {
"code": "RATE_LIMITED",
"message": "Rate limit exceeded",
"retry_after": 30,
"limit": 100,
"window": "1m"
}
}
Multi-Tier Rate Limiting
Apply limits at multiple levels:
Level 1: Global (protect infrastructure)
- 10,000 req/sec across all clients
Level 2: Per-tenant (fair allocation)
- 1,000 req/min per organization
Level 3: Per-user (prevent abuse)
- 100 req/min per user
Level 4: Per-endpoint (protect expensive operations)
- 10 req/min for /export endpoint
Quota Management
Quota vs Rate Limit
Rate Limit: Requests per time window (burst protection)
- 100 requests/minute
Quota: Total allocation over period (budget)
- 10,000 API calls/month
Quota Tracking
Track usage:
- Per API key
- Per endpoint
- Per operation type
Alert thresholds:
- 80% usage: Warning notification
- 100% usage: Hard block or overage charges
Best Practices
Graceful Degradation
Instead of hard block:
1. Reduce quality (lower resolution, fewer results)
2. Queue requests (process later)
3. Serve cached responses
4. Allow burst with penalty (slower recovery)
Client-Side Handling
Implement exponential backoff:
1. Receive 429
2. Wait Retry-After (or 1s)
3. Retry
4. If 429 again, wait 2s
5. Continue doubling up to max (e.g., 60s)
Testing Rate Limits
Test scenarios:
- Burst traffic
- Sustained high traffic
- Clock skew (distributed systems)
- Recovery after limit
- Multiple client types
Related Skills
api-design-fundamentals- API design patternsidempotency-patterns- Safe retriesquality-attributes-taxonomy- Performance attributes