Claude Code Plugins

Community-maintained marketplace

Feedback

rate-limiting-patterns

@majiayu000/claude-skill-registry
5
0

Use when implementing rate limiting, throttling, or API quotas. Covers algorithms like token bucket and sliding window, plus distributed rate limiting patterns.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name rate-limiting-patterns
description Use when implementing rate limiting, throttling, or API quotas. Covers algorithms like token bucket and sliding window, plus distributed rate limiting patterns.
allowed-tools Read, Glob, Grep

Rate Limiting Patterns

Patterns for protecting APIs and services through rate limiting, throttling, and quota management.

When to Use This Skill

  • Implementing API rate limiting
  • Choosing rate limiting algorithms
  • Designing distributed rate limiting
  • Setting up quota management
  • Protecting against abuse

Why Rate Limiting

Protection against:
- DDoS attacks
- Brute force attempts
- Resource exhaustion
- Cost overruns (cloud APIs)
- Cascading failures

Business benefits:
- Fair resource allocation
- Predictable performance
- Cost control
- SLA enforcement

Rate Limiting Algorithms

Token Bucket

Concept: Tokens added at fixed rate, requests consume tokens

Configuration:
- Bucket size (max tokens): 100
- Refill rate: 10 tokens/second

Behavior:
┌─────────────────────────┐
│ Bucket (capacity: 100)  │
│ ████████████░░░░░░░░░░  │ 60 tokens available
└─────────────────────────┘
        ↑           ↓
   10 tokens/s   Request takes 1 token

Allows bursts up to bucket size, then rate-limited.

Characteristics:

  • Allows controlled bursts
  • Simple to implement
  • Memory efficient
  • Most common algorithm

Implementation sketch:

token_bucket:
  tokens = min(tokens + (now - last_update) * rate, capacity)
  if tokens >= cost:
    tokens -= cost
    return ALLOW
  return DENY

Leaky Bucket

Concept: Requests queue and process at fixed rate

┌─────────────────────────┐
│ Queue (capacity: 100)   │
│ ██████████████████████  │ Requests waiting
└──────────┬──────────────┘
           │
           ▼ Process at fixed rate (10/sec)
       [Processing]

Smooths traffic to constant rate.

Characteristics:

  • Smooth output rate
  • No bursts allowed
  • Requests may queue
  • Good for downstream protection

Fixed Window

Concept: Count requests in fixed time windows

Window: 1 minute, Limit: 100 requests

|-------- Window 1 --------|-------- Window 2 --------|
   95 requests                  ? requests
   [Allow]                      [Reset to 0]

Problem: Boundary burst
End of window 1: 100 requests
Start of window 2: 100 requests
= 200 requests in ~1 second span

Characteristics:

  • Simple to implement
  • Memory efficient
  • Boundary burst problem
  • Good for simple use cases

Sliding Window Log

Concept: Track timestamp of each request

Window: 1 minute, Limit: 100

Requests: [t-55s, t-50s, t-45s, ..., t-5s, t-2s, now]
Count all requests in [now - 60s, now]

No boundary burst problem, but memory intensive.

Characteristics:

  • Precise limiting
  • No boundary issues
  • Memory intensive (stores all timestamps)
  • Good for strict limits

Sliding Window Counter

Concept: Weighted average of current and previous windows

Previous window: 80 requests
Current window: 30 requests (40% through window)

Weighted count = 80 * 0.6 + 30 = 78
Limit: 100
Result: ALLOW (78 < 100)

Characteristics:

  • Approximation (usually good enough)
  • Memory efficient
  • Smooths boundary issues
  • Best balance for most cases

Algorithm Selection Guide

Algorithm Burst Handling Memory Precision Use Case
Token Bucket Allows bursts Low Good General API limiting
Leaky Bucket No bursts Low Good Smooth rate enforcement
Fixed Window Boundary burst Very Low Poor Simple limits
Sliding Log No bursts High Exact Strict compliance
Sliding Counter Minimal burst Low Good Best general choice

Distributed Rate Limiting

Challenge

Single node: Simple in-memory counter
Multiple nodes: Need coordination

Without coordination:
Node 1: 50 requests (under 100 limit)
Node 2: 50 requests (under 100 limit)
Node 3: 50 requests (under 100 limit)
Total: 150 requests (over 100 limit!)

Pattern 1: Centralized (Redis)

┌─────────┐     ┌─────────┐     ┌─────────┐
│ Node 1  │     │ Node 2  │     │ Node 3  │
└────┬────┘     └────┬────┘     └────┬────┘
     │               │               │
     └───────────────┼───────────────┘
                     │
              ┌──────▼──────┐
              │    Redis    │
              │ (counters)  │
              └─────────────┘

Pros: Accurate, consistent
Cons: Redis dependency, latency, single point of failure

Pattern 2: Local + Sync

Each node gets fraction of limit:
- 3 nodes, 100 limit → 33 per node

Periodically sync to rebalance unused capacity.

Pros: Low latency, resilient
Cons: Less precise, sync complexity

Pattern 3: Sticky Sessions

Route same client to same node (by IP, API key, etc.)

Pros: Simple, no coordination needed
Cons: Uneven load, failover complexity

Redis Implementation

Token Bucket with Redis:

EVALSHA token_bucket_script 1 {key}
  {capacity} {refill_rate} {tokens_requested}

Script:
1. Get current tokens and timestamp
2. Calculate tokens to add since last request
3. If enough tokens, decrement and allow
4. Return tokens remaining

Rate Limit Headers

Standard headers to communicate limits to clients:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1640000000
Retry-After: 30  (when rate limited)

Or draft standard:
RateLimit-Limit: 100
RateLimit-Remaining: 45
RateLimit-Reset: 30

Rate Limit Response

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30

{
  "error": {
    "code": "RATE_LIMITED",
    "message": "Rate limit exceeded",
    "retry_after": 30,
    "limit": 100,
    "window": "1m"
  }
}

Multi-Tier Rate Limiting

Apply limits at multiple levels:

Level 1: Global (protect infrastructure)
  - 10,000 req/sec across all clients

Level 2: Per-tenant (fair allocation)
  - 1,000 req/min per organization

Level 3: Per-user (prevent abuse)
  - 100 req/min per user

Level 4: Per-endpoint (protect expensive operations)
  - 10 req/min for /export endpoint

Quota Management

Quota vs Rate Limit

Rate Limit: Requests per time window (burst protection)
  - 100 requests/minute

Quota: Total allocation over period (budget)
  - 10,000 API calls/month

Quota Tracking

Track usage:
- Per API key
- Per endpoint
- Per operation type

Alert thresholds:
- 80% usage: Warning notification
- 100% usage: Hard block or overage charges

Best Practices

Graceful Degradation

Instead of hard block:
1. Reduce quality (lower resolution, fewer results)
2. Queue requests (process later)
3. Serve cached responses
4. Allow burst with penalty (slower recovery)

Client-Side Handling

Implement exponential backoff:
1. Receive 429
2. Wait Retry-After (or 1s)
3. Retry
4. If 429 again, wait 2s
5. Continue doubling up to max (e.g., 60s)

Testing Rate Limits

Test scenarios:
- Burst traffic
- Sustained high traffic
- Clock skew (distributed systems)
- Recovery after limit
- Multiple client types

Related Skills

  • api-design-fundamentals - API design patterns
  • idempotency-patterns - Safe retries
  • quality-attributes-taxonomy - Performance attributes