Prompt Caching
Cache LLM prompt prefixes for 90% token savings.
When to Use
- Same system prompts across requests
- Few-shot examples in prompts
- Schema documentation in prompts
- High-volume API calls
Supported Models (2026)
| Provider |
Models |
| Claude |
Opus 4.1, Opus 4, Sonnet 4.5, Sonnet 4, Sonnet 3.7, Haiku 4.5, Haiku 3.5, Haiku 3 |
| OpenAI |
gpt-4o, gpt-4o-mini, o1, o1-mini (automatic caching) |
Claude Prompt Caching
def build_cached_messages(
system_prompt: str,
few_shot_examples: str | None,
user_content: str,
use_extended_cache: bool = False
) -> list[dict]:
"""Build messages with cache breakpoints.
Cache structure (processing order: tools → system → messages):
1. System prompt (cached)
2. Few-shot examples (cached)
─────── CACHE BREAKPOINT ───────
3. User content (NOT cached)
"""
# TTL: "5m" (default, 1.25x write cost) or "1h" (extended, 2x write cost)
ttl = "1h" if use_extended_cache else "5m"
content_parts = []
# Breakpoint 1: System prompt
content_parts.append({
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral", "ttl": ttl}
})
# Breakpoint 2: Few-shot examples (up to 4 breakpoints allowed)
if few_shot_examples:
content_parts.append({
"type": "text",
"text": few_shot_examples,
"cache_control": {"type": "ephemeral", "ttl": ttl}
})
# Dynamic content (NOT cached)
content_parts.append({
"type": "text",
"text": user_content
})
return [{"role": "user", "content": content_parts}]
Cache Pricing (2026)
┌─────────────────────────────────────────────────────────────┐
│ Cache Cost Multipliers (relative to base input price) │
├─────────────────────────────────────────────────────────────┤
│ 5-minute cache write: 1.25x base input price │
│ 1-hour cache write: 2.00x base input price │
│ Cache read: 0.10x base input price (90% off!) │
└─────────────────────────────────────────────────────────────┘
Example: Claude Sonnet 4 @ $3/MTok input
Without Prompt Caching:
System prompt: 2,000 tokens @ $3/MTok = $0.006
Few-shot examples: 5,000 tokens @ $3/MTok = $0.015
User content: 10,000 tokens @ $3/MTok = $0.030
───────────────────────────────────────────────────
Total: 17,000 tokens = $0.051
With 5m Caching (first request = cache write):
Cached prefix: 7,000 tokens @ $3.75/MTok = $0.02625 (1.25x)
User content: 10,000 tokens @ $3/MTok = $0.03000
Total first req: = $0.05625
With 5m Caching (subsequent = cache read):
Cached prefix: 7,000 tokens @ $0.30/MTok = $0.0021 (0.1x)
User content: 10,000 tokens @ $3/MTok = $0.0300
Total cached req: = $0.0321
Savings: 37% per cached request, break-even after 2 requests
Extended Cache (1-hour TTL)
Use 1-hour cache when:
- Prompt reused > 10 times per hour
- System prompts are highly stable
- Token count > 10k (maximize savings)
# Extended cache: 2x write cost but persists 12x longer
"cache_control": {"type": "ephemeral", "ttl": "1h"}
# Break-even: 1h cache pays off after ~8 reads
# (2x write cost ÷ 0.9 savings per read ≈ 8 reads)
OpenAI Automatic Caching
# OpenAI caches prefixes automatically
# No cache_control markers needed
response = await openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt}, # Cached
{"role": "user", "content": user_content} # Not cached
]
)
# Check cache usage in response
cache_tokens = response.usage.prompt_tokens_cached
Cache Processing Order
Cache references entire prompt in order:
1. Tools (cached first)
2. System messages (cached second)
3. User messages (cached last)
⚠️ Extended thinking changes invalidate message caches
(but NOT system/tools caches)
Monitoring Cache Effectiveness
# Track these fields in API response
response = await client.messages.create(...)
cache_created = response.usage.cache_creation_input_tokens # New cache
cache_read = response.usage.cache_read_input_tokens # Cache hit
regular = response.usage.input_tokens # Not cached
# Calculate cache hit rate
if cache_created + cache_read > 0:
hit_rate = cache_read / (cache_created + cache_read)
print(f"Cache hit rate: {hit_rate:.1%}")
Best Practices
# ✅ Good: Long, stable prefix first
messages = [
{"role": "system", "content": LONG_SYSTEM_PROMPT},
{"role": "user", "content": FEW_SHOT_EXAMPLES},
{"role": "user", "content": user_input} # Variable
]
# ❌ Bad: Variable content early (breaks cache)
messages = [
{"role": "user", "content": user_input}, # Breaks cache!
{"role": "system", "content": LONG_SYSTEM_PROMPT}
]
Key Decisions
| Decision |
Recommendation |
| Min prefix size |
1,024 tokens (Claude) |
| Breakpoint count |
2-4 per request |
| Content order |
Stable prefix first |
| Default TTL |
5m for most cases |
| Extended TTL |
1h if >10 reads/hour |
Common Mistakes
- Variable content before cached prefix
- Too many breakpoints (overhead)
- Prefix too short (min 1024 tokens)
- Not checking
cache_read_input_tokens
- Using 1h TTL for infrequent calls (wastes 2x write)
Related Skills
semantic-caching - Redis similarity caching
cache-cost-tracking - Cost monitoring
llm-streaming - Streaming with caching