| name | prompt-caching |
| description | Prompt caching for Claude API to reduce latency by up to 85% and costs by up to 90%. Activate for cache_control, ephemeral caching, cache breakpoints, and performance optimization. |
| allowed-tools | Bash, Read, Write, Edit, Glob, Grep, Task |
| triggers | cache, prompt-caching, cache_control, ephemeral, cache breakpoint, latency, performance, cost optimization |
| dependencies | llm-integration |
| related-skills | extended-thinking, batch-processing |
Prompt Caching Skill
Leverage Anthropic's prompt caching to dramatically reduce latency and costs for repeated prompts.
When to Use This Skill
- RAG systems with large static documents
- Multi-turn conversations with long instructions
- Code analysis with large codebase context
- Batch processing with shared prefixes
- Document analysis and summarization
Core Concepts
Cache Control Placement
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful assistant with access to a large knowledge base...",
"cache_control": {"type": "ephemeral"} # Cache this content
}
],
messages=[{"role": "user", "content": "What is...?"}]
)
Cache Hierarchy
Cache breakpoints are checked in this order:
- Tools - Tool definitions cached first
- System - System prompts cached second
- Messages - Conversation history cached last
TTL Options
| TTL | Write Cost | Read Cost | Use Case |
|---|---|---|---|
| 5 minutes (default) | 1.25x base | 0.1x base | Interactive sessions |
| 1 hour | 2.0x base | 0.1x base | Batch processing, stable docs |
Cache Requirements
- Minimum tokens: 1024-4096 (varies by model)
- Maximum breakpoints: 4 per request
- Supported models: Claude Opus 4.5, Sonnet 4.5, Haiku 4.5
Implementation Patterns
Pattern 1: Single Breakpoint (Recommended)
# Best for: Document analysis, Q&A with static context
system = [
{
"type": "text",
"text": large_document_content,
"cache_control": {"type": "ephemeral"} # Single breakpoint at end
}
]
Pattern 2: Multi-Turn Conversation
# Cache grows with conversation
messages = [
{"role": "user", "content": "First question"},
{"role": "assistant", "content": "First answer"},
{
"role": "user",
"content": "Follow-up question",
"cache_control": {"type": "ephemeral"} # Cache entire conversation
}
]
Pattern 3: RAG with Multiple Breakpoints
system = [
{
"type": "text",
"text": "Tool definitions and instructions",
"cache_control": {"type": "ephemeral"} # Breakpoint 1: Tools
},
{
"type": "text",
"text": retrieved_documents,
"cache_control": {"type": "ephemeral"} # Breakpoint 2: Documents
}
]
Pattern 4: Batch Processing with 1-Hour TTL
# Warm the cache before batch
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=100,
system=[{
"type": "text",
"text": shared_context,
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}],
messages=[{"role": "user", "content": "Initialize cache"}]
)
# Now run batch - all requests hit the cache
for item in batch_items:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[{
"type": "text",
"text": shared_context,
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}],
messages=[{"role": "user", "content": item}]
)
Performance Monitoring
Check Cache Usage
response = client.messages.create(...)
# Monitor these fields
cache_write = response.usage.cache_creation_input_tokens # New cache written
cache_read = response.usage.cache_read_input_tokens # Cache hit!
uncached = response.usage.input_tokens # After breakpoint
print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")
Cost Calculation
def calculate_cost(usage, model="claude-sonnet-4-20250514"):
# Example rates (check current pricing)
base_input_rate = 0.003 # per 1K tokens
write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
uncached_cost = (usage.input_tokens / 1000) * base_input_rate
return write_cost + read_cost + uncached_cost
Cache Invalidation
Changes that invalidate cache:
| Change | Impact |
|---|---|
| Tool definitions | Entire cache invalidated |
| System prompt | System + messages invalidated |
| Any content before breakpoint | That breakpoint + later invalidated |
Best Practices
DO:
- Place breakpoint at END of static content
- Keep tools/instructions stable across requests
- Use 1-hour TTL for batch processing
- Monitor cache_read_input_tokens for savings
DON'T:
- Place breakpoint in middle of dynamic content
- Change tool definitions frequently
- Expect cache to work with <1024 tokens
- Ignore the 20-block lookback limit
Integration with Extended Thinking
# Cache + Extended Thinking
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
system=[{
"type": "text",
"text": large_context,
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": "Analyze this..."}]
)
See Also
- [[llm-integration]] - Claude API basics
- [[extended-thinking]] - Deep reasoning
- [[batch-processing]] - Bulk processing