| name | dspy-framework |
| description | DSPy declarative framework for automatic prompt optimization treating prompts as code with systematic evaluation and compilers. Use when optimizing prompts systematically, building production LLM systems, implementing RAG/classification tasks, or requiring version-controlled reproducible prompts. |
DSPy Framework
progressive_disclosure: entry_point: summary: "Declarative framework for automatic prompt optimization treating prompts as code" when_to_use: - "When optimizing prompts systematically with evaluation data" - "When building production LLM systems requiring accuracy improvements" - "When implementing RAG, classification, or structured extraction tasks" - "When version-controlled, reproducible prompts are needed" quick_start: - "pip install dspy-ai" - "Define signature: class QA(dspy.Signature): question = dspy.InputField(); answer = dspy.OutputField()" - "Create module: qa = dspy.ChainOfThought(QA)" - "Optimize: optimizer.compile(qa, trainset=examples)" token_estimate: entry: 75 full: 5500
Core Philosophy
DSPy (Declarative Self-improving Python) shifts focus from manual prompt engineering to programming language models. Treat prompts as code with:
- Declarative signatures defining inputs/outputs
- Automatic optimization via compilers
- Version control and systematic testing
- Reproducible results across model changes
Key Principle: Don't write prompts manually—define task specifications and let DSPy optimize them.
Core Concepts
Signatures: Defining Task Interfaces
Signatures specify what your LM module should do (inputs → outputs) without saying how.
Basic Signature:
import dspy
# Inline signature (quick)
qa_module = dspy.ChainOfThought("question -> answer")
# Class-based signature (recommended for production)
class QuestionAnswer(dspy.Signature):
"""Answer questions with short factual answers."""
question = dspy.InputField()
answer = dspy.OutputField(desc="often between 1 and 5 words")
# Use signature
qa = dspy.ChainOfThought(QuestionAnswer)
response = qa(question="What is the capital of France?")
print(response.answer) # "Paris"
Advanced Signatures with Type Hints:
from typing import List
class DocumentSummary(dspy.Signature):
"""Generate concise document summaries."""
document: str = dspy.InputField(desc="Full text to summarize")
key_points: List[str] = dspy.OutputField(desc="3-5 bullet points")
summary: str = dspy.OutputField(desc="2-3 sentence summary")
sentiment: str = dspy.OutputField(desc="positive, negative, or neutral")
# Type hints provide strong typing and validation
summarizer = dspy.ChainOfThought(DocumentSummary)
result = summarizer(document="Long document text...")
Field Descriptions:
- Short, descriptive phrases (not full sentences)
- Examples:
desc="often between 1 and 5 words",desc="JSON format" - Used by optimizers to improve prompt quality
Modules: Building Blocks
Modules are DSPy's reasoning patterns—replacements for manual prompt engineering.
ChainOfThought (CoT):
# Zero-shot reasoning
class Reasoning(dspy.Signature):
"""Solve complex problems step by step."""
problem = dspy.InputField()
solution = dspy.OutputField()
cot = dspy.ChainOfThought(Reasoning)
result = cot(problem="Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many total?")
print(result.solution) # Includes reasoning steps automatically
print(result.rationale) # Access the chain-of-thought reasoning
Retrieve Module (RAG):
class RAGSignature(dspy.Signature):
"""Answer questions using retrieved context."""
question = dspy.InputField()
context = dspy.InputField(desc="relevant passages")
answer = dspy.OutputField(desc="answer based on context")
# Combine retrieval + reasoning
retriever = dspy.Retrieve(k=3) # Retrieve top 3 passages
rag = dspy.ChainOfThought(RAGSignature)
# Use in pipeline
question = "What is quantum entanglement?"
context = retriever(question).passages
answer = rag(question=question, context=context)
ReAct (Reasoning + Acting):
class ResearchTask(dspy.Signature):
"""Research a topic using tools."""
topic = dspy.InputField()
findings = dspy.OutputField()
# ReAct interleaves reasoning with tool calls
react = dspy.ReAct(ResearchTask, tools=[web_search, calculator])
result = react(topic="Apple stock price change last month")
# Automatically uses tools when needed
ProgramOfThought:
# Generate and execute Python code
class MathProblem(dspy.Signature):
"""Solve math problems by writing Python code."""
problem = dspy.InputField()
code = dspy.OutputField(desc="Python code to solve problem")
result = dspy.OutputField(desc="final numerical answer")
pot = dspy.ProgramOfThought(MathProblem)
answer = pot(problem="Calculate compound interest on $1000 at 5% for 10 years")
Custom Modules:
class MultiStepRAG(dspy.Module):
"""Custom module combining retrieval and reasoning."""
def __init__(self, num_passages=3):
super().__init__()
self.retrieve = dspy.Retrieve(k=num_passages)
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
# Retrieve relevant passages
context = self.retrieve(question).passages
# Generate answer with context
prediction = self.generate(context=context, question=question)
# Return with metadata
return dspy.Prediction(
answer=prediction.answer,
context=context,
rationale=prediction.rationale
)
# Use custom module
rag = MultiStepRAG(num_passages=5)
optimized_rag = optimizer.compile(rag, trainset=examples)
Optimizers: Automatic Prompt Improvement
Optimizers compile your high-level program into optimized prompts or fine-tuned weights.
BootstrapFewShot
Best For: Small datasets (10-50 examples), quick optimization Optimizes: Few-shot examples only
from dspy.teleprompt import BootstrapFewShot
# Define metric function
def accuracy_metric(example, prediction, trace=None):
"""Evaluate prediction correctness."""
return example.answer.lower() == prediction.answer.lower()
# Configure optimizer
optimizer = BootstrapFewShot(
metric=accuracy_metric,
max_bootstrapped_demos=4, # Max examples to bootstrap
max_labeled_demos=16, # Max labeled examples to consider
max_rounds=1, # Bootstrapping rounds
max_errors=10 # Stop after N errors
)
# Training examples
trainset = [
dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
dspy.Example(question="Capital of France?", answer="Paris").with_inputs("question"),
# ... more examples
]
# Compile program
qa_module = dspy.ChainOfThought("question -> answer")
optimized_qa = optimizer.compile(
student=qa_module,
trainset=trainset
)
# Save optimized program
optimized_qa.save("qa_optimized.json")
How It Works:
- Uses your program to generate outputs on training data
- Filters successful traces using your metric
- Selects representative examples as demonstrations
- Returns optimized program with best few-shot examples
BootstrapFewShotWithRandomSearch
Best For: Medium datasets (50-300 examples), better exploration Optimizes: Few-shot examples with candidate exploration
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
config = dict(
max_bootstrapped_demos=4,
max_labeled_demos=4,
num_candidate_programs=10, # Explore 10 candidate programs
num_threads=4 # Parallel optimization
)
optimizer = BootstrapFewShotWithRandomSearch(
metric=accuracy_metric,
**config
)
optimized_program = optimizer.compile(
qa_module,
trainset=training_examples,
valset=validation_examples # Optional validation set
)
# Compare candidates
print(f"Best program score: {optimizer.best_score}")
Advantage: Explores multiple candidate programs in parallel, selecting best performer via random search.
MIPROv2 (State-of-the-Art 2025)
Best For: Large datasets (300+ examples), production systems Optimizes: Instructions AND few-shot examples jointly via Bayesian optimization
import dspy
from dspy.teleprompt import MIPROv2
# Initialize language model
lm = dspy.LM('openai/gpt-4o-mini', api_key='YOUR_API_KEY')
dspy.configure(lm=lm)
# Define comprehensive metric
def quality_metric(example, prediction, trace=None):
"""Multi-dimensional quality scoring."""
correct = example.answer.lower() in prediction.answer.lower()
reasonable_length = 10 < len(prediction.answer) < 200
has_reasoning = hasattr(prediction, 'rationale') and len(prediction.rationale) > 20
# Weighted composite score
score = (
correct * 1.0 +
reasonable_length * 0.2 +
has_reasoning * 0.3
)
return score / 1.5 # Normalize to [0, 1]
# Initialize MIPROv2 with auto-configuration
teleprompter = MIPROv2(
metric=quality_metric,
auto="medium", # Options: "light", "medium", "heavy"
num_candidates=10, # Number of instruction candidates to explore
init_temperature=1.0 # Temperature for instruction generation
)
# Optimize program
optimized_program = teleprompter.compile(
dspy.ChainOfThought("question -> answer"),
trainset=training_examples,
num_trials=100, # Bayesian optimization trials
max_bootstrapped_demos=4,
max_labeled_demos=8
)
# Save for production
optimized_program.save("production_qa_model.json")
MIPROv2 Auto-Configuration Modes:
light: Fast optimization, ~20 trials, best for iteration (15-30 min)medium: Balanced optimization, ~50 trials, recommended default (30-60 min)heavy: Exhaustive search, ~100+ trials, highest quality (1-3 hours)
How MIPROv2 Works:
- Bootstrap Candidates: Generates few-shot example candidates from training data
- Propose Instructions: Creates instruction variations grounded in task dynamics
- Bayesian Optimization: Uses surrogate model to find optimal instruction + example combinations
- Joint Optimization: Optimizes both components together (not separately) for synergy
Performance Gains (2025 Study):
- Prompt Evaluation: +38.5% accuracy (46.2% → 64.0%)
- Guardrail Enforcement: +16.9% accuracy (72.1% → 84.3%)
- Code Generation: +21.9% accuracy (58.4% → 71.2%)
- Hallucination Detection: +20.8% accuracy (65.8% → 79.5%)
- Agent Routing: +18.5% accuracy (69.3% → 82.1%)
KNN Few-Shot Selector
Best For: Dynamic example selection based on query similarity
from dspy.teleprompt import KNNFewShot
# Requires embeddings for examples
knn_optimizer = KNNFewShot(
k=3, # Select 3 most similar examples
trainset=training_examples
)
optimized_program = knn_optimizer.compile(qa_module)
# Automatically selects relevant examples at inference time
# Math query → retrieves math examples
# Geography query → retrieves geography examples
SignatureOptimizer
Best For: Optimizing signature descriptions and field specifications
from dspy.teleprompt import SignatureOptimizer
sig_optimizer = SignatureOptimizer(
metric=accuracy_metric,
breadth=10, # Number of variations to generate
depth=3 # Optimization iterations
)
optimized_signature = sig_optimizer.compile(
initial_signature=QuestionAnswer,
trainset=trainset
)
# Use optimized signature
qa = dspy.ChainOfThought(optimized_signature)
Sequential Optimization Strategy
Combine optimizers for best results:
# Step 1: Bootstrap few-shot examples (fast)
bootstrap = dspy.BootstrapFewShot(metric=accuracy_metric)
bootstrapped_program = bootstrap.compile(qa_module, trainset=train_examples)
# Step 2: Optimize instructions with MIPRO (comprehensive)
mipro = dspy.MIPROv2(metric=accuracy_metric, auto="medium")
final_program = mipro.compile(
bootstrapped_program,
trainset=train_examples,
num_trials=50
)
# Step 3: Fine-tune signature descriptions
sig_optimizer = dspy.SignatureOptimizer(metric=accuracy_metric)
production_program = sig_optimizer.compile(final_program, trainset=train_examples)
# Save production model
production_program.save("production_optimized.json")
Teleprompters: Compilation Pipelines
Teleprompters orchestrate the optimization process (legacy term for "optimizers").
Custom Teleprompter:
class CustomTeleprompter:
"""Custom optimization pipeline."""
def __init__(self, metric):
self.metric = metric
def compile(self, student, trainset, valset=None):
# Stage 1: Bootstrap examples
bootstrap = BootstrapFewShot(metric=self.metric)
stage1 = bootstrap.compile(student, trainset=trainset)
# Stage 2: Optimize instructions
mipro = MIPROv2(metric=self.metric, auto="light")
stage2 = mipro.compile(stage1, trainset=trainset)
# Stage 3: Validate on held-out set
if valset:
score = self._evaluate(stage2, valset)
print(f"Validation score: {score:.2%}")
return stage2
def _evaluate(self, program, dataset):
correct = 0
for example in dataset:
prediction = program(**example.inputs())
if self.metric(example, prediction):
correct += 1
return correct / len(dataset)
# Use custom teleprompter
custom_optimizer = CustomTeleprompter(metric=accuracy_metric)
optimized = custom_optimizer.compile(
student=qa_module,
trainset=train_examples,
valset=val_examples
)
Metrics and Evaluation
Custom Metrics
Binary Accuracy:
def exact_match(example, prediction, trace=None):
"""Exact match metric."""
return example.answer.lower().strip() == prediction.answer.lower().strip()
Fuzzy Matching:
from difflib import SequenceMatcher
def fuzzy_match(example, prediction, trace=None):
"""Fuzzy string matching."""
similarity = SequenceMatcher(
None,
example.answer.lower(),
prediction.answer.lower()
).ratio()
return similarity > 0.8 # 80% similarity threshold
Multi-Criteria:
def comprehensive_metric(example, prediction, trace=None):
"""Evaluate on multiple criteria."""
# Correctness
correct = example.answer.lower() in prediction.answer.lower()
# Length appropriateness
length_ok = 10 < len(prediction.answer) < 200
# Has reasoning (if CoT)
has_reasoning = (
hasattr(prediction, 'rationale') and
len(prediction.rationale) > 30
)
# Citation quality (if RAG)
has_citations = (
hasattr(prediction, 'context') and
len(prediction.context) > 0
)
# Composite score
score = sum([
correct * 1.0,
length_ok * 0.2,
has_reasoning * 0.3,
has_citations * 0.2
]) / 1.7
return score
LLM-as-Judge:
def llm_judge_metric(example, prediction, trace=None):
"""Use LLM to evaluate quality."""
judge_prompt = f"""
Question: {example.question}
Expected Answer: {example.answer}
Predicted Answer: {prediction.answer}
Evaluate the predicted answer on a scale of 0-10 for:
1. Correctness
2. Completeness
3. Clarity
Return only a number 0-10.
"""
judge_lm = dspy.LM('openai/gpt-4o-mini')
response = judge_lm(judge_prompt)
score = float(response.strip()) / 10.0
return score > 0.7 # Pass if score > 7/10
Evaluation Pipeline
class Evaluator:
"""Comprehensive evaluation system."""
def __init__(self, program, metrics):
self.program = program
self.metrics = metrics
def evaluate(self, dataset, verbose=True):
"""Evaluate program on dataset."""
results = {name: [] for name in self.metrics.keys()}
for example in dataset:
prediction = self.program(**example.inputs())
for metric_name, metric_fn in self.metrics.items():
score = metric_fn(example, prediction)
results[metric_name].append(score)
# Aggregate results
aggregated = {
name: sum(scores) / len(scores)
for name, scores in results.items()
}
if verbose:
print("\nEvaluation Results:")
print("=" * 50)
for name, score in aggregated.items():
print(f"{name:20s}: {score:.2%}")
return aggregated
# Use evaluator
evaluator = Evaluator(
program=optimized_qa,
metrics={
"accuracy": exact_match,
"fuzzy_match": fuzzy_match,
"quality": comprehensive_metric
}
)
scores = evaluator.evaluate(test_dataset)
Language Model Configuration
Supported Providers
OpenAI:
import dspy
lm = dspy.LM('openai/gpt-4o', api_key='YOUR_API_KEY')
dspy.configure(lm=lm)
# With custom settings
lm = dspy.LM(
'openai/gpt-4o-mini',
api_key='YOUR_API_KEY',
temperature=0.7,
max_tokens=1024
)
Anthropic Claude:
lm = dspy.LM(
'anthropic/claude-3-5-sonnet-20241022',
api_key='YOUR_ANTHROPIC_KEY',
max_tokens=4096
)
dspy.configure(lm=lm)
# Claude Opus for complex reasoning
lm_opus = dspy.LM('anthropic/claude-3-opus-20240229', api_key=key)
Local Models (Ollama):
# Requires Ollama running locally
lm = dspy.LM('ollama/llama3.1:70b', api_base='http://localhost:11434')
dspy.configure(lm=lm)
# Mixtral
lm = dspy.LM('ollama/mixtral:8x7b')
Multiple Models:
# Use different models for different stages
strong_lm = dspy.LM('openai/gpt-4o')
fast_lm = dspy.LM('openai/gpt-4o-mini')
# Configure per module
class HybridPipeline(dspy.Module):
def __init__(self):
super().__init__()
# Fast model for retrieval
self.retrieve = dspy.Retrieve(k=5)
# Strong model for reasoning
with dspy.context(lm=strong_lm):
self.reason = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = self.retrieve(question).passages
return self.reason(context=context, question=question)
Model Selection Strategy
def select_model(task_complexity, budget):
"""Select appropriate model based on task and budget."""
models = {
"simple": [
("openai/gpt-4o-mini", 0.15), # (model, cost per 1M tokens)
("anthropic/claude-3-haiku-20240307", 0.25),
],
"medium": [
("openai/gpt-4o", 2.50),
("anthropic/claude-3-5-sonnet-20241022", 3.00),
],
"complex": [
("anthropic/claude-3-opus-20240229", 15.00),
("openai/o1-preview", 15.00),
]
}
candidates = models[task_complexity]
affordable = [m for m, cost in candidates if cost <= budget]
return affordable[0] if affordable else candidates[0][0]
# Use in optimization
task = "complex"
model = select_model(task, budget=10.0)
lm = dspy.LM(model)
dspy.configure(lm=lm)
Program Composition
Chaining Modules
class MultiStepPipeline(dspy.Module):
"""Chain multiple reasoning steps."""
def __init__(self):
super().__init__()
self.step1 = dspy.ChainOfThought("question -> subtasks")
self.step2 = dspy.ChainOfThought("subtask -> solution")
self.step3 = dspy.ChainOfThought("solutions -> final_answer")
def forward(self, question):
# Break down question
decomposition = self.step1(question=question)
# Solve each subtask
solutions = []
for subtask in decomposition.subtasks.split('\n'):
if subtask.strip():
sol = self.step2(subtask=subtask)
solutions.append(sol.solution)
# Synthesize final answer
combined = '\n'.join(solutions)
final = self.step3(solutions=combined)
return dspy.Prediction(
answer=final.final_answer,
subtasks=decomposition.subtasks,
solutions=solutions
)
# Optimize entire pipeline
pipeline = MultiStepPipeline()
optimizer = MIPROv2(metric=quality_metric, auto="medium")
optimized_pipeline = optimizer.compile(pipeline, trainset=examples)
Conditional Branching
class AdaptivePipeline(dspy.Module):
"""Adapt reasoning based on query type."""
def __init__(self):
super().__init__()
self.classifier = dspy.ChainOfThought("question -> category")
self.math_solver = dspy.ProgramOfThought("problem -> solution")
self.fact_qa = dspy.ChainOfThought("question -> answer")
self.creative = dspy.ChainOfThought("prompt -> response")
def forward(self, question):
# Classify query type
category = self.classifier(question=question).category.lower()
# Route to appropriate module
if "math" in category or "calculation" in category:
return self.math_solver(problem=question)
elif "creative" in category or "story" in category:
return self.creative(prompt=question)
else:
return self.fact_qa(question=question)
# Optimize each branch independently
adaptive = AdaptivePipeline()
optimized_adaptive = optimizer.compile(adaptive, trainset=diverse_examples)
Production Deployment
Saving and Loading Models
# Save optimized program
optimized_program.save("models/qa_v1.0.0.json")
# Load in production
production_qa = dspy.ChainOfThought("question -> answer")
production_qa.load("models/qa_v1.0.0.json")
# Use loaded model
response = production_qa(question="What is quantum computing?")
Version Control
import json
from datetime import datetime
class ModelRegistry:
"""Version control for DSPy models."""
def __init__(self, registry_path="models/registry.json"):
self.registry_path = registry_path
self.registry = self._load_registry()
def register(self, name, version, model_path, metadata=None):
"""Register a model version."""
model_id = f"{name}:v{version}"
self.registry[model_id] = {
"name": name,
"version": version,
"path": model_path,
"created_at": datetime.utcnow().isoformat(),
"metadata": metadata or {}
}
self._save_registry()
return model_id
def get_model(self, name, version="latest"):
"""Load model by name and version."""
if version == "latest":
versions = [
v for k, v in self.registry.items()
if v["name"] == name
]
if not versions:
raise ValueError(f"No versions found for {name}")
latest = max(versions, key=lambda x: x["created_at"])
model_path = latest["path"]
else:
model_id = f"{name}:v{version}"
model_path = self.registry[model_id]["path"]
# Load model
module = dspy.ChainOfThought("question -> answer")
module.load(model_path)
return module
def _load_registry(self):
try:
with open(self.registry_path, 'r') as f:
return json.load(f)
except FileNotFoundError:
return {}
def _save_registry(self):
with open(self.registry_path, 'w') as f:
json.dump(self.registry, f, indent=2)
# Use registry
registry = ModelRegistry()
# Register new version
registry.register(
name="qa_assistant",
version="1.0.0",
model_path="models/qa_v1.0.0.json",
metadata={
"accuracy": 0.87,
"optimizer": "MIPROv2",
"training_examples": 500
}
)
# Load for production
qa = registry.get_model("qa_assistant", version="latest")
Monitoring and Logging
import logging
from datetime import datetime
class DSPyMonitor:
"""Monitor DSPy program execution."""
def __init__(self, program, log_file="logs/dspy.log"):
self.program = program
self.logger = self._setup_logger(log_file)
self.metrics = []
def __call__(self, **kwargs):
"""Wrap program execution with monitoring."""
start_time = datetime.utcnow()
try:
# Execute program
result = self.program(**kwargs)
# Log success
duration = (datetime.utcnow() - start_time).total_seconds()
self._log_execution(
status="success",
inputs=kwargs,
outputs=result,
duration=duration
)
return result
except Exception as e:
# Log error
duration = (datetime.utcnow() - start_time).total_seconds()
self._log_execution(
status="error",
inputs=kwargs,
error=str(e),
duration=duration
)
raise
def _log_execution(self, status, inputs, duration, outputs=None, error=None):
"""Log execution details."""
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"status": status,
"inputs": inputs,
"duration_seconds": duration
}
if outputs:
log_entry["outputs"] = str(outputs)
if error:
log_entry["error"] = error
self.logger.info(json.dumps(log_entry))
self.metrics.append(log_entry)
def _setup_logger(self, log_file):
"""Setup logging."""
logger = logging.getLogger("dspy_monitor")
logger.setLevel(logging.INFO)
handler = logging.FileHandler(log_file)
handler.setFormatter(
logging.Formatter('%(asctime)s - %(message)s')
)
logger.addHandler(handler)
return logger
def get_stats(self):
"""Get execution statistics."""
if not self.metrics:
return {}
successes = [m for m in self.metrics if m["status"] == "success"]
errors = [m for m in self.metrics if m["status"] == "error"]
return {
"total_calls": len(self.metrics),
"success_rate": len(successes) / len(self.metrics),
"error_rate": len(errors) / len(self.metrics),
"avg_duration": sum(m["duration_seconds"] for m in self.metrics) / len(self.metrics),
"errors": [m["error"] for m in errors]
}
# Use monitor
monitored_qa = DSPyMonitor(optimized_qa)
result = monitored_qa(question="What is AI?")
# Check stats
stats = monitored_qa.get_stats()
print(f"Success rate: {stats['success_rate']:.2%}")
Integration with LangSmith
Evaluate DSPy programs using LangSmith:
import os
from langsmith import Client
from langsmith.evaluation import evaluate
# Setup
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"
client = Client()
# Wrap DSPy program for LangSmith
def dspy_wrapper(inputs: dict) -> dict:
"""Wrapper for LangSmith evaluation."""
question = inputs["question"]
result = optimized_qa(question=question)
return {"answer": result.answer}
# Define evaluator
def dspy_evaluator(run, example):
"""Evaluate DSPy output."""
predicted = run.outputs["answer"]
expected = example.outputs["answer"]
return {
"key": "correctness",
"score": 1.0 if expected.lower() in predicted.lower() else 0.0
}
# Create dataset
dataset = client.create_dataset(
dataset_name="dspy_qa_eval",
description="DSPy QA evaluation dataset"
)
# Add examples
for example in test_examples:
client.create_example(
dataset_id=dataset.id,
inputs={"question": example.question},
outputs={"answer": example.answer}
)
# Run evaluation
results = evaluate(
dspy_wrapper,
data="dspy_qa_eval",
evaluators=[dspy_evaluator],
experiment_prefix="dspy_v1.0"
)
print(f"Average correctness: {results['results']['correctness']:.2%}")
Real-World Examples
RAG Pipeline
class ProductionRAG(dspy.Module):
"""Production-ready RAG system."""
def __init__(self, k=5):
super().__init__()
self.retrieve = dspy.Retrieve(k=k)
# Multi-stage reasoning
self.rerank = dspy.ChainOfThought(
"question, passages -> relevant_passages"
)
self.generate = dspy.ChainOfThought(
"question, context -> answer, citations"
)
def forward(self, question):
# Retrieve candidate passages
candidates = self.retrieve(question).passages
# Rerank for relevance
reranked = self.rerank(
question=question,
passages="\n---\n".join(candidates)
)
# Generate answer with citations
result = self.generate(
question=question,
context=reranked.relevant_passages
)
return dspy.Prediction(
answer=result.answer,
citations=result.citations,
passages=candidates
)
# Optimize RAG pipeline
rag = ProductionRAG(k=10)
def rag_metric(example, prediction, trace=None):
"""Evaluate RAG quality."""
answer_correct = example.answer.lower() in prediction.answer.lower()
has_citations = len(prediction.citations) > 0
return answer_correct and has_citations
optimizer = MIPROv2(metric=rag_metric, auto="heavy")
optimized_rag = optimizer.compile(rag, trainset=rag_examples)
optimized_rag.save("models/rag_production.json")
Classification
class SentimentClassifier(dspy.Module):
"""Multi-class sentiment classification."""
def __init__(self, classes):
super().__init__()
self.classes = classes
class ClassificationSig(dspy.Signature):
text = dspy.InputField()
reasoning = dspy.OutputField(desc="step-by-step reasoning")
sentiment = dspy.OutputField(desc=f"one of: {', '.join(classes)}")
confidence = dspy.OutputField(desc="confidence score 0-1")
self.classify = dspy.ChainOfThought(ClassificationSig)
def forward(self, text):
result = self.classify(text=text)
# Validate output
if result.sentiment not in self.classes:
result.sentiment = "neutral" # Fallback
return result
# Train classifier
classes = ["positive", "negative", "neutral"]
classifier = SentimentClassifier(classes)
def classification_metric(example, prediction, trace=None):
return example.sentiment == prediction.sentiment
optimizer = BootstrapFewShot(metric=classification_metric)
optimized_classifier = optimizer.compile(
classifier,
trainset=sentiment_examples
)
# Use in production
result = optimized_classifier(text="This product is amazing!")
print(f"Sentiment: {result.sentiment} ({result.confidence})")
Summarization
class DocumentSummarizer(dspy.Module):
"""Hierarchical document summarization."""
def __init__(self):
super().__init__()
# Chunk-level summaries
self.chunk_summary = dspy.ChainOfThought(
"chunk -> summary"
)
# Document-level synthesis
self.final_summary = dspy.ChainOfThought(
"chunk_summaries -> final_summary, key_points"
)
def forward(self, document, chunk_size=1000):
# Split document into chunks
chunks = self._chunk_document(document, chunk_size)
# Summarize each chunk
chunk_summaries = []
for chunk in chunks:
summary = self.chunk_summary(chunk=chunk)
chunk_summaries.append(summary.summary)
# Synthesize final summary
combined = "\n---\n".join(chunk_summaries)
final = self.final_summary(chunk_summaries=combined)
return dspy.Prediction(
summary=final.final_summary,
key_points=final.key_points.split('\n'),
chunk_count=len(chunks)
)
def _chunk_document(self, document, chunk_size):
"""Split document into chunks."""
words = document.split()
chunks = []
for i in range(0, len(words), chunk_size):
chunk = ' '.join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
# Optimize summarizer
summarizer = DocumentSummarizer()
def summary_metric(example, prediction, trace=None):
# Check key points coverage
key_points_present = sum(
1 for kp in example.key_points
if kp.lower() in prediction.summary.lower()
)
coverage = key_points_present / len(example.key_points)
# Check length appropriateness
length_ok = 100 < len(prediction.summary) < 500
return coverage > 0.7 and length_ok
optimizer = MIPROv2(metric=summary_metric, auto="medium")
optimized_summarizer = optimizer.compile(summarizer, trainset=summary_examples)
Question Answering
class MultiHopQA(dspy.Module):
"""Multi-hop question answering."""
def __init__(self):
super().__init__()
# Decompose complex questions
self.decompose = dspy.ChainOfThought(
"question -> subquestions"
)
# Answer subquestions with retrieval
self.retrieve = dspy.Retrieve(k=3)
self.answer_subq = dspy.ChainOfThought(
"subquestion, context -> answer"
)
# Synthesize final answer
self.synthesize = dspy.ChainOfThought(
"question, subanswers -> final_answer, reasoning"
)
def forward(self, question):
# Decompose into subquestions
decomp = self.decompose(question=question)
subquestions = [
sq.strip()
for sq in decomp.subquestions.split('\n')
if sq.strip()
]
# Answer each subquestion
subanswers = []
for subq in subquestions:
context = self.retrieve(subq).passages
answer = self.answer_subq(
subquestion=subq,
context="\n".join(context)
)
subanswers.append(answer.answer)
# Synthesize final answer
combined = "\n".join([
f"Q: {sq}\nA: {sa}"
for sq, sa in zip(subquestions, subanswers)
])
final = self.synthesize(
question=question,
subanswers=combined
)
return dspy.Prediction(
answer=final.final_answer,
reasoning=final.reasoning,
subquestions=subquestions,
subanswers=subanswers
)
# Optimize multi-hop QA
multihop_qa = MultiHopQA()
def multihop_metric(example, prediction, trace=None):
# Check answer correctness
correct = example.answer.lower() in prediction.answer.lower()
# Check reasoning quality
has_reasoning = len(prediction.reasoning) > 50
# Check subquestion coverage
has_subquestions = len(prediction.subquestions) >= 2
return correct and has_reasoning and has_subquestions
optimizer = MIPROv2(metric=multihop_metric, auto="heavy")
optimized_multihop = optimizer.compile(multihop_qa, trainset=multihop_examples)
Migration from Manual Prompting
Before: Manual Prompting
# Manual prompt engineering
PROMPT = """
You are a helpful assistant. Answer questions accurately and concisely.
Examples:
Q: What is 2+2?
A: 4
Q: Capital of France?
A: Paris
Q: {question}
A: """
def manual_qa(question):
response = llm.invoke(PROMPT.format(question=question))
return response
After: DSPy
# DSPy declarative approach
class QA(dspy.Signature):
"""Answer questions accurately and concisely."""
question = dspy.InputField()
answer = dspy.OutputField(desc="short factual answer")
qa = dspy.ChainOfThought(QA)
# Optimize automatically
optimizer = MIPROv2(metric=accuracy_metric, auto="medium")
optimized_qa = optimizer.compile(qa, trainset=examples)
def dspy_qa(question):
result = optimized_qa(question=question)
return result.answer
Benefits:
- Systematic optimization vs. manual trial-and-error
- Version control and reproducibility
- Automatic adaptation to new models
- Performance gains: +18-38% accuracy
Best Practices
Data Preparation
# Create high-quality training examples
def prepare_training_data(raw_data):
"""Convert raw data to DSPy examples."""
examples = []
for item in raw_data:
example = dspy.Example(
question=item["question"],
answer=item["answer"],
context=item.get("context", "") # Optional fields
).with_inputs("question", "context") # Mark input fields
examples.append(example)
return examples
# Split data properly
def train_val_test_split(examples, train=0.7, val=0.15, test=0.15):
"""Split data for optimization and evaluation."""
import random
random.shuffle(examples)
n = len(examples)
train_end = int(n * train)
val_end = int(n * (train + val))
return {
"train": examples[:train_end],
"val": examples[train_end:val_end],
"test": examples[val_end:]
}
# Use split data
data = train_val_test_split(all_examples)
optimized = optimizer.compile(
program,
trainset=data["train"],
valset=data["val"] # For hyperparameter tuning
)
# Final evaluation on held-out test set
evaluator = Evaluator(optimized, metrics={"accuracy": accuracy_metric})
test_results = evaluator.evaluate(data["test"])
Metric Design
# Design metrics aligned with business goals
def business_aligned_metric(example, prediction, trace=None):
"""Metric aligned with business KPIs."""
# Core correctness (must have)
correct = example.answer.lower() in prediction.answer.lower()
if not correct:
return 0.0
# Business-specific criteria
is_concise = len(prediction.answer) < 100 # User preference
is_professional = not any(
word in prediction.answer.lower()
for word in ["um", "like", "maybe", "dunno"]
)
has_confidence = (
hasattr(prediction, 'confidence') and
float(prediction.confidence) > 0.7
)
# Weighted score
score = (
correct * 1.0 +
is_concise * 0.2 +
is_professional * 0.3 +
has_confidence * 0.2
) / 1.7
return score
Error Handling
class RobustModule(dspy.Module):
"""Module with error handling."""
def __init__(self):
super().__init__()
self.qa = dspy.ChainOfThought("question -> answer")
def forward(self, question, max_retries=3):
"""Forward with retry logic."""
for attempt in range(max_retries):
try:
result = self.qa(question=question)
# Validate output
if self._validate_output(result):
return result
else:
logging.warning(f"Invalid output on attempt {attempt + 1}")
except Exception as e:
logging.error(f"Error on attempt {attempt + 1}: {e}")
if attempt == max_retries - 1:
raise
# Fallback
return dspy.Prediction(
answer="I'm unable to answer that question.",
confidence=0.0
)
def _validate_output(self, result):
"""Validate output quality."""
return (
hasattr(result, 'answer') and
len(result.answer) > 0 and
len(result.answer) < 1000
)
Caching for Efficiency
from functools import lru_cache
import hashlib
class CachedModule(dspy.Module):
"""Module with semantic caching."""
def __init__(self, base_module):
super().__init__()
self.base_module = base_module
self.cache = {}
def forward(self, question):
# Check cache
cache_key = self._get_cache_key(question)
if cache_key in self.cache:
logging.info("Cache hit")
return self.cache[cache_key]
# Cache miss: execute module
result = self.base_module(question=question)
self.cache[cache_key] = result
return result
def _get_cache_key(self, question):
"""Generate cache key."""
return hashlib.md5(question.lower().encode()).hexdigest()
# Use cached module
base_qa = dspy.ChainOfThought("question -> answer")
cached_qa = CachedModule(base_qa)
Troubleshooting
Common Issues
Low Optimization Performance:
- Increase training data size (aim for 100+ examples)
- Use better quality metric (more specific)
- Try different optimizer (
auto="heavy"for MIPROv2) - Check for data leakage in metric
Optimization Takes Too Long:
- Use
auto="light"instead of"heavy" - Reduce
num_trialsfor MIPROv2 - Use BootstrapFewShot instead of MIPROv2 for quick iteration
- Parallelize with
num_threadsparameter
Inconsistent Results:
- Set random seed:
dspy.configure(random_seed=42) - Increase temperature for diversity or decrease for consistency
- Use ensemble of multiple optimized programs
- Validate on larger test set
Out of Memory:
- Reduce batch size in optimization
- Use streaming for large datasets
- Clear cache periodically
- Use smaller model for bootstrapping
Debugging Optimization
# Enable verbose logging
import logging
logging.basicConfig(level=logging.INFO)
# Custom teleprompter with debugging
class DebugTeleprompter:
def __init__(self, metric):
self.metric = metric
self.history = []
def compile(self, student, trainset):
print(f"\nStarting optimization with {len(trainset)} examples")
# Bootstrap with debugging
bootstrap = BootstrapFewShot(metric=self.metric)
for i, example in enumerate(trainset):
prediction = student(**example.inputs())
score = self.metric(example, prediction)
self.history.append({
"example_idx": i,
"score": score,
"prediction": str(prediction)
})
print(f"Example {i}: score={score}")
# Continue with optimization
optimized = bootstrap.compile(student, trainset=trainset)
print(f"\nOptimization complete")
print(f"Average score: {sum(h['score'] for h in self.history) / len(self.history):.2f}")
return optimized
# Use debug teleprompter
debug_optimizer = DebugTeleprompter(metric=accuracy_metric)
optimized = debug_optimizer.compile(qa_module, trainset=examples)
Performance Benchmarks
Based on 2025 production studies:
| Use Case | Baseline | DSPy Optimized | Improvement | Optimizer Used |
|---|---|---|---|---|
| Prompt Evaluation | 46.2% | 64.0% | +38.5% | MIPROv2 |
| Guardrail Enforcement | 72.1% | 84.3% | +16.9% | MIPROv2 |
| Code Generation | 58.4% | 71.2% | +21.9% | MIPROv2 |
| Hallucination Detection | 65.8% | 79.5% | +20.8% | BootstrapFewShot |
| Agent Routing | 69.3% | 82.1% | +18.5% | MIPROv2 |
| RAG Accuracy | 54.0% | 68.5% | +26.9% | BootstrapFewShot + MIPRO |
Production Adopters: JetBlue, Databricks, Walmart, VMware, Replit, Sephora, Moody's
Resources
- Documentation: https://dspy.ai/
- GitHub: https://github.com/stanfordnlp/dspy
- Paper: "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines"
- 2025 Study: "Is It Time To Treat Prompts As Code?" (arXiv:2507.03620)
- Community: Discord, GitHub Discussions
Related Skills
When using Dspy, these skills enhance your workflow:
- langgraph: LangGraph for multi-agent orchestration (use with DSPy-optimized prompts)
- test-driven-development: Testing DSPy modules and prompt optimizations
- systematic-debugging: Debugging DSPy compilation and optimization failures
[Full documentation available in these skills if deployed in your bundle]