| name | optimize-prompt-gepa |
| description | Optimizes prompts using full GEPA methodology (Genetic-Pareto Evolution). Use when user wants to improve a prompt's accuracy on test cases, mentions "optimize prompt", "improve prompt", or has examples of desired input/output pairs. Implements Pareto frontier selection, trace-based reflection, and crossover mutations. |
You (Claude) are the optimizer. You run prompts, capture traces, reflect on failures, and evolve improvements.
You run the full GEPA loop and return the optimized prompt.
Example:
Seed: "Extract action items"
Test case:
Input: "Meeting notes: John will prepare the report by Friday. Sarah to review."
Expected: "- John: Prepare report (Due: Friday)\n- Sarah: Review (Due: unspecified)"
After GEPA optimization:
"Extract action items from text. Think step by step:
1. Identify each person mentioned
2. Find what they committed to do
3. Extract any deadline mentioned
Format each item as:
- [Person]: [Task] (Due: [deadline or 'unspecified'])
Rules:
- Skip items without clear ownership
- If deadline is vague (e.g., 'soon', 'later'), mark as 'unspecified'
- One line per action item"
- Seed prompt - What prompt do you want to optimize?
- Test cases - Examples of input and expected output
- Minimum: 1 example (I'll generate more synthetically)
- Recommended: 5-10 examples for robust optimization
Optional:
- Target score (default: 90%)
- Max iterations (default: 10)
- Diversity weight (default: 0.3) - How much to favor diverse solutions
Please provide your prompt and test cases.
Pareto Frontier
ParetoFrontier = [Candidate] # Candidates not dominated by any other
Test Case with Trace
EvaluatedCase = { input: string, expected: string, actual: string, trace: string, # Full reasoning chain score: float, feedback: string }
</data_structures>
<process>
<step name="1_parse_input">
Parse user's input to extract:
- `seed_prompt`: The prompt to optimize
- `test_cases`: Array of {id, input, expected} pairs
- `target_score`: Default 0.9 (90%)
- `max_iterations`: Default 10
- `diversity_weight`: Default 0.3
Assign unique IDs to test cases (tc_1, tc_2, etc.)
</step>
<step name="2_synthetic_generation">
If fewer than 5 test cases, generate synthetic examples:
Given these examples: {for each test_case: input -> expected}
Generate 5 more examples that:
- Follow the EXACT same output format
- Cover edge cases:
- Empty/null inputs
- Multiple items
- Missing information
- Ambiguous cases
- Use different names, numbers, contexts
Return as JSON array: [{"input": "...", "expected": "..."}, ...]
Add generated examples to test_cases with IDs.
</step>
<step name="3_baseline_evaluation">
Create seed candidate and evaluate with TRACES:
seed_candidate = { id: "c_0", prompt: seed_prompt, scores: {}, parent_ids: [], mutation_type: "seed" }
For each test_case:
1. Run with trace capture:
```
{seed_prompt}
Input: {test_case.input}
Think through this step by step, then provide your final answer.
## Reasoning:
[Your step-by-step thinking]
## Answer:
[Your final output]
```
2. Parse trace (reasoning) and answer separately
3. Score answer against expected (0-10)
4. Store: seed_candidate.scores[test_case.id] = score/10
Calculate avg_score = mean(all scores)
Initialize:
- `pareto_frontier = [seed_candidate]`
- `all_candidates = [seed_candidate]`
- `best_avg_score = avg_score`
Report: "Baseline score: {avg_score:.0%}"
</step>
<step name="4_gepa_loop">
FOR iteration 1 to max_iterations:
**4a. Pareto Selection**
Select parent candidate using tournament selection with diversity bonus:
For 3 random candidates from pareto_frontier: Calculate selection_score = avg_score + diversity_weight * uniqueness (uniqueness = how different this candidate's strengths are from others)
Select candidate with highest selection_score
selected_parent = winner
**4b. Mini-batch Evaluation**
Select mini-batch of 3 test cases, prioritizing:
- Cases where selected_parent scored lowest (exploitation)
- 1 random case (exploration)
Run selected_parent.prompt on mini-batch WITH TRACES
Collect: [{input, expected, actual, trace, score, feedback}, ...]
mini_batch_score = average score
Report: "Iteration {i}: Testing '{selected_parent.id}' on mini-batch: {mini_batch_score:.0%}"
**4c. Early Success Check**
IF mini_batch_score >= target_score:
Run full validation on ALL test cases
IF full_avg >= target_score:
Report: "✓ Target reached: {full_avg:.0%}"
GOTO step 5 (output)
**4d. Trace-Based Reflection**
Collect failures (score < 0.8) with their TRACES:
Reflection Task
Current prompt: {selected_parent.prompt}
Failed Cases Analysis
{for each failure:}
Case {id}
Input: {input} Expected: {expected} Actual: {actual} ReasoningTrace: {trace} Score: {score}/10 Feedback: {feedback}
Analysis Questions
Trace Analysis: Where in the reasoning did the model go wrong?
- Did it misunderstand the task?
- Did it miss information in the input?
- Did it apply wrong formatting?
Pattern Recognition: What patterns do you see across failures?
- Common misunderstandings
- Systematic format errors
- Missing edge case handling
Root Cause: What's the SINGLE most impactful fix?
Specific Rules: List 3-5 explicit rules to add to the prompt.
Provide your analysis:
Save reflection_analysis
**4e. Generate Mutations**
Create 2 new candidates:
**Mutation 1: Reflection-based**
Current prompt: {selected_parent.prompt}
Analysis of failures: {reflection_analysis}
Create an improved prompt that:
- Addresses ALL identified issues
- Includes explicit rules from analysis
- Adds step-by-step reasoning instructions if helpful
- Specifies exact output format with examples
Write ONLY the new prompt (no explanation):
**Mutation 2: Crossover (if pareto_frontier has 2+ candidates)**
You have two successful prompts with different strengths:
Prompt A (excels on: {cases where A > B}): {candidate_a.prompt}
Prompt B (excels on: {cases where B > A}): {candidate_b.prompt}
Create a NEW prompt that combines the best elements of both. Merge their rules, keep the most specific instructions from each.
Write ONLY the merged prompt:
Create new candidates:
- mutation_1 = {id: "c_{n}", prompt: reflection_result, parent_ids: [selected_parent.id], mutation_type: "reflection"}
- mutation_2 = {id: "c_{n+1}", prompt: crossover_result, parent_ids: [a.id, b.id], mutation_type: "crossover"}
**4f. Full Evaluation of New Candidates**
For each new candidate:
Run on ALL test cases with traces
Calculate scores per test case and avg_score
**4g. Update Pareto Frontier**
For each new candidate:
Add to all_candidates
Check Pareto dominance:
- Candidate A dominates B if A scores >= B on ALL test cases AND > on at least one
Update pareto_frontier:
- Add new candidate if not dominated by any existing
- Remove any existing candidates now dominated by new one
**4h. Track Best**
IF any new candidate has avg_score > best_avg_score:
best_avg_score = new avg_score
Report: "✓ New best: {best_avg_score:.0%} (candidate {id})"
ELSE:
Report: "No improvement. Pareto frontier size: {len(pareto_frontier)}"
**4i. Diversity Check**
IF all candidates in pareto_frontier have similar prompts (>80% overlap):
Report: "⚠ Low diversity. Injecting random mutation."
Create random_mutation with aggressive changes
Add to next iteration's candidates
END FOR
</step>
<step name="5_output_results">
Select best_candidate = candidate with highest avg_score from pareto_frontier
Present final results:
GEPA Optimization Results
Performance
| Metric | Value |
|---|---|
| Baseline Score | {seed_candidate.avg_score:.0%} |
| Final Score | {best_candidate.avg_score:.0%} |
| Improvement | +{improvement:.0%} |
| Iterations | {iterations_run} |
| Candidates Evaluated | {len(all_candidates)} |
| Pareto Frontier Size | {len(pareto_frontier)} |
Original Prompt
{seed_prompt}
Optimized Prompt
{best_candidate.prompt}
Per-Case Performance
| Test Case | Before | After | Δ |
|---|---|---|---|
| {for each test_case:} | |||
| {id} | {seed_scores[id]:.0%} | {best_scores[id]:.0%} | {delta} |
Key Discoveries
{Summarize main patterns found during reflection:}
- {discovery_1}
- {discovery_2}
- {discovery_3}
Alternative Prompts (Pareto Frontier)
{If pareto_frontier has multiple candidates with different strengths:}
- {candidate.id}: Best for {cases where it excels} ({avg:.0%} avg)
</step>
</process>
<scoring_guide>
## Scoring Outputs (0-10)
| Score | Criteria |
|-------|----------|
| 10 | Perfect match: correct content AND exact format |
| 9 | Correct content, trivial format difference (whitespace, punctuation) |
| 7-8 | Correct content, minor format difference (ordering, capitalization) |
| 5-6 | Mostly correct content, wrong format structure |
| 3-4 | Partial content, significant omissions |
| 1-2 | Minimal correct content |
| 0 | Completely wrong or empty |
## Feedback Template
Score: X/10 ✓ Correct: [what's right] ✗ Wrong: [what's wrong] → Fix: [specific instruction that would fix it]
Be STRICT about format matching. Format errors indicate missing instructions in the prompt.
</scoring_guide>
<trace_analysis_guide>
## How to Analyze Reasoning Traces
When examining a trace, look for:
1. **Task Understanding**
- Did the model correctly interpret what to do?
- Did it miss any requirements?
2. **Information Extraction**
- Did it find all relevant info in the input?
- Did it hallucinate information not present?
3. **Logic Errors**
- Where did the reasoning go wrong?
- What assumption was incorrect?
4. **Format Application**
- Did it know the expected format?
- Did it apply it correctly?
## Red Flags in Traces
- "I assume..." → Missing explicit instruction
- "I'm not sure if..." → Ambiguous requirement
- Skipping steps → Need more structured guidance
- Wrong interpretation → Need examples in prompt
</trace_analysis_guide>
<pareto_frontier_guide>
## Pareto Dominance
Candidate A dominates Candidate B if:
- A.scores[tc] >= B.scores[tc] for ALL test cases
- A.scores[tc] > B.scores[tc] for AT LEAST ONE test case
## Why Pareto Matters
Different prompts may excel on different cases:
- Prompt A: Great at edge cases, weak on simple cases
- Prompt B: Great at simple cases, weak on edge cases
Both belong in the Pareto frontier. Crossover can combine their strengths.
## Frontier Maintenance
- Max size: 5 candidates (prevent explosion)
- If over limit, keep most diverse set using k-medoids
</pareto_frontier_guide>
<edge_cases>
**Only 1 test case**: Generate 5+ synthetic examples covering edge cases before starting.
**Perfect baseline (100%)**: Report success, no optimization needed. Suggest additional edge cases to test robustness.
**No improvement after 5 iterations**:
- Increase diversity_weight to 0.5
- Try aggressive mutations (rewrite from scratch based on learnings)
- Check if test cases have conflicting requirements
**Pareto frontier explodes (>5 candidates)**:
- Keep only the 5 most diverse candidates
- Prioritize candidates with unique strengths
**Crossover produces worse results**:
- Reduce crossover frequency
- Focus on reflection-based mutations
**Oscillating scores (up/down/up)**:
- Indicates conflicting requirements in test cases
- Review test cases for consistency
- Consider splitting into sub-tasks
</edge_cases>
<success_criteria>
Optimization completes when:
1. ✓ Full dataset score >= target_score (default 90%), OR
2. ✓ Max iterations reached, OR
3. ✓ No improvement for 3 consecutive iterations (early stopping)
Always return:
1. Best prompt from Pareto frontier
2. Score improvement trajectory
3. Key discoveries from trace analysis
4. Alternative prompts if Pareto frontier has multiple strong candidates
</success_criteria>
<example_session>
## Example: Action Item Extraction
**User Input:**
Seed prompt: "Extract action items from meeting notes"
Test cases:
Input: "John will send the report by Friday" Expected: "- John: Send report (Due: Friday)"
Input: "We should discuss the budget sometime" Expected: ""
Input: "Sarah and Mike to review the proposal by EOD" Expected: "- Sarah: Review proposal (Due: EOD)\n- Mike: Review proposal (Due: EOD)"
**GEPA Execution:**
Iteration 1: Baseline 40%
- tc_1: 8/10 (format slightly off)
- tc_2: 0/10 (returned items when should be empty)
- tc_3: 4/10 (missed second person)
Reflection: "Model doesn't know to skip vague items or split multiple people"
Mutation 1 (reflection): Added rules for ownership and multiple people
Iteration 2: 70%
- tc_2 now correct (empty)
- tc_3 still failing (format)
Crossover with seed: Merged format examples
Iteration 3: 90% ✓ Target reached
**Final Optimized Prompt:**
Extract action items from meeting notes.
Step-by-step:
- Find each person with a specific commitment
- Identify their task and any deadline
- Format as: "- [Person]: [Task] (Due: [deadline])"
Rules:
- SKIP vague items without clear ownership ("we should...", "someone needs to...")
- If multiple people share a task, create separate lines for each
- If no deadline mentioned, use "Due: unspecified"
- If NO valid action items exist, return empty string
Example: Input: "John and Mary will review docs by Monday. We should improve process." Output:
- John: Review docs (Due: Monday)
- Mary: Review docs (Due: Monday)
</example_session>