name	prompt-forge
description	Meta-prompt that generates improved prompts and templates. Can improve other prompts including Skill Forge and even itself. All improvements are gated by frozen eval harness. Use when optimizing prompts, creating prompt diffs, or running the recursive improvement loop.
version	1.0.0
category	foundry
tags	meta-prompt, self-improvement, recursive, dogfooding

Prompt Forge (Meta-Prompt)

Purpose

Generate improved prompts and templates with:

Explicit rationale for each change
Predicted improvement metrics
Risk assessment
Actionable diffs

Key Innovation: Can improve Skill Forge prompts, then Skill Forge can improve Prompt Forge prompts - creating a recursive improvement loop.

When to Use

Optimizing existing prompts for better performance
Creating prompt diffs with clear rationale
Running the recursive improvement loop
Auditing prompts for common issues

MCP Requirements

memory-mcp (Required)

Purpose: Store proposals, test results, version history

Activation:

claude mcp add memory-mcp npx @modelcontextprotocol/server-memory

Core Operations

Operation 1: Analyze Prompt

Before improving, deeply understand the target prompt.

analysis:
  target: "{prompt_path}"

  structural_analysis:
    sections: [list of sections]
    flow: "How sections connect"
    dependencies: "What inputs/outputs exist"

  quality_assessment:
    clarity:
      score: 0.0-1.0
      issues: ["Ambiguous instruction in section X"]
    completeness:
      score: 0.0-1.0
      issues: ["Missing failure handling for case Y"]
    precision:
      score: 0.0-1.0
      issues: ["Vague success criteria in section Z"]

  pattern_detection:
    evidence_based_techniques:
      self_consistency: present|missing|partial
      program_of_thought: present|missing|partial
      plan_and_solve: present|missing|partial
    failure_handling:
      explicit_errors: present|missing|partial
      edge_cases: present|missing|partial
      uncertainty: present|missing|partial

  improvement_opportunities:
    - area: "Section X"
      issue: "Lacks explicit timeout handling"
      priority: high|medium|low
      predicted_impact: "+X% reliability"

Operation 2: Generate Improvement Proposal

Create concrete, testable improvement proposals.

proposal:
  id: "prop-{timestamp}"
  target: "{prompt_path}"
  type: "prompt_improvement"

  summary: "One-line description of improvement"

  changes:
    - section: "Section name"
      location: "Line X-Y"
      before: |
        Original text...
      after: |
        Improved text...
      rationale: "Why this change improves the prompt"
      technique: "Which evidence-based technique applied"

  predicted_improvement:
    primary_metric: "success_rate"
    expected_delta: "+5%"
    confidence: 0.8
    reasoning: "Based on similar improvements in prompt X"

  risk_assessment:
    regression_risk: low|medium|high
    affected_components:
      - "Component 1"
      - "Component 2"
    rollback_complexity: simple|moderate|complex

  test_plan:
    - test: "Run on benchmark task A"
      expected: "Improvement in clarity score"
    - test: "Check for regressions in task B"
      expected: "No degradation"

Operation 3: Apply Evidence-Based Techniques

Systematically apply research-validated prompting patterns.

Self-Consistency Enhancement

BEFORE:
"Analyze the code and report issues"

AFTER:
"Analyze the code from three perspectives:
1. Security perspective: What vulnerabilities exist?
2. Performance perspective: What bottlenecks exist?
3. Maintainability perspective: What code smells exist?

Cross-reference findings. Flag any inconsistencies between perspectives.
Provide confidence scores for each finding.
Return only findings that appear in 2+ perspectives OR have >80% confidence."

Program-of-Thought Enhancement

BEFORE:
"Calculate the optimal configuration"

AFTER:
"Calculate the optimal configuration step by step:

Step 1: Identify all configuration parameters
  - List each parameter
  - Document valid ranges
  - Note dependencies between parameters

Step 2: Define optimization criteria
  - Primary metric: [what to maximize/minimize]
  - Constraints: [hard limits]
  - Trade-offs: [acceptable compromises]

Step 3: Evaluate options
  - For each viable configuration:
    - Calculate primary metric value
    - Verify all constraints met
    - Document trade-offs accepted

Step 4: Select and validate
  - Choose configuration with best metric
  - Verify against constraints
  - Document reasoning

Show your work at each step."

Plan-and-Solve Enhancement

BEFORE:
"Implement the feature"

AFTER:
"Implement the feature using plan-and-solve:

PLANNING PHASE:
1. Create detailed implementation plan
2. Identify all subtasks
3. Map dependencies between subtasks
4. Estimate complexity per subtask
5. Identify risks and mitigations

VALIDATION GATE: Review plan before proceeding

EXECUTION PHASE:
1. Execute subtasks in dependency order
2. Validate completion of each subtask
3. Run tests after each significant change
4. Document any deviations from plan

VERIFICATION PHASE:
1. Verify all requirements met
2. Run full test suite
3. Check for regressions
4. Document final state"

Uncertainty Handling Enhancement

BEFORE:
"Determine the best approach"

AFTER:
"Determine the best approach:

If confidence > 80%:
  - State your recommendation clearly
  - Provide supporting evidence
  - Note any caveats

If confidence 50-80%:
  - Present top 2-3 options
  - Compare trade-offs explicitly
  - Recommend with stated uncertainty
  - Suggest what additional information would increase confidence

If confidence < 50%:
  - Explicitly state uncertainty
  - List what you don't know
  - Propose information-gathering steps
  - Do NOT guess or fabricate

Never present uncertain conclusions as certain."

Operation 4: Generate Prompt Diff

Create clear, reviewable diffs for any prompt change.

--- a/skills/skill-forge/SKILL.md
+++ b/skills/skill-forge/SKILL.md
@@ -45,7 +45,15 @@ Phase 2: Use Case Crystallization

 ## Phase 3: Structural Architecture

-Design the skill's structure based on progressive disclosure.
+Design the skill's structure based on progressive disclosure.
+
+### Failure Handling (NEW)
+
+For each operation in the skill:
+1. Identify possible failure modes
+2. Define explicit error messages
+3. Specify recovery actions
+4. Include timeout handling
+
+Example:
+```yaml
+error_handling:
+  timeout:
+    threshold: 30s
+    action: "Return partial results with warning"
+  invalid_input:
+    detection: "Validate against schema"
+    action: "Return clear error message with fix suggestion"
+```

Operation 5: Self-Improvement (Recursive)

Improve Prompt Forge itself (with safeguards).

self_improvement:
  target: "prompt-forge/SKILL.md"
  safeguards:
    - "Changes must pass eval harness"
    - "Requires 2+ auditor approval"
    - "Previous version archived before commit"
    - "Rollback available for 30 days"

  process:
    1. "Analyze current Prompt Forge for weaknesses"
    2. "Generate improvement proposals"
    3. "Run proposals through eval harness"
    4. "If improved: Create new version"
    5. "If regressed: Reject and log"

  forbidden_changes:
    - "Removing safeguards"
    - "Bypassing eval harness"
    - "Modifying frozen benchmarks"
    - "Disabling rollback"

Improvement Checklist

When generating prompt improvements, verify:

Clarity

Each instruction has a single clear action
Ambiguous terms are defined
Success criteria are explicit
Examples illustrate expected behavior

Completeness

All inputs are specified
All outputs are defined
Edge cases are addressed
Failure modes have handlers

Precision

Quantifiable where possible
Ranges specified for parameters
Constraints explicitly stated
Trade-offs documented

Evidence-Based Techniques

Self-consistency for factual tasks
Program-of-thought for analytical tasks
Plan-and-solve for complex workflows
Uncertainty handling for ambiguous cases

Safety

Refuse/uncertainty pathway exists
No forced coherence
Rollback instructions included
Validation gates present

Integration with Recursive Loop

Prompt Forge -> Skill Forge

// Prompt Forge improves Skill Forge
Task("Prompt Forge",
  `Analyze skill-forge/SKILL.md and generate improvement proposals:
   - Focus on Phase 2 (Use Case Crystallization)
   - Apply self-consistency technique
   - Add explicit failure handling

   Output: Improvement proposal with diff`,
  "prompt-forge")

Skill Forge -> Prompt Forge

// Skill Forge rebuilds improved Prompt Forge
Task("Skill Forge",
  `Using the improvement proposal from Prompt Forge:
   - Apply changes to prompt-forge/SKILL.md
   - Validate against skill creation standards
   - Generate test cases for new version

   Output: prompt-forge-v{N+1}/SKILL.md`,
  "skill-forge")

Eval Harness Gate

// All changes gated by frozen eval
Task("Eval Runner",
  `Run eval harness on proposed changes:
   - Benchmark suite: prompt-generation-v1
   - Regression tests: prompt-forge-regression-v1

   Requirements:
   - Improvement > 0% on primary metric
   - 0 regressions
   - No new test failures

   Output: ACCEPT or REJECT with reasoning`,
  "eval-runner")

Output Format

All Prompt Forge outputs follow this structure:

prompt_forge_output:
  operation: "analyze|propose|improve|diff"
  target: "{prompt_path}"
  timestamp: "ISO-8601"

  analysis: {...}    # If analyze operation
  proposal: {...}    # If propose operation
  diff: "..."        # If diff operation

  next_steps:
    - "Step 1"
    - "Step 2"

  warnings:
    - "Any concerns about this change"

  requires_human_review: true|false
  reason_for_human_review: "If true, why"

Version History

Prompt Forge versions itself:

prompt-forge/
  SKILL.md           # Current version (v1.0.0)
  .archive/
    SKILL-v0.9.0.md  # Previous versions
  CHANGELOG.md       # What changed and why
  METRICS.md         # Performance over time

Status: Production-Ready Version: 1.0.0 Key Constraint: All self-improvements gated by frozen eval harness

Install Skill

SKILL.md