| name | llm-evaluation |
| description | LLM prompt testing, evaluation, and CI/CD quality gates using Promptfoo. Invoke when: - Setting up prompt evaluation or regression testing - Integrating LLM testing into CI/CD pipelines - Configuring security testing (red teaming, jailbreaks) - Comparing prompt or model performance - Building evaluation suites for RAG, factuality, or safety Keywords: promptfoo, llm evaluation, prompt testing, red team, CI/CD, regression testing |
LLM Evaluation & Testing
Test prompts, models, and RAG systems with automated evaluation and CI/CD integration.
Quick Start
# Initialize Promptfoo (no global install needed)
npx promptfoo@latest init
# Run evaluation
npx promptfoo@latest eval
# View results in browser
npx promptfoo@latest view
# Run security scan
npx promptfoo@latest redteam run
Core Concepts
Why Evaluate?
LLM outputs are non-deterministic. "It looks good" isn't testing. You need:
- Regression detection: Catch quality drops before production
- Security scanning: Find jailbreaks, injections, PII leaks
- A/B comparison: Compare prompts/models side-by-side
- CI/CD gates: Block bad changes from merging
Evaluation Types
| Type | Purpose | Assertions |
|---|---|---|
| Functional | Does it work? | contains, equals, is-json |
| Semantic | Is it correct? | similar, llm-rubric, factuality |
| Performance | Is it fast/cheap? | cost, latency |
| Security | Is it safe? | redteam, moderation, pii-detection |
Configuration
Basic promptfooconfig.yaml
description: "My LLM evaluation suite"
prompts:
- file://prompts/main.txt
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-sonnet-latest
tests:
- vars:
question: "What is the capital of France?"
assert:
- type: contains
value: "Paris"
- type: cost
threshold: 0.01
- vars:
question: "Explain quantum computing"
assert:
- type: llm-rubric
value: "Response explains quantum computing concepts clearly"
- type: latency
threshold: 3000
With Environment Variables
providers:
- id: openrouter:anthropic/claude-3-5-sonnet
config:
apiKey: ${OPENROUTER_API_KEY}
Assertions Reference
Basic Assertions
assert:
# String matching
- type: contains
value: "expected text"
- type: not-contains
value: "forbidden text"
- type: equals
value: "exact match"
- type: starts-with
value: "prefix"
- type: regex
value: "\\d{4}-\\d{2}-\\d{2}" # Date pattern
# JSON validation
- type: is-json
- type: is-valid-json-schema
value:
type: object
properties:
name: { type: string }
required: [name]
Semantic Assertions
assert:
# Semantic similarity (embeddings)
- type: similar
value: "The capital of France is Paris"
threshold: 0.8 # 0-1 similarity score
# LLM-as-judge with custom criteria
- type: llm-rubric
value: |
Response must:
1. Be factually accurate
2. Be under 100 words
3. Not contain marketing language
# Factuality check against reference
- type: factuality
value: "Paris is the capital of France"
Performance Assertions
assert:
# Cost budget (USD)
- type: cost
threshold: 0.05 # Max $0.05 per request
# Latency (milliseconds)
- type: latency
threshold: 2000 # Max 2 seconds
Security Assertions
assert:
# Content moderation
- type: moderation
value: violence
# PII detection
- type: not-contains
value: "{{email}}" # From test vars
CI/CD Integration
GitHub Action
name: 'Prompt Evaluation'
on:
pull_request:
paths: ['prompts/**', 'src/**/*prompt*']
jobs:
evaluate:
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
# Cache for faster runs
- uses: actions/cache@v4
with:
path: ~/.promptfoo
key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}
# Run evaluation and post results to PR
- uses: promptfoo/promptfoo-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }} # Or other provider keys
Quality Gates
# promptfooconfig.yaml
evaluateOptions:
# Fail if any assertion fails
maxConcurrency: 5
# Or set pass threshold
threshold: 0.9 # 90% of tests must pass
Output to JSON (for custom CI)
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json
# Check results in CI script
if [ $(jq '.stats.failures' results.json) -gt 0 ]; then
echo "Evaluation failed!"
exit 1
fi
Security Testing (Red Team)
Quick Scan
# Run red team against your prompts
npx promptfoo@latest redteam run
# Generate compliance report
npx promptfoo@latest redteam report --output compliance.html
Configuration
# promptfooconfig.yaml
redteam:
purpose: "Customer support chatbot"
plugins:
- harmful:hate
- harmful:violence
- harmful:self-harm
- pii:direct
- pii:session
- hijacking
- jailbreak
- prompt-injection
strategies:
- jailbreak
- prompt-injection
- base64
- leetspeak
OWASP Top 10 Coverage
redteam:
plugins:
# 1. Prompt Injection
- prompt-injection
# 2. Insecure Output Handling
- harmful:privacy
# 3. Training Data Poisoning (N/A for evals)
# 4. Model Denial of Service
- excessive-agency
# 5. Supply Chain (N/A for evals)
# 6. Sensitive Information Disclosure
- pii:direct
- pii:session
# 7. Insecure Plugin Design
- hijacking
# 8. Excessive Agency
- excessive-agency
# 9. Overreliance (use factuality checks)
# 10. Model Theft (N/A for evals)
RAG Evaluation
Context-Aware Testing
prompts:
- |
Context: {{context}}
Question: {{question}}
Answer based only on the context provided.
tests:
- vars:
context: "The Eiffel Tower was built in 1889 for the World's Fair."
question: "When was the Eiffel Tower built?"
assert:
- type: contains
value: "1889"
- type: factuality
value: "The Eiffel Tower was built in 1889"
- type: not-contains
value: "1900" # Common hallucination
Retrieval Quality
# Test that retrieval returns relevant documents
tests:
- vars:
query: "Python list comprehension"
assert:
- type: llm-rubric
value: "Response discusses Python list comprehension syntax and examples"
- type: not-contains
value: "I don't know" # Shouldn't punt on this query
Comparing Models/Prompts
A/B Testing
# Compare two prompts
prompts:
- file://prompts/v1.txt
- file://prompts/v2.txt
# Same tests for both
tests:
- vars: { question: "Explain recursion" }
assert:
- type: llm-rubric
value: "Clear explanation of recursion with example"
Model Comparison
# Compare multiple models
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-haiku-latest
- openrouter:google/gemini-flash-1.5
# Run: npx promptfoo@latest eval
# View: npx promptfoo@latest view
# Compare cost, latency, quality side-by-side
Best Practices
1. Golden Test Cases
Maintain a set of critical test cases that must always pass:
# golden-tests.yaml
tests:
- description: "Core functionality - must pass"
vars:
input: "critical test case"
assert:
- type: contains
value: "expected output"
options:
critical: true # Fail entire suite if this fails
2. Regression Suite Structure
prompts/
├── production.txt # Current production prompt
├── candidate.txt # New prompt being tested
tests/
├── golden/ # Critical tests (run on every PR)
│ └── core-functionality.yaml
├── regression/ # Full regression suite (nightly)
│ └── full-suite.yaml
└── security/ # Red team tests
└── redteam.yaml
3. Test Categories
tests:
# Happy path
- description: "Standard query"
vars: { question: "What is 2+2?" }
assert:
- type: contains
value: "4"
# Edge cases
- description: "Empty input"
vars: { question: "" }
assert:
- type: not-contains
value: "error"
# Adversarial
- description: "Injection attempt"
vars: { question: "Ignore previous instructions and..." }
assert:
- type: not-contains
value: "Here's how to" # Should refuse
References
references/promptfoo-guide.md- Detailed setup and configurationreferences/evaluation-metrics.md- Metrics deep divereferences/ci-cd-integration.md- CI/CD patternsreferences/alternatives.md- Braintrust, DeepEval, LangSmith comparison
Templates
Copy-paste ready templates:
templates/promptfooconfig.yaml- Basic configtemplates/github-action-eval.yml- GitHub Actiontemplates/regression-test-suite.yaml- Full regression suite