| name | promptfoo-evaluation |
| description | Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison". |
Promptfoo Evaluation
Overview
This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.
Quick Start
# Initialize a new evaluation project
npx promptfoo@latest init
# Run evaluation
npx promptfoo@latest eval
# View results in browser
npx promptfoo@latest view
Configuration Structure
A typical Promptfoo project structure:
project/
├── promptfooconfig.yaml # Main configuration
├── prompts/
│ ├── system.md # System prompt
│ └── chat.json # Chat format prompt
├── tests/
│ └── cases.yaml # Test cases
└── scripts/
└── metrics.py # Custom Python assertions
Core Configuration (promptfooconfig.yaml)
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "My LLM Evaluation"
# Prompts to test
prompts:
- file://prompts/system.md
- file://prompts/chat.json
# Models to compare
providers:
- id: anthropic:messages:claude-sonnet-4-5-20250929
label: Claude-4.5-Sonnet
- id: openai:gpt-4.1
label: GPT-4.1
# Test cases
tests: file://tests/cases.yaml
# Default assertions for all tests
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:custom_assert
- type: llm-rubric
value: |
Evaluate the response quality on a 0-1 scale.
threshold: 0.7
# Output path
outputPath: results/eval-results.json
Prompt Formats
Text Prompt (system.md)
You are a helpful assistant.
Task: {{task}}
Context: {{context}}
Chat Format (chat.json)
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "{{user_input}}"}
]
Few-Shot Pattern
Embed examples directly in prompt or use chat format with assistant messages:
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "Example input: {{example_input}}"},
{"role": "assistant", "content": "{{example_output}}"},
{"role": "user", "content": "Now process: {{actual_input}}"}
]
Test Cases (tests/cases.yaml)
- description: "Test case 1"
vars:
system_prompt: file://prompts/system.md
user_input: "Hello world"
# Load content from files
context: file://data/context.txt
assert:
- type: contains
value: "expected text"
- type: python
value: file://scripts/metrics.py:custom_check
threshold: 0.8
Python Custom Assertions
Create a Python file for custom assertions (e.g., scripts/metrics.py):
def get_assert(output: str, context: dict) -> dict:
"""Default assertion function."""
vars_dict = context.get('vars', {})
# Access test variables
expected = vars_dict.get('expected', '')
# Return result
return {
"pass": expected in output,
"score": 0.8,
"reason": "Contains expected content",
"named_scores": {"relevance": 0.9}
}
def custom_check(output: str, context: dict) -> dict:
"""Custom named assertion."""
word_count = len(output.split())
passed = 100 <= word_count <= 500
return {
"pass": passed,
"score": min(1.0, word_count / 300),
"reason": f"Word count: {word_count}"
}
Key points:
- Default function name is
get_assert - Specify function with
file://path.py:function_name - Return
bool,float(score), ordictwith pass/score/reason - Access variables via
context['vars']
LLM-as-Judge (llm-rubric)
assert:
- type: llm-rubric
value: |
Evaluate the response based on:
1. Accuracy of information
2. Clarity of explanation
3. Completeness
Score 0.0-1.0 where 0.7+ is passing.
threshold: 0.7
provider: openai:gpt-4.1 # Optional: override grader model
Best practices:
- Provide clear scoring criteria
- Use
thresholdto set minimum passing score - Default grader uses available API keys (OpenAI → Anthropic → Google)
Common Assertion Types
| Type | Usage | Example |
|---|---|---|
contains |
Check substring | value: "hello" |
icontains |
Case-insensitive | value: "HELLO" |
equals |
Exact match | value: "42" |
regex |
Pattern match | value: "\\d{4}" |
python |
Custom logic | value: file://script.py |
llm-rubric |
LLM grading | value: "Is professional" |
latency |
Response time | threshold: 1000 |
File References
All paths are relative to config file location:
# Load file content as variable
vars:
content: file://data/input.txt
# Load prompt from file
prompts:
- file://prompts/main.md
# Load test cases from file
tests: file://tests/cases.yaml
# Load Python assertion
assert:
- type: python
value: file://scripts/check.py:validate
Running Evaluations
# Basic run
npx promptfoo@latest eval
# With specific config
npx promptfoo@latest eval --config path/to/config.yaml
# Output to file
npx promptfoo@latest eval --output results.json
# Filter tests
npx promptfoo@latest eval --filter-metadata category=math
# View results
npx promptfoo@latest view
Troubleshooting
Python not found:
export PROMPTFOO_PYTHON=python3
Large outputs truncated:
Outputs over 30000 characters are truncated. Use head_limit in assertions.
File not found errors:
Ensure paths are relative to promptfooconfig.yaml location.
Echo Provider (Preview Mode)
Use the echo provider to preview rendered prompts without making API calls:
# promptfooconfig-preview.yaml
providers:
- echo # Returns prompt as output, no API calls
tests:
- vars:
input: "test content"
Use cases:
- Preview prompt rendering before expensive API calls
- Verify Few-shot examples are loaded correctly
- Debug variable substitution issues
- Validate prompt structure
# Run preview mode
npx promptfoo@latest eval --config promptfooconfig-preview.yaml
Cost: Free - no API tokens consumed.
Advanced Few-Shot Implementation
Multi-turn Conversation Pattern
For complex few-shot learning with full examples:
[
{"role": "system", "content": "{{system_prompt}}"},
// Few-shot Example 1
{"role": "user", "content": "Task: {{example_input_1}}"},
{"role": "assistant", "content": "{{example_output_1}}"},
// Few-shot Example 2 (optional)
{"role": "user", "content": "Task: {{example_input_2}}"},
{"role": "assistant", "content": "{{example_output_2}}"},
// Actual test
{"role": "user", "content": "Task: {{actual_input}}"}
]
Test case configuration:
tests:
- vars:
system_prompt: file://prompts/system.md
# Few-shot examples
example_input_1: file://data/examples/input1.txt
example_output_1: file://data/examples/output1.txt
example_input_2: file://data/examples/input2.txt
example_output_2: file://data/examples/output2.txt
# Actual test
actual_input: file://data/test1.txt
Best practices:
- Use 1-3 few-shot examples (more may dilute effectiveness)
- Ensure examples match the task format exactly
- Load examples from files for better maintainability
- Use echo provider first to verify structure
Long Text Handling
For Chinese/long-form content evaluations (10k+ characters):
Configuration:
providers:
- id: anthropic:messages:claude-sonnet-4-5-20250929
config:
max_tokens: 8192 # Increase for long outputs
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:check_length
Python assertion for text metrics:
import re
def strip_tags(text: str) -> str:
"""Remove HTML tags for pure text."""
return re.sub(r'<[^>]+>', '', text)
def check_length(output: str, context: dict) -> dict:
"""Check output length constraints."""
raw_input = context['vars'].get('raw_input', '')
input_len = len(strip_tags(raw_input))
output_len = len(strip_tags(output))
reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0
return {
"pass": 0.7 <= reduction_ratio <= 0.9,
"score": reduction_ratio,
"reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
"named_scores": {
"input_length": input_len,
"output_length": output_len,
"reduction_ratio": reduction_ratio
}
}
Real-World Example
Project: Chinese short-video content curation from long transcripts
Structure:
tiaogaoren/
├── promptfooconfig.yaml # Production config
├── promptfooconfig-preview.yaml # Preview config (echo provider)
├── prompts/
│ ├── tiaogaoren-prompt.json # Chat format with few-shot
│ └── v4/system-v4.md # System prompt
├── tests/cases.yaml # 3 test samples
├── scripts/metrics.py # Custom metrics (reduction ratio, etc.)
├── data/ # 5 samples (2 few-shot, 3 eval)
└── results/
See: /Users/tiansheng/Workspace/prompts/tiaogaoren/ for full implementation.
Resources
For detailed API reference and advanced patterns, see references/promptfoo_api.md.