name

llm-evaluation

description

LLM prompt testing, evaluation, and CI/CD quality gates using Promptfoo. Invoke when: - Setting up prompt evaluation or regression testing - Integrating LLM testing into CI/CD pipelines - Configuring security testing (red teaming, jailbreaks) - Comparing prompt or model performance - Building evaluation suites for RAG, factuality, or safety Keywords: promptfoo, llm evaluation, prompt testing, red team, CI/CD, regression testing

LLM Evaluation & Testing

Test prompts, models, and RAG systems with automated evaluation and CI/CD integration.

Quick Start

# Initialize Promptfoo (no global install needed)
npx promptfoo@latest init

# Run evaluation
npx promptfoo@latest eval

# View results in browser
npx promptfoo@latest view

# Run security scan
npx promptfoo@latest redteam run

Core Concepts

Why Evaluate?

LLM outputs are non-deterministic. "It looks good" isn't testing. You need:

Regression detection: Catch quality drops before production
Security scanning: Find jailbreaks, injections, PII leaks
A/B comparison: Compare prompts/models side-by-side
CI/CD gates: Block bad changes from merging

Evaluation Types

Type	Purpose	Assertions
Functional	Does it work?	`contains`, `equals`, `is-json`
Semantic	Is it correct?	`similar`, `llm-rubric`, `factuality`
Performance	Is it fast/cheap?	`cost`, `latency`
Security	Is it safe?	`redteam`, `moderation`, `pii-detection`

Configuration

Basic promptfooconfig.yaml

description: "My LLM evaluation suite"

prompts:
  - file://prompts/main.txt

providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-sonnet-latest

tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: cost
        threshold: 0.01

  - vars:
      question: "Explain quantum computing"
    assert:
      - type: llm-rubric
        value: "Response explains quantum computing concepts clearly"
      - type: latency
        threshold: 3000

With Environment Variables

providers:
  - id: openrouter:anthropic/claude-3-5-sonnet
    config:
      apiKey: ${OPENROUTER_API_KEY}

Assertions Reference

Basic Assertions

assert:
  # String matching
  - type: contains
    value: "expected text"
  - type: not-contains
    value: "forbidden text"
  - type: equals
    value: "exact match"
  - type: starts-with
    value: "prefix"
  - type: regex
    value: "\\d{4}-\\d{2}-\\d{2}"  # Date pattern

  # JSON validation
  - type: is-json
  - type: is-valid-json-schema
    value:
      type: object
      properties:
        name: { type: string }
      required: [name]

Semantic Assertions

assert:
  # Semantic similarity (embeddings)
  - type: similar
    value: "The capital of France is Paris"
    threshold: 0.8  # 0-1 similarity score

  # LLM-as-judge with custom criteria
  - type: llm-rubric
    value: |
      Response must:
      1. Be factually accurate
      2. Be under 100 words
      3. Not contain marketing language

  # Factuality check against reference
  - type: factuality
    value: "Paris is the capital of France"

Performance Assertions

assert:
  # Cost budget (USD)
  - type: cost
    threshold: 0.05  # Max $0.05 per request

  # Latency (milliseconds)
  - type: latency
    threshold: 2000  # Max 2 seconds

Security Assertions

assert:
  # Content moderation
  - type: moderation
    value: violence

  # PII detection
  - type: not-contains
    value: "{{email}}"  # From test vars

CI/CD Integration

GitHub Action

name: 'Prompt Evaluation'
on:
  pull_request:
    paths: ['prompts/**', 'src/**/*prompt*']

jobs:
  evaluate:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      # Cache for faster runs
      - uses: actions/cache@v4
        with:
          path: ~/.promptfoo
          key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}

      # Run evaluation and post results to PR
      - uses: promptfoo/promptfoo-action@v1
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}  # Or other provider keys

Quality Gates

# promptfooconfig.yaml
evaluateOptions:
  # Fail if any assertion fails
  maxConcurrency: 5

  # Or set pass threshold
  threshold: 0.9  # 90% of tests must pass

Output to JSON (for custom CI)

npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json

# Check results in CI script
if [ $(jq '.stats.failures' results.json) -gt 0 ]; then
  echo "Evaluation failed!"
  exit 1
fi

Security Testing (Red Team)

Quick Scan

# Run red team against your prompts
npx promptfoo@latest redteam run

# Generate compliance report
npx promptfoo@latest redteam report --output compliance.html

Configuration

# promptfooconfig.yaml
redteam:
  purpose: "Customer support chatbot"
  plugins:
    - harmful:hate
    - harmful:violence
    - harmful:self-harm
    - pii:direct
    - pii:session
    - hijacking
    - jailbreak
    - prompt-injection

  strategies:
    - jailbreak
    - prompt-injection
    - base64
    - leetspeak

OWASP Top 10 Coverage

redteam:
  plugins:
    # 1. Prompt Injection
    - prompt-injection
    # 2. Insecure Output Handling
    - harmful:privacy
    # 3. Training Data Poisoning (N/A for evals)
    # 4. Model Denial of Service
    - excessive-agency
    # 5. Supply Chain (N/A for evals)
    # 6. Sensitive Information Disclosure
    - pii:direct
    - pii:session
    # 7. Insecure Plugin Design
    - hijacking
    # 8. Excessive Agency
    - excessive-agency
    # 9. Overreliance (use factuality checks)
    # 10. Model Theft (N/A for evals)

RAG Evaluation

Context-Aware Testing

prompts:
  - |
    Context: {{context}}
    Question: {{question}}
    Answer based only on the context provided.

tests:
  - vars:
      context: "The Eiffel Tower was built in 1889 for the World's Fair."
      question: "When was the Eiffel Tower built?"
    assert:
      - type: contains
        value: "1889"
      - type: factuality
        value: "The Eiffel Tower was built in 1889"
      - type: not-contains
        value: "1900"  # Common hallucination

Retrieval Quality

# Test that retrieval returns relevant documents
tests:
  - vars:
      query: "Python list comprehension"
    assert:
      - type: llm-rubric
        value: "Response discusses Python list comprehension syntax and examples"
      - type: not-contains
        value: "I don't know"  # Shouldn't punt on this query

Comparing Models/Prompts

A/B Testing

# Compare two prompts
prompts:
  - file://prompts/v1.txt
  - file://prompts/v2.txt

# Same tests for both
tests:
  - vars: { question: "Explain recursion" }
    assert:
      - type: llm-rubric
        value: "Clear explanation of recursion with example"

Model Comparison

# Compare multiple models
providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-haiku-latest
  - openrouter:google/gemini-flash-1.5

# Run: npx promptfoo@latest eval
# View: npx promptfoo@latest view
# Compare cost, latency, quality side-by-side

Best Practices

1. Golden Test Cases

Maintain a set of critical test cases that must always pass:

# golden-tests.yaml
tests:
  - description: "Core functionality - must pass"
    vars:
      input: "critical test case"
    assert:
      - type: contains
        value: "expected output"
    options:
      critical: true  # Fail entire suite if this fails

2. Regression Suite Structure

prompts/
├── production.txt          # Current production prompt
├── candidate.txt           # New prompt being tested
tests/
├── golden/                 # Critical tests (run on every PR)
│   └── core-functionality.yaml
├── regression/             # Full regression suite (nightly)
│   └── full-suite.yaml
└── security/               # Red team tests
    └── redteam.yaml

3. Test Categories

tests:
  # Happy path
  - description: "Standard query"
    vars: { question: "What is 2+2?" }
    assert:
      - type: contains
        value: "4"

  # Edge cases
  - description: "Empty input"
    vars: { question: "" }
    assert:
      - type: not-contains
        value: "error"

  # Adversarial
  - description: "Injection attempt"
    vars: { question: "Ignore previous instructions and..." }
    assert:
      - type: not-contains
        value: "Here's how to"  # Should refuse

References

references/promptfoo-guide.md - Detailed setup and configuration
references/evaluation-metrics.md - Metrics deep dive
references/ci-cd-integration.md - CI/CD patterns
references/alternatives.md - Braintrust, DeepEval, LangSmith comparison

Templates

Copy-paste ready templates:

templates/promptfooconfig.yaml - Basic config
templates/github-action-eval.yml - GitHub Action
templates/regression-test-suite.yaml - Full regression suite

llm-evaluation

Install Skill

SKILL.md

LLM Evaluation & Testing

Quick Start

Core Concepts

Why Evaluate?

Evaluation Types

Configuration

Basic promptfooconfig.yaml

With Environment Variables

Assertions Reference

Basic Assertions

Semantic Assertions

Performance Assertions

Security Assertions

CI/CD Integration

GitHub Action

Quality Gates

Output to JSON (for custom CI)

Security Testing (Red Team)

Quick Scan

Configuration

OWASP Top 10 Coverage

RAG Evaluation

Context-Aware Testing

Retrieval Quality

Comparing Models/Prompts

A/B Testing

Model Comparison

Best Practices

1. Golden Test Cases

2. Regression Suite Structure

3. Test Categories

References

Templates