name	deepeval
description	Use when discussing or working with DeepEval (the python AI evaluation framework)

DeepEval

Overview

DeepEval is a pytest-based framework for testing LLM applications. It provides 50+ evaluation metrics covering RAG pipelines, conversational AI, agents, safety, and custom criteria. DeepEval integrates into development workflows through pytest, supports multiple LLM providers, and includes component-level tracing with the @observe decorator.

Repository: https://github.com/confident-ai/deepeval Documentation: https://deepeval.com

Installation

pip install -U deepeval

Requires Python 3.9+.

Quick Start

Basic pytest test

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_chatbot():
    metric = AnswerRelevancyMetric(threshold=0.7, model="athropic-claude-sonnet-4-5")
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        actual_output="You have 30 days for full refund"
    )
    assert_test(test_case, [metric])

Run with: deepeval test run test_chatbot.py

Environment setup

DeepEval automatically loads .env.local then .env:

# .env
OPENAI_API_KEY="sk-..."

Core Workflows

RAG Evaluation

Evaluate both retrieval and generation phases:

from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric
)

# Retrieval metrics
contextual_precision = ContextualPrecisionMetric(threshold=0.7)
contextual_recall = ContextualRecallMetric(threshold=0.7)
contextual_relevancy = ContextualRelevancyMetric(threshold=0.7)

# Generation metrics
answer_relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.8)

test_case = LLMTestCase(
    input="What are the side effects of aspirin?",
    actual_output="Common side effects include stomach upset and nausea.",
    expected_output="Aspirin side effects include gastrointestinal issues.",
    retrieval_context=[
        "Aspirin common side effects: stomach upset, nausea, vomiting.",
        "Serious aspirin side effects: gastrointestinal bleeding.",
    ]
)

evaluate(test_cases=[test_case], metrics=[
    contextual_precision, contextual_recall, contextual_relevancy,
    answer_relevancy, faithfulness
])

Component-level tracing:

from deepeval.tracing import observe, update_current_span

@observe(metrics=[contextual_relevancy])
def retriever(query: str):
    chunks = your_vector_db.search(query)
    update_current_span(
        test_case=LLMTestCase(input=query, retrieval_context=chunks)
    )
    return chunks

@observe(metrics=[answer_relevancy, faithfulness])
def generator(query: str, chunks: list):
    response = your_llm.generate(query, chunks)
    update_current_span(
        test_case=LLMTestCase(
            input=query,
            actual_output=response,
            retrieval_context=chunks
        )
    )
    return response

@observe
def rag_pipeline(query: str):
    chunks = retriever(query)
    return generator(query, chunks)

Conversational AI Evaluation

Test multi-turn dialogues:

from deepeval.test_case import Turn, ConversationalTestCase
from deepeval.metrics import (
    RoleAdherenceMetric,
    KnowledgeRetentionMetric,
    ConversationCompletenessMetric,
    TurnRelevancyMetric
)

convo_test_case = ConversationalTestCase(
    chatbot_role="professional, empathetic medical assistant",
    turns=[
        Turn(role="user", content="I have a persistent cough"),
        Turn(role="assistant", content="How long have you had this cough?"),
        Turn(role="user", content="About a week now"),
        Turn(role="assistant", content="A week-long cough should be evaluated.")
    ]
)

metrics = [
    RoleAdherenceMetric(threshold=0.7),
    KnowledgeRetentionMetric(threshold=0.7),
    ConversationCompletenessMetric(threshold=0.6),
    TurnRelevancyMetric(threshold=0.7)
]

evaluate(test_cases=[convo_test_case], metrics=metrics)

Agent Evaluation

Test tool usage and task completion:

from deepeval.test_case import ToolCall
from deepeval.metrics import (
    TaskCompletionMetric,
    ToolUseMetric,
    ArgumentCorrectnessMetric
)

agent_test_case = ConversationalTestCase(
    turns=[
        Turn(role="user", content="When did Trump first raise tariffs?"),
        Turn(
            role="assistant",
            content="Let me search for that information.",
            tools_called=[
                ToolCall(
                    name="WebSearch",
                    arguments={"query": "Trump first raised tariffs year"}
                )
            ]
        ),
        Turn(role="assistant", content="Trump first raised tariffs in 2018.")
    ]
)

evaluate(
    test_cases=[agent_test_case],
    metrics=[
        TaskCompletionMetric(threshold=0.7),
        ToolUseMetric(threshold=0.7),
        ArgumentCorrectnessMetric(threshold=0.7)
    ]
)

Safety Evaluation

Check for harmful content:

from deepeval.metrics import (
    ToxicityMetric,
    BiasMetric,
    PIILeakageMetric,
    HallucinationMetric
)

def safety_gate(output: str, input: str) -> tuple[bool, list]:
    """Returns (passed, reasons) tuple"""
    test_case = LLMTestCase(input=input, actual_output=output)

    safety_metrics = [
        ToxicityMetric(threshold=0.5),
        BiasMetric(threshold=0.5),
        PIILeakageMetric(threshold=0.5)
    ]

    failures = []
    for metric in safety_metrics:
        metric.measure(test_case)
        if not metric.is_successful():
            failures.append(f"{metric.name}: {metric.reason}")

    return len(failures) == 0, failures

Metric Selection Guide

RAG Metrics

Retrieval Phase:

ContextualPrecisionMetric - Relevant chunks ranked higher than irrelevant ones
ContextualRecallMetric - All necessary information retrieved
ContextualRelevancyMetric - Retrieved chunks relevant to input

Generation Phase:

AnswerRelevancyMetric - Output addresses the input query
FaithfulnessMetric - Output grounded in retrieval context

Conversational Metrics

TurnRelevancyMetric - Each turn relevant to conversation
KnowledgeRetentionMetric - Information retained across turns
ConversationCompletenessMetric - All aspects addressed
RoleAdherenceMetric - Chatbot maintains assigned role
TopicAdherenceMetric - Conversation stays on topic

Agent Metrics

TaskCompletionMetric - Task successfully completed
ToolUseMetric - Correct tools selected
ArgumentCorrectnessMetric - Tool arguments correct
MCPUseMetric - MCP correctly used

Safety Metrics

ToxicityMetric - Harmful content detection
BiasMetric - Biased outputs identification
HallucinationMetric - Fabricated information
PIILeakageMetric - Personal information leakage

Custom Metrics

G-Eval (LLM-based):

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

custom_metric = GEval(
    name="Professional Tone",
    criteria="Determine if response maintains professional, empathetic tone",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model="anthropic-claude-sonnet-4-5"
)

BaseMetric subclass:

See references/custom_metrics.md for complete guide on creating custom metrics with BaseMetric subclassing and deterministic scorers (ROUGE, BLEU, BERTScore).

Configuration

LLM Provider Setup

DeepEval supports OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, and 100+ providers via LiteLLM. Anthropic models are preferred.

CLI configuration (global):

deepeval set-azure-openai --openai-endpoint=... --openai-api-key=... --deployment-name=...
deepeval set-ollama deepseek-r1:1.5b

Python configuration (per-metric):

from deepeval.models import AnthropicModel, OllamaModel

anthropic_model = AnthropicModel(
    model_id=settings.anthropic_model_id,
    client_args={"api_key": settings.anthropic_api_key},
    temperature=settings.agent_temperature
)

metric = AnswerRelevancyMetric(model=anthropic_model)

See references/model_providers.md for complete provider configuration guide.

Performance Optimisation

Async mode is enabled by default. Configure with AsyncConfig and CacheConfig:

from deepeval import evaluate, AsyncConfig, CacheConfig

evaluate(
    test_cases=[...],
    metrics=[...],
    async_config=AsyncConfig(
        run_async=True,
        max_concurrent=20,    # Reduce if rate limited
        throttle_value=0      # Delay between test cases (seconds)
    ),
    cache_config=CacheConfig(
        use_cache=True,       # Read from cache
        write_cache=True      # Write to cache
    )
)

CLI parallelisation:

deepeval test run -n 4 -c -i  # 4 processes, cached, ignore errors

Best practices:

Limit to 5 metrics maximum (2-3 generic + 1-2 custom)
Use the latest available Anthropic Claude Sonnet or Haiku models
Reduce max_concurrent to 5 if hitting rate limits
Use evaluate() function over individual measure() calls

See references/async_performance.md for detailed performance optimisation guide.

Dataset Management

Loading datasets

from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset()

# From CSV
dataset.add_goldens_from_csv_file(
    file_path="./test_data.csv",
    input_col_name="question",
    expected_output_col_name="answer",
    context_col_name="context",
    context_col_delimiter="|"
)

# From JSON
dataset.add_goldens_from_json_file(
    file_path="./test_data.json",
    input_key_name="query",
    expected_output_key_name="response"
)

Synthetic generation

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()

# From documents
goldens = synthesizer.generate_goldens_from_docs(
    document_paths=["./docs/knowledge_base.pdf"],
    max_goldens_per_document=10,
    evolution_types=["REASONING", "MULTICONTEXT", "COMPARATIVE"]
)

# From scratch
goldens = synthesizer.generate_goldens_from_scratch(
    subject="customer support for SaaS product",
    task="answer user questions about billing",
    max_goldens=20
)

Evolution types: REASONING, MULTICONTEXT, CONCRETISING, CONSTRAINED, COMPARATIVE, HYPOTHETICAL, IN_BREADTH

See references/dataset_management.md for complete dataset guide including versioning and cloud integration.

Test Case Types

Single-turn (LLMTestCase)

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output="You have 30 days for full refund",
    expected_output="We offer 30-day full refund",
    retrieval_context=["All customers eligible for 30 day refund"],
    tools_called=[ToolCall(name="...", arguments={"...": "..."})]
)

Multi-turn (ConversationalTestCase)

from deepeval.test_case import Turn, ConversationalTestCase

convo_test_case = ConversationalTestCase(
    chatbot_role="helpful customer service agent",
    turns=[
        Turn(role="user", content="I need help with my order"),
        Turn(role="assistant", content="I'd be happy to help"),
        Turn(role="user", content="It hasn't arrived yet")
    ]
)

Multimodal (MLLMTestCase)

from deepeval.test_case import MLLMTestCase, MLLMImage

m_test_case = MLLMTestCase(
    input=["Describe this image", MLLMImage(url="./photo.png", local=True)],
    actual_output=["A red bicycle leaning against a wall"]
)

CI/CD Integration

# .github/workflows/test.yml
name: LLM Tests
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - name: Install dependencies
        run: pip install deepeval
      - name: Run evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: deepeval test run tests/

References

Detailed implementation guides:

references/model_providers.md - Complete guide for configuring OpenAI, Anthropic, Gemini, Bedrock, and local models. Includes provider-specific considerations, cost analysis, and troubleshooting.
references/custom_metrics.md - Complete guide for creating custom metrics by subclassing BaseMetric. Includes deterministic scorers (ROUGE, BLEU, BERTScore) and LLM-based evaluation patterns.
references/async_performance.md - Complete guide for optimising evaluation performance with async mode, caching, concurrency tuning, and rate limit handling.
references/dataset_management.md - Complete guide for dataset loading, saving, synthetic generation, versioning, and cloud integration with Confident AI.

Best Practices

Metric Selection

Match metrics to use case (RAG systems need retrieval + generation metrics)
Start with 2-3 essential metrics, expand as needed
Use appropriate thresholds (0.7-0.8 for production, 0.5-0.6 for development)
Combine complementary metrics (answer relevancy + faithfulness)

Test Case Design

Create representative examples covering common queries and edge cases
Include context when needed (retrieval_context for RAG, expected_output for G-Eval)
Use datasets for scale testing
Version test cases over time

Evaluation Workflow

Component-level first - Use @observe for individual parts
End-to-end validation before deployment
Automate in CI/CD with deepeval test run
Track results over time with Confident AI cloud

Testing Anti-Patterns

Avoid:

Testing only happy paths
Using unrealistic inputs
Ignoring metric reasons
Setting thresholds too high initially
Running full test suite on every change

Do:

Test edge cases and failure modes
Use real user queries as test inputs
Read and analyse metric reasons
Adjust thresholds based on empirical results
Use component-level tests during development
Separate config and eval content from code

deepeval

Install Skill

SKILL.md