name: eval-frameworks description: Evaluation framework patterns for RAG and LLMs, including faithfulness metrics, synthetic dataset generation, and LLM-as-a-judge patterns. Triggers: ragas, deepeval, llm-eval, faithfulness, hallucination-check, synthetic-data.

Evaluation Frameworks

Overview

Traditional software metrics (accuracy, F1) fail to capture the quality of LLM outputs. Evaluation frameworks like Ragas and DeepEval use "LLM-as-a-judge" to quantify subjective qualities like faithfulness, relevance, and professionalism.

When to Use

RAG Benchmarking: To verify if answers are supported by retrieved context (Faithfulness).
Regression Testing: Ensuring that a prompt change or model upgrade doesn't break existing behavior.
Synthetic Benchmarking: Creating evaluation sets when manual gold-standard data is unavailable.

Decision Tree

Do you want to check for hallucinations?
- YES: Run a Faithfulness metric.
Is the retrieved context actually useful for the question?
- YES: Run a Retrieval Relevance metric.
Do you need to scale evaluation without manual labeling?
- YES: Use Synthetic Data Generation.

Workflows

1. Evaluating RAG Faithfulness

Capture the query, the retrieved_context, and the actual_output from the system.
Run a Faithfulness metric (from Ragas or LlamaIndex) which uses an LLM to verify if the output claims are supported by the context.
If the score is low, investigate whether the context was irrelevant or the model hallucinated.

2. Unit Testing LLM Outputs (DeepEval)

Install deepeval and create a test file test_example.py.
Define an LLMTestCase with input, actual_output, and expected_output.
Apply a GEval metric with a custom criteria (e.g., 'professionalism').
Run deepeval test run to assert that the score meets the defined threshold.

3. Automated Question Generation

Point the evaluation framework (e.g., LlamaIndex) at a set of source documents.
Use the QuestionGeneration module to synthetically create test cases (question-context pairs).
Run the RAG pipeline against these generated questions to benchmark performance across the entire dataset.

Non-Obvious Insights

LLM-as-a-Judge: A stronger model (GPT-4o) can effectively grade a smaller/faster model (Llama 3) with human-like accuracy using research-backed metrics like GEval.
Separation of Concerns: Good evaluation splits into 'Response' (was the answer good?) and 'Retrieval' (did we find the right docs?). Fixing one doesn't always fix the other.
Synthetic Scaling: Manual evaluation doesn't scale; using an LLM to generate 1000 edge cases from your data is the only way to reach high production confidence.

Evidence

"Faithfulness: Evaluates if the answer is faithful to the retrieved contexts (in other words, whether if there’s hallucination)." - LlamaIndex
"GEval is a research-backed metric... for you to evaluate your LLM output's on any custom metric with human-like accuracy." - DeepEval
"Traditional evaluation metrics don't capture what matters for LLM applications." - Ragas

Scripts

scripts/eval-frameworks_tool.py: Script defining a Faithfulness evaluation loop.
scripts/eval-frameworks_tool.js: Node.js simulation for calculating relevance scores.

Dependencies

ragas
deepeval
llama-index

References

references/README.md

eval-frameworks

Install Skill

SKILL.md