Claude Code Plugins

Community-maintained marketplace

Feedback

Evaluate RAG systems with hit rate, MRR, faithfulness metrics and compare retrieval strategies. Use when testing retrieval quality, generating evaluation datasets, comparing embeddings or retrievers, A/B testing, or measuring production RAG performance.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name evaluating-rag
description Evaluate RAG systems with hit rate, MRR, faithfulness metrics and compare retrieval strategies. Use when testing retrieval quality, generating evaluation datasets, comparing embeddings or retrievers, A/B testing, or measuring production RAG performance.

Evaluating RAG Systems

Guide for measuring RAG performance, comparing strategies, and implementing continuous evaluation. Focus on key metrics and practical testing approaches.

When to Use This Skill

  • Testing retrieval quality and accuracy
  • Generating evaluation datasets for your domain
  • Comparing different retrieval strategies (vector vs BM25 vs hybrid)
  • A/B testing embedding models or rerankers
  • Measuring production RAG performance
  • Validating improvements after optimizations
  • Comparing your 7 retrieval strategies in src/ or src-iLand/

Key Evaluation Metrics

Retrieval Metrics

Hit Rate: Fraction of queries where correct answer found in top-k

  • Perfect: 1.0 (all queries found relevant docs)
  • Good: 0.85+ (85%+ queries successful)
  • Needs work: <0.70

MRR (Mean Reciprocal Rank): Quality of ranking

  • Perfect: 1.0 (relevant doc always rank 1)
  • Good: 0.80+ (relevant doc typically in top 2-3)
  • Formula: Average of 1/rank across queries

Response Metrics

Faithfulness: No hallucinations, grounded in context Correctness: Factually accurate vs reference answer Relevancy: Directly addresses the query

Quick Decision Guide

When to Evaluate

  • After implementing → Baseline performance
  • After optimization → Validate improvements
  • Before production → Quality gate
  • In production → Continuous monitoring

What to Measure

  • Development → Hit rate + MRR (retrieval quality)
  • Production → All metrics (retrieval + response quality)
  • A/B testing → Comparative metrics

Dataset Size

  • Quick test → 20-50 Q&A pairs
  • Thorough eval → 100-200 pairs
  • Production → 500+ pairs

Quick Start Patterns

Pattern 1: Basic Retrieval Evaluation

from llama_index.core.evaluation import RetrieverEvaluator

# Create evaluator
evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"],
    retriever=retriever
)

# Run evaluation
eval_results = await evaluator.aevaluate_dataset(qa_dataset)

print(f"Hit Rate: {eval_results['hit_rate']:.3f}")
print(f"MRR: {eval_results['mrr']:.3f}")

Pattern 2: Generate Evaluation Dataset

from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.llms.openai import OpenAI

# Generate Q&A pairs from your documents
llm = OpenAI(model="gpt-4o-mini")
qa_dataset = generate_question_context_pairs(
    nodes,
    llm=llm,
    num_questions_per_chunk=2
)

# Filter invalid entries
qa_dataset = filter_qa_dataset(qa_dataset)

# Save for reuse
qa_dataset.save_json("evaluation_dataset.json")

Pattern 3: Compare Multiple Strategies

strategies = {
    "vector": vector_retriever,
    "bm25": bm25_retriever,
    "hybrid": hybrid_retriever,
    "metadata": metadata_retriever,
}

results = {}
for strategy_name, retriever in strategies.items():
    evaluator = RetrieverEvaluator.from_metric_names(
        ["mrr", "hit_rate"],
        retriever=retriever
    )
    eval_result = await evaluator.aevaluate_dataset(qa_dataset)
    results[strategy_name] = eval_result
    print(f"{strategy_name}: {eval_result}")

# Find best strategy
best_strategy = max(results, key=lambda x: results[x]['hit_rate'])
print(f"\nBest strategy: {best_strategy}")

Pattern 4: Compare With/Without Reranking

# Without reranking
retriever_no_rerank = index.as_retriever(similarity_top_k=5)

# With reranking
from llama_index.postprocessor.cohere_rerank import CohereRerank
retriever_with_rerank = index.as_retriever(
    similarity_top_k=10,
    node_postprocessors=[CohereRerank(top_n=5)]
)

# Evaluate both
for name, retriever in [("No Rerank", retriever_no_rerank),
                        ("With Rerank", retriever_with_rerank)]:
    evaluator = RetrieverEvaluator.from_metric_names(
        ["mrr", "hit_rate"],
        retriever=retriever
    )
    results = await evaluator.aevaluate_dataset(qa_dataset)
    print(f"{name}: Hit Rate={results['hit_rate']:.3f}, MRR={results['mrr']:.3f}")

# Calculate improvement
improvement = (rerank_results['hit_rate'] - no_rerank_results['hit_rate']) / no_rerank_results['hit_rate']
print(f"Improvement: {improvement * 100:.1f}%")

Pattern 5: Response Quality Evaluation

from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator
)

# Initialize evaluators
faithfulness_evaluator = FaithfulnessEvaluator()
relevancy_evaluator = RelevancyEvaluator()

# Generate response
response = query_engine.query("What is machine learning?")

# Evaluate faithfulness (no hallucinations)
faithfulness_result = faithfulness_evaluator.evaluate_response(
    response=response
)
print(f"Faithfulness: {faithfulness_result.passing}")

# Evaluate relevancy
relevancy_result = relevancy_evaluator.evaluate_response(
    query="What is machine learning?",
    response=response
)
print(f"Relevancy: {relevancy_result.passing}")

Your Codebase Integration

For src/ Pipeline (7 Strategies)

Compare All Strategies:

strategies = {
    "vector": "src/10_basic_query_engine.py",
    "summary": "src/11_document_summary_retriever.py",
    "recursive": "src/12_recursive_retriever.py",
    "metadata": "src/14_metadata_filtering.py",
    "chunk_decoupling": "src/15_chunk_decoupling.py",
    "hybrid": "src/16_hybrid_search.py",
    "planner": "src/17_query_planning_agent.py",
}

# Create evaluation framework to compare all 7

Baseline Performance:

  1. Generate Q&A dataset from your documents
  2. Evaluate each strategy
  3. Identify best performer
  4. Use as baseline for improvements

For src-iLand/ Pipeline (Thai Land Deeds)

Thai-Specific Evaluation:

# Generate Thai Q&A pairs
llm = OpenAI(model="gpt-4o-mini")  # Supports Thai
qa_dataset = generate_question_context_pairs(
    thai_nodes,
    llm=llm,
    num_questions_per_chunk=2
)

# Test with Thai queries
thai_queries = [
    "โฉนดที่ดินในกรุงเทพ",  # Land deeds in Bangkok
    "นส.3 คืออะไร",  # What is NS.3
    "ที่ดินในสมุทรปราการ"  # Land in Samut Prakan
]

Router Evaluation (src-iLand/retrieval/router.py):

  • Test index classification accuracy
  • Test strategy selection appropriateness
  • Measure end-to-end performance

Fast Metadata Testing:

  • Validate <50ms response time
  • Test filtering accuracy
  • Compare with/without fast indexing

Detailed References

Load these when you need comprehensive details:

  • reference-metrics.md: Complete evaluation guide

    • All metrics (hit rate, MRR, faithfulness, correctness)
    • Dataset generation techniques
    • A/B testing frameworks
    • Production monitoring
    • Statistical significance testing
  • reference-agents.md: Advanced techniques

    • Agents (FunctionAgent, ReActAgent)
    • Multi-agent systems
    • Query engines (Router, SubQuestion)
    • Workflow orchestration
    • Observability and debugging

Common Workflows

Workflow 1: Create Evaluation Dataset

  • Step 1: Prepare representative documents

    • Sample from different categories
    • Include edge cases
  • Step 2: Generate Q&A pairs

    qa_dataset = generate_question_context_pairs(
        nodes, llm=llm, num_questions_per_chunk=2
    )
    
  • Step 3: Filter invalid entries

    • Remove auto-generated artifacts
    • Load reference-metrics.md for filtering code
  • Step 4: Manual review (optional)

    • Check 10-20 samples
    • Ensure question quality
  • Step 5: Save for reuse

    qa_dataset.save_json("eval_dataset.json")
    

Workflow 2: Compare Retrieval Strategies

  • Step 1: Load evaluation dataset

    from llama_index.core.llama_dataset import LabelledRagDataset
    qa_dataset = LabelledRagDataset.from_json("eval_dataset.json")
    
  • Step 2: Define strategies to compare

    • List all retrievers to test
    • For src/: All 7 strategies
    • For src-iLand/: Router + individual strategies
  • Step 3: Run evaluation for each

    for name, retriever in strategies.items():
        results[name] = evaluate(retriever, qa_dataset)
    
  • Step 4: Compare results

    • Identify best hit rate
    • Identify best MRR
    • Consider trade-offs (latency, cost)
  • Step 5: Document findings

    • Record baseline performance
    • Note best strategies for different query types

Workflow 3: A/B Test an Optimization

  • Step 1: Measure baseline

    baseline_results = evaluate(current_retriever, qa_dataset)
    
  • Step 2: Apply optimization

    • Add reranking
    • Change embedding model
    • Adjust chunk size
    • etc.
  • Step 3: Measure optimized version

    optimized_results = evaluate(optimized_retriever, qa_dataset)
    
  • Step 4: Calculate improvement

    improvement = (optimized - baseline) / baseline * 100
    print(f"Hit Rate improvement: {improvement:.1f}%")
    
  • Step 5: Decide based on data

    • If improvement > 5%: Deploy
    • If improvement < 2%: Consider cost/complexity
    • If negative: Rollback

Workflow 4: Production Monitoring

  • Step 1: Create production evaluation set

    • Sample real user queries
    • Include ground truth when available
  • Step 2: Set up continuous evaluation

    class ProductionEvaluator:
        def evaluate_query(self, query, response):
            # Log metrics
            # Track over time
    
  • Step 3: Define alerts

    • Hit rate < 0.80 → Alert
    • MRR < 0.70 → Alert
    • Latency p95 > 2s → Alert
  • Step 4: Monitor trends

    • Daily/weekly metrics
    • Detect degradation early
  • Step 5: Iterate based on data

    • Identify failure patterns
    • Generate new test cases
    • Improve weak areas

Workflow 5: Evaluate All 7 Strategies (src/)

  • Step 1: Generate comprehensive dataset

    • Cover different query types
    • Factual, summarization, comparison
  • Step 2: Run each strategy

    python src/10_basic_query_engine.py  # Vector
    python src/11_document_summary_retriever.py  # Summary
    python src/12_recursive_retriever.py  # Recursive
    python src/14_metadata_filtering.py  # Metadata
    python src/15_chunk_decoupling.py  # Chunk decoupling
    python src/16_hybrid_search.py  # Hybrid
    python src/17_query_planning_agent.py  # Planner
    
  • Step 3: Collect metrics

    • Hit rate for each
    • MRR for each
    • Latency for each
  • Step 4: Create comparison table

    Strategy Hit Rate MRR Latency Use Case
    Vector ... ... ... General
    Hybrid ... ... ... Best overall
    ... ... ... ... ...
  • Step 5: Document recommendations

    • Best for factual queries
    • Best for complex queries
    • Best for production (speed + quality)

Evaluation Metrics Reference

Hit Rate Interpretation

  • 1.0 → Perfect (all queries successful)
  • 0.90+ → Excellent
  • 0.80-0.89 → Good
  • 0.70-0.79 → Acceptable
  • <0.70 → Needs improvement

MRR Interpretation

  • 1.0 → Perfect ranking (relevant doc always #1)
  • 0.85+ → Excellent (relevant doc typically #1 or #2)
  • 0.70-0.84 → Good
  • 0.50-0.69 → Acceptable
  • <0.50 → Poor ranking quality

Latency Targets

  • <100ms → Excellent
  • 100-500ms → Good
  • 500ms-1s → Acceptable
  • >1s → Needs optimization

Performance Benchmarks

Embedding Model Comparison (from reference docs)

Embedding Reranker Hit Rate MRR
JinaAI Base bge-reranker-large 0.938 0.869
JinaAI Base CohereRerank 0.933 0.874
OpenAI CohereRerank 0.927 0.866
OpenAI bge-reranker-large 0.910 0.856

Typical Improvements

  • Adding reranking: +5-15% hit rate
  • Hybrid vs vector: +3-8% hit rate
  • Optimal chunk size: +2-5% hit rate
  • Better embeddings: +3-10% hit rate

Scripts

This skill includes utility scripts in the scripts/ directory:

generate_qa_dataset.py

Generate evaluation Q&A pairs from documents:

python .claude/skills/evaluating-rag/scripts/generate_qa_dataset.py \
    --documents-dir ./data \
    --output eval_dataset.json \
    --num-questions-per-chunk 2

compare_retrievers.py

Compare multiple retrieval strategies:

python .claude/skills/evaluating-rag/scripts/compare_retrievers.py \
    --dataset eval_dataset.json \
    --strategies vector,bm25,hybrid \
    --output comparison_results.json

Outputs:

  • Hit rate and MRR for each strategy
  • Performance comparison table
  • Recommendations

run_evaluation.py

Run comprehensive evaluation:

python .claude/skills/evaluating-rag/scripts/run_evaluation.py \
    --retriever-config config.yaml \
    --dataset eval_dataset.json \
    --metrics hit_rate,mrr,faithfulness

Reports:

  • All requested metrics
  • Per-query breakdown
  • Summary statistics

Key Reminders

Dataset Quality:

  • Generate from your actual documents
  • Include diverse query types
  • Filter invalid auto-generated entries
  • Manual review recommended for critical domains

Evaluation Best Practices:

  • Start with baseline (before optimization)
  • Test one change at a time (for clear attribution)
  • Use same dataset for comparisons
  • Statistical significance matters (>5% improvement)

Production Monitoring:

  • Continuous evaluation on sample queries
  • Track trends over time
  • Alert on degradation
  • Regular dataset refresh

For Your Pipelines:

  • src/: Compare all 7 strategies systematically
  • src-iLand/: Test with Thai queries and metadata
  • Both: Establish baselines before optimizations

Next Steps

After evaluation:

  • Optimize: Use optimizing-rag skill to improve low scores
  • Implement: Use implementing-rag skill to rebuild weak components
  • Monitor: Set up continuous evaluation in production
  • Iterate: Regular evaluation → optimization → re-evaluation cycle