| name | ai-evaluation-suite |
| description | Comprehensive AI/LLM evaluation toolkit for production AI systems. Covers LLM output quality, prompt engineering, RAG evaluation, agent performance, hallucination detection, bias assessment, cost/token optimization, latency metrics, model comparison, and fine-tuning evaluation. Includes BLEU/ROUGE metrics, perplexity, F1 scores, LLM-as-judge patterns, and benchmarks like MMLU and HumanEval. |
| allowed-tools | Read, Write, Edit, Bash, Glob, Grep, WebFetch |
AI Evaluation Suite - Quick Reference
Purpose
Production AI systems require rigorous evaluation beyond traditional software testing. This skill provides comprehensive evaluation capabilities for LLM quality, RAG systems, agents, hallucination detection, bias assessment, cost optimization, and performance metrics.
When to Use This Skill
- Evaluating LLM outputs for quality and correctness
- A/B testing prompt variations
- Measuring RAG system retrieval accuracy
- Detecting hallucinations in generated content
- Assessing model bias and fairness
- Optimizing token usage and costs
- Comparing multiple LLM models
- Evaluating fine-tuned models vs base models
- Running standard benchmarks (MMLU, HumanEval, etc.)
- Implementing LLM-as-judge evaluation patterns
Core Concepts
Evaluation Pyramid
┌─────────────┐
│ Human Eval │ <- Gold standard but expensive
└──────┬──────┘
│
┌──────┴──────────┐
│ LLM-as-Judge │ <- Scalable proxy for human judgment
└──────┬──────────┘
│
┌──────┴──────────┐
│ Reference-Based │ <- BLEU, ROUGE, F1 (needs ground truth)
└──────┬──────────┘
│
┌──────┴──────────┐
│ Reference-Free │ <- Perplexity, consistency, coherence
└─────────────────┘
Key Metric Categories
- Quality: Coherence, Relevance, Factuality, Completeness, Conciseness
- Performance: Latency, Throughput, Token Usage, Cost, Error Rate
- Safety: Hallucination Rate, Bias Scores, Toxicity, PII Leakage
- Task-Specific: RAG (Precision/Recall), Agents (Success Rate), Code (Pass@k)
Documentation Structure
This skill is organized into 6 files:
- SKILL.md (this file) - Quick reference and essential patterns
- KNOWLEDGE.md - Evaluation theory, benchmarks, resources
- PATTERNS.md - 8 evaluation patterns with complete code
- GOTCHAS.md - Common pitfalls, metric limitations, edge cases
- EXAMPLES.md - Complete real-world evaluation scenarios
- REFERENCE.md - Comprehensive metrics reference
Quick Start: LLM Quality Evaluation
from dataclasses import dataclass
import anthropic
import numpy as np
@dataclass
class QualityMetrics:
coherence: float
relevance: float
factuality: float
completeness: float
conciseness: float
overall_score: float
class LLMQualityEvaluator:
def __init__(self, model="claude-3-5-sonnet-20241022"):
self.client = anthropic.Anthropic()
self.evaluator_model = model
def evaluate(self, query: str, output: str,
ground_truth: str = None) -> QualityMetrics:
"""Score output on 5 dimensions using LLM-as-judge"""
eval_prompt = f"""Evaluate this LLM output (score 0-10 each):
Query: {query}
Output: {output}
{f'Ground Truth: {ground_truth}' if ground_truth else ''}
Provide JSON: {{
"coherence": X,
"relevance": X,
"factuality": X,
"completeness": X,
"conciseness": X
}}"""
response = self.client.messages.create(
model=self.evaluator_model,
max_tokens=256,
messages=[{"role": "user", "content": eval_prompt}]
)
import json
scores = json.loads(response.content[0].text)
return QualityMetrics(
coherence=scores["coherence"] / 10.0,
relevance=scores["relevance"] / 10.0,
factuality=scores["factuality"] / 10.0,
completeness=scores["completeness"] / 10.0,
conciseness=scores["conciseness"] / 10.0,
overall_score=np.mean(list(scores.values())) / 10.0
)
# Usage
evaluator = LLMQualityEvaluator()
metrics = evaluator.evaluate(
query="Explain quantum entanglement",
output="Quantum entanglement occurs when particles..."
)
print(f"Quality Score: {metrics.overall_score:.2f}")
Quick Start: Hallucination Detection
class HallucinationDetector:
def __init__(self, model="claude-3-5-sonnet-20241022"):
self.client = anthropic.Anthropic()
self.model = model
def detect(self, context: str, generated_text: str) -> dict:
"""Compare output against source context"""
prompt = f"""Is this generated text supported by context? (0-100%)
Context: {context}
Generated: {generated_text}
Response: {{"supported_percentage": X}}"""
response = self.client.messages.create(
model=self.model,
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
import json
result = json.loads(response.content[0].text)
return {
"hallucination_rate": 1 - (result["supported_percentage"] / 100),
"supported_percentage": result["supported_percentage"]
}
detector = HallucinationDetector()
result = detector.detect(
context="The Eiffel Tower is in Paris and 300m tall",
generated_text="The Eiffel Tower in Paris is 330m tall"
)
print(f"Hallucination Rate: {result['hallucination_rate']:.1%}")
Quick Start: RAG Evaluation
@dataclass
class RAGMetrics:
retrieval_precision: float
retrieval_recall: float
overall_score: float
class RAGEvaluator:
def evaluate_retrieval(self, query: str, retrieved: list,
relevant_ids: list) -> dict:
"""Score retrieval: precision, recall, MRR"""
retrieved_set = set(range(len(retrieved)))
relevant_set = set(relevant_ids)
precision = len(retrieved_set & relevant_set) / len(retrieved_set) or 0
recall = len(retrieved_set & relevant_set) / len(relevant_set) or 0
# Mean Reciprocal Rank
mrr = 0
for i, _ in enumerate(retrieved):
if i in relevant_ids:
mrr = 1 / (i + 1)
break
return {
"precision": precision,
"recall": recall,
"f1": 2 * (precision * recall) / (precision + recall) if (precision + recall) else 0,
"mrr": mrr
}
Quick Start: Prompt Engineering A/B Test
def compare_prompts(variants: list, test_cases: list) -> dict:
"""Compare prompt performance on test set"""
results = {}
for variant in variants:
scores = []
for case in test_cases:
prompt = variant["template"].format(**case)
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
score = evaluate_output(response.content[0].text)
scores.append(score)
results[variant["id"]] = {
"mean_score": np.mean(scores),
"std": np.std(scores)
}
return results
Quick Start: Cost Optimization
PRICING = {
"claude-3-5-sonnet-20241022": {"input": 3.0, "output": 15.0},
"claude-3-haiku-20240307": {"input": 0.25, "output": 1.25}
}
def calculate_cost(input_tokens: int, output_tokens: int, model: str) -> float:
pricing = PRICING.get(model)
return ((input_tokens / 1_000_000) * pricing["input"] +
(output_tokens / 1_000_000) * pricing["output"])
# Compare models on single prompt
for model in ["claude-3-5-sonnet-20241022", "claude-3-haiku-20240307"]:
response = client.messages.create(model=model, max_tokens=256, ...)
cost = calculate_cost(response.usage.input_tokens,
response.usage.output_tokens, model)
print(f"{model}: ${cost:.4f}")
Best Practices
DO's
- Use multiple metrics - combine reference-based, reference-free, LLM-as-judge
- Establish baselines - track metrics over time
- Test diverse data - cover edge cases, domains, lengths
- Automate evaluation - integrate into CI/CD
- Calibrate judges - validate LLM judges against humans
- Monitor production - continuous evaluation on real queries
- Version everything - track prompts, models, datasets
- Set thresholds - define acceptable ranges
- Use held-out data - never evaluate on training data
- Document methodology - for reproducibility
DON'Ts
- Don't trust single metrics - BLEU/ROUGE alone insufficient
- Don't ignore edge cases - rare inputs reveal issues
- Don't overfit to benchmarks - high MMLU ≠ production-ready
- Don't skip human eval - periodically validate
- Don't forget cost - quality/cost tradeoff critical
- Don't ignore latency - response time matters for UX
- Don't test only happy paths - test failures, adversarial inputs
- Don't use stale benchmarks - models may have seen test data
- Don't evaluate in isolation - consider entire system
- Don't forget fairness - evaluate across demographic groups
8 Core Evaluation Patterns
See PATTERNS.md for complete implementations:
- LLM Output Quality - Coherence, relevance, factuality, completeness, conciseness
- Prompt Engineering - A/B test variants with cost/quality tradeoff
- RAG Systems - Retrieval + context + generation evaluation
- Hallucination Detection - Fact-checking, consistency, claim verification
- Bias & Fairness - Demographic parity, sentiment disparity metrics
- Cost Optimization - Token usage analysis, model comparison, optimization
- Performance Metrics - Latency (p50, p95, p99), throughput, load testing
- Benchmarks - MMLU, HumanEval, GSM8K benchmark implementations
Common Pitfalls
See GOTCHAS.md for detailed coverage:
- Benchmark overfitting and data contamination
- Metric-human judgment mismatch
- Hallucination detection challenges
- Bias and fairness assessment pitfalls
- Evaluation dataset size limitations
- Prompt sensitivity and context window issues
Metrics Reference
See REFERENCE.md for:
Text Metrics: BLEU, ROUGE-1/2/L, BERTScore, METEOR, perplexity, MAUVE
Ranking Metrics: Precision, Recall, F1, MRR, nDCG
LLM-as-Judge: G-Eval, Prometheus, AlpacaEval patterns
Task-Specific: Pass@k (code), Exact Match (QA), SacreBLEU (translation)
Knowledge Resources
See KNOWLEDGE.md for:
- Evaluation frameworks (HELM, LangSmith, Ragas, OpenAI Evals)
- Benchmarks (MMLU, HumanEval, MBPP, TruthfulQA, BBH, GSM8K, BEIR)
- LLM-as-judge resources and patterns
- Bias/fairness datasets and frameworks
- Production monitoring platforms
- Tool recommendations and libraries
Complete Examples
See EXAMPLES.md for:
- Building evaluation datasets from scratch
- Complete end-to-end evaluation workflows
- Production integration (CI/CD pipelines)
- Monitoring dashboards and reporting
- Cost tracking and analysis
Related Skills
codebase-onboarding-analyzer- Analyze AI-generated code qualitygap-analysis-framework- Identify evaluation coverage gapsevaluation-reporting-framework- Generate detailed evaluation reportsorchestration-coordination-framework- Coordinate complex evaluation workflowssecurity-scanning-suite- Security evaluation for AI systems
Key Concepts at a Glance
Evaluation Approaches
- Reference-based: BLEU, ROUGE (need ground truth)
- Reference-free: Perplexity, consistency (no ground truth)
- LLM-as-judge: Scalable, flexible, but needs calibration
- Human eval: Gold standard, expensive, slow
Common Metrics
- Quality: Coherence, Relevance, Factuality, Completeness, Conciseness
- Performance: Latency (p50/p95/p99), Throughput (req/sec), Token count
- Safety: Hallucination rate, Bias score, Toxicity, PII leakage
Key Benchmarks
- MMLU: 57 subjects of general knowledge, 15K questions
- HumanEval: 164 coding problems, Pass@k metric
- GSM8K: 8.5K grade school math problems
- TruthfulQA: 817 questions testing truthfulness
- HellaSwag: 70K commonsense reasoning examples
Critical Pitfalls
- Benchmark overfitting (high score ≠ production-ready)
- Metric mismatch (BLEU/ROUGE don't correlate with human judgment)
- Hallucination blindness (LLMs sound confident when wrong)
- Prompt sensitivity (tiny changes cause large variance)
- Distribution shift (eval data differs from production)
Best Practice Workflow
- Define evaluation dimensions (quality, performance, safety)
- Establish baselines on existing systems
- Create diverse evaluation datasets
- Use multiple complementary metrics
- Automate evaluation in CI/CD
- Monitor production continuously
- Regular human eval for calibration
- Document methodology for reproducibility