| name | ai-llm-rag-engineering |
| description | Operational patterns for RAG systems (recent advances): page-level chunking (0.648 accuracy), hybrid retrieval with cross-encoder reranking, adaptive/multimodal/self-correcting systems, recall@k/nDCG evaluation, groundedness metrics, real-time quality tracking. Emphasizes **Modern shift to dynamic, intelligent retrieval beyond static RAG. |
RAG Engineering – Quick Reference
This skill provides practical, production-grade RAG design patterns with recent advances:
- Chunking strategies: Page-level chunking (0.648 accuracy, highest in NVIDIA benchmarks)
- Contextual Retrieval: Anthropic's 2024 technique (67% accuracy improvement with prompt caching)
- Hybrid retrieval: Lexical (BM25) + vector + cross-encoder reranking
- Reranking: Cross-encoder (ms-marco-TinyBERT-L-2-v2, 4.3M params, outperforms larger models)
- RAG evaluation: Recall@K, Precision@K, nDCG, groundedness, verbosity, instruction following
- Modern paradigm shift: Adaptive, multimodal, self-correcting systems (static RAG is over)
Key Insights:
- Page-level chunking achieved highest accuracy (0.648) with lowest variance
- Contextual Retrieval reduces retrieval failures by 67% when combined with reranking
- Semantic chunking improves recall by up to 9% over simpler methods
- Hybrid retrieval + reranking drastically improves accuracy
- Era of static RAG is over - adaptive, wise retrieval is mainstream
It focuses on doing, not explaining theory.
Scope note: Retrieval algorithm tuning (BM25/HNSW/hybrid, query rewriting) lives in ai-llm-search-retrieval; this skill covers RAG-specific packaging, context injection, and grounded generation.
Quick Reference
| Task | Tool/Framework | Command/Pattern | When to Use |
|---|---|---|---|
| Chunking | Page-level, Semantic | RecursiveCharacterTextSplitter (400-512) | 0.648 accuracy, 85-90% recall |
| Contextual Retrieval | Anthropic Claude | Generate chunk context + prompt caching | 67% failure reduction, $1.02/M tokens |
| Hybrid Retrieval | BM25 + Vector | LlamaIndex, LangChain, Haystack | Significant relevance benefits (modern standard) |
| Reranking | Cross-encoder | ms-marco-TinyBERT-L-2-v2 (4.3M params) | Drastically improves accuracy, <100ms |
| Vector Index | HNSW, IVF | FAISS, Pinecone, Qdrant, Weaviate | <10M: HNSW, >10M: IVF/ScaNN |
| Evaluation | RAGAS, TruLens | Recall@K, nDCG, groundedness metrics | Quality validation, A/B testing |
Decision Tree: RAG Architecture Selection
Building RAG system: [Architecture Path]
├─ Document type?
│ ├─ Page-structured? → Page-level chunking (0.648 accuracy, lowest variance)
│ ├─ Technical docs? → Semantic chunking (9% recall improvement)
│ └─ Simple content? → RecursiveCharacterTextSplitter (400-512, 85-90% recall)
│
├─ Retrieval accuracy low?
│ ├─ Multi-entity docs? → Contextual Retrieval (67% failure reduction)
│ ├─ Noisy results? → Cross-encoder reranking (TinyBERT, <100ms)
│ └─ Mixed queries? → Hybrid retrieval (BM25 + vector + reranking)
│
├─ Dataset size?
│ ├─ <100k chunks? → Flat index (exact search)
│ ├─ 100k-10M? → HNSW (low latency)
│ └─ >10M? → IVF/ScaNN/DiskANN (scalable)
│
└─ Production quality?
└─ Full pipeline: Page-level + Contextual + Hybrid + Reranking → Optimal accuracy
When to Use This Skill
Claude should invoke this skill when the user asks:
- "Help me design a RAG pipeline."
- "How should I chunk this document?"
- "Optimize retrieval for my use case."
- "My RAG system is hallucinating — fix it."
- "Choose the right vector database / index type."
- "Create a RAG evaluation framework."
- "Debug why retrieval gives irrelevant results."
Related Skills
For adjacent topics, reference these skills:
- ai-llm-development - Prompting, fine-tuning, instruction datasets
- ai-llm-engineering - Agentic workflows, multi-agent systems, LLM orchestration
- ai-llm-search-retrieval - BM25, hybrid search, ranking pipelines (complements RAG retrieval)
- ai-llm-ops-inference - Serving performance, quantization, batching
- ai-ml-ops-security - Security, privacy, PII handling
- ai-ml-ops-production - Deployment, monitoring, data pipelines
- ai-prompt-engineering - Prompt patterns for RAG generation phase
Detailed Guides
Core RAG Architecture
- Pipeline Architecture - End-to-end RAG pipeline structure, ingestion, freshness, index hygiene, embedding selection
- Chunking Strategies - Modern benchmarks (page-level 0.648 accuracy, semantic, RecursiveCharacterTextSplitter 400-512)
- Index Selection Guide - Vector database configuration, HNSW/IVF/Flat selection, parameter tuning
Advanced Retrieval Techniques
- Retrieval Patterns - Dense retrieval, hybrid search, query preprocessing, reranking workflow, metadata filtering
- Contextual Retrieval Guide - Anthropic's 2024 technique (67% failure reduction), prompt caching, implementation
- Grounding Checklists - Context compression, hallucination control, citation patterns, answerability validation
Production & Evaluation
- RAG Evaluation Guide - Recall@K, nDCG, groundedness, RAGAS/TruLens, A/B testing, sliced evaluation
- Advanced RAG Patterns - Graph/multimodal RAG, online evaluation, telemetry, shadow/canary testing, adaptive retrieval
- RAG Troubleshooting - Failure mode triage, debugging irrelevant results, hallucination fixes
Existing Detailed Patterns
- Chunking Patterns - Technical implementation details for all chunking approaches
- Retrieval Patterns - Low-level retrieval implementation patterns
Templates
Chunking & Ingestion
Embedding & Indexing
Retrieval & Reranking
Context Packaging & Grounding
Evaluation
Navigation
Resources
- resources/rag-evaluation-guide.md
- resources/rag-troubleshooting.md
- resources/contextual-retrieval-guide.md
- resources/pipeline-architecture.md
- resources/advanced-rag-patterns.md
- resources/chunking-strategies.md
- resources/grounding-checklists.md
- resources/index-selection-guide.md
- resources/retrieval-patterns.md
- resources/chunking-patterns.md
Templates
- templates/context/template-context-packing.md
- templates/context/template-grounding.md
- templates/chunking/template-basic-chunking.md
- templates/chunking/template-code-chunking.md
- templates/chunking/template-long-doc-chunking.md
- templates/retrieval/template-retrieval-pipeline.md
- templates/retrieval/template-hybrid-search.md
- templates/retrieval/template-reranking.md
- templates/eval/template-rag-eval.md
- templates/eval/template-rag-testset.jsonl
- templates/indexing/template-index-config.md
- templates/indexing/template-metadata-schema.md
Data
- data/sources.json — Curated external references
External Resources
See data/sources.json for:
- Embedding models (OpenAI, Cohere, Sentence Transformers, Voyage AI, Jina)
- Vector DBs (FAISS, Pinecone, Qdrant, Weaviate, Milvus, Chroma, pgvector, LanceDB)
- Hybrid search libraries (Elasticsearch, OpenSearch, Typesense, Meilisearch)
- Reranking models (Cohere Rerank, Jina Reranker, RankGPT, Flashrank)
- Evaluation frameworks (RAGAS, TruLens, DeepEval, BEIR)
- RAG frameworks (LlamaIndex, LangChain, Haystack, txtai)
- Advanced techniques (RAG Fusion, CRAG, Self-RAG, Contextual Retrieval)
- Production platforms (Vectara, AWS Kendra)
Use this skill whenever the user needs retrieval-augmented system design or debugging, not prompt work or deployment.