| name | implementing-rag |
| description | Set up RAG pipelines with document chunking, embedding generation, and retrieval strategies using LlamaIndex. Use when building new RAG systems, choosing chunking approaches, selecting embedding models, or implementing vector/hybrid retrieval for src/ or src-iLand/ pipelines. |
Implementing RAG Systems
Quick-start guide for building production-grade RAG pipelines with LlamaIndex. This skill helps you set up the foundational components: document processing, embedding generation, and retrieval.
When to Use This Skill
- Building a new RAG pipeline from scratch
- Choosing optimal chunking strategy for your documents
- Selecting and configuring embedding models
- Implementing retrieval (vector, BM25, or hybrid search)
- Optimizing for specific use cases (Thai language, legal documents, product catalogs)
- Integrating with existing pipelines in
src/orsrc-iLand/
Quick Decision Guide
Chunking Strategy
- Legal/Technical docs (like Thai land deeds) → 512-1024 tokens
- Narrative content → 1024-2048 tokens
- Q&A pairs → 256-512 tokens
- Rule of thumb: When halving chunk size, double
similarity_top_k
Embedding Model Selection
- Cost-sensitive → HuggingFace bge-base-en-v1.5 with ONNX (3-7x faster, free)
- Quality-first → OpenAI text-embedding-3-small or JinaAI-base
- Multi-language (Thai) → Cohere embed-multilingual-v3.0
- Balanced → OpenAI or bge-large (open-source)
Retrieval Strategy
- Simple queries → Vector search
- Keyword-heavy → BM25
- Production systems → Hybrid (vector + BM25)
- Structured data → Metadata filtering
- Large doc sets (100+) → Document summary or recursive retrieval
Quick Start Patterns
Pattern 1: Standard RAG Setup (Vector Search)
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure chunking
Settings.chunk_size = 1024
Settings.chunk_overlap = 20
# Configure embeddings
Settings.embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
embed_batch_size=100
)
# Load and index documents
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
# Create query engine
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("Your question here")
Pattern 2: Multi-Language Setup (Thai Support)
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.core.node_parser import SentenceSplitter
# Use multilingual embeddings
embed_model = CohereEmbedding(
model_name="embed-multilingual-v3.0",
api_key="YOUR_COHERE_API_KEY"
)
# Configure chunking for Thai text
node_parser = SentenceSplitter(
chunk_size=1024,
chunk_overlap=50
)
# Load Thai documents
documents = SimpleDirectoryReader("./data").load_data()
# Create index with multilingual support
index = VectorStoreIndex.from_documents(
documents,
embed_model=embed_model,
transformations=[node_parser]
)
query_engine = index.as_query_engine(similarity_top_k=5)
Pattern 3: Hybrid Search (Production-Ready)
from llama_index.core import VectorStoreIndex
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
import Stemmer
# Create vector index
index = VectorStoreIndex.from_documents(documents)
vector_retriever = index.as_retriever(similarity_top_k=10)
# Create BM25 retriever
bm25_retriever = BM25Retriever.from_defaults(
docstore=index.docstore,
similarity_top_k=10,
stemmer=Stemmer.Stemmer("english")
)
# Combine with query fusion
hybrid_retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
similarity_top_k=5,
mode="reciprocal_rerank",
use_async=True
)
# Use in query engine
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(hybrid_retriever)
Pattern 4: Cost-Optimized Local Embeddings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
# Use local model with ONNX acceleration (3-7x faster)
Settings.embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-base-en-v1.5",
backend="onnx" # Requires: pip install optimum[onnxruntime]
)
# Rest of the pipeline remains the same
index = VectorStoreIndex.from_documents(documents)
Your Codebase Integration
For src/ Pipeline
Document Preprocessing (src/02_prep_doc_for_embedding.py):
- Add configurable chunking strategy
- Support multiple node parser types
Batch Embeddings (src/09_enhanced_batch_embeddings.py):
- Increase
embed_batch_sizeto 100 for faster processing - Consider local models to eliminate API costs
Basic Retrieval (src/10_basic_query_engine.py):
- Baseline vector search implementation
- Adjust
similarity_top_kbased on chunk size
For src-iLand/ Pipeline (Thai Land Deeds)
Data Processing (src-iLand/data_processing/):
- Current chunk size: 1024 tokens (good for legal documents)
- Consider testing 512 with increased
top_kfor better precision
Embeddings (src-iLand/docs_embedding/):
- Current: OpenAI text-embedding-3-small
- Test: Cohere embed-multilingual-v3.0 for better Thai understanding
Retrieval (src-iLand/retrieval/retrievers/):
- Vector strategy: Baseline semantic search
- Hybrid strategy: Combines vector + BM25 for Thai queries
- Metadata strategy: Fast filtering (<50ms) on Thai geographic/legal metadata
Detailed References
Load these reference files when you need comprehensive details:
reference-chunking.md: Complete chunking strategies guide
- Node parser types and configurations
- Chunk size optimization for different domains
- Sentence-window and hierarchical patterns
- Metadata preservation
reference-embeddings.md: Embedding model selection and optimization
- All supported models (OpenAI, Cohere, HuggingFace, JinaAI, Voyage)
- Performance benchmarks (hit rate, MRR)
- ONNX/OpenVINO optimization details
- Multi-language support
- Cost optimization strategies
reference-retrieval-basics.md: Core retrieval patterns
- Vector retrieval implementation
- BM25 keyword search
- Hybrid search (vector + BM25)
- Metadata filtering
- Configuration parameters and tuning
Common Workflows
Workflow 1: New RAG Pipeline Setup
Step 1: Choose chunking strategy based on document type
- Load
reference-chunking.mdfor detailed guidance - Test with sample documents
- Load
Step 2: Select embedding model
- Load
reference-embeddings.mdfor model comparison - Consider: cost, latency, language support
- For Thai: use multilingual models
- Load
Step 3: Implement basic vector retrieval
- Start with simple VectorStoreIndex
- Test with sample queries
- Measure baseline performance
Step 4: Optimize chunk size and top_k
- Adjust based on query results
- Balance precision vs recall
Step 5: (Optional) Upgrade to hybrid search
- Combine vector + BM25 for production quality
Workflow 2: Migrating from API to Local Embeddings
Step 1: Choose local model (see
reference-embeddings.md)- Recommended: BAAI/bge-base-en-v1.5
Step 2: Install optimization backend
pip install optimum[onnxruntime]Step 3: Update embedding configuration
Settings.embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-base-en-v1.5", backend="onnx" )Step 4: Re-index all documents
- CRITICAL: Must re-embed with new model
- Use caching to avoid redundant work
Step 5: Validate retrieval quality
- Compare results with API embeddings
- Measure latency improvement (3-7x faster expected)
Workflow 3: Thai Language RAG Setup
Step 1: Use multilingual embedding model
embed_model = CohereEmbedding( model_name="embed-multilingual-v3.0" )Step 2: Configure appropriate chunking
- Test 512-1024 tokens for Thai text
- Preserve Thai metadata
Step 3: Implement hybrid search
- Vector for semantic understanding
- BM25 for exact Thai keyword matching
Step 4: Add metadata filtering
- จังหวัด (province), อำเภอ (district), ประเภท (type)
- Enable fast filtering (<50ms)
Step 5: Test with Thai queries
- Validate Unicode handling
- Check retrieval relevance
Key Reminders
Critical Requirements:
- Re-indexing: When changing embedding models, MUST re-index all data
- Model consistency: Identical models for indexing and querying
- Chunk size + top_k: Halving chunk size → double
similarity_top_k
Best Practices:
- Start simple (vector search) → Add complexity as needed (hybrid, metadata)
- Test with representative queries before full indexing
- Use local models for cost optimization at scale
- Enable parallel loading (
num_workers=10) for 13x speedup
Performance Tips:
- Batch size: Set
embed_batch_size=100for API calls - Parallel loading: Use
SimpleDirectoryReader(...).load_data(num_workers=10) - Caching: Enable ingestion pipeline caching for faster re-runs
Next Steps
After implementing basic RAG:
- Optimize: Use
optimizing-ragskill for reranking, caching, production deployment - Evaluate: Use
evaluating-ragskill to measure hit rate, MRR, and compare strategies