Claude Code Plugins

Community-maintained marketplace

Feedback

Set up RAG pipelines with document chunking, embedding generation, and retrieval strategies using LlamaIndex. Use when building new RAG systems, choosing chunking approaches, selecting embedding models, or implementing vector/hybrid retrieval for src/ or src-iLand/ pipelines.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name implementing-rag
description Set up RAG pipelines with document chunking, embedding generation, and retrieval strategies using LlamaIndex. Use when building new RAG systems, choosing chunking approaches, selecting embedding models, or implementing vector/hybrid retrieval for src/ or src-iLand/ pipelines.

Implementing RAG Systems

Quick-start guide for building production-grade RAG pipelines with LlamaIndex. This skill helps you set up the foundational components: document processing, embedding generation, and retrieval.

When to Use This Skill

  • Building a new RAG pipeline from scratch
  • Choosing optimal chunking strategy for your documents
  • Selecting and configuring embedding models
  • Implementing retrieval (vector, BM25, or hybrid search)
  • Optimizing for specific use cases (Thai language, legal documents, product catalogs)
  • Integrating with existing pipelines in src/ or src-iLand/

Quick Decision Guide

Chunking Strategy

  • Legal/Technical docs (like Thai land deeds) → 512-1024 tokens
  • Narrative content → 1024-2048 tokens
  • Q&A pairs → 256-512 tokens
  • Rule of thumb: When halving chunk size, double similarity_top_k

Embedding Model Selection

  • Cost-sensitive → HuggingFace bge-base-en-v1.5 with ONNX (3-7x faster, free)
  • Quality-first → OpenAI text-embedding-3-small or JinaAI-base
  • Multi-language (Thai) → Cohere embed-multilingual-v3.0
  • Balanced → OpenAI or bge-large (open-source)

Retrieval Strategy

  • Simple queries → Vector search
  • Keyword-heavy → BM25
  • Production systems → Hybrid (vector + BM25)
  • Structured data → Metadata filtering
  • Large doc sets (100+) → Document summary or recursive retrieval

Quick Start Patterns

Pattern 1: Standard RAG Setup (Vector Search)

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure chunking
Settings.chunk_size = 1024
Settings.chunk_overlap = 20

# Configure embeddings
Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    embed_batch_size=100
)

# Load and index documents
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Create query engine
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("Your question here")

Pattern 2: Multi-Language Setup (Thai Support)

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.core.node_parser import SentenceSplitter

# Use multilingual embeddings
embed_model = CohereEmbedding(
    model_name="embed-multilingual-v3.0",
    api_key="YOUR_COHERE_API_KEY"
)

# Configure chunking for Thai text
node_parser = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=50
)

# Load Thai documents
documents = SimpleDirectoryReader("./data").load_data()

# Create index with multilingual support
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=embed_model,
    transformations=[node_parser]
)

query_engine = index.as_query_engine(similarity_top_k=5)

Pattern 3: Hybrid Search (Production-Ready)

from llama_index.core import VectorStoreIndex
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
import Stemmer

# Create vector index
index = VectorStoreIndex.from_documents(documents)
vector_retriever = index.as_retriever(similarity_top_k=10)

# Create BM25 retriever
bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore,
    similarity_top_k=10,
    stemmer=Stemmer.Stemmer("english")
)

# Combine with query fusion
hybrid_retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    similarity_top_k=5,
    mode="reciprocal_rerank",
    use_async=True
)

# Use in query engine
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(hybrid_retriever)

Pattern 4: Cost-Optimized Local Embeddings

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

# Use local model with ONNX acceleration (3-7x faster)
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-base-en-v1.5",
    backend="onnx"  # Requires: pip install optimum[onnxruntime]
)

# Rest of the pipeline remains the same
index = VectorStoreIndex.from_documents(documents)

Your Codebase Integration

For src/ Pipeline

Document Preprocessing (src/02_prep_doc_for_embedding.py):

  • Add configurable chunking strategy
  • Support multiple node parser types

Batch Embeddings (src/09_enhanced_batch_embeddings.py):

  • Increase embed_batch_size to 100 for faster processing
  • Consider local models to eliminate API costs

Basic Retrieval (src/10_basic_query_engine.py):

  • Baseline vector search implementation
  • Adjust similarity_top_k based on chunk size

For src-iLand/ Pipeline (Thai Land Deeds)

Data Processing (src-iLand/data_processing/):

  • Current chunk size: 1024 tokens (good for legal documents)
  • Consider testing 512 with increased top_k for better precision

Embeddings (src-iLand/docs_embedding/):

  • Current: OpenAI text-embedding-3-small
  • Test: Cohere embed-multilingual-v3.0 for better Thai understanding

Retrieval (src-iLand/retrieval/retrievers/):

  • Vector strategy: Baseline semantic search
  • Hybrid strategy: Combines vector + BM25 for Thai queries
  • Metadata strategy: Fast filtering (<50ms) on Thai geographic/legal metadata

Detailed References

Load these reference files when you need comprehensive details:

  • reference-chunking.md: Complete chunking strategies guide

    • Node parser types and configurations
    • Chunk size optimization for different domains
    • Sentence-window and hierarchical patterns
    • Metadata preservation
  • reference-embeddings.md: Embedding model selection and optimization

    • All supported models (OpenAI, Cohere, HuggingFace, JinaAI, Voyage)
    • Performance benchmarks (hit rate, MRR)
    • ONNX/OpenVINO optimization details
    • Multi-language support
    • Cost optimization strategies
  • reference-retrieval-basics.md: Core retrieval patterns

    • Vector retrieval implementation
    • BM25 keyword search
    • Hybrid search (vector + BM25)
    • Metadata filtering
    • Configuration parameters and tuning

Common Workflows

Workflow 1: New RAG Pipeline Setup

  • Step 1: Choose chunking strategy based on document type

    • Load reference-chunking.md for detailed guidance
    • Test with sample documents
  • Step 2: Select embedding model

    • Load reference-embeddings.md for model comparison
    • Consider: cost, latency, language support
    • For Thai: use multilingual models
  • Step 3: Implement basic vector retrieval

    • Start with simple VectorStoreIndex
    • Test with sample queries
    • Measure baseline performance
  • Step 4: Optimize chunk size and top_k

    • Adjust based on query results
    • Balance precision vs recall
  • Step 5: (Optional) Upgrade to hybrid search

    • Combine vector + BM25 for production quality

Workflow 2: Migrating from API to Local Embeddings

  • Step 1: Choose local model (see reference-embeddings.md)

    • Recommended: BAAI/bge-base-en-v1.5
  • Step 2: Install optimization backend

    pip install optimum[onnxruntime]
    
  • Step 3: Update embedding configuration

    Settings.embed_model = HuggingFaceEmbedding(
        model_name="BAAI/bge-base-en-v1.5",
        backend="onnx"
    )
    
  • Step 4: Re-index all documents

    • CRITICAL: Must re-embed with new model
    • Use caching to avoid redundant work
  • Step 5: Validate retrieval quality

    • Compare results with API embeddings
    • Measure latency improvement (3-7x faster expected)

Workflow 3: Thai Language RAG Setup

  • Step 1: Use multilingual embedding model

    embed_model = CohereEmbedding(
        model_name="embed-multilingual-v3.0"
    )
    
  • Step 2: Configure appropriate chunking

    • Test 512-1024 tokens for Thai text
    • Preserve Thai metadata
  • Step 3: Implement hybrid search

    • Vector for semantic understanding
    • BM25 for exact Thai keyword matching
  • Step 4: Add metadata filtering

    • จังหวัด (province), อำเภอ (district), ประเภท (type)
    • Enable fast filtering (<50ms)
  • Step 5: Test with Thai queries

    • Validate Unicode handling
    • Check retrieval relevance

Key Reminders

Critical Requirements:

  • Re-indexing: When changing embedding models, MUST re-index all data
  • Model consistency: Identical models for indexing and querying
  • Chunk size + top_k: Halving chunk size → double similarity_top_k

Best Practices:

  • Start simple (vector search) → Add complexity as needed (hybrid, metadata)
  • Test with representative queries before full indexing
  • Use local models for cost optimization at scale
  • Enable parallel loading (num_workers=10) for 13x speedup

Performance Tips:

  • Batch size: Set embed_batch_size=100 for API calls
  • Parallel loading: Use SimpleDirectoryReader(...).load_data(num_workers=10)
  • Caching: Enable ingestion pipeline caching for faster re-runs

Next Steps

After implementing basic RAG:

  • Optimize: Use optimizing-rag skill for reranking, caching, production deployment
  • Evaluate: Use evaluating-rag skill to measure hit rate, MRR, and compare strategies