Claude Code Plugins

Community-maintained marketplace

Feedback

rag-architecture

@liauw-media/CodeAssist
1
0

Use when building retrieval-augmented generation systems. Covers chunking strategies, embedding models, vector databases, retrieval patterns, and hybrid search. Apply when adding knowledge bases, document Q&A, or semantic search to applications.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name rag-architecture
description Use when building retrieval-augmented generation systems. Covers chunking strategies, embedding models, vector databases, retrieval patterns, and hybrid search. Apply when adding knowledge bases, document Q&A, or semantic search to applications.

RAG Architecture

Core Principle

RAG quality depends on retrieval quality. Optimize the retrieval pipeline before tuning the generation.

When to Use This Skill

  • Building document Q&A systems
  • Adding knowledge bases to chatbots
  • Implementing semantic search
  • Creating context-aware AI features
  • Reducing hallucinations with grounded responses
  • Building enterprise search systems

The Iron Law

GARBAGE IN, GARBAGE OUT.

If your chunks are poorly structured or your embeddings don't capture semantics, no amount of prompt engineering will save you.

Why RAG?

Benefits:

  • Grounds LLM responses in actual data
  • Reduces hallucinations significantly
  • Enables domain-specific knowledge
  • Keeps data private (no fine-tuning needed)
  • Easy to update (just re-index)
  • Cost-effective vs fine-tuning

Without RAG:

  • LLM makes up facts
  • Knowledge cutoff limits usefulness
  • No access to private data
  • Expensive fine-tuning for updates
  • Generic responses

RAG Pipeline Overview

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Documents  │───▶│  Chunking   │───▶│  Embedding  │
└─────────────┘    └─────────────┘    └─────────────┘
                                             │
                                             ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Response   │◀───│     LLM     │◀───│  Retrieval  │
└─────────────┘    └─────────────┘    └─────────────┘
                          ▲
                          │
                   ┌─────────────┐
                   │    Query    │
                   └─────────────┘

Step 1: Document Ingestion

Supported Formats

# Common document loaders
SUPPORTED_FORMATS = {
    'text': ['.txt', '.md', '.rst'],
    'documents': ['.pdf', '.docx', '.doc', '.pptx'],
    'code': ['.py', '.js', '.ts', '.java', '.go'],
    'data': ['.csv', '.json', '.xml'],
    'web': ['html', 'urls', 'sitemaps'],
}

Pre-processing

def preprocess_document(doc):
    """Clean and prepare document for chunking."""
    # 1. Extract text
    text = extract_text(doc)

    # 2. Clean artifacts
    text = remove_headers_footers(text)
    text = normalize_whitespace(text)
    text = fix_encoding_issues(text)

    # 3. Preserve structure
    sections = identify_sections(text)

    # 4. Extract metadata
    metadata = {
        'source': doc.path,
        'title': extract_title(doc),
        'date': extract_date(doc),
        'type': doc.type,
    }

    return text, sections, metadata

Step 2: Chunking Strategies

Strategy Selection

Content Type Strategy Chunk Size Overlap
Documentation Semantic 500-1000 100-200
Code Function/Class Varies Minimal
Legal/Contracts Paragraph 300-500 50-100
Chat logs Message groups 5-10 msgs 2-3 msgs
Q&A pairs Per pair Full pair None

Fixed-Size Chunking

def fixed_size_chunk(text, chunk_size=500, overlap=100):
    """Simple but effective for uniform content."""
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]

        # Don't cut mid-word
        if end < len(text):
            last_space = chunk.rfind(' ')
            if last_space > chunk_size * 0.8:
                chunk = chunk[:last_space]
                end = start + last_space

        chunks.append(chunk.strip())
        start = end - overlap

    return chunks

Semantic Chunking

def semantic_chunk(text, max_chunk_size=1000):
    """Chunk by semantic boundaries (paragraphs, sections)."""
    # Split by semantic boundaries
    paragraphs = text.split('\n\n')

    chunks = []
    current_chunk = []
    current_size = 0

    for para in paragraphs:
        para_size = len(para)

        if current_size + para_size > max_chunk_size and current_chunk:
            chunks.append('\n\n'.join(current_chunk))
            current_chunk = []
            current_size = 0

        current_chunk.append(para)
        current_size += para_size

    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))

    return chunks

Recursive Chunking (LangChain style)

def recursive_chunk(text, chunk_size=500, separators=None):
    """Recursively split on increasingly fine separators."""
    if separators is None:
        separators = ['\n\n', '\n', '. ', ' ', '']

    separator = separators[0]
    splits = text.split(separator)

    chunks = []
    current = []
    current_len = 0

    for split in splits:
        if current_len + len(split) > chunk_size:
            if current:
                chunk_text = separator.join(current)
                # Recursively split if still too large
                if len(chunk_text) > chunk_size and len(separators) > 1:
                    chunks.extend(recursive_chunk(
                        chunk_text, chunk_size, separators[1:]
                    ))
                else:
                    chunks.append(chunk_text)
            current = [split]
            current_len = len(split)
        else:
            current.append(split)
            current_len += len(split) + len(separator)

    if current:
        chunks.append(separator.join(current))

    return chunks

Code-Aware Chunking

def code_chunk(source_code, language='python'):
    """Chunk code by logical units."""
    import ast

    if language == 'python':
        tree = ast.parse(source_code)
        chunks = []

        for node in ast.walk(tree):
            if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
                chunk = ast.get_source_segment(source_code, node)
                chunks.append({
                    'content': chunk,
                    'type': type(node).__name__,
                    'name': node.name,
                    'line_start': node.lineno,
                })

        return chunks

Chunking Best Practices

DO:
- Keep related content together
- Include context in each chunk
- Preserve code blocks intact
- Add metadata for filtering
- Test with real queries

DON'T:
- Cut mid-sentence
- Split code functions
- Ignore document structure
- Use one-size-fits-all
- Forget about overlap

Step 3: Embedding Models

Model Selection

Model Dimensions Speed Quality Cost
OpenAI text-embedding-3-small 1536 Fast Good $0.02/1M
OpenAI text-embedding-3-large 3072 Medium Best $0.13/1M
Cohere embed-v3 1024 Fast Good $0.10/1M
Voyage-2 1024 Medium Excellent $0.10/1M
BGE-large (local) 1024 Varies Good Free
all-MiniLM-L6 (local) 384 Fast Decent Free

Embedding Implementation

from openai import OpenAI

client = OpenAI()

def embed_texts(texts, model="text-embedding-3-small"):
    """Embed multiple texts efficiently."""
    # Batch for efficiency (max 2048 texts per call)
    embeddings = []
    batch_size = 2048

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            model=model,
            input=batch
        )
        embeddings.extend([e.embedding for e in response.data])

    return embeddings

def embed_query(query, model="text-embedding-3-small"):
    """Embed a single query."""
    response = client.embeddings.create(
        model=model,
        input=query
    )
    return response.data[0].embedding

Local Embedding (Cost-Free)

from sentence_transformers import SentenceTransformer

# Load model once
model = SentenceTransformer('all-MiniLM-L6-v2')

def embed_local(texts):
    """Free local embedding."""
    return model.encode(texts, convert_to_numpy=True)

Step 4: Vector Database Selection

Database Comparison

Database Type Scale Features Best For
Pinecone Managed Billions Metadata, namespaces Production, scale
Weaviate Self/Managed Millions Hybrid search, GraphQL Flexibility
Qdrant Self/Managed Millions Filtering, payloads Performance
ChromaDB Embedded Thousands Simple API Prototyping
pgvector Extension Millions SQL integration Existing Postgres
Milvus Self-hosted Billions Enterprise features Large scale

ChromaDB (Quick Start)

import chromadb

# Initialize
client = chromadb.Client()
collection = client.create_collection("documents")

# Add documents
collection.add(
    documents=["Doc 1 content", "Doc 2 content"],
    metadatas=[{"source": "file1"}, {"source": "file2"}],
    ids=["doc1", "doc2"]
)

# Query
results = collection.query(
    query_texts=["search query"],
    n_results=5
)

Pinecone (Production)

from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")
index = pc.Index("your-index")

# Upsert vectors
index.upsert(
    vectors=[
        {
            "id": "doc1",
            "values": embedding_vector,
            "metadata": {"source": "file1", "category": "docs"}
        }
    ],
    namespace="production"
)

# Query with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={"category": {"$eq": "docs"}},
    include_metadata=True,
    namespace="production"
)

pgvector (Postgres)

-- Enable extension
CREATE EXTENSION vector;

-- Create table
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536),
    metadata JSONB
);

-- Create index for fast search
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Query
SELECT content, metadata,
       1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;

Step 5: Retrieval Strategies

Basic Similarity Search

def retrieve_similar(query, k=5):
    """Basic vector similarity search."""
    query_embedding = embed_query(query)

    results = vector_db.query(
        vector=query_embedding,
        top_k=k
    )

    return [r.content for r in results]

Hybrid Search (Vector + Keyword)

def hybrid_search(query, k=5, alpha=0.5):
    """Combine semantic and keyword search."""
    # Semantic search
    semantic_results = vector_search(query, k=k*2)

    # Keyword search (BM25)
    keyword_results = bm25_search(query, k=k*2)

    # Reciprocal Rank Fusion
    combined = reciprocal_rank_fusion(
        [semantic_results, keyword_results],
        weights=[alpha, 1-alpha]
    )

    return combined[:k]

def reciprocal_rank_fusion(result_lists, weights, k=60):
    """RRF scoring for combining rankings."""
    scores = {}

    for results, weight in zip(result_lists, weights):
        for rank, doc in enumerate(results):
            doc_id = doc.id
            if doc_id not in scores:
                scores[doc_id] = 0
            scores[doc_id] += weight * (1 / (k + rank + 1))

    sorted_docs = sorted(scores.items(), key=lambda x: -x[1])
    return [doc_id for doc_id, score in sorted_docs]

Multi-Query Retrieval

def multi_query_retrieve(query, k=5):
    """Generate query variations for better recall."""
    # Generate variations
    variations = generate_query_variations(query)

    all_results = []
    for q in [query] + variations:
        results = vector_search(q, k=k)
        all_results.extend(results)

    # Dedupe and rank
    unique = deduplicate_by_id(all_results)
    return rerank(query, unique)[:k]

def generate_query_variations(query):
    """Use LLM to generate query variations."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Generate 3 alternative phrasings of this search query.
            Return only the queries, one per line.

            Query: {query}"""
        }]
    )
    return response.choices[0].message.content.strip().split('\n')

Contextual Retrieval

def contextual_retrieve(query, conversation_history, k=5):
    """Consider conversation context for retrieval."""
    # Rewrite query with context
    contextualized_query = rewrite_with_context(
        query,
        conversation_history
    )

    return vector_search(contextualized_query, k=k)

def rewrite_with_context(query, history):
    """Make query self-contained using conversation context."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Rewrite the query to be self-contained, incorporating relevant context from the conversation."
        }, {
            "role": "user",
            "content": f"Conversation:\n{history}\n\nQuery: {query}"
        }]
    )
    return response.choices[0].message.content

Step 6: Reranking

Cross-Encoder Reranking

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query, documents, top_k=5):
    """Rerank retrieved documents for relevance."""
    pairs = [[query, doc.content] for doc in documents]
    scores = reranker.predict(pairs)

    ranked = sorted(
        zip(documents, scores),
        key=lambda x: x[1],
        reverse=True
    )

    return [doc for doc, score in ranked[:top_k]]

Cohere Rerank API

import cohere

co = cohere.Client("your-api-key")

def cohere_rerank(query, documents, top_k=5):
    """Use Cohere's rerank endpoint."""
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=[d.content for d in documents],
        top_n=top_k
    )

    return [documents[r.index] for r in response.results]

Step 7: Generation with Context

Basic RAG Prompt

def generate_response(query, retrieved_docs):
    """Generate response using retrieved context."""
    context = "\n\n".join([
        f"[Source: {doc.metadata['source']}]\n{doc.content}"
        for doc in retrieved_docs
    ])

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """Answer the question based on the provided context.
                If the context doesn't contain the answer, say so.
                Cite sources using [Source: filename] format."""
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}"
            }
        ]
    )

    return response.choices[0].message.content

Streaming Response

def stream_rag_response(query, retrieved_docs):
    """Stream the RAG response for better UX."""
    context = format_context(retrieved_docs)

    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": RAG_SYSTEM_PROMPT},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ],
        stream=True
    )

    for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

Advanced Patterns

Parent-Child Chunking

def parent_child_index(document):
    """Index small chunks but retrieve larger context."""
    # Create parent chunks (larger)
    parent_chunks = semantic_chunk(document, max_size=2000)

    # Create child chunks (smaller, for retrieval)
    for i, parent in enumerate(parent_chunks):
        child_chunks = fixed_size_chunk(parent, chunk_size=200)

        for j, child in enumerate(child_chunks):
            index_chunk(
                content=child,
                metadata={
                    'parent_id': f'parent_{i}',
                    'parent_content': parent,  # Store parent
                }
            )

def retrieve_with_parent(query, k=3):
    """Retrieve children, return parents."""
    results = vector_search(query, k=k*3)

    # Get unique parents
    parent_ids = list(set(r.metadata['parent_id'] for r in results))

    # Return parent content
    return [get_parent_content(pid) for pid in parent_ids[:k]]

Self-Querying Retriever

def self_query_retrieve(natural_query):
    """Extract structured filters from natural language."""
    # Use LLM to parse query
    parsed = parse_query_with_llm(natural_query)

    # Example: "Show me Python tutorials from 2024"
    # parsed = {
    #     "query": "Python tutorials",
    #     "filters": {"language": "python", "year": 2024}
    # }

    return vector_search(
        query=parsed["query"],
        filter=parsed["filters"]
    )

Evaluation Metrics

Retrieval Metrics

def evaluate_retrieval(queries, ground_truth):
    """Evaluate retrieval quality."""
    metrics = {
        'precision@k': [],
        'recall@k': [],
        'mrr': [],  # Mean Reciprocal Rank
    }

    for query, relevant_ids in zip(queries, ground_truth):
        retrieved = retrieve(query, k=10)
        retrieved_ids = [r.id for r in retrieved]

        # Precision@K
        relevant_retrieved = len(set(retrieved_ids) & set(relevant_ids))
        metrics['precision@k'].append(relevant_retrieved / len(retrieved_ids))

        # Recall@K
        metrics['recall@k'].append(relevant_retrieved / len(relevant_ids))

        # MRR
        for i, rid in enumerate(retrieved_ids):
            if rid in relevant_ids:
                metrics['mrr'].append(1 / (i + 1))
                break
        else:
            metrics['mrr'].append(0)

    return {k: sum(v)/len(v) for k, v in metrics.items()}

End-to-End Evaluation

def evaluate_rag(test_cases):
    """Evaluate full RAG pipeline."""
    results = []

    for case in test_cases:
        response = rag_pipeline(case['question'])

        results.append({
            'question': case['question'],
            'expected': case['expected_answer'],
            'actual': response,
            'relevance': judge_relevance(response, case['expected_answer']),
            'faithfulness': judge_faithfulness(response, retrieved_context),
        })

    return results

Common Mistakes

Mistake Impact Fix
Chunks too small Lost context Increase size, add overlap
Chunks too large Noise in retrieval Decrease size, semantic split
Wrong embedding model Poor retrieval Match model to content type
No metadata Can't filter Add source, date, category
Ignoring structure Broken code/tables Use format-aware chunking
No reranking Irrelevant results Add cross-encoder reranker

Integration with Skills

Use with:

  • llm-integration - API patterns for embeddings and generation
  • agentic-design - RAG as a tool in agent systems
  • test-driven-development - Test retrieval quality

Checklist

Before deploying RAG:

  • Document formats properly handled
  • Chunking strategy tested with real queries
  • Embedding model appropriate for domain
  • Vector database can handle scale
  • Metadata indexed for filtering
  • Retrieval quality measured
  • Reranking improves results
  • Response cites sources
  • Latency acceptable
  • Cost estimated

Authority

Based on:

  • Anthropic RAG best practices
  • OpenAI cookbook patterns
  • LangChain documentation
  • Academic research on dense retrieval
  • Production systems at scale

Bottom Line: RAG quality = Retrieval quality. Invest in chunking, embeddings, and retrieval before optimizing prompts. Test with real queries, measure retrieval metrics, iterate.