Claude Code Plugins

Community-maintained marketplace

Feedback

Implement retrieval-augmented generation systems. Use when building knowledge-intensive applications, document search, Q&A systems, or need to ground LLM responses in external data. Covers embedding strategy, vector stores, retrieval pipelines, and evaluation.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name RAG Implementer
description Implement retrieval-augmented generation systems. Use when building knowledge-intensive applications, document search, Q&A systems, or need to ground LLM responses in external data. Covers embedding strategy, vector stores, retrieval pipelines, and evaluation.
version 1.0.0

RAG Implementer

Build production-ready retrieval-augmented generation systems.

Core Principle

RAG = Retrieval + Context Assembly + Generation

Use RAG when you need LLMs to access fresh, domain-specific, or proprietary knowledge that wasn't in their training data.


⚠️ Prerequisites & Cost Reality Check

STOP: Have You Validated the Need for RAG?

Before implementing RAG, confirm:

  • Problem validated - Completed product-strategist Phase 1 (problem discovery)
  • Users need AI search - Tested with simpler alternatives (see below)
  • ROI justified - Calculated cost vs benefit of RAG vs alternatives

Try These FIRST (Before RAG)

RAG is powerful but expensive. Try cheaper alternatives first:

1. FAQ Page / Documentation (1 day, $0)

  • Create well-organized FAQ or docs
  • Add search with Cmd+F
  • Works for: <50 common questions, static content
  • Test: Do users find answers? If yes, stop here.

2. Simple Keyword Search (2-3 days, $0-20/month)

  • Use Algolia, Typesense, or PostgreSQL full-text search
  • Good enough for 80% of use cases
  • Works for: <100k documents, keyword matching sufficient
  • Test: Do users get relevant results? If yes, stop here.

3. Manual Curation (Concierge MVP) (1 week, $0)

  • Manually answer user questions
  • Build FAQ from common questions
  • Works for: <100 users, validating if users want AI
  • Test: Do users value your answers enough to pay? If yes, consider RAG.

4. Simple Semantic Search (1 week, $30-50/month)

  • Use OpenAI embeddings + Postgres pgvector
  • Skip complex retrieval, re-ranking, etc.
  • Works for: <50k documents, basic semantic search
  • Test: Are embeddings better than keyword search? If no, stop here.

Cost Reality Check

Naive RAG (Prototype):

  • Time: 1-2 weeks
  • Cost: $50-150/month (vector DB + embeddings + API calls)
  • When: Prototype, <10k documents, proof of concept

Advanced RAG (Production):

  • Time: 3-4 weeks
  • Cost: $200-500/month (hybrid search, re-ranking, monitoring)
  • When: Production, 10k-1M documents, validated demand

Modular RAG (Enterprise):

  • Time: 6-8 weeks
  • Cost: $500-2000+/month (multiple KBs, specialized modules)
  • When: Enterprise, 1M+ documents, mission-critical

Decision Tree: Do You Really Need RAG?

Do users need to search your content?
│
├─ No → Don't build RAG ❌
│
└─ Yes
   ├─ <50 items? → FAQ page ✅ ($0)
   │
   └─ >50 items?
      ├─ Keyword search enough? → Use Algolia ✅ ($0-20/mo)
      │
      └─ Need semantic understanding?
         ├─ <50k docs? → Simple semantic (pgvector) ✅ ($30/mo)
         │
         └─ >50k docs?
            ├─ Validated with users? → Build RAG ✅
            └─ Not validated? → Test with Concierge MVP first ⚠️

Validation Checklist

Only proceed with RAG implementation if:

  • Tested simpler alternatives (FAQ, keyword search, manual curation)
  • Users confirmed they need AI-powered search (not just you think they do)
  • Calculated ROI: cost of RAG < value users get
  • Have >50k documents OR complex semantic search requirements
  • Budget: $200-500/month for infrastructure
  • Time: 3-4 weeks for production implementation

If any checkbox is unchecked: Go back to product-strategist or mvp-builder skills to validate first.

See also: PLAYBOOKS/validation-first-development.md for step-by-step validation process.


8-Phase RAG Implementation

Phase 1: Knowledge Base Design

Goal: Create well-structured knowledge foundation

Actions:

  • Map data sources (internal: docs, databases, APIs / external: web, feeds)
  • Filter noise, select authoritative content (prevent "data dump fallacy")
  • Define chunking strategy: semantic chunking based on structure
  • Add metadata: tags, timestamps, source identifiers, categories

Validation:

  • All data sources catalogued and prioritized
  • Data quality assessed (accuracy, completeness, freshness)
  • Chunking strategy tested with sample documents
  • Metadata schema validated for search effectiveness

Common Chunking Strategies:

  • Fixed-size: 500-1000 tokens, 50-100 token overlap
  • Semantic: By paragraph, section headers, or topic boundaries
  • Recursive: Split by structure (markdown headers, code blocks)

Phase 2: Embedding Strategy

Goal: Choose optimal embedding approach for semantic understanding

Actions:

  • Select embedding model: text-embedding-3-large (1536 dim) for general, domain-specific for specialized
  • Plan multi-modal needs (text, code, images, tables)
  • Decide on fine-tuning: use domain data if general embeddings underperform
  • Establish similarity benchmarks

Validation:

  • Embedding model benchmarked on domain data
  • Retrieval accuracy tested with known query-document pairs
  • Storage and compute costs validated

Model Selection:

  • General: OpenAI text-embedding-3-large, text-embedding-3-small
  • Code: code-search-babbage-code-001 or StarEncoder
  • Multilingual: multilingual-e5-large

Phase 3: Vector Store Architecture

Goal: Implement scalable vector database

Actions:

  • Choose vector DB (Pinecone, Weaviate, Qdrant, Chroma, pgvector)
  • Configure index: HNSW for speed, IVF for scale
  • Plan scalability: data growth and query volume
  • Implement backup, recovery, security

Validation:

  • Vector store benchmarked under expected load
  • Index optimized for retrieval speed and accuracy
  • Backup and recovery tested
  • Security controls implemented

Vector DB Decision:

  • Managed cloud → Pinecone
  • Self-hosted, feature-rich → Weaviate
  • Lightweight, local → Chroma
  • Cost-conscious → pgvector (Postgres extension)
  • High-performance → Qdrant

Phase 4: Retrieval Pipeline

Goal: Build sophisticated retrieval beyond simple similarity search

Actions:

  • Implement hybrid retrieval: semantic search + keyword (BM25)
  • Add query enhancement: expansion, reformulation, multi-query
  • Apply contextual filtering: metadata, temporal constraints, relevance ranking
  • Design for query types: factual (precision), analytical (breadth), creative (diversity)
  • Handle edge cases: no relevant results found

Advanced Techniques:

  • Re-ranking: Use cross-encoder after initial retrieval (e.g., cross-encoder/ms-marco-MiniLM-L-12-v2)
  • Query routing: Route different query types to specialized strategies
  • Ensemble methods: Combine multiple retrieval approaches
  • Adaptive retrieval: Adjust top-k based on query complexity

Validation:

  • Retrieval accuracy tested across diverse query types
  • Hybrid retrieval outperforms single-method baselines
  • Query latency meets requirements (<500ms ideal)
  • Edge cases and fallbacks tested

Phase 5: Context Assembly

Goal: Transform retrieved chunks into optimal LLM context

Actions:

  • Rank and select: prioritize by relevance score, recency, source authority
  • Synthesize: merge related chunks, avoid redundancy
  • Compress: use LLMLingua or similar for token optimization
  • Mitigate "lost in the middle": place critical info at start/end
  • Adapt dynamically: adjust context based on conversation history

Context Engineering Integration:

  • Blend RAG results with system instructions and user prompts
  • Maintain conversation coherence across multi-turn interactions
  • Implement context persistence for follow-up queries
  • Balance context size vs. information density

Validation:

  • Context relevance validated against human judgments
  • Token optimization maintains accuracy
  • Multi-turn conversations maintain coherence
  • Assembly latency <200ms

Phase 6: Evaluation & Metrics

Goal: Measure RAG system performance comprehensively

Retrieval Quality:

  • Precision@K: Fraction of top-K results that are relevant
  • Recall@K: Fraction of relevant docs in top-K
  • MRR (Mean Reciprocal Rank): Average rank of first relevant result
  • NDCG: Ranking quality with graded relevance

Generation Quality:

  • Faithfulness: Generated content accuracy vs. sources
  • Answer Relevance: Response relevance to query
  • Context Utilization: How effectively LLM uses retrieved info
  • Hallucination Rate: Frequency of unsupported claims

System Performance:

  • End-to-End Latency: Query to answer (<3 seconds target)
  • Retrieval Latency: Time to retrieve and rank (<500ms)
  • Token Efficiency: Information density per token
  • Cost Per Query: Combined retrieval + generation costs

Validation:

  • Baseline metrics established
  • A/B testing framework for config comparisons
  • Automated evaluation pipeline deployed
  • Human evaluation protocols for ground truth

Phase 7: Production Deployment

Goal: Deploy with enterprise-grade reliability and security

Deployment:

  • Containerize with Docker/Kubernetes
  • Implement load balancing across RAG instances
  • Add caching for frequent queries
  • Graceful degradation: fallback to base model on component failure

Security:

  • Role-based access controls for knowledge base
  • Data masking and PII protection
  • Audit logging for compliance
  • Prompt injection defense

Monitoring:

  • Real-time metrics dashboard (latency, cost, accuracy)
  • Query analysis for patterns and failure modes
  • Cost tracking and optimization alerts
  • Performance profiling for bottlenecks

Validation:

  • Production handles expected traffic
  • Security prevents unauthorized access
  • Monitoring provides actionable insights
  • Incident response procedures tested

Phase 8: Continuous Improvement

Goal: Establish processes for ongoing enhancement

Data Pipeline:

  • Automated knowledge base updates (real-time or scheduled)
  • Quality monitoring: detect data drift and degradation
  • Source diversification: add new data sources
  • Feedback integration: user corrections and preferences

Model Evolution:

  • Evaluate and migrate to improved embeddings
  • Fine-tune on domain data regularly
  • Upgrade architecture: Naive → Advanced → Modular RAG
  • Expand multi-modal support (images, audio, video)

Optimization:

  • Analyze query patterns, optimize for common needs
  • Improve cache hit rates
  • Tune vector indices regularly
  • Balance performance vs. costs

Validation:

  • Automated improvement pipelines functioning
  • Performance trends show improvement
  • User satisfaction increasing
  • System adapts to changing needs

Key RAG Principles

1. Relevance Over Volume

  • Quality curation > massive datasets
  • Remove outdated/low-quality content continuously
  • Prioritize most relevant info to prevent "lost in the middle"

2. Semantic Understanding

  • Use embeddings for true semantic matching, not just keywords
  • Recognize query intent (factual, analytical, creative)
  • Adapt retrieval strategy based on context

3. Multi-Modal Intelligence

  • Handle text, images, code, tables, structured data
  • Enable cross-modal retrieval (text query → image results)
  • Preserve document structure and formatting

4. Temporal Awareness

  • Prioritize recent info for time-sensitive topics
  • Maintain historical access when relevant
  • Integrate real-time data feeds for dynamic domains

5. Transparency & Trust

  • Always provide source citations
  • Indicate confidence levels
  • Explain why specific information was selected

Standard RAG Response Format

{
  "answer": "Generated response incorporating retrieved information",
  "sources": [
    {
      "content": "Retrieved text chunk",
      "source": "Document/URL identifier",
      "relevance_score": 0.95,
      "chunk_id": "unique_identifier"
    }
  ],
  "confidence": 0.87,
  "retrieval_metadata": {
    "chunks_retrieved": 5,
    "retrieval_time_ms": 150,
    "generation_time_ms": 800
  }
}

Critical Success Rules

Non-Negotiable:

  1. ✅ Source attribution for every response
  2. ✅ Validate generated content against sources (prevent hallucination)
  3. ✅ Filter sensitive data before retrieval
  4. ✅ Respond within latency thresholds (<3 seconds)
  5. ✅ Monitor and optimize costs continuously
  6. ✅ Comply with security policies
  7. ✅ Graceful degradation on failures
  8. ✅ Comprehensive testing before production

Quality Gates:

  • Before Production: >85% accuracy on evaluation dataset
  • Ongoing: User satisfaction >4.0/5.0
  • Performance: 95th percentile <5 seconds
  • Reliability: 99.5% uptime
  • Cost: Within 10% of budget

Advanced Patterns

Modular RAG Architecture

  • Search Module: Query understanding and reformulation
  • Memory Module: Long-term conversation persistence
  • Routing Module: Query routing to specialized knowledge bases
  • Predict Module: Anticipatory pre-loading based on context

Hybrid RAG + Fine-tuning

  • RAG for dynamic, frequently changing knowledge
  • Fine-tuning for domain-specific reasoning patterns
  • Combine strengths for maximum effectiveness

Related Resources

Related Skills:

  • multi-agent-architect - For complex RAG orchestration
  • knowledge-graph-builder - For structured knowledge integration
  • performance-optimizer - For RAG system optimization

Related Patterns:

  • META/DECISION-FRAMEWORK.md - Vector DB and embedding selection
  • STANDARDS/architecture-patterns/rag-pattern.md - RAG architecture details (when created)

Related Playbooks:

  • PLAYBOOKS/deploy-rag-system.md - RAG deployment procedure (when created)