Claude Code Plugins

Community-maintained marketplace

Feedback

Contextual Retrieval implementation for RAG - chunks clinical notes with LLM-generated context prepended to each chunk before embedding. Improves citation accuracy by 49% per Anthropic research.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name contextual-chunking
description Contextual Retrieval implementation for RAG - chunks clinical notes with LLM-generated context prepended to each chunk before embedding. Improves citation accuracy by 49% per Anthropic research.

Contextual Chunking Skill

Overview

This skill implements Anthropic's Contextual Retrieval pattern for RAG systems. It chunks clinical notes into fixed-size segments (1000 tokens, 200 token overlap) and generates 50-100 token contextual summaries for each chunk using Phi-4. The context is prepended to chunks before embedding, significantly improving retrieval accuracy for citation extraction.

When to Use

Use this skill when:

  • Preparing clinical notes for RAG-based summarization
  • Creating embeddings for ChromaDB storage
  • Need to improve citation accuracy and reduce hallucinations
  • Processing multi-page clinical notes for semantic search

Research Background

Anthropic Contextual Retrieval Paper: Prepending chunk-specific context improves retrieval accuracy by 49% over standard RAG. The context helps the embedding model understand each chunk's role within the larger document.

Installation

IMPORTANT: This skill has its own isolated virtual environment (.venv) managed by uv. Do NOT use system Python.

Initialize the skill's environment:

# From the skill directory
cd .agent/skills/contextual-chunking
uv sync  # Creates .venv and installs dependencies from pyproject.toml

Dependencies are in pyproject.toml:

  • tiktoken - Token counting for Phi-4

Usage

CRITICAL: Always use uv run to execute code with this skill's .venv, NOT system Python.

Basic Chunking with Context

# From .agent/skills/contextual-chunking/ directory
# Run with: uv run python -c "..."
from contextual_chunking import ContextualChunker

# You'll need to import ollama-client separately
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent / "ollama-client"))
from ollama_client import OllamaClient

# Initialize
chunker = ContextualChunker(
    ollama_client=OllamaClient(),
    chunk_size=1000,        # Tokens per chunk
    chunk_overlap=200,      # Overlap between chunks (20%)
    context_size=75         # Context tokens (50-100 range)
)

# Chunk clinical note
clinical_note = "Patient presents with chest pain radiating to left arm..."

enriched_chunks = chunker.chunk_with_context(
    document_text=clinical_note,
    doc_id="note_123"
)

# Each enriched chunk contains:
for chunk in enriched_chunks:
    print(f"Chunk ID: {chunk['id']}")
    print(f"Original text: {chunk['original_text'][:100]}...")
    print(f"Context: {chunk['context']}")
    print(f"Enriched (context + text): {chunk['enriched_text'][:150]}...")
    print(f"Offsets: {chunk['start_offset']}-{chunk['end_offset']}")
    print("---")

Integration with ChromaDB

from src.skills.chroma_client.chroma_client import ChromaClient

# 1. Chunk with context
enriched_chunks = chunker.chunk_with_context(clinical_note, "note_123")

# 2. Store enriched chunks in ChromaDB
chroma_client = ChromaClient()
chroma_client.add_chunks(
    collection_name="clinical_note_session_456",
    chunks=[chunk['enriched_text'] for chunk in enriched_chunks],
    metadatas=[{
        'chunk_id': chunk['id'],
        'start_offset': chunk['start_offset'],
        'end_offset': chunk['end_offset'],
        'original_text': chunk['original_text']
    } for chunk in enriched_chunks],
    ids=[chunk['id'] for chunk in enriched_chunks]
)

Context Generation Prompt

The LLM generates context with this prompt template:

Given the whole document context, provide succinct context (50-100 tokens) to situate this chunk for search retrieval purposes.

Document title/type: Clinical Note
Document context: [First 2000 chars of full document]

Chunk to contextualize:
{chunk_text}

Provide ONLY the context (no explanations):

Example Output:

Context: This section describes the patient's presenting symptoms during initial triage, specifically cardiovascular complaints requiring urgent evaluation.

Chunk Structure

Each enriched chunk dictionary contains:

{
    'id': 'note_123_chunk_0',
    'original_text': 'Patient presents with chest pain...',
    'context': 'This section describes presenting symptoms...',
    'enriched_text': 'This section describes presenting symptoms... Patient presents with chest pain...',
    'start_offset': 0,
    'end_offset': 1200,
    'token_count': 1000
}

Configuration

Parameters:

  • chunk_size: Tokens per chunk (default: 1000)

    • Too small: Context fragmentation, poor retrieval
    • Too large: Embedding quality degrades, slower search
  • chunk_overlap: Token overlap (default: 200, ~20%)

    • Prevents information loss at boundaries
    • Critical for accurate citation offsets
  • context_size: Context tokens (default: 75, range: 50-100)

    • Balances informativeness vs token cost
    • Generated by LLM for each chunk

Best Practices

  1. Token Counting: Use tiktoken for accurate Phi-4 token counts
  2. Context Quality: Verify LLM generates succinct, relevant context
  3. Offset Tracking: Maintain character offsets for citation extraction
  4. Batch Processing: Generate contexts in batches for efficiency
  5. Cache Contexts: Store enriched chunks to avoid regeneration

Performance Considerations

Chunking a 10-page note (5000 tokens):

  • Chunks: ~5 chunks (1000 tokens each, 200 overlap)
  • Context generation: 5 LLM calls (~5-10 seconds total)
  • Total time: 10-15 seconds (acceptable for offline processing)

Trade-offs:

  • Pro: 49% better retrieval accuracy
  • Pro: Fewer hallucinations, better citations
  • Con: Additional LLM inference time
  • Con: Slightly higher token usage

Error Handling

  • If LLM context generation fails, fall back to empty context (still functional)
  • If chunk exceeds token limit, split further
  • Preserve original text and offsets even if context fails

Integration with RAG Pipeline

Workflow:

  1. Chunk: Use this skill to create enriched chunks
  2. Embed: Store in ChromaDB (automatic embedding)
  3. Retrieve: Query ChromaDB for relevant chunks
  4. Extract: Use citation-extraction skill to validate citations
  5. Cleanup: Clear ChromaDB collection after session

Implementation

See contextual_chunking.py for the full Python implementation.

References