name	contextual-chunking
description	Contextual Retrieval implementation for RAG - chunks clinical notes with LLM-generated context prepended to each chunk before embedding. Improves citation accuracy by 49% per Anthropic research.

Contextual Chunking Skill

Overview

This skill implements Anthropic's Contextual Retrieval pattern for RAG systems. It chunks clinical notes into fixed-size segments (1000 tokens, 200 token overlap) and generates 50-100 token contextual summaries for each chunk using Phi-4. The context is prepended to chunks before embedding, significantly improving retrieval accuracy for citation extraction.

When to Use

Use this skill when:

Preparing clinical notes for RAG-based summarization
Creating embeddings for ChromaDB storage
Need to improve citation accuracy and reduce hallucinations
Processing multi-page clinical notes for semantic search

Research Background

Anthropic Contextual Retrieval Paper: Prepending chunk-specific context improves retrieval accuracy by 49% over standard RAG. The context helps the embedding model understand each chunk's role within the larger document.

Installation

IMPORTANT: This skill has its own isolated virtual environment (.venv) managed by uv. Do NOT use system Python.

Initialize the skill's environment:

# From the skill directory
cd .agent/skills/contextual-chunking
uv sync  # Creates .venv and installs dependencies from pyproject.toml

Dependencies are in pyproject.toml:

tiktoken - Token counting for Phi-4

Usage

CRITICAL: Always use uv run to execute code with this skill's .venv, NOT system Python.

Basic Chunking with Context

# From .agent/skills/contextual-chunking/ directory
# Run with: uv run python -c "..."
from contextual_chunking import ContextualChunker

# You'll need to import ollama-client separately
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent / "ollama-client"))
from ollama_client import OllamaClient

# Initialize
chunker = ContextualChunker(
    ollama_client=OllamaClient(),
    chunk_size=1000,        # Tokens per chunk
    chunk_overlap=200,      # Overlap between chunks (20%)
    context_size=75         # Context tokens (50-100 range)
)

# Chunk clinical note
clinical_note = "Patient presents with chest pain radiating to left arm..."

enriched_chunks = chunker.chunk_with_context(
    document_text=clinical_note,
    doc_id="note_123"
)

# Each enriched chunk contains:
for chunk in enriched_chunks:
    print(f"Chunk ID: {chunk['id']}")
    print(f"Original text: {chunk['original_text'][:100]}...")
    print(f"Context: {chunk['context']}")
    print(f"Enriched (context + text): {chunk['enriched_text'][:150]}...")
    print(f"Offsets: {chunk['start_offset']}-{chunk['end_offset']}")
    print("---")

Integration with ChromaDB

from src.skills.chroma_client.chroma_client import ChromaClient

# 1. Chunk with context
enriched_chunks = chunker.chunk_with_context(clinical_note, "note_123")

# 2. Store enriched chunks in ChromaDB
chroma_client = ChromaClient()
chroma_client.add_chunks(
    collection_name="clinical_note_session_456",
    chunks=[chunk['enriched_text'] for chunk in enriched_chunks],
    metadatas=[{
        'chunk_id': chunk['id'],
        'start_offset': chunk['start_offset'],
        'end_offset': chunk['end_offset'],
        'original_text': chunk['original_text']
    } for chunk in enriched_chunks],
    ids=[chunk['id'] for chunk in enriched_chunks]
)

Context Generation Prompt

The LLM generates context with this prompt template:

Given the whole document context, provide succinct context (50-100 tokens) to situate this chunk for search retrieval purposes.

Document title/type: Clinical Note
Document context: [First 2000 chars of full document]

Chunk to contextualize:
{chunk_text}

Provide ONLY the context (no explanations):

Example Output:

Context: This section describes the patient's presenting symptoms during initial triage, specifically cardiovascular complaints requiring urgent evaluation.

Chunk Structure

Each enriched chunk dictionary contains:

{
    'id': 'note_123_chunk_0',
    'original_text': 'Patient presents with chest pain...',
    'context': 'This section describes presenting symptoms...',
    'enriched_text': 'This section describes presenting symptoms... Patient presents with chest pain...',
    'start_offset': 0,
    'end_offset': 1200,
    'token_count': 1000
}

Configuration

Parameters:

chunk_size: Tokens per chunk (default: 1000)
- Too small: Context fragmentation, poor retrieval
- Too large: Embedding quality degrades, slower search
chunk_overlap: Token overlap (default: 200, ~20%)
- Prevents information loss at boundaries
- Critical for accurate citation offsets
context_size: Context tokens (default: 75, range: 50-100)
- Balances informativeness vs token cost
- Generated by LLM for each chunk

Best Practices

Token Counting: Use tiktoken for accurate Phi-4 token counts
Context Quality: Verify LLM generates succinct, relevant context
Offset Tracking: Maintain character offsets for citation extraction
Batch Processing: Generate contexts in batches for efficiency
Cache Contexts: Store enriched chunks to avoid regeneration

Performance Considerations

Chunking a 10-page note (5000 tokens):

Chunks: ~5 chunks (1000 tokens each, 200 overlap)
Context generation: 5 LLM calls (~5-10 seconds total)
Total time: 10-15 seconds (acceptable for offline processing)

Trade-offs:

Pro: 49% better retrieval accuracy
Pro: Fewer hallucinations, better citations
Con: Additional LLM inference time
Con: Slightly higher token usage

Error Handling

If LLM context generation fails, fall back to empty context (still functional)
If chunk exceeds token limit, split further
Preserve original text and offsets even if context fails

Integration with RAG Pipeline

Workflow:

Chunk: Use this skill to create enriched chunks
Embed: Store in ChromaDB (automatic embedding)
Retrieve: Query ChromaDB for relevant chunks
Extract: Use citation-extraction skill to validate citations
Cleanup: Clear ChromaDB collection after session

Implementation

See contextual_chunking.py for the full Python implementation.

contextual-chunking

Install Skill

SKILL.md