| name | contextual-chunking |
| description | Contextual Retrieval implementation for RAG - chunks clinical notes with LLM-generated context prepended to each chunk before embedding. Improves citation accuracy by 49% per Anthropic research. |
Contextual Chunking Skill
Overview
This skill implements Anthropic's Contextual Retrieval pattern for RAG systems. It chunks clinical notes into fixed-size segments (1000 tokens, 200 token overlap) and generates 50-100 token contextual summaries for each chunk using Phi-4. The context is prepended to chunks before embedding, significantly improving retrieval accuracy for citation extraction.
When to Use
Use this skill when:
- Preparing clinical notes for RAG-based summarization
- Creating embeddings for ChromaDB storage
- Need to improve citation accuracy and reduce hallucinations
- Processing multi-page clinical notes for semantic search
Research Background
Anthropic Contextual Retrieval Paper: Prepending chunk-specific context improves retrieval accuracy by 49% over standard RAG. The context helps the embedding model understand each chunk's role within the larger document.
Installation
IMPORTANT: This skill has its own isolated virtual environment (.venv) managed by uv. Do NOT use system Python.
Initialize the skill's environment:
# From the skill directory
cd .agent/skills/contextual-chunking
uv sync # Creates .venv and installs dependencies from pyproject.toml
Dependencies are in pyproject.toml:
tiktoken- Token counting for Phi-4
Usage
CRITICAL: Always use uv run to execute code with this skill's .venv, NOT system Python.
Basic Chunking with Context
# From .agent/skills/contextual-chunking/ directory
# Run with: uv run python -c "..."
from contextual_chunking import ContextualChunker
# You'll need to import ollama-client separately
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent / "ollama-client"))
from ollama_client import OllamaClient
# Initialize
chunker = ContextualChunker(
ollama_client=OllamaClient(),
chunk_size=1000, # Tokens per chunk
chunk_overlap=200, # Overlap between chunks (20%)
context_size=75 # Context tokens (50-100 range)
)
# Chunk clinical note
clinical_note = "Patient presents with chest pain radiating to left arm..."
enriched_chunks = chunker.chunk_with_context(
document_text=clinical_note,
doc_id="note_123"
)
# Each enriched chunk contains:
for chunk in enriched_chunks:
print(f"Chunk ID: {chunk['id']}")
print(f"Original text: {chunk['original_text'][:100]}...")
print(f"Context: {chunk['context']}")
print(f"Enriched (context + text): {chunk['enriched_text'][:150]}...")
print(f"Offsets: {chunk['start_offset']}-{chunk['end_offset']}")
print("---")
Integration with ChromaDB
from src.skills.chroma_client.chroma_client import ChromaClient
# 1. Chunk with context
enriched_chunks = chunker.chunk_with_context(clinical_note, "note_123")
# 2. Store enriched chunks in ChromaDB
chroma_client = ChromaClient()
chroma_client.add_chunks(
collection_name="clinical_note_session_456",
chunks=[chunk['enriched_text'] for chunk in enriched_chunks],
metadatas=[{
'chunk_id': chunk['id'],
'start_offset': chunk['start_offset'],
'end_offset': chunk['end_offset'],
'original_text': chunk['original_text']
} for chunk in enriched_chunks],
ids=[chunk['id'] for chunk in enriched_chunks]
)
Context Generation Prompt
The LLM generates context with this prompt template:
Given the whole document context, provide succinct context (50-100 tokens) to situate this chunk for search retrieval purposes.
Document title/type: Clinical Note
Document context: [First 2000 chars of full document]
Chunk to contextualize:
{chunk_text}
Provide ONLY the context (no explanations):
Example Output:
Context: This section describes the patient's presenting symptoms during initial triage, specifically cardiovascular complaints requiring urgent evaluation.
Chunk Structure
Each enriched chunk dictionary contains:
{
'id': 'note_123_chunk_0',
'original_text': 'Patient presents with chest pain...',
'context': 'This section describes presenting symptoms...',
'enriched_text': 'This section describes presenting symptoms... Patient presents with chest pain...',
'start_offset': 0,
'end_offset': 1200,
'token_count': 1000
}
Configuration
Parameters:
chunk_size: Tokens per chunk (default: 1000)- Too small: Context fragmentation, poor retrieval
- Too large: Embedding quality degrades, slower search
chunk_overlap: Token overlap (default: 200, ~20%)- Prevents information loss at boundaries
- Critical for accurate citation offsets
context_size: Context tokens (default: 75, range: 50-100)- Balances informativeness vs token cost
- Generated by LLM for each chunk
Best Practices
- Token Counting: Use tiktoken for accurate Phi-4 token counts
- Context Quality: Verify LLM generates succinct, relevant context
- Offset Tracking: Maintain character offsets for citation extraction
- Batch Processing: Generate contexts in batches for efficiency
- Cache Contexts: Store enriched chunks to avoid regeneration
Performance Considerations
Chunking a 10-page note (5000 tokens):
- Chunks: ~5 chunks (1000 tokens each, 200 overlap)
- Context generation: 5 LLM calls (~5-10 seconds total)
- Total time: 10-15 seconds (acceptable for offline processing)
Trade-offs:
- Pro: 49% better retrieval accuracy
- Pro: Fewer hallucinations, better citations
- Con: Additional LLM inference time
- Con: Slightly higher token usage
Error Handling
- If LLM context generation fails, fall back to empty context (still functional)
- If chunk exceeds token limit, split further
- Preserve original text and offsets even if context fails
Integration with RAG Pipeline
Workflow:
- Chunk: Use this skill to create enriched chunks
- Embed: Store in ChromaDB (automatic embedding)
- Retrieve: Query ChromaDB for relevant chunks
- Extract: Use
citation-extractionskill to validate citations - Cleanup: Clear ChromaDB collection after session
Implementation
See contextual_chunking.py for the full Python implementation.