| name | chunking-strategies |
| description | Document chunking strategies for RAG systems. Use when implementing document processing pipelines to determine optimal chunking approaches based on document type and retrieval requirements. |
Chunking Strategies Skill
This skill provides chunking strategies for RAG document processing.
Chunking Methods
1. Fixed-Size Chunking
def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks
2. Semantic Chunking
Split on natural boundaries (sentences, paragraphs).
def semantic_chunk(text: str, max_tokens: int = 500):
paragraphs = text.split("\n\n")
chunks = []
current_chunk = []
current_tokens = 0
for para in paragraphs:
para_tokens = count_tokens(para)
if current_tokens + para_tokens > max_tokens:
chunks.append("\n\n".join(current_chunk))
current_chunk = [para]
current_tokens = para_tokens
else:
current_chunk.append(para)
current_tokens += para_tokens
if current_chunk:
chunks.append("\n\n".join(current_chunk))
return chunks
3. Recursive Chunking
Hierarchical splitting on multiple separators.
SEPARATORS = ["\n\n", "\n", ". ", " "]
def recursive_chunk(text: str, max_size: int, separators: list[str]):
if len(text) <= max_size:
return [text]
sep = separators[0] if separators else ""
chunks = []
parts = text.split(sep)
for part in parts:
if len(part) <= max_size:
chunks.append(part)
elif len(separators) > 1:
chunks.extend(recursive_chunk(part, max_size, separators[1:]))
else:
chunks.append(part[:max_size])
return chunks
Chunking by Document Type
| Document Type | Recommended Strategy | Chunk Size |
|---|---|---|
| Technical docs | Semantic (headers) | 500-1000 tokens |
| Legal documents | Semantic (sections) | 1000-2000 tokens |
| Code | Function/class based | 200-500 tokens |
| Conversations | Message boundaries | 100-300 tokens |
| General text | Recursive | 300-500 tokens |
Chunk Enrichment
@dataclass
class EnrichedChunk:
content: str
metadata: dict
summary: str # LLM-generated
keywords: list[str]
parent_id: str # For hierarchical retrieval
Best Practices
- Add overlap between chunks (10-20%)
- Preserve semantic boundaries
- Include metadata (source, position)
- Consider hierarchical chunking for long docs
- Test retrieval quality with different sizes