| name | Document Parser |
| description | Parse large documents that exceed context limits into structured sections with abstracts, metadata, and hierarchies. This skill should be used when encountering documents over the context limit (typically 25k+ tokens) or when a user explicitly requests document parsing. Apply layout-aware hierarchical chunking principles to preserve semantic structure. |
| when_to_use | When encountering documents over 25k tokens, when user mentions "parse document", "too large to read", "context limit exceeded", when analyzing research papers or technical documentation, when extracting structure from markdown files, when building RAG systems that need chunked content, when user requests "extract metadata" or "build document hierarchy" |
| version | 1.0.0 |
| languages | all |
Document Parser
Overview
This skill provides tools and workflows for parsing large documents that exceed context limits. It extracts hierarchical structure, generates section abstracts, and extracts metadata using layout-aware hierarchical chunking principles optimized for RAG systems.
Core principle: Preserve semantic structure while chunking documents into 400-900 token sections with rich metadata for retrieval and comprehension.
When to Use This Skill
Use this skill when:
- Document exceeds 25k+ tokens and can't fit in context
- User explicitly requests document parsing or structure extraction
- Building RAG systems that need semantically coherent chunks
- Analyzing research papers, technical docs, or long-form content
- Need to extract tables, code blocks, benchmarks, or key terms
- Want progressive reading (abstracts first, then deep-dives)
- Comparing multiple large documents
Don't use for:
- ❌ Documents under 10k tokens (read directly instead)
- ❌ Binary file formats (PDFs, Word docs) - convert to markdown first
- ❌ Simple text extraction (use grep/awk instead)
Core Capabilities
The document-parser skill provides four main capabilities:
Structure Analysis
- Extract markdown headers (H1-H6)
- Build hierarchical section tree
- Count tokens per section (target: 400-900)
- Generate section maps for navigation
Abstract Generation
- Create 100-200 token summaries for major sections
- Preserve key concepts and relationships
- Enable progressive reading workflows
Metadata Extraction
- Extract tables with structure preservation
- Capture code blocks with language tags
- Identify benchmarks (percentages, metrics)
- Extract key terms (techniques, models, acronyms)
Output Generation
- Machine-readable JSON (structure.json, metadata.json)
- Human-readable markdown (section_map.md)
- Full section content with metadata
Quick Reference
| Task | Command | Output |
|---|---|---|
| Parse structure | python3 scripts/parse_document_structure.py <file.md> |
structure.json, section_map.md |
| Extract metadata | python3 scripts/extract_metadata.py <file.md> |
metadata.json |
| Custom output path | --output <path> |
Specify output file |
| Section map | --map <path> |
Human-readable navigation |
Chunking Principles Reference
The skill implements RAG-optimized chunking principles:
The 400-900 Token Sweet Spot
- Too small (<400): Fragments semantic meaning, loses context
- Sweet spot (400-900): Complete thoughts, searchable, coherent
- Too large (>900): Dilutes relevance, adds noise
Layout-Aware Hierarchical Chunking
- Respect document structure (headers, sections)
- Never split mid-paragraph or mid-code-block
- Preserve parent-child relationships
- Include breadcrumb context (section path)
Dual-Storage Pattern
- Abstracts: Quick navigation, relevance filtering
- Full sections: Deep-dive when needed
- Metadata: Tables, benchmarks, key terms for targeted search
See references/chunking_principles.md for complete details.
Sandbox Configuration
IMPORTANT: This skill requires executing Python scripts. In read-only sandbox mode, you need to either:
Recommended: Configure sandbox allowlist in
~/.codex/config.toml:[sandbox] allowed_paths = ["~/.codex/skills/*/scripts"]Alternative: Use
dangerouslyDisableSandbox: truewhen calling Bash tool
See README.md in this skill directory for complete sandbox setup instructions.
Implementation Workflows
Workflow 1: Parse Single Large Document
Use case: User has a 47k token research paper
# Step 1: Parse document structure
cd ~/.codex/skills/document-parser
python3 scripts/parse_document_structure.py /path/to/document.md \
--output structure.json \
--map section_map.md
# Step 2: Review section map
cat section_map.md
# Shows hierarchical outline with token counts
# Step 3: Extract metadata
python3 scripts/extract_metadata.py /path/to/document.md \
--output metadata.json
# Step 4: Review extracted metadata
cat metadata.json | jq '.tables | length'
cat metadata.json | jq '.benchmarks | length'
cat metadata.json | jq '.key_terms | keys'
Expected output:
structure.json: Hierarchical section tree with token countssection_map.md: Human-readable outline for navigationmetadata.json: Tables, code blocks, benchmarks, key terms
Workflow 2: Comparative Analysis
Use case: Compare two research papers on similar topics
# Parse both documents
for doc in paper1.md paper2.md; do
python3 scripts/parse_document_structure.py "$doc" \
--output "${doc%.md}_structure.json"
python3 scripts/extract_metadata.py "$doc" \
--output "${doc%.md}_metadata.json"
done
# Compare structures
diff -u \
<(jq '.sections[] | .title' paper1_structure.json) \
<(jq '.sections[] | .title' paper2_structure.json)
# Compare key terms
diff -u \
<(jq '.key_terms.techniques[]' paper1_metadata.json | sort) \
<(jq '.key_terms.techniques[]' paper2_metadata.json | sort)
Workflow 3: Progressive Document Reading
Use case: Understand document before deep-dive
# Step 1: Get high-level structure
python3 scripts/parse_document_structure.py document.md --map outline.md
cat outline.md
# Review: What are the main sections?
# Step 2: Read abstracts (if available in structure.json)
jq '.sections[] | select(.abstract) | {title, abstract}' structure.json
# Step 3: Extract metadata for context
python3 scripts/extract_metadata.py document.md --output metadata.json
# Step 4: Review key terms to understand domain
jq '.key_terms' metadata.json
# Step 5: Deep-dive into specific sections
# Read full sections from original document based on structure
Script Documentation
parse_document_structure.py
Extracts markdown headers, builds hierarchical section tree, counts tokens.
Usage:
python3 scripts/parse_document_structure.py <file.md> [OPTIONS]
Options:
--output FILEPATH- Output JSON file (default: structure.json)--map FILEPATH- Output markdown section map (default: section_map.md)
Output structure.json format:
{
"sections": [
{
"id": "section-1",
"title": "Introduction",
"level": 1,
"token_count": 450,
"children": [
{
"id": "section-1.1",
"title": "Background",
"level": 2,
"token_count": 320,
"children": []
}
]
}
],
"total_sections": 56,
"total_tokens": 47000
}
Output section_map.md format:
# Document Structure
- Introduction (450 tokens)
- Background (320 tokens)
- Motivation (280 tokens)
- Methods (650 tokens)
- Data Collection (520 tokens)
- Analysis (580 tokens)
extract_metadata.py
Extracts tables, code blocks, benchmarks, and key terms.
Usage:
python3 scripts/extract_metadata.py <file.md> [OPTIONS]
Options:
--output FILEPATH- Output JSON file (default: metadata.json)
Output metadata.json format:
{
"tables": [
{
"id": "table-1",
"section": "Results",
"headers": ["Model", "Accuracy", "F1"],
"rows": [
["GPT-4", "95.2%", "0.94"],
["Claude", "94.8%", "0.93"]
]
}
],
"code_blocks": [
{
"id": "code-1",
"section": "Implementation",
"language": "python",
"content": "def parse_document(text):\n ..."
}
],
"benchmarks": [
{
"metric": "Accuracy",
"value": "95.2%",
"context": "GPT-4 on MMLU benchmark"
}
],
"key_terms": {
"techniques": ["RAG", "Fine-tuning", "Few-shot learning"],
"models": ["GPT-4", "Claude", "Llama-2"],
"acronyms": ["MMLU", "RAG", "NLP"]
}
}
Common Mistakes
❌ Sandbox permission errors when running scripts
Problem: Permission denied or scripts won't execute in read-only sandbox mode
Fix: Configure sandbox allowlist in ~/.codex/config.toml:
[sandbox]
allowed_paths = ["~/.codex/skills/*/scripts"]
Or use dangerouslyDisableSandbox: true flag when calling Bash tool (development only).
See README.md for complete setup instructions.
❌ Parsing non-markdown files
Problem: Scripts expect markdown format Fix: Convert PDFs/Word docs to markdown first using pandoc:
pandoc document.pdf -o document.md
❌ Ignoring token counts
Problem: Sections too large for embedding models Fix: Review section_map.md token counts, split sections >900 tokens manually
❌ Missing Python dependencies
Problem: Scripts require specific libraries Fix: Install dependencies:
pip install tiktoken markdown beautifulsoup4
❌ Not preserving structure
Problem: Flat extraction loses context Fix: Always use hierarchical parsing, maintain parent-child relationships
❌ Skipping metadata extraction
Problem: Lose valuable structured data Fix: Always run both scripts for complete analysis
Examples
Example 1: Research Paper (47k tokens)
Input: 47k token research paper on RAG systems
Commands:
python3 scripts/parse_document_structure.py rag_paper.md
python3 scripts/extract_metadata.py rag_paper.md
Results:
- 56 sections extracted
- 54 tables identified
- 145 benchmarks found
- 71 techniques cataloged
- Section map showing 3-level hierarchy
- Average section size: 839 tokens (within target range)
Example 2: Technical Documentation
Input: API documentation with code examples
Commands:
python3 scripts/parse_document_structure.py api_docs.md --map api_outline.md
python3 scripts/extract_metadata.py api_docs.md
Use results to:
- Navigate API structure via outline
- Extract all code examples for testing
- Catalog all endpoints from tables
- Build searchable knowledge base
Example 3: Multi-Document Comparison
Input: 3 papers on LLM evaluation
Workflow:
# Parse all documents
for doc in paper*.md; do
python3 scripts/parse_document_structure.py "$doc"
python3 scripts/extract_metadata.py "$doc"
done
# Compare methodologies
jq -r '.sections[] | select(.title | contains("Method")) | .title' *_structure.json
# Compare benchmarks
jq -r '.benchmarks[] | select(.metric == "Accuracy") | "\(.value) - \(.context)"' *_metadata.json
Testing Your Parsing
After parsing a document, verify quality:
Structure Checklist:
- All major sections captured
- Hierarchy preserved (H1 > H2 > H3)
- Token counts reasonable (400-900 target)
- Section map is human-readable
- JSON is valid (
jq . structure.json)
Metadata Checklist:
- Tables extracted with structure
- Code blocks include language tags
- Benchmarks capture value + context
- Key terms are domain-relevant
- JSON is valid (
jq . metadata.json)
Advanced Usage
Custom Section Splitting
If sections are too large (>900 tokens), split manually:
# In parse_document_structure.py, add target_size parameter
python3 scripts/parse_document_structure.py document.md \
--target-size 600 \
--max-size 900
Filtering by Section Level
Extract only top-level sections:
jq '.sections[] | select(.level == 1)' structure.json
Building RAG Index
Use parsed output for RAG system:
import json
# Load structure
with open('structure.json') as f:
structure = json.load(f)
# Load metadata
with open('metadata.json') as f:
metadata = json.load(f)
# Build embeddings for each section
for section in structure['sections']:
if 400 <= section['token_count'] <= 900:
# Optimal chunk size
embed_and_index(section)
Integration with Other Skills
This skill complements:
- skill-builder: Create new parsing strategies as skills
- time-awareness: Track document parsing timestamps
Proven Success
Tested successfully on:
- ✅ 47K token research document
- ✅ 56 sections extracted
- ✅ 54 tables preserved
- ✅ 145 benchmarks identified
- ✅ 71 techniques cataloged
- ✅ Hierarchical section maps generated
- ✅ Metadata JSON validated
References
references/chunking_principles.md- Complete RAG chunking methodology- Scripts in
scripts/directory - See skill-builder for creating document-specific parsing skills
Remember: Large documents are structured data. Parse the structure first, then read strategically.