name	Document Parser
description	Parse large documents that exceed context limits into structured sections with abstracts, metadata, and hierarchies. This skill should be used when encountering documents over the context limit (typically 25k+ tokens) or when a user explicitly requests document parsing. Apply layout-aware hierarchical chunking principles to preserve semantic structure.
when_to_use	When encountering documents over 25k tokens, when user mentions "parse document", "too large to read", "context limit exceeded", when analyzing research papers or technical documentation, when extracting structure from markdown files, when building RAG systems that need chunked content, when user requests "extract metadata" or "build document hierarchy"
version	1.0.0
languages	all

Document Parser

Overview

This skill provides tools and workflows for parsing large documents that exceed context limits. It extracts hierarchical structure, generates section abstracts, and extracts metadata using layout-aware hierarchical chunking principles optimized for RAG systems.

Core principle: Preserve semantic structure while chunking documents into 400-900 token sections with rich metadata for retrieval and comprehension.

When to Use This Skill

Use this skill when:

Document exceeds 25k+ tokens and can't fit in context
User explicitly requests document parsing or structure extraction
Building RAG systems that need semantically coherent chunks
Analyzing research papers, technical docs, or long-form content
Need to extract tables, code blocks, benchmarks, or key terms
Want progressive reading (abstracts first, then deep-dives)
Comparing multiple large documents

Don't use for:

❌ Documents under 10k tokens (read directly instead)
❌ Binary file formats (PDFs, Word docs) - convert to markdown first
❌ Simple text extraction (use grep/awk instead)

Core Capabilities

The document-parser skill provides four main capabilities:

Structure Analysis
- Extract markdown headers (H1-H6)
- Build hierarchical section tree
- Count tokens per section (target: 400-900)
- Generate section maps for navigation
Abstract Generation
- Create 100-200 token summaries for major sections
- Preserve key concepts and relationships
- Enable progressive reading workflows
Metadata Extraction
- Extract tables with structure preservation
- Capture code blocks with language tags
- Identify benchmarks (percentages, metrics)
- Extract key terms (techniques, models, acronyms)
Output Generation
- Machine-readable JSON (structure.json, metadata.json)
- Human-readable markdown (section_map.md)
- Full section content with metadata

Quick Reference

Task	Command	Output
Parse structure	`python3 scripts/parse_document_structure.py <file.md>`	structure.json, section_map.md
Extract metadata	`python3 scripts/extract_metadata.py <file.md>`	metadata.json
Custom output path	`--output <path>`	Specify output file
Section map	`--map <path>`	Human-readable navigation

Chunking Principles Reference

The skill implements RAG-optimized chunking principles:

The 400-900 Token Sweet Spot

Too small (<400): Fragments semantic meaning, loses context
Sweet spot (400-900): Complete thoughts, searchable, coherent
Too large (>900): Dilutes relevance, adds noise

Layout-Aware Hierarchical Chunking

Respect document structure (headers, sections)
Never split mid-paragraph or mid-code-block
Preserve parent-child relationships
Include breadcrumb context (section path)

Dual-Storage Pattern

Abstracts: Quick navigation, relevance filtering
Full sections: Deep-dive when needed
Metadata: Tables, benchmarks, key terms for targeted search

See references/chunking_principles.md for complete details.

Sandbox Configuration

IMPORTANT: This skill requires executing Python scripts. In read-only sandbox mode, you need to either:

Recommended: Configure sandbox allowlist in ~/.codex/config.toml:
```
[sandbox]
allowed_paths = ["~/.codex/skills/*/scripts"]
```
Alternative: Use dangerouslyDisableSandbox: true when calling Bash tool

See README.md in this skill directory for complete sandbox setup instructions.

Implementation Workflows

Workflow 1: Parse Single Large Document

Use case: User has a 47k token research paper

# Step 1: Parse document structure
cd ~/.codex/skills/document-parser
python3 scripts/parse_document_structure.py /path/to/document.md \
  --output structure.json \
  --map section_map.md

# Step 2: Review section map
cat section_map.md
# Shows hierarchical outline with token counts

# Step 3: Extract metadata
python3 scripts/extract_metadata.py /path/to/document.md \
  --output metadata.json

# Step 4: Review extracted metadata
cat metadata.json | jq '.tables | length'
cat metadata.json | jq '.benchmarks | length'
cat metadata.json | jq '.key_terms | keys'

Expected output:

structure.json: Hierarchical section tree with token counts
section_map.md: Human-readable outline for navigation
metadata.json: Tables, code blocks, benchmarks, key terms

Workflow 2: Comparative Analysis

Use case: Compare two research papers on similar topics

# Parse both documents
for doc in paper1.md paper2.md; do
  python3 scripts/parse_document_structure.py "$doc" \
    --output "${doc%.md}_structure.json"
  python3 scripts/extract_metadata.py "$doc" \
    --output "${doc%.md}_metadata.json"
done

# Compare structures
diff -u \
  <(jq '.sections[] | .title' paper1_structure.json) \
  <(jq '.sections[] | .title' paper2_structure.json)

# Compare key terms
diff -u \
  <(jq '.key_terms.techniques[]' paper1_metadata.json | sort) \
  <(jq '.key_terms.techniques[]' paper2_metadata.json | sort)

Workflow 3: Progressive Document Reading

Use case: Understand document before deep-dive

# Step 1: Get high-level structure
python3 scripts/parse_document_structure.py document.md --map outline.md
cat outline.md
# Review: What are the main sections?

# Step 2: Read abstracts (if available in structure.json)
jq '.sections[] | select(.abstract) | {title, abstract}' structure.json

# Step 3: Extract metadata for context
python3 scripts/extract_metadata.py document.md --output metadata.json

# Step 4: Review key terms to understand domain
jq '.key_terms' metadata.json

# Step 5: Deep-dive into specific sections
# Read full sections from original document based on structure

Script Documentation

parse_document_structure.py

Extracts markdown headers, builds hierarchical section tree, counts tokens.

Usage:

python3 scripts/parse_document_structure.py <file.md> [OPTIONS]

Options:

--output FILEPATH - Output JSON file (default: structure.json)
--map FILEPATH - Output markdown section map (default: section_map.md)

Output structure.json format:

{
  "sections": [
    {
      "id": "section-1",
      "title": "Introduction",
      "level": 1,
      "token_count": 450,
      "children": [
        {
          "id": "section-1.1",
          "title": "Background",
          "level": 2,
          "token_count": 320,
          "children": []
        }
      ]
    }
  ],
  "total_sections": 56,
  "total_tokens": 47000
}

Output section_map.md format:

# Document Structure

- Introduction (450 tokens)
  - Background (320 tokens)
  - Motivation (280 tokens)
- Methods (650 tokens)
  - Data Collection (520 tokens)
  - Analysis (580 tokens)

extract_metadata.py

Extracts tables, code blocks, benchmarks, and key terms.

Usage:

python3 scripts/extract_metadata.py <file.md> [OPTIONS]

Options:

--output FILEPATH - Output JSON file (default: metadata.json)

Output metadata.json format:

{
  "tables": [
    {
      "id": "table-1",
      "section": "Results",
      "headers": ["Model", "Accuracy", "F1"],
      "rows": [
        ["GPT-4", "95.2%", "0.94"],
        ["Claude", "94.8%", "0.93"]
      ]
    }
  ],
  "code_blocks": [
    {
      "id": "code-1",
      "section": "Implementation",
      "language": "python",
      "content": "def parse_document(text):\n    ..."
    }
  ],
  "benchmarks": [
    {
      "metric": "Accuracy",
      "value": "95.2%",
      "context": "GPT-4 on MMLU benchmark"
    }
  ],
  "key_terms": {
    "techniques": ["RAG", "Fine-tuning", "Few-shot learning"],
    "models": ["GPT-4", "Claude", "Llama-2"],
    "acronyms": ["MMLU", "RAG", "NLP"]
  }
}

Common Mistakes

❌ Sandbox permission errors when running scripts

Problem: Permission denied or scripts won't execute in read-only sandbox mode Fix: Configure sandbox allowlist in ~/.codex/config.toml:

[sandbox]
allowed_paths = ["~/.codex/skills/*/scripts"]

Or use dangerouslyDisableSandbox: true flag when calling Bash tool (development only). See README.md for complete setup instructions.

❌ Parsing non-markdown files

Problem: Scripts expect markdown format Fix: Convert PDFs/Word docs to markdown first using pandoc:

pandoc document.pdf -o document.md

❌ Ignoring token counts

Problem: Sections too large for embedding models Fix: Review section_map.md token counts, split sections >900 tokens manually

❌ Missing Python dependencies

Problem: Scripts require specific libraries Fix: Install dependencies:

pip install tiktoken markdown beautifulsoup4

❌ Not preserving structure

Problem: Flat extraction loses context Fix: Always use hierarchical parsing, maintain parent-child relationships

❌ Skipping metadata extraction

Problem: Lose valuable structured data Fix: Always run both scripts for complete analysis

Examples

Example 1: Research Paper (47k tokens)

Input: 47k token research paper on RAG systems

Commands:

python3 scripts/parse_document_structure.py rag_paper.md
python3 scripts/extract_metadata.py rag_paper.md

Results:

56 sections extracted
54 tables identified
145 benchmarks found
71 techniques cataloged
Section map showing 3-level hierarchy
Average section size: 839 tokens (within target range)

Example 2: Technical Documentation

Input: API documentation with code examples

Commands:

python3 scripts/parse_document_structure.py api_docs.md --map api_outline.md
python3 scripts/extract_metadata.py api_docs.md

Use results to:

Navigate API structure via outline
Extract all code examples for testing
Catalog all endpoints from tables
Build searchable knowledge base

Example 3: Multi-Document Comparison

Input: 3 papers on LLM evaluation

Workflow:

# Parse all documents
for doc in paper*.md; do
  python3 scripts/parse_document_structure.py "$doc"
  python3 scripts/extract_metadata.py "$doc"
done

# Compare methodologies
jq -r '.sections[] | select(.title | contains("Method")) | .title' *_structure.json

# Compare benchmarks
jq -r '.benchmarks[] | select(.metric == "Accuracy") | "\(.value) - \(.context)"' *_metadata.json

Testing Your Parsing

After parsing a document, verify quality:

Structure Checklist:

All major sections captured
Hierarchy preserved (H1 > H2 > H3)
Token counts reasonable (400-900 target)
Section map is human-readable
JSON is valid (jq . structure.json)

Metadata Checklist:

Tables extracted with structure
Code blocks include language tags
Benchmarks capture value + context
Key terms are domain-relevant
JSON is valid (jq . metadata.json)

Advanced Usage

Custom Section Splitting

If sections are too large (>900 tokens), split manually:

# In parse_document_structure.py, add target_size parameter
python3 scripts/parse_document_structure.py document.md \
  --target-size 600 \
  --max-size 900

Filtering by Section Level

Extract only top-level sections:

jq '.sections[] | select(.level == 1)' structure.json

Building RAG Index

Use parsed output for RAG system:

import json

# Load structure
with open('structure.json') as f:
    structure = json.load(f)

# Load metadata
with open('metadata.json') as f:
    metadata = json.load(f)

# Build embeddings for each section
for section in structure['sections']:
    if 400 <= section['token_count'] <= 900:
        # Optimal chunk size
        embed_and_index(section)

Integration with Other Skills

This skill complements:

skill-builder: Create new parsing strategies as skills
time-awareness: Track document parsing timestamps

Proven Success

Tested successfully on:

✅ 47K token research document
✅ 56 sections extracted
✅ 54 tables preserved
✅ 145 benchmarks identified
✅ 71 techniques cataloged
✅ Hierarchical section maps generated
✅ Metadata JSON validated

References

references/chunking_principles.md - Complete RAG chunking methodology
Scripts in scripts/ directory
See skill-builder for creating document-specific parsing skills

Remember: Large documents are structured data. Parse the structure first, then read strategically.

Document Parser

Install Skill

SKILL.md

Document Parser

Overview

When to Use This Skill

Core Capabilities

Quick Reference

Chunking Principles Reference

The 400-900 Token Sweet Spot

Layout-Aware Hierarchical Chunking

Dual-Storage Pattern

Sandbox Configuration

Implementation Workflows

Workflow 1: Parse Single Large Document

Workflow 2: Comparative Analysis

Workflow 3: Progressive Document Reading

Script Documentation

parse_document_structure.py

extract_metadata.py

Common Mistakes

❌ Sandbox permission errors when running scripts

❌ Parsing non-markdown files

❌ Ignoring token counts

❌ Missing Python dependencies

❌ Not preserving structure

❌ Skipping metadata extraction

Examples

Example 1: Research Paper (47k tokens)

Example 2: Technical Documentation

Example 3: Multi-Document Comparison

Testing Your Parsing

Advanced Usage

Custom Section Splitting

Filtering by Section Level

Building RAG Index

Integration with Other Skills

Proven Success

References