name	document-converter
description	Convert between Markdown, DOCX, and PDF formats bidirectionally. Handles text extraction from PDF/DOCX, markdown to document conversion. Use when converting document formats or extracting structured content from Word or PDF files.
allowed-tools	Bash, Read, Glob, Write
dependencies	pandoc>=2.0, python3>=3.8, markitdown (optional, recommended), pymupdf4llm (optional, recommended)
model	haiku-4.5
model-justification	Orchestrates external conversion tools with minimal AI reasoning required
fallback-model	sonnet-4.5

Document Converter Skill

Convert documents bidirectionally between Markdown, DOCX, and PDF formats. This skill automatically detects optimal conversion tools, handles batch processing, and ensures quality output with appropriate fallback mechanisms.

Core Capabilities

Conversion Directions

TO Markdown (text extraction from documents):

DOCX → Markdown (MarkItDown or Pandoc)
PDF → Markdown (MarkItDown or PyMuPDF4LLM)

FROM Markdown (document generation):

Markdown → DOCX (Pandoc)
Markdown → PDF (Pandoc with Typst or XeLaTeX)

Features

Automatic tool detection and selection
Cascading fallback mechanisms
Batch processing support
Image extraction and embedding
Filename sanitization (spaces to underscores)
Quality validation and reporting
Concurrent conversion support

Tool Priority Matrix

The skill uses intelligent tool selection based on format and quality metrics:

DOCX → Markdown

MarkItDown (primary) - 75-80% fidelity
- Perfect table preservation (pipe-style markdown)
- Excellent Unicode/emoji support
- Fast processing
Pandoc (fallback) - 68% fidelity
- Reliable baseline conversion
- Tables converted to grid format

PDF → Markdown

MarkItDown (primary) - Best for most PDFs
- Consistent quality across document types
- Easy to configure
PyMuPDF4LLM (fallback) - Fast alternative
- Zero configuration required
- Perfect Unicode preservation
- Good for simple PDFs

Markdown → DOCX

Pandoc (only option) - 95%+ quality preservation

Markdown → PDF

Pandoc + Typst (primary) - Fast, modern PDF engine
Pandoc + XeLaTeX (fallback) - Traditional LaTeX engine

Usage Patterns

Basic Conversion

# Source conversion core library
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

# Detect available tools
detect_tools

# Convert DOCX/PDF to Markdown
main_conversion /path/to/documents /path/to/output

# Convert Markdown to DOCX
main_conversion /path/to/markdown /path/to/output

Batch Processing

The conversion core automatically processes all files in the input directory:

Discovers all convertible files (.docx, .pdf, or .md)
Detects conversion direction automatically
Processes files concurrently (default: 4 parallel conversions)
Generates conversion.log with statistics

Progress Streaming

The conversion script emits PROGRESS markers:

[PROGRESS] Converting: file1.docx → file1.md
[PROGRESS] Converting: file2.pdf → file2.md (2/10)
[SUCCESS] Converted file1.docx → file1.md
[FAILED] file3.pdf: Conversion timeout after 300s

Conversion Workflow

Phase 1: Tool Detection

Check for MarkItDown availability (command -v markitdown)
Check for Pandoc availability (command -v pandoc)
Check for PyMuPDF4LLM availability (python3 -c "import pymupdf4llm")
Check for PDF engines (Typst, XeLaTeX)
Set availability flags for tool selection

Phase 2: File Discovery

Scan input directory for convertible files
Detect conversion direction (TO_MARKDOWN or FROM_MARKDOWN)
Validate mixed-mode errors (cannot mix directions)
Create output directory structure

Phase 3: Conversion Execution

Process files using optimal tool based on priority matrix
Apply timeout limits (60s DOCX, 300s PDF, 120s MD→PDF)
Handle collisions (overwrite or skip existing files)
Extract/embed images appropriately
Retry with fallback tools on failure

Phase 4: Validation

Verify output file exists
Check for broken image links
Validate document structure (headings present)
Report any quality issues

Phase 5: Reporting

Generate conversion.log with statistics
Report success/failure counts by format
List timeout occurrences
Summarize validation issues

Quality Considerations

Fidelity Expectations

DOCX→Markdown: 75-80% with MarkItDown (best tables)
PDF→Markdown: Varies by PDF complexity (scan quality critical)
Markdown→DOCX: 95%+ with Pandoc (excellent preservation)
Markdown→PDF: High quality with Typst/XeLaTeX engines

Known Limitations

Scanned PDFs: OCR quality depends on scan resolution
Complex layouts: Multi-column or nested tables may degrade
Embedded fonts: PDF fonts may affect text extraction accuracy
Images: Large images may cause timeout issues

Best Practices

Use MarkItDown for DOCX/PDF when available (better quality)
Allow longer timeouts for large PDFs (300s default)
Review conversion.log for failure patterns
Test output files for critical conversions

Error Handling

Common Errors

Tool not available:

Error: No conversion tool available for DOCX→Markdown
Required: markitdown or pandoc

→ Install required tools

Conversion timeout:

[FAILED] large_document.pdf: Conversion timeout after 300s

→ Increase TIMEOUT_PDF_TO_MD or use simpler PDF

Validation failure:

[WARNING] output.md: No headings found (possible conversion issue)

→ Check source document structure

Recovery Strategies

Failed conversions automatically retry with fallback tool
Timeouts skip to next file (batch processing continues)
Validation warnings don't block workflow (reported only)

Configuration Options

Environment variables to tune conversion behavior:

# Timeout multipliers (seconds)
TIMEOUT_MULTIPLIER=1.5  # Increase all timeouts by 50%

# Disk usage limits
MAX_DISK_USAGE_GB=10  # Abort if output exceeds 10GB
MIN_FREE_SPACE_MB=500  # Require 500MB free space

# Concurrency
MAX_CONCURRENT_CONVERSIONS=4  # Parallel conversion limit

Integration Examples

From Claude Code Agents

When working within agent contexts, the skill automatically triggers when Claude detects conversion needs:

User: "Extract text from these PDF reports"
→ Skill auto-invokes: document-converter
→ Converts PDFs to Markdown
→ Returns structured text

From Slash Commands

The /convert-docs command delegates to this skill when available:

/convert-docs ./documents ./output
→ Checks skill availability
→ Delegates to document-converter skill
→ Falls back to script mode if skill unavailable

From Other Skills

Skills can compose with document-converter:

# research-specialist skill
dependencies:
  - document-converter  # Auto-loads for PDF analysis

Script Locations

The skill relies on conversion scripts in the project:

Core orchestration: .claude/lib/convert/convert-core.sh
DOCX functions: .claude/lib/convert/convert-docx.sh
PDF functions: .claude/lib/convert/convert-pdf.sh
Markdown utilities: .claude/lib/convert/convert-markdown.sh

Scripts are symlinked in the skill's scripts/ directory for easy access.

Testing

Test the skill with sample conversions:

# Test DOCX→Markdown
/convert-docs ./test/sample.docx ./output

# Test PDF→Markdown
/convert-docs ./test/report.pdf ./output

# Test Markdown→DOCX
/convert-docs ./test/document.md ./output

# Batch test
/convert-docs ./test/documents ./output

Verify:

Conversion.log generated with statistics
Output files created with correct extensions
Image directories created when needed
Quality meets expectations (check tables, formatting)

Troubleshooting

Skill Not Triggering

If the skill doesn't auto-invoke when expected:

Check description includes trigger keywords (convert, document, PDF, DOCX, Markdown)
Test with explicit skill invocation: "Use document-converter skill"
Verify skill is in .claude/skills/ directory (project-level)

Tool Installation

MarkItDown (recommended):

pip install markitdown

PyMuPDF4LLM (optional):

pip install pymupdf4llm

Pandoc (required):

# Ubuntu/Debian
apt install pandoc

# macOS
brew install pandoc

Typst (optional, for MD→PDF):

# Ubuntu/Debian
wget https://github.com/typst/typst/releases/latest/download/typst-x86_64-unknown-linux-musl.tar.xz
tar -xf typst-*.tar.xz && sudo mv typst-*/typst /usr/local/bin/

# macOS
brew install typst

Performance Issues

If conversions are slow:

Reduce MAX_CONCURRENT_CONVERSIONS (lower parallelism)
Increase timeout values for large files
Check disk I/O (slow storage may bottleneck)
Use PyMuPDF4LLM for simple PDFs (faster than MarkItDown)

Install Skill

SKILL.md