Claude Code Plugins

Community-maintained marketplace

Feedback

document-converter

@benbrastmckie/.config
422
0

Convert between Markdown, DOCX, and PDF formats bidirectionally. Handles text extraction from PDF/DOCX, markdown to document conversion. Use when converting document formats or extracting structured content from Word or PDF files.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name document-converter
description Convert between Markdown, DOCX, and PDF formats bidirectionally. Handles text extraction from PDF/DOCX, markdown to document conversion. Use when converting document formats or extracting structured content from Word or PDF files.
allowed-tools Bash, Read, Glob, Write
dependencies pandoc>=2.0, python3>=3.8, markitdown (optional, recommended), pymupdf4llm (optional, recommended)
model haiku-4.5
model-justification Orchestrates external conversion tools with minimal AI reasoning required
fallback-model sonnet-4.5

Document Converter Skill

Convert documents bidirectionally between Markdown, DOCX, and PDF formats. This skill automatically detects optimal conversion tools, handles batch processing, and ensures quality output with appropriate fallback mechanisms.

Core Capabilities

Conversion Directions

TO Markdown (text extraction from documents):

  • DOCX → Markdown (MarkItDown or Pandoc)
  • PDF → Markdown (MarkItDown or PyMuPDF4LLM)

FROM Markdown (document generation):

  • Markdown → DOCX (Pandoc)
  • Markdown → PDF (Pandoc with Typst or XeLaTeX)

Features

  • Automatic tool detection and selection
  • Cascading fallback mechanisms
  • Batch processing support
  • Image extraction and embedding
  • Filename sanitization (spaces to underscores)
  • Quality validation and reporting
  • Concurrent conversion support

Tool Priority Matrix

The skill uses intelligent tool selection based on format and quality metrics:

DOCX → Markdown

  1. MarkItDown (primary) - 75-80% fidelity
    • Perfect table preservation (pipe-style markdown)
    • Excellent Unicode/emoji support
    • Fast processing
  2. Pandoc (fallback) - 68% fidelity
    • Reliable baseline conversion
    • Tables converted to grid format

PDF → Markdown

  1. MarkItDown (primary) - Best for most PDFs
    • Consistent quality across document types
    • Easy to configure
  2. PyMuPDF4LLM (fallback) - Fast alternative
    • Zero configuration required
    • Perfect Unicode preservation
    • Good for simple PDFs

Markdown → DOCX

  1. Pandoc (only option) - 95%+ quality preservation

Markdown → PDF

  1. Pandoc + Typst (primary) - Fast, modern PDF engine
  2. Pandoc + XeLaTeX (fallback) - Traditional LaTeX engine

Usage Patterns

Basic Conversion

# Source conversion core library
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

# Detect available tools
detect_tools

# Convert DOCX/PDF to Markdown
main_conversion /path/to/documents /path/to/output

# Convert Markdown to DOCX
main_conversion /path/to/markdown /path/to/output

Batch Processing

The conversion core automatically processes all files in the input directory:

  • Discovers all convertible files (.docx, .pdf, or .md)
  • Detects conversion direction automatically
  • Processes files concurrently (default: 4 parallel conversions)
  • Generates conversion.log with statistics

Progress Streaming

The conversion script emits PROGRESS markers:

[PROGRESS] Converting: file1.docx → file1.md
[PROGRESS] Converting: file2.pdf → file2.md (2/10)
[SUCCESS] Converted file1.docx → file1.md
[FAILED] file3.pdf: Conversion timeout after 300s

Conversion Workflow

Phase 1: Tool Detection

  • Check for MarkItDown availability (command -v markitdown)
  • Check for Pandoc availability (command -v pandoc)
  • Check for PyMuPDF4LLM availability (python3 -c "import pymupdf4llm")
  • Check for PDF engines (Typst, XeLaTeX)
  • Set availability flags for tool selection

Phase 2: File Discovery

  • Scan input directory for convertible files
  • Detect conversion direction (TO_MARKDOWN or FROM_MARKDOWN)
  • Validate mixed-mode errors (cannot mix directions)
  • Create output directory structure

Phase 3: Conversion Execution

  • Process files using optimal tool based on priority matrix
  • Apply timeout limits (60s DOCX, 300s PDF, 120s MD→PDF)
  • Handle collisions (overwrite or skip existing files)
  • Extract/embed images appropriately
  • Retry with fallback tools on failure

Phase 4: Validation

  • Verify output file exists
  • Check for broken image links
  • Validate document structure (headings present)
  • Report any quality issues

Phase 5: Reporting

  • Generate conversion.log with statistics
  • Report success/failure counts by format
  • List timeout occurrences
  • Summarize validation issues

Quality Considerations

Fidelity Expectations

  • DOCX→Markdown: 75-80% with MarkItDown (best tables)
  • PDF→Markdown: Varies by PDF complexity (scan quality critical)
  • Markdown→DOCX: 95%+ with Pandoc (excellent preservation)
  • Markdown→PDF: High quality with Typst/XeLaTeX engines

Known Limitations

  • Scanned PDFs: OCR quality depends on scan resolution
  • Complex layouts: Multi-column or nested tables may degrade
  • Embedded fonts: PDF fonts may affect text extraction accuracy
  • Images: Large images may cause timeout issues

Best Practices

  • Use MarkItDown for DOCX/PDF when available (better quality)
  • Allow longer timeouts for large PDFs (300s default)
  • Review conversion.log for failure patterns
  • Test output files for critical conversions

Error Handling

Common Errors

Tool not available:

Error: No conversion tool available for DOCX→Markdown
Required: markitdown or pandoc

→ Install required tools

Conversion timeout:

[FAILED] large_document.pdf: Conversion timeout after 300s

→ Increase TIMEOUT_PDF_TO_MD or use simpler PDF

Validation failure:

[WARNING] output.md: No headings found (possible conversion issue)

→ Check source document structure

Recovery Strategies

  • Failed conversions automatically retry with fallback tool
  • Timeouts skip to next file (batch processing continues)
  • Validation warnings don't block workflow (reported only)

Configuration Options

Environment variables to tune conversion behavior:

# Timeout multipliers (seconds)
TIMEOUT_MULTIPLIER=1.5  # Increase all timeouts by 50%

# Disk usage limits
MAX_DISK_USAGE_GB=10  # Abort if output exceeds 10GB
MIN_FREE_SPACE_MB=500  # Require 500MB free space

# Concurrency
MAX_CONCURRENT_CONVERSIONS=4  # Parallel conversion limit

Integration Examples

From Claude Code Agents

When working within agent contexts, the skill automatically triggers when Claude detects conversion needs:

User: "Extract text from these PDF reports"
→ Skill auto-invokes: document-converter
→ Converts PDFs to Markdown
→ Returns structured text

From Slash Commands

The /convert-docs command delegates to this skill when available:

/convert-docs ./documents ./output
→ Checks skill availability
→ Delegates to document-converter skill
→ Falls back to script mode if skill unavailable

From Other Skills

Skills can compose with document-converter:

# research-specialist skill
dependencies:
  - document-converter  # Auto-loads for PDF analysis

Script Locations

The skill relies on conversion scripts in the project:

  • Core orchestration: .claude/lib/convert/convert-core.sh
  • DOCX functions: .claude/lib/convert/convert-docx.sh
  • PDF functions: .claude/lib/convert/convert-pdf.sh
  • Markdown utilities: .claude/lib/convert/convert-markdown.sh

Scripts are symlinked in the skill's scripts/ directory for easy access.

Testing

Test the skill with sample conversions:

# Test DOCX→Markdown
/convert-docs ./test/sample.docx ./output

# Test PDF→Markdown
/convert-docs ./test/report.pdf ./output

# Test Markdown→DOCX
/convert-docs ./test/document.md ./output

# Batch test
/convert-docs ./test/documents ./output

Verify:

  • Conversion.log generated with statistics
  • Output files created with correct extensions
  • Image directories created when needed
  • Quality meets expectations (check tables, formatting)

Troubleshooting

Skill Not Triggering

If the skill doesn't auto-invoke when expected:

  • Check description includes trigger keywords (convert, document, PDF, DOCX, Markdown)
  • Test with explicit skill invocation: "Use document-converter skill"
  • Verify skill is in .claude/skills/ directory (project-level)

Tool Installation

MarkItDown (recommended):

pip install markitdown

PyMuPDF4LLM (optional):

pip install pymupdf4llm

Pandoc (required):

# Ubuntu/Debian
apt install pandoc

# macOS
brew install pandoc

Typst (optional, for MD→PDF):

# Ubuntu/Debian
wget https://github.com/typst/typst/releases/latest/download/typst-x86_64-unknown-linux-musl.tar.xz
tar -xf typst-*.tar.xz && sudo mv typst-*/typst /usr/local/bin/

# macOS
brew install typst

Performance Issues

If conversions are slow:

  • Reduce MAX_CONCURRENT_CONVERSIONS (lower parallelism)
  • Increase timeout values for large files
  • Check disk I/O (slow storage may bottleneck)
  • Use PyMuPDF4LLM for simple PDFs (faster than MarkItDown)

See Also