| name | file-to-markdown |
| description | Convert any file to markdown format using the markitdown library. Use this skill when users need to convert documents (PDF, DOCX, XLSX, PPTX, images, HTML, CSV, JSON, XML, audio files, etc.) into markdown format for easier reading, editing, or integration into markdown-based workflows. |
| license | Complete terms in LICENSE.txt |
File to Markdown Converter
Convert files to markdown format using the markitdown library. This skill handles documents, images, audio, structured data, and more.
When to Use This Skill
Use this skill when the user needs to:
- Convert documents (PDF, DOCX, PPTX, XLSX) to markdown
- Extract text from images using OCR
- Transcribe audio files to text
- Convert structured data (CSV, JSON, XML) to markdown tables
- Process web content (HTML, MHTML) into markdown
- Batch convert multiple files to markdown
Supported Formats
Documents: PDF, DOCX, PPTX, XLSX
Web: HTML, MHTML
Images: PNG, JPG, JPEG, GIF (with OCR and description)
Audio: MP3, WAV (with transcription)
Data: CSV, JSON, XML
Archives: ZIP
Other: Plain text files
Decision Tree: Choosing Your Approach
User request → Single file or multiple files?
├─ Single file → Use helper script
│ └─ Run: python scripts/convert_file.py <input> [output]
│
└─ Multiple files → Use batch conversion
└─ Run: python scripts/batch_convert.py <input_dir> [output_dir] [--pattern PATTERN]
Installation Check
Before converting, verify markitdown is installed:
pip install markitdown
For full functionality (image OCR, audio transcription):
pip install markitdown[all]
Conversion Workflow
Single File Conversion
Use the helper script as your primary method:
python scripts/convert_file.py input_file.pdf output.md
The script handles:
- File validation
- Conversion with error handling
- Output file creation with proper encoding
- Progress reporting
If output filename is omitted, the script creates input_file.md automatically.
Batch Conversion
For multiple files, use the batch converter:
# Convert all files in a directory
python scripts/batch_convert.py ./documents
# Specify output directory
python scripts/batch_convert.py ./documents ./markdown_output
# Filter by pattern
python scripts/batch_convert.py ./documents ./output --pattern "*.pdf"
# Multiple extensions
python scripts/batch_convert.py ./documents ./output --pattern "*.{pdf,docx}"
The batch script:
- Automatically excludes
.mdfiles - Provides progress tracking
- Reports success/failure for each file
- Creates output directories as needed
Direct Python Integration
When helper scripts don't fit, use the markitdown library directly:
from markitdown import MarkItDown
# Initialize converter
md = MarkItDown()
# Convert file
try:
result = md.convert("path/to/file.pdf")
if result and result.text_content:
# Process or save markdown
with open("output.md", "w", encoding="utf-8") as f:
f.write(result.text_content)
else:
print("No content extracted")
except Exception as e:
print(f"Conversion failed: {e}")
Format-Specific Guidance
Images (PNG, JPG, GIF)
- markitdown performs OCR to extract text
- Can generate image descriptions using vision models
- Best results with clear, well-lit text
- May not preserve complex layouts perfectly
Audio (MP3, WAV)
- Automatically transcribed to text
- Requires good audio quality for accuracy
- Processing time increases with file length
- Output formatted as markdown text
Documents (PDF, DOCX, PPTX, XLSX)
- Text extraction maintains basic structure
- Tables converted to markdown tables
- Some complex formatting may be simplified
- XLSX: each sheet becomes a section with table
Structured Data (CSV, JSON, XML)
- CSV: converted to markdown tables
- JSON: formatted as readable text structure
- XML: converted to hierarchical markdown
Web Content (HTML, MHTML)
- Extracts main content
- Converts HTML to clean markdown
- Preserves links and basic formatting
Error Handling
Common errors and solutions:
ImportError: markitdown not installed
- Install with:
pip install markitdown - For full features:
pip install markitdown[all]
- Install with:
FileNotFoundError
- Verify file path is correct
- Use absolute paths when uncertain
No content extracted
- File may be corrupted or empty
- Format may not be supported
- Try with a different file to verify installation
Encoding errors
- Always use
encoding='utf-8'when writing output files - Helper scripts handle this automatically
- Always use
Best Practices
- Start with helper scripts: They handle common cases reliably
- Test with samples first: Verify conversion quality before batch processing
- Use batch converter for large sets: More efficient than individual conversions
- Handle errors gracefully: Not all files convert perfectly
- Preserve original files: Conversion is non-destructive, but verify output before deleting sources
- Check output quality: Some complex formatting may not translate perfectly
Reference Files
scripts/
- convert_file.py: Single file conversion with error handling
- batch_convert.py: Directory-based batch conversion with pattern matching
references/
- markitdown_api.md: Complete API reference for markitdown library
- format_guide.md: Format-specific conversion tips and limitations
Always run scripts with --help first to see current usage and options.