Docling Document Conversion
Convert documents (PDF, DOCX, PPTX, HTML, Markdown, etc.) to structured formats with image extraction.
Installation
pip install docling
# For page range processing (optional but recommended)
pip install pymupdf
# For Tesseract OCR (optional):
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr
Quick Start
CLI
# Basic conversion (output to directory)
docling document.pdf --to markdown
# With OCR for scanned documents
docling scanned.pdf --ocr --ocr-engine easyocr --to markdown
# Batch conversion
docling file1.pdf file2.docx --output ./converted
Python
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
print(result.document.export_to_markdown())
Using the Enhanced Conversion Script
Execute scripts/convert_document.py for advanced conversions with image extraction:
# Basic PDF to Markdown with image extraction
python scripts/convert_document.py document.pdf -o ./output
# All formats (Markdown, JSON, HTML) with images
python scripts/convert_document.py document.pdf -o ./output -f all
# With OCR (Japanese + English)
python scripts/convert_document.py scanned.pdf -o ./output --ocr --languages ja en
# High accuracy table extraction
python scripts/convert_document.py document.pdf -o ./output --table-mode accurate
# Process specific page range
python scripts/convert_document.py large.pdf -o ./output --pages 1-20
# Generate batch script for large files (50+ pages)
python scripts/convert_document.py large.pdf -o ./output --generate-script
Output Structure
output/
├── document.md # Markdown with embedded image links
├── document.json # Structured JSON data
├── document.html # HTML output (optional)
└── images/
├── figure_001.png # Diagrams, charts as PNG
├── figure_002.png
├── photo_001.jpg # Photos as JPEG
└── photo_002.jpg
Script Options
| Option |
Description |
Default |
-o, --output |
Output directory (required) |
- |
-f, --format |
Output format (markdown, json, html, all) |
markdown |
--ocr |
Enable OCR for scanned documents |
disabled |
--ocr-engine |
OCR engine (easyocr, tesseract) |
easyocr |
--languages |
OCR languages |
en ja |
--table-mode |
Table extraction (fast, accurate) |
fast |
--pages |
Page range (e.g., 1-20) |
all pages |
--generate-script |
Generate batch script for large files |
- |
--batch-size |
Pages per batch |
20 |
Large File Handling
For files with 50+ pages or 50+ MB:
Option 1: Page Range Processing
# Process pages 1-20
python scripts/convert_document.py large.pdf -o ./output --pages 1-20
# Then process pages 21-40
python scripts/convert_document.py large.pdf -o ./output2 --pages 21-40
Option 2: Generate Batch Script
# Generate a batch processing script
python scripts/convert_document.py large.pdf -o ./output --generate-script --batch-size 20
# Run the generated script
python ./output/batch_process.py
Advanced Configuration
OCR Setup
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions
pipeline = PdfPipelineOptions()
pipeline.do_ocr = True
pipeline.ocr_options = EasyOcrOptions(
lang=["ja", "en"], # Languages for OCR
confidence_threshold=0.5
)
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline)}
)
Table Extraction
from docling.datamodel.pipeline_options import TableFormerMode
pipeline.do_table_structure = True
pipeline.table_structure_options.mode = TableFormerMode.ACCURATE # or FAST
pipeline.table_structure_options.do_cell_matching = True
Image Extraction (Python API)
from pathlib import Path
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
# Enable image generation
pipeline = PdfPipelineOptions()
pipeline.generate_picture_images = True
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline)}
)
result = converter.convert("document.pdf")
# Save images
images_dir = Path("./output/images")
images_dir.mkdir(parents=True, exist_ok=True)
for idx, picture in enumerate(result.document.pictures):
if picture.image and picture.image.pil_image:
pil_img = picture.image.pil_image
pil_img.save(images_dir / f"image_{idx:03d}.png", "PNG")
Export Options
# Markdown
markdown = doc.export_to_markdown()
# JSON (dict)
data = doc.export_to_dict()
# HTML
html = doc.export_to_html()
# Save with different image modes
from docling_core.types.doc import ImageRefMode
doc.save_as_markdown("output.md", image_mode=ImageRefMode.REFERENCED)
Supported Formats
| Input |
Output |
| PDF, DOCX, PPTX, XLSX |
Markdown |
| HTML, Markdown, AsciiDoc |
JSON |
| Images (PNG, JPG, TIFF) |
HTML |
OCR Engines
| Engine |
Install |
Languages |
| EasyOCR |
pip install easyocr |
80+ languages |
| Tesseract |
System package |
100+ languages |
Common Patterns
Batch Processing
from pathlib import Path
converter = DocumentConverter()
for pdf in Path("docs").glob("*.pdf"):
result = converter.convert(str(pdf))
output = pdf.with_suffix(".md")
output.write_text(result.document.export_to_markdown())
RAG Chunking
from docling.chunking import HierarchicalChunker
chunker = HierarchicalChunker()
chunks = list(chunker.chunk(result.document))
for chunk in chunks:
print(chunk.text)
Extract Tables to CSV
for idx, table in enumerate(result.document.tables):
df = table.export_to_dataframe()
df.to_csv(f"table_{idx}.csv", index=False)
print(df.to_markdown()) # Print as Markdown