| name | pdf-to-markdown |
| description | Extract text from scanned PDF documents and convert to clean markdown. Use when user asks to transcribe, extract text, OCR, or convert a PDF (especially scanned documents, historical documents, or image-based PDFs) to markdown format. Outputs markdown file saved next to the original PDF. |
| allowed-tools | Bash(pdfinfo:*), Bash(qpdf:*) |
PDF to Markdown Extraction
Extract text content from PDF documents and save as clean markdown.
How It Works
Claude is multimodal. Use the Read tool to visually read PDF pages and transcribe the text directly.
Workflow
- Check PDF size - Count pages using
pdfinfo <file>.pdf | grep Pages - Split if large (10+ pages) - Use qpdf to split into 4-page chunks
- Extract text - For small PDFs: read directly. For large PDFs: process chunks in parallel using Task agents
- Merge results - Combine extracted text in page order
- Format as markdown - Apply appropriate heading levels, lists, and formatting
- Review for errors - Check grammar/spelling, fix obvious OCR-style typos
- Save output - Write markdown file next to original PDF with same base name
- Cleanup - Remove temporary chunk files
Handling Large PDFs (10+ pages)
For PDFs with 10 or more pages, split into chunks for parallel processing.
Prerequisites
Ensure qpdf is installed:
# Check if installed
command -v qpdf
# Install if missing (Debian/Ubuntu)
sudo apt install qpdf
Splitting Process
# Create unique temp directory
TMPDIR=$(mktemp -d)
# Split into 4-page chunks
qpdf --split-pages=4 input.pdf "$TMPDIR/chunk-%d.pdf"
This creates files like chunk-1.pdf, chunk-2.pdf, etc.
Parallel Extraction
Launch multiple Task agents concurrently (one per chunk) to extract text. Each agent reads its assigned chunk and returns the extracted text. Collect and merge results in page order.
Cleanup
After merging, remove the temporary directory:
rm -rf "$TMPDIR"
Output Location
Save the .md file in the same directory as the source PDF:
- Input:
/path/to/document.pdf - Output:
/path/to/document.md
Language-Specific Notes
- Preserve diacritics accurately (háčky, čárky)
- Keep abbreviations as in original (e.g.,
čs.,použ.)
Handling Annotations
If the document contains handwritten annotations:
- Use
~~strikethrough~~for crossed-out text - Use
*italics*for handwritten additions - Note unclear annotations in the output, use
(and)to highlight
Quality Checklist
Before saving, verify:
- All pages transcribed
- Heading hierarchy makes sense
- Lists formatted consistently
- No obvious typos or garbled text
- Special characters rendered correctly