| name | |
| description | Process PDF files - extract text, create PDFs, merge documents. Use when user asks to read PDF, create PDF, or work with PDF files. |
PDF Processing Skill
You now have expertise in PDF manipulation. Follow these workflows:
Reading PDFs
Option 1: Quick text extraction (preferred)
# Using pdftotext (poppler-utils)
pdftotext input.pdf - # Output to stdout
pdftotext input.pdf output.txt # Output to file
# If pdftotext not available, try:
python3 -c "
import fitz # PyMuPDF
doc = fitz.open('input.pdf')
for page in doc:
print(page.get_text())
"
Option 2: Page-by-page with metadata
import fitz # pip install pymupdf
doc = fitz.open("input.pdf")
print(f"Pages: {len(doc)}")
print(f"Metadata: {doc.metadata}")
for i, page in enumerate(doc):
text = page.get_text()
print(f"--- Page {i+1} ---")
print(text)
Creating PDFs
Option 1: From Markdown (recommended)
# Using pandoc
pandoc input.md -o output.pdf
# With custom styling
pandoc input.md -o output.pdf --pdf-engine=xelatex -V geometry:margin=1in
Option 2: Programmatically
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Hello, PDF!")
c.save()
Option 3: From HTML
# Using wkhtmltopdf
wkhtmltopdf input.html output.pdf
# Or with Python
python3 -c "
import pdfkit
pdfkit.from_file('input.html', 'output.pdf')
"
Merging PDFs
import fitz
result = fitz.open()
for pdf_path in ["file1.pdf", "file2.pdf", "file3.pdf"]:
doc = fitz.open(pdf_path)
result.insert_pdf(doc)
result.save("merged.pdf")
Splitting PDFs
import fitz
doc = fitz.open("input.pdf")
for i in range(len(doc)):
single = fitz.open()
single.insert_pdf(doc, from_page=i, to_page=i)
single.save(f"page_{i+1}.pdf")
Key Libraries
| Task | Library | Install |
|---|---|---|
| Read/Write/Merge | PyMuPDF | pip install pymupdf |
| Create from scratch | ReportLab | pip install reportlab |
| HTML to PDF | pdfkit | pip install pdfkit + wkhtmltopdf |
| Text extraction | pdftotext | brew install poppler / apt install poppler-utils |
Best Practices
- Always check if tools are installed before using them
- Handle encoding issues - PDFs may contain various character encodings
- Large PDFs: Process page by page to avoid memory issues
- OCR for scanned PDFs: Use
pytesseractif text extraction returns empty