name	document-processing
description	Extract content from documents (PDF, Word, PowerPoint, Excel) as Markdown. Use when working with document attachments, reading files, or when asked to analyze PDF, DOCX, PPTX, or XLSX files. Also covers the workflow for processing attachments before analysis.

Document Processing

This skill covers how to read and analyze document attachments in advisory work.

Supported Formats

Extension	Type	Best For
`.pdf`	PDF	Reports, contracts, board decks
`.docx`	Word	Memos, proposals, narratives
`.pptx`	PowerPoint	Presentations, slide decks
`.xlsx`	Excel	Financial models, data tables

WORKFLOW: Processing Attachments Before Analysis

When you need to analyze a document attachment:

Step 1: Check if Already Processed

Look for a .md file alongside the original:

ls -la attachments/
# If report.pdf exists, check for report.md

Step 2: Convert if Needed

If no .md file exists, convert the document:

python -m src.doc_converter attachments/filename.pdf

This creates a cached .md file with:

A summary header describing what's in the file
Full markdown content extracted from the document

Step 3: Read the Markdown File

Use the generated .md file for your analysis:

cat attachments/filename.md

Or use grep to find specific content:

grep -n "revenue" attachments/filename.md

Conversion Options

# Standard conversion (recommended)
python -m src.doc_converter path/to/file.pdf

# Skip the summary header
python -m src.doc_converter file.pdf --no-summary

# Force re-conversion (ignore cache)
python -m src.doc_converter file.pdf --force

# Generate LLM-powered summary (costs tokens, use sparingly)
python -m src.doc_converter file.pdf --llm-summary

When to Use --llm-summary

Only use --llm-summary when:

The user explicitly requests a smart summary
The document is complex and the heuristic summary isn't helpful
You need to quickly understand a large document before deep analysis

Cost note: LLM summary uses Claude Haiku with hard token limits (8K input chars, 150 output tokens).

Output Format

The converted markdown includes:

---
source: original-filename.pdf
converted: auto-generated markdown via MarkItDown
note: Text extraction may have some loss. For critical analysis, verify against original.
---

> **What's in this file:** [summary of content, structure, key sections]

---

[Full document content as markdown...]

IMPORTANT: Handling Extraction Limitations

MarkItDown text extraction may have some information loss, especially for:

Complex layouts and multi-column documents
Embedded images and charts (only alt-text/captions extracted)
Scanned PDFs (limited OCR capability)
Heavily formatted tables

When to Verify Against Original

If your analysis requires high confidence on specific details:

Call out the limitation to the user:

"Note: This analysis is based on extracted text from the PDF. Some formatting or embedded content may not have been fully captured."
For critical verification (max 5 pages per analysis to control costs):
- Identify the specific pages with relevant content using grep/search
- You may request OCR-style analysis of those specific pages via Claude API
- Only do this when high confidence is required AND the extracted text seems incomplete

Token Budget Awareness

When doing deep document analysis:

Start with the extracted .md file (free, fast)
Use grep to locate relevant sections (free, fast)
Only escalate to page-level OCR verification when necessary
Hard limit: Maximum 5 pages for OCR-style verification per analysis task

Finding Content in Large Documents

For large documents, use search tools to locate relevant content:

# Find mentions of specific terms
grep -n "quarterly revenue" attachments/board-deck.md

# Find all headings
grep -n "^#" attachments/board-deck.md

# Find tables
grep -n "|\-\-\-" attachments/board-deck.md

Updating attachments/INDEX.md

After processing new attachments, update the index:

## Attachments Index

| File | Type | Summary | Processed |
|------|------|---------|-----------|
| Q3-board-deck.pdf | PDF | Q3 2025 board presentation | ✓ |
| financial-model.xlsx | Excel | 5-year financial projections | ✓ |

Notes

Conversion is cached: first run creates .md, subsequent reads use cache
Use --force if the source file has been updated
Legacy formats (.doc, .xls, .ppt) have limited support
Password-protected files will fail - inform the user

document-processing

Install Skill

SKILL.md