| name | document-processing |
| description | Extract content from documents (PDF, Word, PowerPoint, Excel) as Markdown. Use when working with document attachments, reading files, or when asked to analyze PDF, DOCX, PPTX, or XLSX files. Also covers the workflow for processing attachments before analysis. |
Document Processing
This skill covers how to read and analyze document attachments in advisory work.
Supported Formats
| Extension | Type | Best For |
|---|---|---|
.pdf |
Reports, contracts, board decks | |
.docx |
Word | Memos, proposals, narratives |
.pptx |
PowerPoint | Presentations, slide decks |
.xlsx |
Excel | Financial models, data tables |
WORKFLOW: Processing Attachments Before Analysis
When you need to analyze a document attachment:
Step 1: Check if Already Processed
Look for a .md file alongside the original:
ls -la attachments/
# If report.pdf exists, check for report.md
Step 2: Convert if Needed
If no .md file exists, convert the document:
python -m src.doc_converter attachments/filename.pdf
This creates a cached .md file with:
- A summary header describing what's in the file
- Full markdown content extracted from the document
Step 3: Read the Markdown File
Use the generated .md file for your analysis:
cat attachments/filename.md
Or use grep to find specific content:
grep -n "revenue" attachments/filename.md
Conversion Options
# Standard conversion (recommended)
python -m src.doc_converter path/to/file.pdf
# Skip the summary header
python -m src.doc_converter file.pdf --no-summary
# Force re-conversion (ignore cache)
python -m src.doc_converter file.pdf --force
# Generate LLM-powered summary (costs tokens, use sparingly)
python -m src.doc_converter file.pdf --llm-summary
When to Use --llm-summary
Only use --llm-summary when:
- The user explicitly requests a smart summary
- The document is complex and the heuristic summary isn't helpful
- You need to quickly understand a large document before deep analysis
Cost note: LLM summary uses Claude Haiku with hard token limits (8K input chars, 150 output tokens).
Output Format
The converted markdown includes:
---
source: original-filename.pdf
converted: auto-generated markdown via MarkItDown
note: Text extraction may have some loss. For critical analysis, verify against original.
---
> **What's in this file:** [summary of content, structure, key sections]
---
[Full document content as markdown...]
IMPORTANT: Handling Extraction Limitations
MarkItDown text extraction may have some information loss, especially for:
- Complex layouts and multi-column documents
- Embedded images and charts (only alt-text/captions extracted)
- Scanned PDFs (limited OCR capability)
- Heavily formatted tables
When to Verify Against Original
If your analysis requires high confidence on specific details:
Call out the limitation to the user:
"Note: This analysis is based on extracted text from the PDF. Some formatting or embedded content may not have been fully captured."
For critical verification (max 5 pages per analysis to control costs):
- Identify the specific pages with relevant content using grep/search
- You may request OCR-style analysis of those specific pages via Claude API
- Only do this when high confidence is required AND the extracted text seems incomplete
Token Budget Awareness
When doing deep document analysis:
- Start with the extracted
.mdfile (free, fast) - Use grep to locate relevant sections (free, fast)
- Only escalate to page-level OCR verification when necessary
- Hard limit: Maximum 5 pages for OCR-style verification per analysis task
Finding Content in Large Documents
For large documents, use search tools to locate relevant content:
# Find mentions of specific terms
grep -n "quarterly revenue" attachments/board-deck.md
# Find all headings
grep -n "^#" attachments/board-deck.md
# Find tables
grep -n "|\-\-\-" attachments/board-deck.md
Updating attachments/INDEX.md
After processing new attachments, update the index:
## Attachments Index
| File | Type | Summary | Processed |
|------|------|---------|-----------|
| Q3-board-deck.pdf | PDF | Q3 2025 board presentation | ✓ |
| financial-model.xlsx | Excel | 5-year financial projections | ✓ |
Notes
- Conversion is cached: first run creates
.md, subsequent reads use cache - Use
--forceif the source file has been updated - Legacy formats (.doc, .xls, .ppt) have limited support
- Password-protected files will fail - inform the user