Claude Code Plugins

Community-maintained marketplace

Feedback

document-processing

@ashwinlimaye/advisory-agent
0
0

Extract content from documents (PDF, Word, PowerPoint, Excel) as Markdown. Use when working with document attachments, reading files, or when asked to analyze PDF, DOCX, PPTX, or XLSX files. Also covers the workflow for processing attachments before analysis.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name document-processing
description Extract content from documents (PDF, Word, PowerPoint, Excel) as Markdown. Use when working with document attachments, reading files, or when asked to analyze PDF, DOCX, PPTX, or XLSX files. Also covers the workflow for processing attachments before analysis.

Document Processing

This skill covers how to read and analyze document attachments in advisory work.

Supported Formats

Extension Type Best For
.pdf PDF Reports, contracts, board decks
.docx Word Memos, proposals, narratives
.pptx PowerPoint Presentations, slide decks
.xlsx Excel Financial models, data tables

WORKFLOW: Processing Attachments Before Analysis

When you need to analyze a document attachment:

Step 1: Check if Already Processed

Look for a .md file alongside the original:

ls -la attachments/
# If report.pdf exists, check for report.md

Step 2: Convert if Needed

If no .md file exists, convert the document:

python -m src.doc_converter attachments/filename.pdf

This creates a cached .md file with:

  • A summary header describing what's in the file
  • Full markdown content extracted from the document

Step 3: Read the Markdown File

Use the generated .md file for your analysis:

cat attachments/filename.md

Or use grep to find specific content:

grep -n "revenue" attachments/filename.md

Conversion Options

# Standard conversion (recommended)
python -m src.doc_converter path/to/file.pdf

# Skip the summary header
python -m src.doc_converter file.pdf --no-summary

# Force re-conversion (ignore cache)
python -m src.doc_converter file.pdf --force

# Generate LLM-powered summary (costs tokens, use sparingly)
python -m src.doc_converter file.pdf --llm-summary

When to Use --llm-summary

Only use --llm-summary when:

  • The user explicitly requests a smart summary
  • The document is complex and the heuristic summary isn't helpful
  • You need to quickly understand a large document before deep analysis

Cost note: LLM summary uses Claude Haiku with hard token limits (8K input chars, 150 output tokens).

Output Format

The converted markdown includes:

---
source: original-filename.pdf
converted: auto-generated markdown via MarkItDown
note: Text extraction may have some loss. For critical analysis, verify against original.
---

> **What's in this file:** [summary of content, structure, key sections]

---

[Full document content as markdown...]

IMPORTANT: Handling Extraction Limitations

MarkItDown text extraction may have some information loss, especially for:

  • Complex layouts and multi-column documents
  • Embedded images and charts (only alt-text/captions extracted)
  • Scanned PDFs (limited OCR capability)
  • Heavily formatted tables

When to Verify Against Original

If your analysis requires high confidence on specific details:

  1. Call out the limitation to the user:

    "Note: This analysis is based on extracted text from the PDF. Some formatting or embedded content may not have been fully captured."

  2. For critical verification (max 5 pages per analysis to control costs):

    • Identify the specific pages with relevant content using grep/search
    • You may request OCR-style analysis of those specific pages via Claude API
    • Only do this when high confidence is required AND the extracted text seems incomplete

Token Budget Awareness

When doing deep document analysis:

  • Start with the extracted .md file (free, fast)
  • Use grep to locate relevant sections (free, fast)
  • Only escalate to page-level OCR verification when necessary
  • Hard limit: Maximum 5 pages for OCR-style verification per analysis task

Finding Content in Large Documents

For large documents, use search tools to locate relevant content:

# Find mentions of specific terms
grep -n "quarterly revenue" attachments/board-deck.md

# Find all headings
grep -n "^#" attachments/board-deck.md

# Find tables
grep -n "|\-\-\-" attachments/board-deck.md

Updating attachments/INDEX.md

After processing new attachments, update the index:

## Attachments Index

| File | Type | Summary | Processed |
|------|------|---------|-----------|
| Q3-board-deck.pdf | PDF | Q3 2025 board presentation | ✓ |
| financial-model.xlsx | Excel | 5-year financial projections | ✓ |

Notes

  • Conversion is cached: first run creates .md, subsequent reads use cache
  • Use --force if the source file has been updated
  • Legacy formats (.doc, .xls, .ppt) have limited support
  • Password-protected files will fail - inform the user