Claude Code Plugins

Community-maintained marketplace

Feedback
45
0

Extract text, tables, and images from PDF files. Use when converting PDF documentation, manuals, or reports to searchable text.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name pdf-extractor
description Extract text, tables, and images from PDF files. Use when converting PDF documentation, manuals, or reports to searchable text.

PDF Extractor Skill

Purpose

Single responsibility: Extract structured content (text, tables, images) from PDF files into organized, searchable formats. (BP-4)

Grounding Checkpoint (Archetype 1 Mitigation)

Before executing, VERIFY:

  • PDF file exists and is readable (file <path> confirms PDF format)
  • PDF is not corrupted (pdfinfo <path> returns metadata)
  • Password known if encrypted
  • Output directory is writable
  • Required tools available (pdfplumber, pytesseract for OCR)

DO NOT proceed without verification. Inspect PDF metadata first.

Uncertainty Escalation (Archetype 2 Mitigation)

ASK USER instead of guessing when:

  • PDF appears to be scanned (needs OCR) but OCR tools unavailable
  • Multiple table formats detected - unclear which parser to use
  • Password-protected but no password provided
  • Image extraction quality unclear (resolution, format preferences)
  • Language detection needed for OCR

NEVER assume PDF structure without inspection.

Context Scope (Archetype 3 Mitigation)

Context Type Included Excluded
RELEVANT Target PDF, extraction options, output path Other PDF files
PERIPHERAL Similar PDF structure examples Unrelated documents
DISTRACTOR Previous extraction attempts Other file formats

Workflow Steps

Step 1: Inspect PDF (Grounding)

# Check file type
file document.pdf

# Get PDF metadata
pdfinfo document.pdf

# Check page count
pdfinfo document.pdf | grep Pages

# Check if encrypted
pdfinfo document.pdf | grep Encrypted

Step 2: Determine Extraction Strategy

PDF Type Detection Strategy
Text-based pdftotext produces readable text Direct extraction
Scanned/Image pdftotext produces empty/garbled OCR required
Mixed Some pages text, some images Hybrid approach
Tables Visual grid patterns Table extraction mode
Forms Interactive fields Form field extraction

Step 3: Execute Extraction

Option A: With skill-seekers (if installed)

# Basic extraction
skill-seekers pdf --pdf document.pdf --name myskill

# With table extraction
skill-seekers pdf --pdf document.pdf --name myskill --extract-tables

# With OCR for scanned docs
skill-seekers pdf --pdf document.pdf --name myskill --ocr

# With parallel processing (large PDFs)
skill-seekers pdf --pdf document.pdf --name myskill --parallel --workers 8

# Password-protected
skill-seekers pdf --pdf document.pdf --name myskill --password "secret"

Option B: Manual extraction guidance

# Basic text extraction
pdftotext -layout document.pdf output.txt

# Extract with page markers
pdftotext -layout -eol unix document.pdf output.txt

# Extract images
pdfimages -all document.pdf images/

# OCR scanned PDF (requires tesseract)
pdftoppm document.pdf page -png
tesseract page-*.png output -l eng

Step 4: Validate Output

# Check extraction quality
head -100 output/<skill-name>/references/content.md

# Verify table extraction
grep -A 10 "| " output/<skill-name>/references/*.md

# Check image extraction
ls -la output/<skill-name>/assets/images/

Recovery Protocol (Archetype 4 Mitigation)

On error:

  1. PAUSE - Stop extraction, preserve partial output
  2. DIAGNOSE - Check error type:
    • File not found → Verify path
    • Password required → Ask user for password
    • Corrupt PDF → Try repair with qpdf --check
    • OCR failed → Check tesseract installation, language packs
    • Memory error → Process in chunks, reduce workers
  3. ADAPT - Switch strategy based on diagnosis
  4. RETRY - Resume with adapted approach (max 3 attempts)
  5. ESCALATE - Ask user for guidance

Checkpoint Support

State saved to: .aiwg/working/checkpoints/pdf-extractor/

For large PDFs, extraction saves progress per chunk:

checkpoints/pdf-extractor/
├── document_metadata.json
├── pages_1-50.json
├── pages_51-100.json
└── current_position.json

Output Structure

output/<skill-name>/
├── SKILL.md              # Skill description with PDF summary
├── references/
│   ├── index.md          # Table of contents
│   ├── chapter_1.md      # Content by section
│   ├── chapter_2.md
│   └── tables.md         # Extracted tables
└── assets/
    └── images/           # Extracted images (if enabled)
        ├── page_1_fig_1.png
        └── page_5_chart_1.png

Configuration Options

{
  "name": "mymanual",
  "description": "Product manual documentation",
  "pdf_path": "docs/manual.pdf",
  "extract_options": {
    "chunk_size": 10,
    "min_quality": 6.0,
    "extract_images": true,
    "min_image_size": 150,
    "ocr_enabled": false,
    "ocr_language": "eng",
    "table_extraction": true
  },
  "categories": {
    "getting_started": ["introduction", "setup", "installation"],
    "usage": ["using", "operation", "guide"],
    "reference": ["appendix", "specifications", "api"]
  }
}

Extraction Quality Metrics

Metric Good Acceptable Poor
Text extraction rate >95% 80-95% <80%
Table accuracy >90% 70-90% <70%
Image quality >300 DPI 150-300 DPI <150 DPI
OCR confidence >90% 70-90% <70%

Troubleshooting

Issue Diagnosis Solution
Garbled text Scanned PDF Enable OCR mode
Missing tables Complex layout Use --extract-tables with pdfplumber
Poor OCR Low resolution Increase DPI, check language pack
Memory error Large PDF Use chunked extraction, reduce workers
Corrupt PDF File damaged Try qpdf --check or mutool clean

Dependencies

Required:

  • Python 3.10+
  • pdfplumber or pypdf

Optional (for advanced features):

  • pytesseract + tesseract-ocr (for OCR)
  • Pillow (for image processing)
  • camelot-py (for complex tables)

References