| name | pdf-extractor |
| description | Expert in PDF content extraction and analysis. **Use whenever the user mentions PDFs, .pdf files, or requests to extract, read, parse, analyze, convert, or process PDF documents.** Handles text extraction, image extraction, converting PDFs to markdown or other formats, batch PDF processing, and analyzing PDF document structure for AI processing. Uses a fast Go binary with Vertex AI Gemini for intelligent image analysis. Supports two methods - preferred binary-based extraction (default) and alternative image-based extraction (when explicitly requested). (project, gitignored) |
PDF Extractor Skill
You are an expert in extracting and analyzing PDF content, converting it to AI-friendly formats with intelligent image analysis and classification.
Two extraction methods available:
- Method 1 (Preferred): Script-based extraction to markdown with AI image analysis
- Method 2 (Alternative): Page-by-page image conversion for complex layouts (only when explicitly requested)
⚠️ CRITICAL REQUIREMENT: ALWAYS USE FULL FILE PATHS
YOU MUST ALWAYS use absolute/full paths when working with PDF files.
✅ CORRECT:
~/.claude/skills/pdf-extractor/scripts/pdf-extractor /Users/sebastien.morand/Downloads/document.pdf
~/.claude/skills/pdf-extractor/scripts/pdf-extractor -output ~/Documents/extracted ~/Downloads/report.pdf
❌ INCORRECT:
~/.claude/skills/pdf-extractor/scripts/pdf-extractor document.pdf
~/.claude/skills/pdf-extractor/scripts/pdf-extractor ./document.pdf
Why: Always use full absolute paths or paths starting with ~/ for the input PDF file to ensure correct file resolution.
Extraction Methods
This skill supports two methods for extracting information from PDFs:
Method 1: Binary-Based Extraction (PREFERRED ✅)
This is the default and recommended method.
Uses the pdf-extractor Go binary to:
- Extract text with layout preservation to markdown
- Extract embedded images from the PDF
- Analyze images with AI-powered descriptions and classifications (via Vertex AI Gemini)
- Generate an intelligent images catalog (images.md)
When to use: All standard PDF extraction tasks (default behavior)
How to use: Simply ask to "extract PDF", "analyze PDF", "read PDF", etc.
Method 2: Image-Based Extraction (ALTERNATIVE)
Converts entire PDF pages to images using ImageMagick, then analyzes each page as an image.
When to use:
- Complex layouts that don't extract well as text
- Scanned documents or image-heavy PDFs
- When visual layout/formatting is critical
- When explicitly requested by user
How to trigger: User must explicitly request:
- "extract information from PDF using image mode"
- "extract PDF information through images"
- "convert PDF to images and analyze"
- "analyze PDF as images page by page"
Process:
- Convert PDF to PNG images (one per page) using ImageMagick:
magick convert -density 150 input.pdf output-%03d.png - Use Task tool to analyze each image page with Claude's vision capabilities
- Combine analysis from all pages
⚠️ Note: Image-based extraction is slower and more expensive (vision API calls per page) but preserves exact visual layout.
Core Capabilities (Method 1)
- PDF text extraction with layout preservation using Go-based pdfcpu library
- AI-powered image analysis and classification using Gemini 1.5 Flash via Vertex AI
- Automatic image description generation with type detection (photo, diagram, chart, table, banner, logo, etc.)
- Intelligent images catalog (images.md) with detailed metadata
- Markdown conversion with embedded image references
- Multi-page PDF processing
- Smart default output paths based on PDF filename
- Optional cleanup for temporary extractions
- Fast, standalone binary with no Python dependencies
Quick Start
Basic Usage
Binary Location: ~/.claude/skills/pdf-extractor/scripts/pdf-extractor
# Basic extraction (creates folder in PDF's directory)
~/.claude/skills/pdf-extractor/scripts/pdf-extractor /path/to/document.pdf
# Extract to specific directory
~/.claude/skills/pdf-extractor/scripts/pdf-extractor -output ~/Documents/extracted ~/Downloads/report.pdf
# Temporary analysis with automatic cleanup
~/.claude/skills/pdf-extractor/scripts/pdf-extractor -cleanup ~/Downloads/temp.pdf
# Custom GCP project and region
~/.claude/skills/pdf-extractor/scripts/pdf-extractor -project my-project -region europe-west1 ~/Downloads/doc.pdf
# Skip AI analysis (faster, no image descriptions)
~/.claude/skills/pdf-extractor/scripts/pdf-extractor -no-ai ~/Downloads/doc.pdf
# Use different Gemini model
~/.claude/skills/pdf-extractor/scripts/pdf-extractor -model gemini-1.5-pro ~/Downloads/doc.pdf
Default Output Paths
When no output directory specified:
- PDF:
/path/to/myfile.pdf - Output:
/path/to/myfile_extraction/(default suffix:_extraction)
When custom output directory specified:
- PDF:
/path/to/myfile.pdf - Custom output:
/target/ - Final output:
/target/(exact directory specified)
Examples:
# Extract ~/Downloads/report.pdf → Output: ~/Downloads/report_extraction/
~/.claude/skills/pdf-extractor/scripts/pdf-extractor ~/Downloads/report.pdf
# Extract to custom location → Output: ~/Documents/extracted/
~/.claude/skills/pdf-extractor/scripts/pdf-extractor -output ~/Documents/extracted ~/Downloads/report.pdf
Output Structure
Every extraction creates:
pdf_name/
├── document.md # Extracted text content with image references
├── images.md # AI-generated image catalog with descriptions
└── image-*.png # Extracted images
Common Workflows
1. Summarize PDF with Image Context
# Extract with AI image analysis
~/.claude/skills/pdf-extractor/scripts/pdf-extractor ~/Downloads/report.pdf
# Read content
cat ~/Downloads/report_extraction/document.md # Text content
cat ~/Downloads/report_extraction/images.md # Image descriptions
Process: Extract → Read document.md → Review images.md → Filter relevant images → Combine for comprehensive summary
2. Filter Images by Type
# Extract PDF
~/.claude/skills/pdf-extractor/scripts/pdf-extractor ~/Downloads/presentation.pdf
# Find charts and diagrams only
grep -A 15 "**Type:** chart" ~/Downloads/presentation_extraction/images.md
grep -A 15 "**Type:** diagram" ~/Downloads/presentation_extraction/images.md
# Exclude decorative elements
grep -v "banner\|logo" ~/Downloads/presentation_extraction/images.md
3. Batch Process Multiple PDFs
# Process all PDFs in directory
for pdf in ~/Downloads/*.pdf; do
~/.claude/skills/pdf-extractor/scripts/pdf-extractor "$pdf"
done
4. Temporary Analysis
# Quick analysis without keeping files
~/.claude/skills/pdf-extractor/scripts/pdf-extractor -cleanup ~/Downloads/temp.pdf
See references/workflows.md for more detailed workflows.
Binary Details
How It Works
The pdf-extractor is a standalone Go binary that:
- Extracts PDF text and images using the pdfcpu library
- Converts content to markdown format
- Uses Vertex AI Gemini API for AI-powered image analysis
- Authenticates via gcloud Application Default Credentials
- No dependencies or virtual environments required
- Fast execution with compiled Go performance
Command-Line Arguments
pdf-extractor [OPTIONS] <pdf-file>
Required:
<pdf-file>: Full path to the PDF file to process
Options:
-output string: Output directory for extracted content (default:pdf_name_extraction)-cleanup: Delete extracted images after processing (keeps markdown files)-no-ai: Skip AI image analysis (faster, but no image descriptions)-model string: Vertex AI model to use (default:gemini-1.5-flash)-project string: GCP project ID for Vertex AI (default:btdp-dta-gbl-0002-gen-ai-01)-region string: GCP region for Vertex AI (default:europe-west1)
See references/script-usage.md for detailed usage and examples.
Image Catalog (images.md)
The binary automatically generates an images.md file with AI-powered analysis (unless -no-ai is used).
For each image:
- Path: Relative path to image file
- Type: Classification (photo, diagram, chart, graph, table, screenshot, banner, logo, icon, illustration, map, or other)
- Contains Text: Whether image has readable text
- Description: AI-generated 2-3 sentence description
- Preview: Embedded image for reference
Benefits:
- Filter out banners, logos, decorative images
- Understand charts/diagrams without viewing
- Find specific images by type or description
- Provides accessibility descriptions
See references/output-format.md for complete output format details.
Prerequisites & Setup
Required
- gcloud CLI - Google Cloud SDK (for authentication)
- GCP project with Vertex AI API enabled
- No other dependencies - the binary is self-contained
Authentication Setup
Quick Setup:
# 1. Install gcloud CLI (macOS)
brew install google-cloud-sdk
# 2. Authenticate with Application Default Credentials
gcloud auth application-default login
# 3. Set default project (optional - can override with -project flag)
gcloud config set project YOUR_PROJECT_ID
# 4. Enable Vertex AI API
gcloud services enable aiplatform.googleapis.com
Override defaults via command-line flags:
# Specify project and region at runtime
pdf-extractor -project my-project -region us-central1 ~/Downloads/file.pdf
See references/authentication.md for detailed authentication setup and troubleshooting.
Response Approach
When helping with PDF extraction:
- Determine method:
- Default: Use Method 1 (binary-based extraction) unless user explicitly requests image mode
- Image mode: Only use Method 2 if user explicitly asks for "image mode", "through images", "as images page by page", etc.
- Understand task: What information needed? Are images important? Should AI analysis be skipped (-no-ai)?
- Locate PDF: Check mentioned locations, search common directories
- Extract content:
- Method 1: Use pdf-extractor binary with appropriate flags (-cleanup, -no-ai, -output, etc.)
- Method 2: Convert to images with ImageMagick, then analyze with Task tool
- Process:
- Method 1: Read document.md and images.md (if AI analysis was enabled), filter images by type if needed
- Method 2: Combine analysis from all page images
- Provide results: Summarize findings, highlight relevant visuals, suggest next steps
- Clean up: Note extraction location, provide commands for further analysis
Performance & Best Practices
Performance:
- Binary execution: Fast startup (no virtual environment setup)
- Text extraction: Very fast with compiled Go code
- Image analysis: ~1-2 seconds per image via Vertex AI Gemini API (when enabled)
- API costs: Uses Vertex AI credits (billed to GCP project)
- Default model:
gemini-1.5-flash(fast and cost-effective) - Alternative model:
gemini-1.5-pro(higher quality, slower, more expensive) - Default region: europe-west1 (optimized for EU users)
When to use default output path: Single documents, keeping extractions near source PDFs
When to use -output flag: Custom organization, multiple extractions, specific project directories
When to use -cleanup flag: Temporary analysis, limited disk space, only need text without images
When to use -no-ai flag: Fast extraction without image descriptions, cost savings, text-only analysis
See references/advanced.md for customization, performance tuning, and integration details.
Troubleshooting
Common Issues
"PDF file not found":
# Always use absolute paths
ls -lh ~/Downloads/file.pdf
# Find PDF files if location unknown
find ~ -name "*.pdf" -type f | grep -i "filename"
Authentication errors:
# Verify authentication
gcloud auth application-default print-access-token
# Re-authenticate if needed
gcloud auth application-default login
# Check current project
gcloud config get-value project
Vertex AI API errors:
# Enable Vertex AI API
gcloud services enable aiplatform.googleapis.com
# Verify project has access
gcloud projects describe YOUR_PROJECT_ID
Binary permission issues:
# Make binary executable
chmod +x ~/.claude/skills/pdf-extractor/scripts/pdf-extractor
See references/authentication.md for detailed troubleshooting.
Reference Documentation
- Binary Usage & Arguments - Complete binary usage with examples
- Output Format - Output structure, images.md format, file organization
- Workflows - Common workflows and use cases
- Authentication - GCP/Vertex AI setup and troubleshooting
- Advanced Usage - Performance tuning, model selection, integration
Building from Source
The binary is built from Go source code located at ~/projects/new/pdf-extractor/.
To rebuild after making changes:
# Build the binary
make -C ~/projects/new/pdf-extractor
# Deploy to skill directory
cp ~/projects/new/pdf-extractor/pdf-extractor ~/.claude/skills/pdf-extractor/scripts/
See CLAUDE.md for complete development workflow.