| name | extract-from-pdfs |
| description | This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics. |
Extract Structured Data from Scientific PDFs
Purpose
Extract standardized, structured data from scientific PDF literature using Claude's vision capabilities. Transform PDF collections into validated databases ready for statistical analysis in Python, R, or other frameworks.
Core capabilities:
- Organize metadata from BibTeX, RIS, directories, or DOI lists
- Filter papers by abstract using Claude (Haiku/Sonnet) or local models (Ollama)
- Extract structured data from PDFs with customizable schemas
- Repair and validate JSON outputs automatically
- Enrich with external databases (GBIF, WFO, GeoNames, PubChem, NCBI)
- Calculate precision/recall metrics for quality assurance
- Export to Python, R, CSV, Excel, or SQLite
When to Use This Skill
Use when:
- Conducting systematic literature reviews requiring data extraction
- Building databases from scientific publications
- Converting PDF collections to structured datasets
- Validating extraction quality with ground truth metrics
- Comparing extraction approaches (different models, prompts)
Do not use for:
- Single PDF summarization (use basic PDF reading instead)
- Full-text PDF search (use document search tools)
- PDF editing or manipulation
Getting Started
1. Initial Setup
Read the setup guide for installation and configuration:
cat references/setup_guide.md
Key setup steps:
- Install dependencies:
conda env create -f environment.yml - Set API keys:
export ANTHROPIC_API_KEY='your-key' - Optional: Install Ollama for free local filtering
2. Define Extraction Requirements
Ask the user:
- Research domain and extraction goals
- How PDFs are organized (reference manager, directory, DOI list)
- Approximate collection size
- Preferred analysis environment (Python, R, etc.)
Provide 2-3 example PDFs to analyze structure and design schema.
3. Design Extraction Schema
Create custom schema from template:
cp assets/schema_template.json my_schema.json
Customize for the specific domain:
- Set
objectivedescribing what to extract - Define
output_schemawith field types and descriptions - Add domain-specific
instructionsfor Claude - Provide
output_exampleshowing desired format
See assets/example_flower_visitors_schema.json for real-world ecology example.
Workflow Execution
Complete Pipeline
Run the 6-step pipeline (plus optional validation):
# Step 1: Organize metadata
python scripts/01_organize_metadata.py \
--source-type bibtex \
--source library.bib \
--pdf-dir pdfs/ \
--output metadata.json
# Step 2: Filter papers (optional - recommended)
# Choose backend: anthropic-haiku (cheap), anthropic-sonnet (accurate), ollama (free)
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend anthropic-haiku \
--use-batches \
--output filtered_papers.json
# Step 3: Extract from PDFs
python scripts/03_extract_from_pdfs.py \
--metadata filtered_papers.json \
--schema my_schema.json \
--method batches \
--output extracted_data.json
# Step 4: Repair JSON
python scripts/04_repair_json.py \
--input extracted_data.json \
--schema my_schema.json \
--output cleaned_data.json
# Step 5: Validate with APIs
python scripts/05_validate_with_apis.py \
--input cleaned_data.json \
--apis my_api_config.json \
--output validated_data.json
# Step 6: Export to analysis format
python scripts/06_export_database.py \
--input validated_data.json \
--format python \
--output results
Validation (Optional but Recommended)
Calculate extraction quality metrics:
# Step 7: Sample papers for annotation
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy stratified \
--output validation_set.json
# Step 8: Manually annotate (edit validation_set.json)
# Fill ground_truth field for each sampled paper
# Step 9: Calculate metrics
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--output validation_metrics.json \
--report validation_report.txt
Validation produces precision, recall, and F1 metrics per field and overall.
Detailed Documentation
Access comprehensive guides in the references/ directory:
Setup and installation:
cat references/setup_guide.md
Complete workflow with examples:
cat references/workflow_guide.md
Validation methodology:
cat references/validation_guide.md
API integration details:
cat references/api_reference.md
Customization
Schema Customization
Modify my_schema.json to match the research domain:
- Objective: Describe what data to extract
- Instructions: Step-by-step extraction guidance
- Output schema: JSON schema defining structure
- Important notes: Domain-specific rules
- Examples: Show desired output format
Use imperative language in instructions. Be specific about data types, required vs optional fields, and edge cases.
API Configuration
Configure external database validation in my_api_config.json:
Map extracted fields to validation APIs:
gbif_taxonomy- Biological taxonomywfo_plants- Plant names specificallygeonames- Geographic locationsgeocode- Address to coordinatespubchem- Chemical compoundsncbi_gene- Gene identifiers
See assets/example_api_config_ecology.json for ecology-specific example.
Filtering Customization
Edit filtering criteria in scripts/02_filter_abstracts.py (line 74):
Replace TODO section with domain-specific criteria:
- What constitutes primary data vs review?
- What data types are relevant?
- What scope (geographic, temporal, taxonomic) is needed?
Use conservative criteria (when in doubt, include paper) to avoid false negatives.
Cost Optimization
Backend selection for filtering (Step 2):
- Ollama (local): $0 - Best for privacy and high volume
- Haiku (API): ~$0.25/M tokens - Best balance of cost/quality
- Sonnet (API): ~$3/M tokens - Best for complex filtering
Typical costs for 100 papers:
- With filtering (Haiku + Sonnet): ~$4
- With local Ollama + Sonnet: ~$3.75
- Without filtering (Sonnet only): ~$7.50
Optimization strategies:
- Use abstract filtering to reduce PDF processing
- Use local Ollama for filtering (free)
- Enable prompt caching with
--use-caching - Process in batches with
--use-batches
Quality Assurance
Validation workflow provides:
- Precision: % of extracted items that are correct
- Recall: % of true items that were extracted
- F1 score: Harmonic mean of precision and recall
- Per-field metrics: Identify weak fields
Use metrics to:
- Establish baseline extraction quality
- Compare different approaches (models, prompts, schemas)
- Identify areas for improvement
- Report extraction quality in publications
Recommended sample sizes:
- Small projects (<100 papers): 10-20 papers
- Medium projects (100-500 papers): 20-50 papers
- Large projects (>500 papers): 50-100 papers
Iterative Improvement
- Run initial extraction with baseline schema
- Validate on sample using Steps 7-9
- Analyze field-level metrics and error patterns
- Revise schema, prompts, or model selection
- Re-extract and re-validate
- Compare metrics to verify improvement
- Repeat until acceptable quality achieved
See references/validation_guide.md for detailed guidance on interpreting metrics and improving extraction quality.
Available Scripts
Data organization:
scripts/01_organize_metadata.py- Standardize PDFs and metadata
Filtering:
scripts/02_filter_abstracts.py- Filter by abstract (Haiku/Sonnet/Ollama)
Extraction:
scripts/03_extract_from_pdfs.py- Extract from PDFs with Claude vision
Processing:
scripts/04_repair_json.py- Repair and validate JSONscripts/05_validate_with_apis.py- Enrich with external databasesscripts/06_export_database.py- Export to analysis formats
Validation:
scripts/07_prepare_validation_set.py- Sample papers for annotationscripts/08_calculate_validation_metrics.py- Calculate P/R/F1 metrics
Assets
Templates:
assets/schema_template.json- Blank extraction schema templateassets/api_config_template.json- API validation configuration template
Examples:
assets/example_flower_visitors_schema.json- Ecology extraction exampleassets/example_api_config_ecology.json- Ecology API validation example