name

pdf-form-converter

description

Convert flat/scanned PDF documents into interactive fillable PDF forms using OCR-based field detection with AWS Textract and AI vision validation. Use when users need to: (1) Transform static PDFs into editable forms, (2) Detect and convert form fields (text inputs, checkboxes, signatures) automatically, (3) Digitize paper-based forms for electronic completion, (4) Validate form field detection accuracy, or (5) Batch convert multiple similar forms while preserving original styling and layout.

PDF Form Converter

Overview

Convert static PDF documents into interactive fillable forms through automated field detection using OCR (AWS Textract) and AI vision validation. Preserve original document styling while creating editable form elements.

Workflow Decision Tree

Follow this decision tree to determine the appropriate conversion approach:

Single form conversion with standard layout? → Use Standard Conversion Workflow
Complex form with tables, nested fields, or unusual layout? → Use Advanced Conversion with Enhanced Validation
Multiple similar forms to process? → Use Batch Conversion Workflow
Need to validate existing converted form? → Use Validation-Only Workflow
Form has accessibility requirements? → Use Standard Conversion + Accessibility Enhancement

Standard Conversion Workflow

For typical forms with standard layouts (employment applications, intake forms, contact sheets).

Step 1: Analyze Source PDF

First, analyze the source PDF to understand its structure:

python scripts/analyze_pdf.py <input.pdf>

This script outputs:

Page count and dimensions
Detected text regions
Visual layout analysis
Preliminary field candidates

Step 2: OCR Field Detection

Run AWS Textract to detect form fields, labels, and structure:

python scripts/textract_detection.py <input.pdf> --output-json fields.json

This generates:

Field locations (bounding boxes)
Associated labels
Field types (text, checkbox, signature)
Confidence scores

Note: Requires AWS credentials configured. Set environment variables:

export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1

Step 3: AI Vision Validation

Use Claude's vision capabilities to validate and refine OCR results:

python scripts/vision_validation.py <input.pdf> fields.json --output validated_fields.json

This script:

Sends PDF pages as images to Claude
Cross-validates field boundaries
Corrects misaligned detections
Identifies missed fields
Improves label-to-field matching

Step 4: Generate Interactive PDF

Create the fillable PDF with detected fields:

python scripts/generate_fillable_pdf.py <input.pdf> validated_fields.json --output fillable_form.pdf

This produces an interactive PDF with:

Text input fields
Checkboxes and radio buttons
Dropdown menus
Signature fields
Preserved styling and layout

Step 5: Quality Check

Verify the conversion quality:

python scripts/validate_conversion.py <original.pdf> <fillable_form.pdf>

This compares original and converted PDFs, reporting:

Field alignment accuracy
Missing or extra fields
Style preservation score
Overall conversion quality

Advanced Conversion with Enhanced Validation

For complex forms with intricate layouts, tables, or nested structures.

Enhanced Detection Strategy

Run multi-pass OCR with different confidence thresholds
Apply table detection for forms with tabular sections
Use semantic analysis to group related fields
Validate with multiple vision passes for high-stakes forms

# Multi-pass detection for complex forms
python scripts/advanced_detection.py <input.pdf> \
  --multipass \
  --detect-tables \
  --semantic-grouping \
  --output complex_fields.json

Manual Field Refinement

For critical forms requiring perfect accuracy:

# Generate field mapping file for manual review/editing
python scripts/export_field_map.py validated_fields.json --output field_map.yaml

# After manual edits, regenerate PDF
python scripts/generate_fillable_pdf.py <input.pdf> field_map.yaml --output final_form.pdf

See references/field_mapping_spec.md for field mapping YAML format.

Batch Conversion Workflow

Process multiple similar forms efficiently.

Batch Processing

python scripts/batch_convert.py <input_directory> \
  --output-dir converted_forms \
  --template-learning \
  --parallel 4

The --template-learning flag uses the first form as a template to improve detection on subsequent forms with similar layouts.

Batch options:

--parallel N: Process N forms simultaneously
--validation-level [basic|full]: Trade speed for accuracy
--preserve-structure: Maintain directory structure in output
--report: Generate conversion summary report

Validation-Only Workflow

Validate existing converted forms without re-conversion.

python scripts/validate_conversion.py \
  <original.pdf> \
  <converted.pdf> \
  --detailed-report \
  --visual-diff \
  --output validation_report.pdf

Generates report with:

Side-by-side comparison
Highlighted discrepancies
Field accuracy metrics
Recommendations for improvement

Field Type Configuration

Configure how different field types are detected and created.

Common Field Types

Text Fields:

Single-line text inputs
Multi-line text areas
Numeric fields (with validation)
Date fields (with date picker)
Email fields (with validation)

Selection Fields:

Checkboxes (single or grouped)
Radio buttons (mutually exclusive groups)
Dropdown menus (from detected options)

Signature Fields:

Digital signature areas
Initial fields
Date-signed fields

Field Properties

Configure field properties in the field mapping:

fields:
  - label: "Patient Name"
    type: text
    required: true
    max_length: 100
    font_size: 12
  
  - label: "Insurance Provider"
    type: dropdown
    options: ["Blue Cross", "Aetna", "United", "Other"]
    required: true
  
  - label: "Consent Signature"
    type: signature
    required: true
    lock_after_signing: true

See references/field_properties.md for complete property specifications.

Style Preservation

Maintain original document appearance during conversion.

Preserved Elements

Typography: Fonts, sizes, weights, colors
Layout: Margins, spacing, alignment
Graphics: Logos, watermarks, background images
Borders: Form field borders and styling
Colors: Background colors, field highlights

Style Extraction

python scripts/extract_styles.py <input.pdf> --output styles.json

Use extracted styles when generating fillable forms:

python scripts/generate_fillable_pdf.py <input.pdf> fields.json \
  --styles styles.json \
  --output styled_form.pdf

Accessibility Enhancement

Ensure converted forms meet digital accessibility standards.

python scripts/add_accessibility.py <fillable_form.pdf> \
  --add-tags \
  --alt-text \
  --tab-order \
  --screen-reader-labels \
  --output accessible_form.pdf

Accessibility features added:

PDF/UA compliance tagging
Form field labels for screen readers
Logical tab order
Alt text for images
Keyboard navigation support

Troubleshooting

Low Detection Confidence

If field detection is poor:

Check PDF quality (increase scan resolution if needed)
Use --enhance-image flag to improve image quality
Run advanced detection with --multipass option
Manually review and edit field mapping

Misaligned Fields

If fields don't align properly:

Verify PDF page dimensions match original
Check for skewed/rotated source documents
Use vision validation to correct boundaries
Adjust field positions in field mapping YAML

Missing Fields

If fields are not detected:

Check OCR confidence thresholds (lower if needed)
Verify field labels are clear and readable
Use semantic analysis for unlabeled fields
Manually add missing fields to field mapping

Style Issues

If original styling is lost:

Ensure style extraction was successful
Verify font embedding in source PDF
Check for unsupported font types
Use --embed-fonts flag when generating

Resources

scripts/

Core conversion and validation scripts:

analyze_pdf.py - PDF structure analysis
textract_detection.py - AWS Textract field detection
vision_validation.py - AI vision cross-validation
generate_fillable_pdf.py - Interactive PDF creation
validate_conversion.py - Quality validation
advanced_detection.py - Complex form detection
batch_convert.py - Batch processing
extract_styles.py - Style preservation
add_accessibility.py - Accessibility compliance

references/

Detailed documentation:

field_mapping_spec.md - Field mapping YAML format specification
field_properties.md - Complete field property reference
textract_api.md - AWS Textract API patterns and best practices
accessibility_standards.md - PDF/UA compliance guidelines