Claude Code Plugins

Community-maintained marketplace

Feedback

PDF Workbench — Extraction, Forms & Generation

@jscraik/Cortex-OS
0
0

Automate PDF analysis, form filling, and document assembly using Python toolkits and curated workflows for Cortex-OS.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

id skill-pdf-workbench
name PDF Workbench — Extraction, Forms & Generation
description Automate PDF analysis, form filling, and document assembly using Python toolkits and curated workflows for Cortex-OS.
version 1.0.0
author brAInwav Documentation Guild
owner @jamiescottcraik
category documentation
difficulty advanced
tags pdf, automation, forms, extraction, document
estimatedTokens 4200
license Complete terms in LICENSE.txt
requiredTools python, pandoc, pypdf, reportlab
prerequisites Install pypdf, pdfplumber, reportlab, pandas, and PyPDF2 as needed, Pandoc installed for conversions, Input PDFs accessible in the task workspace
relatedSkills skill-docx-workbench, skill-creator
resources ./resources/reference.md, ./resources/forms.md, ./resources/scripts/check_bounding_boxes.py, ./resources/scripts/check_bounding_boxes_test.py, ./resources/scripts/check_fillable_fields.py, ./resources/scripts/convert_pdf_to_images.py, ./resources/scripts/create_validation_image.py, ./resources/scripts/extract_form_field_info.py, ./resources/scripts/fill_fillable_fields.py, ./resources/scripts/fill_pdf_form_with_annotations.py, ./resources/LICENSE.txt
deprecated false
replacedBy null
impl packages/doc-tools/src/pdf_workbench.ts#runPdfWorkflow
inputs [object Object]
outputs [object Object]
preconditions Source documents cleared for processing and backed up., Form field requirements documented (names, default values)., OCR plan established if PDFs are image-based (use OCR preprocessing skill if necessary).
sideEffects Generates intermediate JSON/CSV files for extracted content., Writes processing logs and metadata reports to `logs/pdf-workbench/`.
estimatedCost $0.003 / PDF workflow (~600 tokens across extraction + validation).
calls skill-testing-evidence-triplet
requiresContext memory://skills/skill-pdf-workbench/historical-runs
providesContext memory://skills/skill-pdf-workbench/latest-report
monitoring true
lifecycle [object Object]
estimatedDuration PT40M
i18n [object Object]
persuasiveFraming [object Object]
observability [object Object]
governance [object Object]
schemaStatus [object Object]

PDF Workbench — Extraction, Forms & Generation

When to Use

  • Extracting tables or text from financial reports, statements, or regulatory filings.
  • Filling standardized PDF forms (e.g., onboarding, compliance attestations) programmatically.
  • Splitting or merging PDFs as part of document assembly workflows.
  • Generating branded or data-driven reports directly to PDF.

How to Apply

  1. Assess the document: text-based vs image-based, form fields required, expected outputs.
  2. Choose the workflow (extract, form-fill, merge, split, generate) and review resources/reference.md or forms.md for detailed steps.
  3. Use the bundled scripts (pypdf/pdfplumber/reportlab) to implement the operation; capture logs/output files.
  4. Validate results (compare counts, inspect sample pages, run Pandoc conversion if needed) and store artefacts.
  5. Log Local Memory outcomes and attach evidence to the task/PR.

Success Criteria

  • Extracted data matches expected record counts, headers, and formatting.
  • Form fields populated accurately with correct data types and character sets.
  • Generated/split/merged PDFs pass QC (manual spot-check or automated validation).
  • Evidence bundle includes logs, sample outputs, and markdown summary.
  • Local Memory entry records the workflow, effectiveness ≥0.8, and artefact IDs.

0) Mission Snapshot — What / Why / Where / How / Result

  • What: Provide reliable automation patterns for PDF manipulation in Cortex-OS engagements.
  • Why: Manual PDF processing is error-prone; automation yields repeatable outputs and audit trails.
  • Where: Finance, compliance, partner support, and operations tasks relying on PDF documentation.
  • How: Combine Python toolkits (pypdf, pdfplumber, reportlab) with curated scripts and validation steps.
  • Result: Reviewed PDF artefacts with traceable evidence and ready for downstream workflows.

1) Contract — Inputs → Outputs

Inputs: source PDF(s), workflow choice, field mapping (for forms), destination path. Outputs: processed PDF(s), extracted datasets (CSV/XLSX), logs, markdown summary, Local Memory references.

2) Preconditions & Safeguards

  • Verify document confidentiality and approvals before processing.
  • If PDFs require OCR, plan upfront (use Tesseract or other OCR pipeline before extraction).
  • Ensure file paths and encodings are sanitized to avoid injection issues.
  • Install dependencies in isolated environment (python -m venv or uv) to avoid version drift.

3) Implementation Playbook (RED→GREEN→REFACTOR)

  1. Recon (RED): Inspect the PDF (page count, structure), identify forms/tables, and list required outputs.
  2. Execute (GREEN): Run relevant scripts (extraction, form fill) with logging enabled; iterate on selectors/field mapping.
  3. Refine (REFACTOR): Validate outputs, handle edge cases (empty tables, merged cells), rerun until quality gates satisfied.

4) Observability & Telemetry Hooks

  • Log steps with [brAInwav] prefix, including page counts and success/failure per batch.
  • Capture metrics on records extracted and store summary JSON/CSV for downstream analytics.
  • Archive raw logs in logs/pdf-workbench/ for review.

5) Safety, Compliance & Governance

  • Preserve original documents untouched; work on copies.
  • Redact or mask sensitive data in outputs before sharing.
  • Keep audit trails (input hash, script version) for repeatability.
  • Respect form licensing/usage requirements before publishing generated PDFs.

6) Success Criteria & Acceptance Tests

  • Automated scripts exit cleanly; tests comparing record counts or field completions pass.
  • Manual spot-check of PDF structure confirms correct pagination and formatting.
  • Evidence Triplet recorded (initial failure if any, passing run, artifact verification).
  • Reviewer checklist completed with no major issues outstanding.

7) Failure Modes & Recovery

  • Image-only PDF: Apply OCR pre-processing; fallback to manual if text extraction impossible.
  • Misparsed tables: Adjust pdfplumber settings (edge tolerance), or use tabula/ocr fallback.
  • Form fields misnamed: Inspect field names (resources/scripts/forms helper) and map correctly.
  • Large files: Process in batches; ensure memory usage monitored.

8) Worked Examples & Snippets

  • resources/scripts/extract_tables.py — example using pdfplumber + pandas to export tables.
  • resources/scripts/fill_form.py — demonstrates populating AcroForm fields.
  • resources/scripts/create_pdf_report.py — builds multi-page report with reportlab.

9) Memory & Knowledge Integration

  • Log each run in Local Memory with skillUsed: "skill-pdf-workbench", success rate, and artefact IDs.
  • Link to related processes (docx workbench, OCR skills) for discoverability.
  • Reference memory IDs in PRs and review manifests.

10) Lifecycle & Versioning Notes

  • Track library versions (pypdf, pdfplumber, reportlab) and update scripts when APIs change.
  • Document differences between forms (AcroForm vs XFA) and expand scripts accordingly.
  • Review skill quarterly to incorporate new extraction techniques or vendor tools.

11) References & Evidence

  • resources/reference.md — extended PDF processing guide (Anthropic mirror).
  • resources/forms.md — form-specific workflows.
  • Bundled scripts for extraction, metadata, and report generation.
  • Artefacts: logs, CSV/XLSX exports, final PDFs.

12) Schema Gap Checklist

  • Add script to hash inputs/outputs for audit logs.
  • Automate OCR fallback integration.
  • Generate summary JSON for extracted tables to feed analytics pipelines.