id	skill-pdf-workbench
name	PDF Workbench — Extraction, Forms & Generation
description	Automate PDF analysis, form filling, and document assembly using Python toolkits and curated workflows for Cortex-OS.
version	1.0.0
author	brAInwav Documentation Guild
owner	@jamiescottcraik
category	documentation
difficulty	advanced
tags	pdf, automation, forms, extraction, document
estimatedTokens	4200
license	Complete terms in LICENSE.txt
requiredTools	python, pandoc, pypdf, reportlab
prerequisites	Install pypdf, pdfplumber, reportlab, pandas, and PyPDF2 as needed, Pandoc installed for conversions, Input PDFs accessible in the task workspace
relatedSkills	skill-docx-workbench, skill-creator
resources	./resources/reference.md, ./resources/forms.md, ./resources/scripts/check_bounding_boxes.py, ./resources/scripts/check_bounding_boxes_test.py, ./resources/scripts/check_fillable_fields.py, ./resources/scripts/convert_pdf_to_images.py, ./resources/scripts/create_validation_image.py, ./resources/scripts/extract_form_field_info.py, ./resources/scripts/fill_fillable_fields.py, ./resources/scripts/fill_pdf_form_with_annotations.py, ./resources/LICENSE.txt
deprecated	false
replacedBy	null
impl	packages/doc-tools/src/pdf_workbench.ts#runPdfWorkflow
inputs	[object Object]
outputs	[object Object]
preconditions	Source documents cleared for processing and backed up., Form field requirements documented (names, default values)., OCR plan established if PDFs are image-based (use OCR preprocessing skill if necessary).
sideEffects	Generates intermediate JSON/CSV files for extracted content., Writes processing logs and metadata reports to `logs/pdf-workbench/`.
estimatedCost	$0.003 / PDF workflow (~600 tokens across extraction + validation).
calls	skill-testing-evidence-triplet
requiresContext	memory://skills/skill-pdf-workbench/historical-runs
providesContext	memory://skills/skill-pdf-workbench/latest-report
monitoring	true
lifecycle	[object Object]
estimatedDuration	PT40M
i18n	[object Object]
persuasiveFraming	[object Object]
observability	[object Object]
governance	[object Object]
schemaStatus	[object Object]

PDF Workbench — Extraction, Forms & Generation

When to Use

Extracting tables or text from financial reports, statements, or regulatory filings.
Filling standardized PDF forms (e.g., onboarding, compliance attestations) programmatically.
Splitting or merging PDFs as part of document assembly workflows.
Generating branded or data-driven reports directly to PDF.

How to Apply

Assess the document: text-based vs image-based, form fields required, expected outputs.
Choose the workflow (extract, form-fill, merge, split, generate) and review resources/reference.md or forms.md for detailed steps.
Use the bundled scripts (pypdf/pdfplumber/reportlab) to implement the operation; capture logs/output files.
Validate results (compare counts, inspect sample pages, run Pandoc conversion if needed) and store artefacts.
Log Local Memory outcomes and attach evidence to the task/PR.

Success Criteria

Extracted data matches expected record counts, headers, and formatting.
Form fields populated accurately with correct data types and character sets.
Generated/split/merged PDFs pass QC (manual spot-check or automated validation).
Evidence bundle includes logs, sample outputs, and markdown summary.
Local Memory entry records the workflow, effectiveness ≥0.8, and artefact IDs.

0) Mission Snapshot — What / Why / Where / How / Result

What: Provide reliable automation patterns for PDF manipulation in Cortex-OS engagements.
Why: Manual PDF processing is error-prone; automation yields repeatable outputs and audit trails.
Where: Finance, compliance, partner support, and operations tasks relying on PDF documentation.
How: Combine Python toolkits (pypdf, pdfplumber, reportlab) with curated scripts and validation steps.
Result: Reviewed PDF artefacts with traceable evidence and ready for downstream workflows.

1) Contract — Inputs → Outputs

Inputs: source PDF(s), workflow choice, field mapping (for forms), destination path. Outputs: processed PDF(s), extracted datasets (CSV/XLSX), logs, markdown summary, Local Memory references.

2) Preconditions & Safeguards

Verify document confidentiality and approvals before processing.
If PDFs require OCR, plan upfront (use Tesseract or other OCR pipeline before extraction).
Ensure file paths and encodings are sanitized to avoid injection issues.
Install dependencies in isolated environment (python -m venv or uv) to avoid version drift.

3) Implementation Playbook (RED→GREEN→REFACTOR)

Recon (RED): Inspect the PDF (page count, structure), identify forms/tables, and list required outputs.
Execute (GREEN): Run relevant scripts (extraction, form fill) with logging enabled; iterate on selectors/field mapping.
Refine (REFACTOR): Validate outputs, handle edge cases (empty tables, merged cells), rerun until quality gates satisfied.

4) Observability & Telemetry Hooks

Log steps with [brAInwav] prefix, including page counts and success/failure per batch.
Capture metrics on records extracted and store summary JSON/CSV for downstream analytics.
Archive raw logs in logs/pdf-workbench/ for review.

5) Safety, Compliance & Governance

Preserve original documents untouched; work on copies.
Redact or mask sensitive data in outputs before sharing.
Keep audit trails (input hash, script version) for repeatability.
Respect form licensing/usage requirements before publishing generated PDFs.

6) Success Criteria & Acceptance Tests

Automated scripts exit cleanly; tests comparing record counts or field completions pass.
Manual spot-check of PDF structure confirms correct pagination and formatting.
Evidence Triplet recorded (initial failure if any, passing run, artifact verification).
Reviewer checklist completed with no major issues outstanding.

7) Failure Modes & Recovery

Image-only PDF: Apply OCR pre-processing; fallback to manual if text extraction impossible.
Misparsed tables: Adjust pdfplumber settings (edge tolerance), or use tabula/ocr fallback.
Form fields misnamed: Inspect field names (resources/scripts/forms helper) and map correctly.
Large files: Process in batches; ensure memory usage monitored.

8) Worked Examples & Snippets

resources/scripts/extract_tables.py — example using pdfplumber + pandas to export tables.
resources/scripts/fill_form.py — demonstrates populating AcroForm fields.
resources/scripts/create_pdf_report.py — builds multi-page report with reportlab.

9) Memory & Knowledge Integration

Log each run in Local Memory with skillUsed: "skill-pdf-workbench", success rate, and artefact IDs.
Link to related processes (docx workbench, OCR skills) for discoverability.
Reference memory IDs in PRs and review manifests.

10) Lifecycle & Versioning Notes

Track library versions (pypdf, pdfplumber, reportlab) and update scripts when APIs change.
Document differences between forms (AcroForm vs XFA) and expand scripts accordingly.
Review skill quarterly to incorporate new extraction techniques or vendor tools.

11) References & Evidence

resources/reference.md — extended PDF processing guide (Anthropic mirror).
resources/forms.md — form-specific workflows.
Bundled scripts for extraction, metadata, and report generation.
Artefacts: logs, CSV/XLSX exports, final PDFs.

12) Schema Gap Checklist

Add script to hash inputs/outputs for audit logs.
Automate OCR fallback integration.
Generate summary JSON for extracted tables to feed analytics pipelines.

PDF Workbench â Extraction, Forms & Generation

Install Skill

SKILL.md

PDF Workbench — Extraction, Forms & Generation

When to Use

How to Apply

Success Criteria

0) Mission Snapshot — What / Why / Where / How / Result

1) Contract — Inputs → Outputs

2) Preconditions & Safeguards

3) Implementation Playbook (RED→GREEN→REFACTOR)

4) Observability & Telemetry Hooks

5) Safety, Compliance & Governance

6) Success Criteria & Acceptance Tests

7) Failure Modes & Recovery

8) Worked Examples & Snippets

9) Memory & Knowledge Integration

10) Lifecycle & Versioning Notes

11) References & Evidence

12) Schema Gap Checklist