| id | skill-pdf-workbench |
| name | PDF Workbench — Extraction, Forms & Generation |
| description | Automate PDF analysis, form filling, and document assembly using Python toolkits and curated workflows for Cortex-OS. |
| version | 1.0.0 |
| author | brAInwav Documentation Guild |
| owner | @jamiescottcraik |
| category | documentation |
| difficulty | advanced |
| tags | pdf, automation, forms, extraction, document |
| estimatedTokens | 4200 |
| license | Complete terms in LICENSE.txt |
| requiredTools | python, pandoc, pypdf, reportlab |
| prerequisites | Install pypdf, pdfplumber, reportlab, pandas, and PyPDF2 as needed, Pandoc installed for conversions, Input PDFs accessible in the task workspace |
| relatedSkills | skill-docx-workbench, skill-creator |
| resources | ./resources/reference.md, ./resources/forms.md, ./resources/scripts/check_bounding_boxes.py, ./resources/scripts/check_bounding_boxes_test.py, ./resources/scripts/check_fillable_fields.py, ./resources/scripts/convert_pdf_to_images.py, ./resources/scripts/create_validation_image.py, ./resources/scripts/extract_form_field_info.py, ./resources/scripts/fill_fillable_fields.py, ./resources/scripts/fill_pdf_form_with_annotations.py, ./resources/LICENSE.txt |
| deprecated | false |
| replacedBy | null |
| impl | packages/doc-tools/src/pdf_workbench.ts#runPdfWorkflow |
| inputs | [object Object] |
| outputs | [object Object] |
| preconditions | Source documents cleared for processing and backed up., Form field requirements documented (names, default values)., OCR plan established if PDFs are image-based (use OCR preprocessing skill if necessary). |
| sideEffects | Generates intermediate JSON/CSV files for extracted content., Writes processing logs and metadata reports to `logs/pdf-workbench/`. |
| estimatedCost | $0.003 / PDF workflow (~600 tokens across extraction + validation). |
| calls | skill-testing-evidence-triplet |
| requiresContext | memory://skills/skill-pdf-workbench/historical-runs |
| providesContext | memory://skills/skill-pdf-workbench/latest-report |
| monitoring | true |
| lifecycle | [object Object] |
| estimatedDuration | PT40M |
| i18n | [object Object] |
| persuasiveFraming | [object Object] |
| observability | [object Object] |
| governance | [object Object] |
| schemaStatus | [object Object] |
PDF Workbench — Extraction, Forms & Generation
When to Use
- Extracting tables or text from financial reports, statements, or regulatory filings.
- Filling standardized PDF forms (e.g., onboarding, compliance attestations) programmatically.
- Splitting or merging PDFs as part of document assembly workflows.
- Generating branded or data-driven reports directly to PDF.
How to Apply
- Assess the document: text-based vs image-based, form fields required, expected outputs.
- Choose the workflow (
extract,form-fill,merge,split,generate) and reviewresources/reference.mdorforms.mdfor detailed steps. - Use the bundled scripts (pypdf/pdfplumber/reportlab) to implement the operation; capture logs/output files.
- Validate results (compare counts, inspect sample pages, run Pandoc conversion if needed) and store artefacts.
- Log Local Memory outcomes and attach evidence to the task/PR.
Success Criteria
- Extracted data matches expected record counts, headers, and formatting.
- Form fields populated accurately with correct data types and character sets.
- Generated/split/merged PDFs pass QC (manual spot-check or automated validation).
- Evidence bundle includes logs, sample outputs, and markdown summary.
- Local Memory entry records the workflow, effectiveness ≥0.8, and artefact IDs.
0) Mission Snapshot — What / Why / Where / How / Result
- What: Provide reliable automation patterns for PDF manipulation in Cortex-OS engagements.
- Why: Manual PDF processing is error-prone; automation yields repeatable outputs and audit trails.
- Where: Finance, compliance, partner support, and operations tasks relying on PDF documentation.
- How: Combine Python toolkits (pypdf, pdfplumber, reportlab) with curated scripts and validation steps.
- Result: Reviewed PDF artefacts with traceable evidence and ready for downstream workflows.
1) Contract — Inputs → Outputs
Inputs: source PDF(s), workflow choice, field mapping (for forms), destination path. Outputs: processed PDF(s), extracted datasets (CSV/XLSX), logs, markdown summary, Local Memory references.
2) Preconditions & Safeguards
- Verify document confidentiality and approvals before processing.
- If PDFs require OCR, plan upfront (use Tesseract or other OCR pipeline before extraction).
- Ensure file paths and encodings are sanitized to avoid injection issues.
- Install dependencies in isolated environment (
python -m venvor uv) to avoid version drift.
3) Implementation Playbook (RED→GREEN→REFACTOR)
- Recon (RED): Inspect the PDF (page count, structure), identify forms/tables, and list required outputs.
- Execute (GREEN): Run relevant scripts (extraction, form fill) with logging enabled; iterate on selectors/field mapping.
- Refine (REFACTOR): Validate outputs, handle edge cases (empty tables, merged cells), rerun until quality gates satisfied.
4) Observability & Telemetry Hooks
- Log steps with
[brAInwav]prefix, including page counts and success/failure per batch. - Capture metrics on records extracted and store summary JSON/CSV for downstream analytics.
- Archive raw logs in
logs/pdf-workbench/for review.
5) Safety, Compliance & Governance
- Preserve original documents untouched; work on copies.
- Redact or mask sensitive data in outputs before sharing.
- Keep audit trails (input hash, script version) for repeatability.
- Respect form licensing/usage requirements before publishing generated PDFs.
6) Success Criteria & Acceptance Tests
- Automated scripts exit cleanly; tests comparing record counts or field completions pass.
- Manual spot-check of PDF structure confirms correct pagination and formatting.
- Evidence Triplet recorded (initial failure if any, passing run, artifact verification).
- Reviewer checklist completed with no major issues outstanding.
7) Failure Modes & Recovery
- Image-only PDF: Apply OCR pre-processing; fallback to manual if text extraction impossible.
- Misparsed tables: Adjust pdfplumber settings (edge tolerance), or use tabula/ocr fallback.
- Form fields misnamed: Inspect field names (
resources/scripts/formshelper) and map correctly. - Large files: Process in batches; ensure memory usage monitored.
8) Worked Examples & Snippets
resources/scripts/extract_tables.py— example using pdfplumber + pandas to export tables.resources/scripts/fill_form.py— demonstrates populating AcroForm fields.resources/scripts/create_pdf_report.py— builds multi-page report with reportlab.
9) Memory & Knowledge Integration
- Log each run in Local Memory with
skillUsed: "skill-pdf-workbench", success rate, and artefact IDs. - Link to related processes (docx workbench, OCR skills) for discoverability.
- Reference memory IDs in PRs and review manifests.
10) Lifecycle & Versioning Notes
- Track library versions (pypdf, pdfplumber, reportlab) and update scripts when APIs change.
- Document differences between forms (AcroForm vs XFA) and expand scripts accordingly.
- Review skill quarterly to incorporate new extraction techniques or vendor tools.
11) References & Evidence
resources/reference.md— extended PDF processing guide (Anthropic mirror).resources/forms.md— form-specific workflows.- Bundled scripts for extraction, metadata, and report generation.
- Artefacts: logs, CSV/XLSX exports, final PDFs.
12) Schema Gap Checklist
- Add script to hash inputs/outputs for audit logs.
- Automate OCR fallback integration.
- Generate summary JSON for extracted tables to feed analytics pipelines.