name	pdf-split
description	PDF chapter splitting

PDF Chapter Splitting

Split PDF documents into individual chapter files based on table of contents or text pattern detection.

Overview

This skill handles PDF splitting when:

A book or document needs to be divided by chapters
The PDF has embedded bookmarks/outlines, OR
Chapter boundaries can be detected from text patterns (e.g., "Chapter 1:", "Part One")

Prerequisites

Install pypdf via uv inline script dependency:

# /// script
# dependencies = ["pypdf"]
# ///

Workflow

Phase 1: Analyze PDF Structure

Run scripts/extract_toc.py to analyze the PDF:

uv run ~/.claude/skills/pdf-split/scripts/extract_toc.py <pdf_path>

Output includes:

Total page count
Embedded bookmarks/outline (if present)
Detected chapter patterns from text

Phase 2: Define Chapter Boundaries

Based on Phase 1 output, define chapter boundaries as a list of tuples:

chapters = [
    (start_page, end_page, "chapter_name"),
    # ...
]

If bookmarks exist: Use bookmark page numbers directly.

If no bookmarks:

Search for chapter heading patterns in text
Verify boundaries by checking page content
Present proposed boundaries for user confirmation

Phase 3: Execute Split

Run scripts/split_by_chapters.py with the chapter definitions:

uv run ~/.claude/skills/pdf-split/scripts/split_by_chapters.py <pdf_path> <output_dir> --chapters '<json_chapters>'

Example:

uv run ~/.claude/skills/pdf-split/scripts/split_by_chapters.py \
  ~/book.pdf \
  ~/book_chapters \
  --chapters '[[1,22,"00_Intro"],[23,45,"01_Chapter1"]]'

Common Chapter Patterns

Pattern	Regex	Example
Numbered	`Chapter\s+\d+`	"Chapter 1", "Chapter 12"
Part + Chapter	`Part\s+\w+.*Chapter`	"Part One: Chapter 1"
Section	`Section\s+\d+`	"Section 1.1"
Roman numerals	`Chapter\s+[IVXLC]+`	"Chapter IV"

Edge Cases

Large Chapter Detection (100+ pages)

When a detected chapter exceeds 100 pages, verify the boundary:

Check if appendix content is included
Look for sub-sections that should be separate files

Missing TOC

When no bookmarks or clear patterns exist:

Extract first 20 pages of text
Look for manual TOC listing
Parse page numbers from TOC text

Duplicate Pattern Matches

Filter results to keep only actual chapter starts:

Chapter headings typically appear at page top
Ignore references to chapters in body text (e.g., "see Chapter 3")

Output Structure

output_dir/
├── 00_Front_Matter.pdf
├── 01_Chapter_Name.pdf
├── 02_Chapter_Name.pdf
├── ...
└── Appendix.pdf

Naming convention: {index:02d}_{sanitized_name}.pdf

Integration Notes

For NotebookLM Upload

Split PDFs are suitable for NotebookLM sources:

Each chapter as separate source enables targeted queries
Recommended: Keep files under 500KB when possible
Large chapters may need further splitting

For RAG Systems

Chapter-level splitting provides natural semantic boundaries for:

Document chunking
Retrieval granularity
Citation accuracy

Scripts Reference

Script	Purpose
`scripts/extract_toc.py`	Analyze PDF, extract bookmarks and detect chapter patterns
`scripts/split_by_chapters.py`	Execute split with provided chapter definitions

Additional Resources

references/pypdf-guide.md - pypdf API quick reference for custom operations

pdf-split

Install Skill

SKILL.md