| name | pdf-split |
| description | PDF chapter splitting |
PDF Chapter Splitting
Split PDF documents into individual chapter files based on table of contents or text pattern detection.
Overview
This skill handles PDF splitting when:
- A book or document needs to be divided by chapters
- The PDF has embedded bookmarks/outlines, OR
- Chapter boundaries can be detected from text patterns (e.g., "Chapter 1:", "Part One")
Prerequisites
Install pypdf via uv inline script dependency:
# /// script
# dependencies = ["pypdf"]
# ///
Workflow
Phase 1: Analyze PDF Structure
Run scripts/extract_toc.py to analyze the PDF:
uv run ~/.claude/skills/pdf-split/scripts/extract_toc.py <pdf_path>
Output includes:
- Total page count
- Embedded bookmarks/outline (if present)
- Detected chapter patterns from text
Phase 2: Define Chapter Boundaries
Based on Phase 1 output, define chapter boundaries as a list of tuples:
chapters = [
(start_page, end_page, "chapter_name"),
# ...
]
If bookmarks exist: Use bookmark page numbers directly.
If no bookmarks:
- Search for chapter heading patterns in text
- Verify boundaries by checking page content
- Present proposed boundaries for user confirmation
Phase 3: Execute Split
Run scripts/split_by_chapters.py with the chapter definitions:
uv run ~/.claude/skills/pdf-split/scripts/split_by_chapters.py <pdf_path> <output_dir> --chapters '<json_chapters>'
Example:
uv run ~/.claude/skills/pdf-split/scripts/split_by_chapters.py \
~/book.pdf \
~/book_chapters \
--chapters '[[1,22,"00_Intro"],[23,45,"01_Chapter1"]]'
Common Chapter Patterns
| Pattern | Regex | Example |
|---|---|---|
| Numbered | Chapter\s+\d+ |
"Chapter 1", "Chapter 12" |
| Part + Chapter | Part\s+\w+.*Chapter |
"Part One: Chapter 1" |
| Section | Section\s+\d+ |
"Section 1.1" |
| Roman numerals | Chapter\s+[IVXLC]+ |
"Chapter IV" |
Edge Cases
Large Chapter Detection (100+ pages)
When a detected chapter exceeds 100 pages, verify the boundary:
- Check if appendix content is included
- Look for sub-sections that should be separate files
Missing TOC
When no bookmarks or clear patterns exist:
- Extract first 20 pages of text
- Look for manual TOC listing
- Parse page numbers from TOC text
Duplicate Pattern Matches
Filter results to keep only actual chapter starts:
- Chapter headings typically appear at page top
- Ignore references to chapters in body text (e.g., "see Chapter 3")
Output Structure
output_dir/
├── 00_Front_Matter.pdf
├── 01_Chapter_Name.pdf
├── 02_Chapter_Name.pdf
├── ...
└── Appendix.pdf
Naming convention: {index:02d}_{sanitized_name}.pdf
Integration Notes
For NotebookLM Upload
Split PDFs are suitable for NotebookLM sources:
- Each chapter as separate source enables targeted queries
- Recommended: Keep files under 500KB when possible
- Large chapters may need further splitting
For RAG Systems
Chapter-level splitting provides natural semantic boundaries for:
- Document chunking
- Retrieval granularity
- Citation accuracy
Scripts Reference
| Script | Purpose |
|---|---|
scripts/extract_toc.py |
Analyze PDF, extract bookmarks and detect chapter patterns |
scripts/split_by_chapters.py |
Execute split with provided chapter definitions |
Additional Resources
references/pypdf-guide.md- pypdf API quick reference for custom operations