Claude Code Plugins

Community-maintained marketplace

Feedback

documentation-scraper

@ratacat/claude-skills
8
0

Use when needing to scrape documentation websites into markdown for AI context. Triggers on "scrape docs", "download documentation", "get docs for [library]", or creating local copies of online documentation. CRITICAL - always analyze sitemap first before scraping.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name documentation-scraper
description Use when needing to scrape documentation websites into markdown for AI context. Triggers on "scrape docs", "download documentation", "get docs for [library]", or creating local copies of online documentation. CRITICAL - always analyze sitemap first before scraping.

Documentation Scraper with slurp-ai

Overview

slurp-ai scrapes documentation websites and compiles them into a single markdown file optimized for AI agent context. It is lightweight, fast, and deterministic - it does NOT use AI to scrape, it is FOR AI consumption.

CRITICAL: Run Outside Sandbox

All commands in this skill MUST be run outside the sandbox. Use dangerouslyDisableSandbox: true for all Bash commands including:

  • which slurp (installation check)
  • node analyze-sitemap.js (sitemap analysis)
  • slurp (scraping)
  • File inspection commands (wc, head, cat, etc.)

The sandbox blocks network access and file operations required for web scraping.

Pre-Flight: Check Installation

Before scraping, verify slurp-ai is installed:

which slurp || echo "NOT INSTALLED"

If not installed, ask the user to run:

npm install -g slurp-ai

Requires: Node.js v20+

Do NOT proceed with scraping until slurp-ai is confirmed installed.

Commands

Command Purpose
slurp <url> Fetch and compile in one step
slurp fetch <url> [version] Download docs to partials only
slurp compile Compile partials into single file
slurp read <package> [version] Read local documentation

Output: Creates slurp_compiled/compiled_docs.md from partials in slurp_partials/.

CRITICAL: Analyze Sitemap First

Before running slurp, ALWAYS analyze the sitemap. This reveals the complete site structure and informs your --base-path and --max decisions.

Step 1: Run Sitemap Analysis

Use the included analyze-sitemap.js script:

node analyze-sitemap.js https://docs.example.com

This outputs:

  • Total page count (informs --max)
  • URLs grouped by section (informs --base-path)
  • Suggested slurp commands with appropriate flags
  • Sample URLs to understand naming patterns

Step 2: Interpret the Output

Example output:

📊 Total URLs in sitemap: 247

📁 URLs by top-level section:
   /docs                          182 pages
   /api                            45 pages
   /blog                           20 pages

🎯 Suggested --base-path options:
   https://docs.example.com/docs/guides/     (67 pages)
   https://docs.example.com/docs/reference/  (52 pages)
   https://docs.example.com/api/             (45 pages)

💡 Recommended slurp commands:

   # Just "/docs/guides" section (67 pages)
   slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80

Step 3: Choose Scope Based on Analysis

Sitemap Shows Action
< 50 pages total Scrape entire site: slurp <url> --max 60
50-200 pages Scope to relevant section with --base-path
200+ pages Must scope down - pick specific subsection
No sitemap found Start with --max 30, inspect partials, adjust

Step 4: Frame the Slurp Command

With sitemap data, you can now set accurate parameters:

# From sitemap: /docs/api has 45 pages
slurp https://docs.example.com/docs/api/intro \
  --base-path https://docs.example.com/docs/api/ \
  --max 55

Key insight: Starting URL is where crawling begins. Base path filters which links get followed. They can differ (useful when base path itself returns 404).

Common Scraping Patterns

Library Documentation (versioned)

# Express.js 4.x docs
slurp https://expressjs.com/en/4x/api.html --base-path https://expressjs.com/en/4x/

# React docs (latest)
slurp https://react.dev/learn --base-path https://react.dev/learn

API Reference Only

slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/

Full Documentation Site

slurp https://docs.example.com/

CLI Options

Flag Default Purpose
--max <n> 20 Maximum pages to scrape
--concurrency <n> 5 Parallel page requests
--headless <bool> true Use headless browser
--base-path <url> start URL Filter links to this prefix
--output <dir> ./slurp_partials Output directory for partials
--retry-count <n> 3 Retries for failed requests
--retry-delay <ms> 1000 Delay between retries
--yes - Skip confirmation prompts

Compile Options

Flag Default Purpose
--input <dir> ./slurp_partials Input directory
--output <file> ./slurp_compiled/compiled_docs.md Output file
--preserve-metadata true Keep metadata blocks
--remove-navigation true Strip nav elements
--remove-duplicates true Eliminate duplicates
--exclude <json> - JSON array of regex patterns to exclude

When to Disable Headless Mode

Use --headless false for:

  • Static HTML documentation sites
  • Faster scraping when JS rendering not needed

Default is headless (true) - works for most modern doc sites including SPAs.

Output Structure

slurp_partials/              # Intermediate files
  └── page1.md
  └── page2.md
slurp_compiled/              # Final output
  └── compiled_docs.md       # Compiled result

Quick Reference

# 1. ALWAYS analyze sitemap first
node analyze-sitemap.js https://docs.example.com

# 2. Scrape with informed parameters (from sitemap analysis)
slurp https://docs.example.com/docs/ --base-path https://docs.example.com/docs/ --max 80

# 3. Skip prompts for automation
slurp https://docs.example.com/ --yes

# 4. Check output
cat slurp_compiled/compiled_docs.md | head -100

Common Issues

Problem Cause Solution
Wrong --max value Guessing page count Run analyze-sitemap.js first
Too few pages scraped --max limit (default 20) Set --max based on sitemap analysis
Missing content JS not rendering Ensure --headless true (default)
Crawl stuck/slow Rate limiting Reduce --concurrency 3
Duplicate sections Similar content Use --remove-duplicates (default)
Wrong pages included Base path too broad Use sitemap to find correct --base-path
Prompts blocking automation Interactive mode Add --yes flag

Post-Scrape Usage

The output markdown is designed for AI context injection:

# Check file size (context budget)
wc -c slurp_compiled/compiled_docs.md

# Preview structure
grep "^#" slurp_compiled/compiled_docs.md | head -30

# Use with Claude Code - reference in prompt or via @file

When NOT to Use

  • API specs in OpenAPI/Swagger: Use dedicated parsers instead
  • GitHub READMEs: Fetch directly via raw.githubusercontent.com
  • npm package docs: Often better to read source + README
  • Frequently updated docs: Consider caching strategy