Claude Code Plugins

Community-maintained marketplace

Feedback

Generic web scraper for extracting and organizing Snowflake documentation with intelligent caching and configurable spider depth. Install globally with uv for easy access, or use uvx for development. Scrapes any section of docs.snowflake.com controlled by --base-path.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name doc-scraper
description Generic web scraper for extracting and organizing Snowflake documentation with intelligent caching and configurable spider depth. Install globally with uv for easy access, or use uvx for development. Scrapes any section of docs.snowflake.com controlled by --base-path.

Snowflake Documentation Scraper

Generic scraper for any section of docs.snowflake.com. Converts to Markdown with metadata and generates indexed SKILL.md.

Features: SQLite caching (7-day expiration) • Multi-threaded (4 workers) • Configurable spider depth • Base path filtering

Install: uv tool install --editable .Run: doc-scraper --output-dir=/path/to/output

Quick Start

One-time setup (installs globally):

cd .claude/skills/doc-scraper
uv tool install --editable .

Then run from anywhere:

doc-scraper --output-dir=/path/to/output

For development (picks up code/config changes without reinstall):

cd .claude/skills/doc-scraper
uvx --from . doc-scraper --output-dir=../../snowflake-docs

⚠️ Safety Guidelines for AI Agents

NEVER delete documentation or cache unless explicitly requested by the user with clear confirmation:

DO NOT:

  • Delete entire output directories (e.g., snowflake-docs/)
  • Delete documentation files (.md files in output directory)
  • Suggest clearing cache as a first step or routine operation
  • Use rm -rf commands without explicit user request and path verification
  • Delete cache to "fix" issues without diagnosing the actual problem

DO:

  • Run the scraper normally - cache handles itself automatically
  • Trust the 7-day cache expiration to keep content fresh
  • Only suggest cache clearing if user reports cache corruption errors
  • Ask user to verify paths before any deletion commands
  • Explain that cache clearing is rarely needed and may slow down subsequent runs

The scraper is designed to run without manual cache management. Cache expiration, updates, and maintenance are all automatic.

Command Options

Option Default Description
--output-dir DIR Required Output directory for scraped docs (relative to CWD)
--base-path PATH /en/migrations/ URL filter - controls what section to scrape
--spider/--no-spider --spider Follow internal links
--spider-depth N 1 Depth: 0=seeds only, 1=+direct links, 2=+2nd degree
--limit N None Cap URLs (for testing)
--dry-run - Preview without writing files
-v, --verbose - Debug logging

Common Tasks

Using globally installed tool (recommended - run from anywhere):

# Initial scrape - migrations section (cache is built automatically)
doc-scraper --output-dir=/path/to/project/snowflake-docs

# Update existing (skips cached pages automatically - this is the normal workflow)
doc-scraper --output-dir=/path/to/project/snowflake-docs

# Scrape SQL reference instead of migrations
doc-scraper --output-dir=/path/to/project/snowflake-docs \
  --base-path="/en/sql-reference/"

# Scrape user guide section
doc-scraper --output-dir=/path/to/project/snowflake-docs \
  --base-path="/en/user-guide/"

# Deep spider (2 levels of link following)
doc-scraper --output-dir=/path/to/project/snowflake-docs \
  --spider-depth=2

# Test run (10 URLs max, preview only)
doc-scraper --output-dir=/path/to/project/snowflake-docs \
  --limit 10 --dry-run

For development (from .claude/skills/doc-scraper/ directory):

# Use uvx to pick up code/config changes immediately
uvx --from . doc-scraper --output-dir=../../snowflake-docs
uvx --from . doc-scraper --output-dir=../../snowflake-docs --base-path="/en/sql-reference/"

Configuration

Auto-created on first run: {output-dir}/scraper_config.yaml

The config file is automatically created in your output directory on first run with these defaults:

rate_limiting:
  requests_per_second: 2        # Rate limit
  max_concurrent_threads: 4     # Parallel workers

spider:
  max_pages: 1000              # Stop after N pages
  max_queue_size: 500          # Queue limit
  allowed_paths: ["/en/"]      # URL filters

scraped_pages:
  expiration_days: 7           # Cache expiration

To customize: Edit {output-dir}/scraper_config.yaml directly. Changes take effect on next run.

Output Structure

snowflake-docs/
├── SKILL.md                    # Auto-generated index
└── en/migrations/              # Organized by URL path
    ├── page1.md               # With YAML frontmatter
    └── page2.md

Each markdown file includes:

---
title: "Page Title"
source_url: "https://docs.snowflake.com/..."
last_scraped: "2025-11-19T12:30:00+00:00"
---

Cache Management

Cache location: {output-dir}/.cache/ (NOT the docs themselves!)

  • scraper.db - SQLite database with page metadata
  • sitemap_cache.json - Cached sitemap URLs

⚠️ IMPORTANT: The cache is designed to be persistent and should NOT be deleted during normal operation. The scraper automatically:

  • Uses cached pages that are less than 7 days old
  • Updates expired cache entries automatically
  • Skips re-scraping recent pages to save time

When cache clearing is needed (rare):

  • Cache corruption detected (scraper will show errors)
  • Testing cache behavior during development
  • Forcing a complete re-scrape of all pages

If you must clear the cache (verify output-dir path carefully):

# ONLY clear the .cache subdirectory, NEVER delete the docs themselves!
# Double-check the path before running!
rm -rf /path/to/output/.cache

# Example from project root (verify path first):
rm -rf snowflake-docs/.cache

Cache behavior: 7-day expiration, automatic pre-load, survives crashes, ~50KB per 1,000 pages.

Troubleshooting

Issue Solution
Too many pages scraped Lower --spider-depth or add blocked patterns in config
Missing pages Increase --spider-depth or adjust allowed_paths
Slow performance Check max_queue_size and max_pages in config
Cache corruption errors Verify {output-dir}/.cache/scraper.db exists and check permissions. Only delete if corrupted.
ImportError (uv) Use uvx --refresh or clear uv's own cache (NOT project cache): rm -rf ~/.cache/uv/archive-v0/<hash>

Installation

Recommended: Global install with uv

cd .claude/skills/doc-scraper
uv tool install --editable .

Benefits:

  • ✅ Run from anywhere: doc-scraper --output-dir=/path/to/output
  • ✅ Simple command: no uvx or directory navigation
  • ✅ Editable mode: updates when you pull new code

For development: Use uvx

cd .claude/skills/doc-scraper
uvx --from . doc-scraper --output-dir=../../snowflake-docs

Benefits:

  • ✅ No installation needed
  • ✅ Picks up code changes immediately
  • ✅ Picks up config changes (scraper_config.yaml) immediately
  • ✅ Perfect for testing changes

Legacy: venv method

cd .claude/skills/doc-scraper/scripts
python -m venv ../venv
source ../venv/bin/activate
pip install -r ../requirements.txt
python doc_scraper.py --output-dir=../../snowflake-docs

⚠️ Path Tips:

  • Global install: Use absolute paths for --output-dir
  • uvx/venv: Can use relative paths from project directory

Technical Stack

Python 3.11+ • SQLite • BeautifulSoup • markdownify • requests • ratelimit • click


Version: 1.1.0 • Status: Production Ready ✅