Claude Code Plugins

Community-maintained marketplace

Feedback

Efficient Web Scraping

@ClementWalter/rookie-marketplace
2
0

This skill should be used when agents need to extract data from websites, APIs, or web pages. Use when scraping content, fetching structured data, or processing web responses. Key principle: always use programmatic extraction with uv scripts instead of reading thousands of lines into context.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md


name: Efficient Web Scraping description: This skill should be used when agents need to extract data from websites, APIs, or web pages. Use when scraping content, fetching structured data, or processing web responses. Key principle: always use programmatic extraction with uv scripts instead of reading thousands of lines into context. version: 1.0.0

Efficient Web Scraping for Agents

Extract data from websites efficiently using uv scripts. Never bloat context by reading raw HTML/JSON into conversation.

Core Principle

Programmatic extraction > Reading thousands of lines

Approach Tokens Quality Speed
Read raw HTML into context 50,000+ Poor Slow
uv script + structured output ~200 Excellent Fast

When This Applies

  • Scraping website content (articles, profiles, feeds)
  • Fetching API responses
  • Extracting data from web services
  • Processing RSS/JSON feeds
  • Crawling multiple pages

The Pattern

1. Identify Target Data

Before writing code, specify exactly what you need:

**Target**: LinkedIn profile data
**Fields needed**: name, headline, recent posts (last 5)
**Output format**: JSON

2. Write a uv Script

Create a single-file Python script using uv inline metadata:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = ["requests", "beautifulsoup4", "lxml"]
# ///
"""
Extract specific data from [target].
Output: Structured JSON to stdout.
"""
import json
import sys
import requests
from bs4 import BeautifulSoup

def extract_data(url: str) -> dict:
    """Extract target fields from URL."""
    response = requests.get(url, headers={
        "User-Agent": "Mozilla/5.0 (compatible; research)"
    })
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")

    # Extract ONLY the fields you need
    return {
        "title": soup.find("h1").get_text(strip=True) if soup.find("h1") else None,
        "description": soup.find("meta", {"name": "description"})["content"]
                       if soup.find("meta", {"name": "description"}) else None,
        # Add only needed fields
    }

if __name__ == "__main__":
    url = sys.argv[1] if len(sys.argv) > 1 else None
    if not url:
        print("Usage: script.py <url>", file=sys.stderr)
        sys.exit(1)

    result = extract_data(url)
    print(json.dumps(result, indent=2, ensure_ascii=False))

3. Run and Use Output

uv run script.py "https://example.com/page" > output.json

Then read only the small JSON output (200 tokens) instead of the full page (50,000 tokens).

Common Patterns

Pattern A: API with Authentication

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = ["requests"]
# ///
import json
import os
import requests

def fetch_from_api(endpoint: str) -> dict:
    token = os.environ.get("API_TOKEN")
    response = requests.get(
        endpoint,
        headers={"Authorization": f"Bearer {token}"}
    )
    response.raise_for_status()

    # Extract only needed fields from potentially large response
    data = response.json()
    return {
        "items": [
            {"id": item["id"], "title": item["title"]}
            for item in data.get("items", [])[:10]  # Limit
        ],
        "total": data.get("total_count", 0)
    }

Pattern B: Multiple Pages

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = ["requests", "beautifulsoup4"]
# ///
import json
import sys
import requests
from bs4 import BeautifulSoup

def scrape_list_page(url: str) -> list[dict]:
    """Scrape paginated list, return structured data."""
    results = []
    page = 1

    while len(results) < 50:  # Hard limit
        response = requests.get(f"{url}?page={page}")
        if response.status_code != 200:
            break

        soup = BeautifulSoup(response.text, "html.parser")
        items = soup.select(".item-class")  # Adjust selector

        if not items:
            break

        for item in items:
            results.append({
                "title": item.select_one(".title").get_text(strip=True),
                "link": item.select_one("a")["href"],
            })

        page += 1

    return results

if __name__ == "__main__":
    print(json.dumps(scrape_list_page(sys.argv[1]), indent=2))

Pattern C: Browser-Required Sites (Playwright)

For JavaScript-rendered content:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = ["playwright"]
# ///
import json
import sys
from playwright.sync_api import sync_playwright

def extract_dynamic_content(url: str) -> dict:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        page.wait_for_selector(".content-loaded")

        # Extract only what you need
        result = {
            "title": page.title(),
            "content": page.locator(".main-content").text_content()[:1000],
        }

        browser.close()
        return result

if __name__ == "__main__":
    print(json.dumps(extract_dynamic_content(sys.argv[1]), indent=2))

Best Practices

DO

  • Specify exact fields needed before writing script
  • Limit results (e.g., [:10], < 50 items)
  • Output structured JSON for easy parsing
  • Handle errors gracefully with try/except
  • Add rate limiting for multiple requests
  • Use selectors (CSS/XPath) not regex for HTML

DON'T

  • Read full page HTML into context
  • Fetch data you won't use
  • Skip error handling
  • Make unlimited requests
  • Parse HTML with regex

Script Location

Reusable scraping scripts are in this skill's scripts directory:

chief-of-staff/skills/efficient-scraping/scripts/
└── web-extract.py    # Generic web content extractor

Using web-extract.py

# Basic metadata extraction
uv run scripts/web-extract.py "https://example.com"

# Extract specific elements
uv run scripts/web-extract.py "https://example.com" --selector ".article" --fields "text,href"

Output is always minimal JSON - no raw HTML in context.

Error Handling

Always wrap scraping in try/except and return structured errors:

try:
    result = extract_data(url)
    print(json.dumps(result, indent=2))
except requests.exceptions.HTTPError as e:
    print(json.dumps({"error": f"HTTP {e.response.status_code}", "url": url}))
    sys.exit(1)
except Exception as e:
    print(json.dumps({"error": str(e), "type": type(e).__name__}))
    sys.exit(1)

Token Savings Example

Scenario Without Script With Script
LinkedIn profile page ~45,000 tokens ~150 tokens
Twitter feed (20 tweets) ~30,000 tokens ~800 tokens
API response (100 items) ~20,000 tokens ~500 tokens

10-100x token reduction = faster, cheaper, better context for reasoning.

Quick Reference

## Scraping Checklist

- [ ] Define exactly what fields are needed
- [ ] Write uv script with inline dependencies
- [ ] Extract ONLY needed fields
- [ ] Output structured JSON
- [ ] Run script, read JSON output (not raw HTML)
- [ ] Handle errors gracefully