Claude Code Plugins

Community-maintained marketplace

Feedback

Complete toolkit for web crawling and data extraction using Crawl4AI. This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name crawl4ai
description Complete toolkit for web crawling and data extraction using Crawl4AI. This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
version 0.7.4
crawl4ai_version >=0.7.4
last_updated Sun Jan 19 2025 00:00:00 GMT+0000 (Coordinated Universal Time)

Crawl4AI

Overview

This skill provides comprehensive support for web crawling and data extraction using the Crawl4AI library, including the complete SDK reference, ready-to-use scripts for common patterns, and optimized workflows for efficient data extraction.

Quick Start

Running with uvx (Recommended)

Use uvx to run crawl4ai without installing it globally. This automatically manages the virtual environment:

# Run inline Python with crawl4ai
uvx --from crawl4ai python -c "
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun('https://example.com')
        print(result.markdown)

asyncio.run(main())
"

Basic First Crawl

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:500])  # First 500 chars

asyncio.run(main())

Using Provided Scripts

# Simple markdown extraction (using uvx)
uvx --from crawl4ai python scripts/basic_crawler.py https://example.com

# BM25 query-based filtering (extract only relevant content)
uvx --from crawl4ai python scripts/basic_crawler.py https://docs.python.org --bm25-query "async await"

# Adaptive crawling with information foraging (intelligent stopping)
uvx --from crawl4ai python scripts/adaptive_crawler.py https://docs.example.com --query "authentication API"

# Batch processing
uvx --from crawl4ai python scripts/batch_crawler.py urls.txt

# Site-wide crawling (crawl entire site to markdown)
uvx --from crawl4ai python scripts/site_crawler.py https://docs.example.com

# Data extraction
uvx --from crawl4ai python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"

Core Crawling Fundamentals

1. Basic Crawling

Understanding the core components for any crawl:

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

# Browser configuration (controls browser behavior)
browser_config = BrowserConfig(
    headless=True,  # Run without GUI
    viewport_width=1920,
    viewport_height=1080,
    user_agent="custom-agent"  # Optional custom user agent
)

# Crawler configuration (controls crawl behavior)
crawler_config = CrawlerRunConfig(
    page_timeout=30000,  # 30 seconds timeout
    screenshot=True,  # Take screenshot
    remove_overlay_elements=True  # Remove popups/overlays
)

# Execute crawl with arun()
async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(
        url="https://example.com",
        config=crawler_config
    )

    # CrawlResult contains everything
    print(f"Success: {result.success}")
    print(f"HTML length: {len(result.html)}")
    print(f"Markdown length: {len(result.markdown)}")
    print(f"Links found: {len(result.links)}")

2. Configuration Deep Dive

BrowserConfig - Controls the browser instance:

  • headless: Run with/without GUI
  • viewport_width/height: Browser dimensions
  • user_agent: Custom user agent string
  • cookies: Pre-set cookies
  • headers: Custom HTTP headers

CrawlerRunConfig - Controls each crawl:

  • page_timeout: Maximum page load/JS execution time (ms)
  • wait_for: CSS selector or JS condition to wait for (optional)
  • cache_mode: Control caching behavior
  • js_code: Execute custom JavaScript
  • screenshot: Capture page screenshot
  • session_id: Persist session across crawls

3. Content Processing

Basic content operations available in every crawl:

result = await crawler.arun(url)

# Access extracted content
markdown = result.markdown  # Clean markdown
html = result.html  # Raw HTML
text = result.cleaned_html  # Cleaned HTML

# Media and links
images = result.media["images"]
videos = result.media["videos"]
internal_links = result.links["internal"]
external_links = result.links["external"]

# Metadata
title = result.metadata["title"]
description = result.metadata["description"]

Markdown Generation (Primary Use Case)

1. Basic Markdown Extraction

Crawl4AI excels at generating clean, well-formatted markdown:

# Simple markdown extraction
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://docs.example.com")

    # High-quality markdown ready for LLMs
    with open("documentation.md", "w") as f:
        f.write(result.markdown)

2. Fit Markdown (Content Filtering)

Use content filters to get only relevant content:

from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

# Option 1: Pruning filter (removes low-quality content)
pruning_filter = PruningContentFilter(threshold=0.4, threshold_type="fixed")

# Option 2: BM25 filter (relevance-based filtering)
bm25_filter = BM25ContentFilter(user_query="machine learning tutorials", bm25_threshold=1.0)

md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)

config = CrawlerRunConfig(markdown_generator=md_generator)

result = await crawler.arun(url, config=config)
# Access filtered content
print(result.markdown.fit_markdown)  # Filtered markdown
print(result.markdown.raw_markdown)  # Original markdown

3. Markdown Customization

Control markdown generation with options:

config = CrawlerRunConfig(
    # Exclude elements from markdown
    excluded_tags=["nav", "footer", "aside"],

    # Focus on specific CSS selector
    css_selector=".main-content",

    # Clean up formatting
    remove_forms=True,
    remove_overlay_elements=True,

    # Control link handling
    exclude_external_links=True,
    exclude_internal_links=False
)

# Custom markdown generation
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

generator = DefaultMarkdownGenerator(
    options={
        "ignore_links": False,
        "ignore_images": False,
        "image_alt_text": True
    }
)

Data Extraction

1. Schema-Based Extraction (Most Efficient)

For repetitive patterns, generate schema once and reuse:

# Step 1: Generate schema with LLM (one-time)
uvx --from crawl4ai python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"

# Step 2: Use schema for fast extraction (no LLM)
uvx --from crawl4ai python scripts/extraction_pipeline.py --use-schema https://shop.com generated_schema.json

2. Manual CSS/JSON Extraction

When you know the structure:

schema = {
    "name": "articles",
    "baseSelector": "article.post",
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "date", "selector": ".date", "type": "text"},
        {"name": "content", "selector": ".content", "type": "text"}
    ]
}

extraction_strategy = JsonCssExtractionStrategy(schema=schema)
config = CrawlerRunConfig(extraction_strategy=extraction_strategy)

3. LLM-Based Extraction

For complex or irregular content:

extraction_strategy = LLMExtractionStrategy(
    provider="openai/gpt-4o-mini",
    instruction="Extract key financial metrics and quarterly trends"
)

Advanced Patterns

1. Adaptive Crawling (Information Foraging)

Intelligent crawling that knows when to stop based on information sufficiency. Perfect for research tasks.

# Basic adaptive crawl with query
uvx --from crawl4ai python scripts/adaptive_crawler.py https://docs.python.org --query "async context managers"

# Higher confidence for exhaustive coverage
uvx --from crawl4ai python scripts/adaptive_crawler.py https://api.example.com/docs --query "authentication" --confidence 0.85

# Export knowledge base for AI consumption
uvx --from crawl4ai python scripts/adaptive_crawler.py https://scikit-learn.org --query "model evaluation" --max-pages 30 --export-kb

When to use:

  • Research tasks requiring comprehensive coverage
  • Question answering with sufficient context
  • Knowledge base building for AI/ML
  • API documentation discovery

How it works:

  • Uses three metrics: Coverage, Consistency, Saturation
  • Automatically stops when sufficient information gathered
  • Prioritizes links based on relevance to query
  • Saves most relevant pages ranked by score

2. Site-Wide Crawling

Crawl an entire website and convert all pages to markdown:

# Crawl entire docs site with visible browser (for debugging)
uvx --from crawl4ai python scripts/site_crawler.py https://docs.example.com

# Headless mode with custom options (default: 250 max pages)
uvx --from crawl4ai python scripts/site_crawler.py https://docs.example.com \
    --headless --max-pages 100 --delay 1.0 -o ./output_dir

# Crawl entire domain (not just URLs under start path)
uvx --from crawl4ai python scripts/site_crawler.py https://example.com --no-stay-within-path

# Resume a previous crawl (continues from site_index.json)
uvx --from crawl4ai python scripts/site_crawler.py https://docs.example.com \
    -o ./output_dir --resume

The site crawler:

  • Discovers all internal links automatically (BFS traversal)
  • Converts each page to clean markdown
  • Generates site_index.json with crawl statistics and URL metadata
  • Supports rate limiting, max page limits, and path filtering
  • Supports resuming from previous crawl via --resume flag

3. Deep Crawling (Manual)

For custom link discovery and crawling logic:

# Basic link discovery
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url)

    # Extract and process discovered links
    internal_links = result.links.get("internal", [])
    external_links = result.links.get("external", [])

    # Crawl discovered internal links
    for link in internal_links:
        if "/blog/" in link and "/tag/" not in link:  # Filter links
            sub_result = await crawler.arun(link)
            # Process sub-page

    # For advanced deep crawling, consider using URL seeding patterns
    # or custom crawl strategies (see complete-sdk-reference.md)

4. Batch & Multi-URL Processing

Efficiently crawl multiple URLs:

urls = ["https://site1.com", "https://site2.com", "https://site3.com"]

async with AsyncWebCrawler() as crawler:
    # Concurrent crawling with arun_many()
    results = await crawler.arun_many(
        urls=urls,
        config=crawler_config,
        max_concurrent=5  # Control concurrency
    )

    for result in results:
        if result.success:
            print(f"✅ {result.url}: {len(result.markdown)} chars")

5. Session & Authentication

Handle login-required content:

# First crawl - establish session and login
login_config = CrawlerRunConfig(
    session_id="user_session",
    js_code="""
    document.querySelector('#username').value = 'myuser';
    document.querySelector('#password').value = 'mypass';
    document.querySelector('#submit').click();
    """,
    wait_for="css:.dashboard"  # Wait for post-login element
)

await crawler.arun("https://site.com/login", config=login_config)

# Subsequent crawls - reuse session
config = CrawlerRunConfig(session_id="user_session")
await crawler.arun("https://site.com/protected-content", config=config)

6. Dynamic Content Handling

For JavaScript-heavy sites:

config = CrawlerRunConfig(
    # Wait for dynamic content
    wait_for="css:.ajax-content",

    # Execute JavaScript
    js_code="""
    // Scroll to load content
    window.scrollTo(0, document.body.scrollHeight);

    // Click load more button
    document.querySelector('.load-more')?.click();
    """,

    # Note: For virtual scrolling (Twitter/Instagram-style),
    # use virtual_scroll_config parameter (see docs)

    # Extended timeout for slow loading
    page_timeout=60000
)

7. Anti-Detection & Proxies

Avoid bot detection:

# Proxy configuration
browser_config = BrowserConfig(
    headless=True,
    proxy_config={
        "server": "http://proxy.server:8080",
        "username": "user",
        "password": "pass"
    }
)

# For stealth/undetected browsing, consider:
# - Rotating user agents via user_agent parameter
# - Using different viewport sizes
# - Adding delays between requests

# Rate limiting
import asyncio
for url in urls:
    result = await crawler.arun(url)
    await asyncio.sleep(2)  # Delay between requests

Common Use Cases

Documentation to Markdown

# Crawl entire documentation site to markdown files
uvx --from crawl4ai python scripts/site_crawler.py https://docs.example.com -o ./docs_output

# Results in:
# docs_output/
# ├── site_index.json     # Crawl stats and URL metadata
# └── pages/
#     ├── 001_docs.md
#     ├── 002_docs_getting-started.md
#     └── ...

For single page extraction:

# Convert single documentation page to clean markdown
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://docs.example.com")

    # Save as markdown for LLM consumption
    with open("docs.md", "w") as f:
        f.write(result.markdown)

E-commerce Product Monitoring

# Generate schema once for product pages
# Then monitor prices/availability without LLM costs
schema = load_json("product_schema.json")
products = await crawler.arun_many(product_urls,
    config=CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema)))

News Aggregation

# Crawl multiple news sources concurrently
news_urls = ["https://news1.com", "https://news2.com", "https://news3.com"]
results = await crawler.arun_many(news_urls, max_concurrent=5)

# Extract articles with Fit Markdown
for result in results:
    if result.success:
        # Get only relevant content
        article = result.fit_markdown

Research & Data Collection

# Academic paper collection with focused extraction
config = CrawlerRunConfig(
    fit_markdown=True,
    fit_markdown_options={
        "query": "machine learning transformers",
        "max_tokens": 10000
    }
)

Resources

scripts/

  • adaptive_crawler.py - Intelligent crawling with information foraging (knows when to stop)
  • site_crawler.py - Crawl entire sites and convert all pages to markdown
  • extraction_pipeline.py - Three extraction approaches with schema generation
  • basic_crawler.py - Simple markdown extraction with BM25/Pruning filtering
  • batch_crawler.py - Multi-URL concurrent processing

references/

  • complete-sdk-reference.md - Complete SDK documentation (23K words) with all parameters, methods, and advanced features

Example Code Repository

The Crawl4AI repository includes extensive examples in docs/examples/:

Core Examples

  • quickstart.py - Comprehensive starter with all basic patterns:
    • Simple crawling, JavaScript execution, CSS selectors
    • Content filtering, link analysis, media handling
    • LLM extraction, CSS extraction, dynamic content
    • Browser comparison, SSL certificates

Specialized Examples

  • amazonproduct_extraction*.py - Three approaches for e-commerce scraping
  • extraction_strategies_examples.py - All extraction strategies demonstrated
  • deepcrawl_example.py - Advanced deep crawling patterns
  • crypto_analysis_example.py - Complex data extraction with analysis
  • parallel_execution_example.py - High-performance concurrent crawling
  • session_management_example.py - Authentication and session handling
  • markdown_generation_example.py - Advanced markdown customization
  • hooks_example.py - Custom hooks for crawl lifecycle events
  • proxy_rotation_example.py - Proxy management and rotation
  • router_example.py - Request routing and URL patterns

Advanced Patterns

  • adaptive_crawling/ - Intelligent crawling strategies
  • c4a_script/ - C4A script examples
  • docker_*.py - Docker deployment patterns

To explore examples:

# Run any example directly with uvx:
uvx --from crawl4ai python docs/examples/quickstart.py

# The examples are located in your Crawl4AI installation:
# Look in: docs/examples/ directory

# Start with quickstart.py for comprehensive patterns
# It includes: simple crawl, JS execution, CSS selectors,
# content filtering, LLM extraction, dynamic pages, and more

# For specific use cases:
# - E-commerce: amazon_product_extraction_*.py
# - High performance: parallel_execution_example.py
# - Authentication: session_management_example.py
# - Deep crawling: deepcrawl_example.py

Best Practices

  1. Start with basic crawling - Understand BrowserConfig, CrawlerRunConfig, and arun() before moving to advanced features
  2. Use markdown generation for documentation and content - Crawl4AI excels at clean markdown extraction
  3. Try schema generation first for structured data - 10-100x more efficient than LLM extraction
  4. Enable caching during development - cache_mode=CacheMode.ENABLED to avoid repeated requests
  5. Set appropriate timeouts - 30s for normal sites, 60s+ for JavaScript-heavy sites
  6. Respect rate limits - Use delays and max_concurrent parameter
  7. Reuse sessions for authenticated content instead of re-logging

Troubleshooting

JavaScript not loading:

config = CrawlerRunConfig(
    wait_for="css:.dynamic-content",  # Wait for specific element
    page_timeout=60000  # Increase timeout
)

Bot detection issues:

browser_config = BrowserConfig(
    headless=False,  # Sometimes visible browsing helps
    viewport_width=1920,
    viewport_height=1080,
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
# Add delays between requests
await asyncio.sleep(random.uniform(2, 5))

Content extraction problems:

# Debug what's being extracted
result = await crawler.arun(url)
print(f"HTML length: {len(result.html)}")
print(f"Markdown length: {len(result.markdown)}")
print(f"Links found: {len(result.links)}")

# Try different wait strategies
config = CrawlerRunConfig(
    wait_for="js:document.querySelector('.content') !== null"
)

Session/auth issues:

# Verify session is maintained
config = CrawlerRunConfig(session_id="test_session")
result = await crawler.arun(url, config=config)
print(f"Session ID: {result.session_id}")
print(f"Cookies: {result.cookies}")

For more details on any topic, refer to references/complete-sdk-reference.md which contains comprehensive documentation of all features, parameters, and advanced usage patterns.