| name | web-scraper |
| description | This skill should be used when users need to scrape content from websites, extract text from web pages, crawl and follow links, or download documentation from online sources. It features concurrent URL processing, automatic deduplication, content filtering, domain restrictions, and proper directory hierarchy based on URL structure. Use for documentation gathering, content extraction, web archival, or research data collection. |
Web Scraper
Overview
Recursively scrape web pages with concurrent processing, extracting clean text content while following links. The scraper automatically handles URL deduplication, creates proper directory hierarchies based on URL structure, filters out unwanted content, and respects domain boundaries.
When to Use This Skill
Use this skill when users request:
- Scraping content from websites
- Downloading documentation from online sources
- Extracting text from web pages at scale
- Crawling websites to gather information
- Archiving web content locally
- Following and downloading linked pages
- Research data collection from web sources
- Building text datasets from websites
Prerequisites
Install required dependencies:
pip install aiohttp beautifulsoup4 lxml aiofiles
These libraries provide:
aiohttp- Async HTTP client for concurrent requestsbeautifulsoup4- HTML parsing and content extractionlxml- Fast HTML/XML parseraiofiles- Async file I/O
Core Capabilities
1. Basic Single-Page Scraping
Scrape a single page without following links:
python scripts/scrape.py <URL> <output-directory> --depth 0
Example:
python scripts/scrape.py https://example.com/article output/
This downloads only the specified page, extracts clean text content, and saves it to output/example.com/article.txt.
2. Recursive Scraping with Link Following
Scrape a page and follow links up to a specified depth:
python scripts/scrape.py <URL> <output-directory> --depth <N>
Example:
python scripts/scrape.py https://docs.example.com output/ --depth 2
Depth levels:
--depth 0- Only the start URL(s)--depth 1- Start URLs + all links on those pages--depth 2- Start URLs + links + links found on those linked pages--depth 3+- Continue following links to the specified depth
3. Limiting the Number of Pages
Prevent excessive scraping by setting a maximum page limit:
python scripts/scrape.py <URL> <output-directory> --depth 3 --max-pages 100
Example:
python scripts/scrape.py https://docs.example.com output/ --depth 3 --max-pages 50
Useful for:
- Testing scraper configuration before full run
- Limiting resource usage
- Sampling content from large sites
- Staying within rate limits
4. Concurrent Processing
Control the number of simultaneous requests for faster scraping:
python scripts/scrape.py <URL> <output-directory> --concurrent <N>
Example:
python scripts/scrape.py https://docs.example.com output/ --depth 2 --concurrent 20
Default is 10 concurrent requests. Increase for faster scraping, decrease for more conservative resource usage.
Guidelines:
- Small sites or slow servers:
--concurrent 5 - Medium sites:
--concurrent 10(default) - Large, fast sites:
--concurrent 20-30 - Be respectful of server resources
5. Domain Restrictions
By default, the scraper only follows links on the same domain as the start URL. This can be controlled:
Same domain only (default):
python scripts/scrape.py https://example.com output/ --depth 2
Follow external links:
python scripts/scrape.py https://example.com output/ --depth 2 --follow-external
Specify allowed domains:
python scripts/scrape.py https://example.com output/ --depth 2 --allowed-domains example.com docs.example.com blog.example.com
Use --allowed-domains when:
- Documentation is split across multiple subdomains
- Content spans related domains
- You want to limit to specific trusted domains
6. Multiple Start URLs
Scrape from multiple starting points simultaneously:
python scripts/scrape.py <URL1> <URL2> <URL3> <output-directory>
Example:
python scripts/scrape.py https://example.com/docs https://example.com/guides https://example.com/tutorials output/ --depth 2
All start URLs are processed with the same configuration (depth, domain restrictions, etc.).
7. Request Configuration
Customize HTTP request behavior:
python scripts/scrape.py <URL> <output-directory> --user-agent "MyBot/1.0" --timeout 60
Options:
--user-agent- Custom User-Agent header (default: "Mozilla/5.0 (compatible; WebScraper/1.0)")--timeout- Request timeout in seconds (default: 30)
Example:
python scripts/scrape.py https://example.com output/ --depth 2 --user-agent "MyResearchBot/1.0 (+https://mysite.com/bot)" --timeout 45
8. Verbose Output
Enable detailed logging to monitor scraping progress:
python scripts/scrape.py <URL> <output-directory> --verbose
Verbose mode shows:
- Each URL being fetched
- Successful saves with file paths
- Errors and timeouts
- Detailed error information
Output Structure
Directory Hierarchy
The scraper creates a directory hierarchy that mirrors the URL structure:
output/
├── example.com/
│ ├── index.txt # https://example.com/
│ ├── about.txt # https://example.com/about
│ ├── docs/
│ │ ├── index.txt # https://example.com/docs/
│ │ ├── getting-started.txt
│ │ └── api/
│ │ └── reference.txt
│ └── blog/
│ ├── post-1.txt
│ └── post-2.txt
├── docs.example.com/
│ └── guide.txt
└── _metadata.json
File Format
Each scraped page is saved as a text file with the following structure:
URL: https://example.com/docs/guide
Title: Getting Started Guide
Scraped: 2025-10-21T14:30:00
================================================================================
[Clean extracted text content]
Metadata File
_metadata.json contains scraping session information:
{
"start_time": "2025-10-21T14:30:00",
"end_time": "2025-10-21T14:35:30",
"pages_scraped": 42,
"total_visited": 45,
"errors": {
"https://example.com/broken": "HTTP 404",
"https://example.com/slow": "Timeout"
}
}
Content Extraction and Filtering
What Gets Extracted
The scraper extracts clean text content by:
Focusing on main content - Prioritizes
<main>,<article>, or<body>tagsRemoving unwanted elements - Strips out:
- Scripts and styles
- Navigation menus
- Headers and footers
- Sidebars (aside tags)
- Iframes and embedded content
- SVG graphics
- Comments
Filtering common patterns - Removes:
- Cookie consent messages
- Privacy policy links
- Terms of service boilerplate
- UI elements (arrows, single numbers)
- Very short lines (likely navigation items)
Preserving structure - Maintains line breaks between content blocks
What Gets Filtered Out
Common unwanted patterns automatically removed:
- "Accept cookies" / "Reject all"
- "Cookie settings"
- "Privacy policy"
- "Terms of service"
- Navigation arrows (←, →, ↑, ↓)
- Isolated numbers
- Lines shorter than 3 characters
Common Usage Patterns
Download Documentation Site
Scrape an entire documentation site with reasonable limits:
python scripts/scrape.py https://docs.example.com docs-archive/ --depth 3 --max-pages 200 --concurrent 15
Archive a Blog
Download all blog posts from a blog (following pagination):
python scripts/scrape.py https://blog.example.com blog-archive/ --depth 2 --max-pages 500
Research Data Collection
Gather text content from multiple related sources:
python scripts/scrape.py https://research.edu/papers https://research.edu/publications research-data/ --depth 2 --allowed-domains research.edu --concurrent 20
Sample a Large Site
Test configuration on a small sample before full scrape:
python scripts/scrape.py https://largeSite.com sample/ --depth 2 --max-pages 20 --verbose
Then run full scrape after confirming results:
python scripts/scrape.py https://largeSite.com full-archive/ --depth 3 --max-pages 500 --concurrent 15
Multi-Domain Knowledge Base
Scrape across multiple authorized domains:
python scripts/scrape.py https://main.example.com knowledge-base/ --depth 3 --allowed-domains main.example.com docs.example.com wiki.example.com --max-pages 300
Implementation Approach
When users request web scraping:
Identify the scope:
- What URLs to start from?
- Should links be followed? How deep?
- Any domain restrictions needed?
- Is there a reasonable page limit?
Configure the scraper:
- Set appropriate depth (typically 1-3)
- Set max-pages to avoid runaway scraping
- Choose concurrent level based on site size
- Determine domain restrictions
Run with monitoring:
- Start with verbose mode or small sample
- Monitor output for errors or unexpected content
- Adjust configuration if needed
Verify output:
- Check the output directory structure
- Review
_metadata.jsonfor statistics - Sample a few text files for quality
- Check for errors in metadata
Process the content:
- Text files are ready for loading into context
- Use Read tool to examine specific files
- Use Grep to search across all scraped content
- Load files as needed for analysis
Quick Reference
Command structure:
python scripts/scrape.py <URL> [URL2 ...] <output-dir> [options]
Essential options:
-d, --depth N- Maximum link depth (default: 2)-m, --max-pages N- Maximum pages to scrape-c, --concurrent N- Concurrent requests (default: 10)-f, --follow-external- Follow external links-a, --allowed-domains- Specify allowed domains-v, --verbose- Detailed output-u, --user-agent- Custom User-Agent-t, --timeout- Request timeout in seconds
Get full help:
python scripts/scrape.py --help
Best Practices
- Start small - Test with
--depth 1 --max-pages 10before large scrapes - Respect servers - Use reasonable concurrency and timeouts
- Set limits - Always use
--max-pagesfor initial runs - Check robots.txt - Manually verify the site allows scraping
- Use verbose mode - Monitor for errors and unexpected behavior
- Identify yourself - Use a descriptive User-Agent with contact info
- Monitor output - Check
_metadata.jsonfor errors and statistics - Handle errors gracefully - Review error log in metadata for problematic URLs
Troubleshooting
Common issues:
- "Missing required dependency": Run
pip install aiohttp beautifulsoup4 lxml aiofiles - Too many timeouts: Increase
--timeoutor reduce--concurrent - Scraping too slow: Increase
--concurrent(e.g., 20-30) - Memory issues with large scrapes: Reduce
--concurrentor use--max-pagesto chunk the work - Following too many links: Reduce
--depthor enable same-domain-only (default) - Missing content: Some sites may require JavaScript; this scraper only handles static HTML
- HTTP errors: Check
_metadata.jsonerrors section for specific issues
Limitations:
- Does not execute JavaScript (single-page apps may not work)
- Does not handle authentication or login
- Does not follow links in JavaScript or dynamically loaded content
- No built-in rate limiting (use
--concurrentto control request rate)
Advanced Use Cases
Loading Scraped Content
After scraping, use the Read tool to load content into context:
# Read a specific scraped page
Read file_path: output/docs.example.com/guide.txt
# Search across all scraped content
Grep pattern: "API endpoint" path: output/ -r
Selective Re-scraping
The scraper tracks visited URLs in memory during a session but doesn't persist this between runs. To avoid re-downloading:
- Run initial scrape with limits
- Check output directory for what was downloaded
- Run additional scrapes with different start URLs or configurations
Combining with Other Tools
Chain the scraper with other processing:
# Scrape then process with custom script
python scripts/scrape.py https://example.com output/ --depth 2
python your_analysis_script.py output/
Resources
scripts/scrape.py
The main web scraping tool implementing concurrent crawling, content extraction, and intelligent filtering. Key features:
- Async/concurrent processing - Uses
asyncioandaiohttpfor high-performance concurrent requests - URL normalization - Removes fragments and trailing slashes for proper deduplication
- Visited tracking - Maintains
visited_urlsandqueued_urlssets to prevent re-downloading - Smart content extraction - Removes scripts, styles, navigation, and common unwanted patterns
- Directory hierarchy - Converts URLs to safe filesystem paths maintaining structure
- Error handling - Tracks and reports errors in metadata file
- Metadata generation - Creates
_metadata.jsonwith scraping statistics and errors
The script can be executed directly and includes comprehensive command-line help via --help.