| name | web-research |
| description | Fetch and analyze web page content from URLs. Use when asked to read, summarize, or analyze web links, check websites, or research online content. Converts web pages to readable Markdown format. |
Web Research
This skill enables fetching and analyzing web page content from URLs provided in emails or conversations.
When to Use
Use this skill when:
- User provides a URL and asks you to read, summarize, or analyze it
- Email contains links and user asks "what does this say?" or "summarize this"
- User asks you to "check this website" or "look at this page"
- User wants information from a specific web page
How to Fetch Web Content
python -m src.web_fetcher <url>
This will:
- Download the web page content
- Convert HTML to clean Markdown
- Add a summary header describing the content
- Output to stdout (you can read it directly)
- Automatically clean up any temporary files
Example Usage
# Fetch and read a web page
python -m src.web_fetcher https://example.com/article
# Skip the summary header (get raw content only)
python -m src.web_fetcher https://example.com/article --no-summary
Output Format
The fetched content includes:
---
source: https://example.com/article
converted: auto-generated markdown from HTML via MarkItDown
note: JavaScript-rendered content not included. Some dynamic content may be missing.
---
> **What's in this page:** Article from example.com: "Article Title", 5 sections, includes tables, 15 links, ~2000 words
---
[Full page content as markdown...]
Important Limitations
JavaScript-Rendered Content
- Only fetches static HTML (no JavaScript execution)
- Modern single-page apps (SPAs) may show minimal content
- Dynamic content loaded via JavaScript will be missing
- If content seems incomplete, inform the user of this limitation
Authentication & Paywalls
- Cannot access content behind login walls
- Cannot bypass paywalls or subscription requirements
- If you encounter these, inform the user the content is not accessible
Rate Limiting & Blocking
- Some websites block automated requests
- If fetch fails, the site may be blocking bots
- Inform user and suggest they copy/paste the content if needed
Security Features
The web fetcher includes built-in security:
- Only HTTP/HTTPS URLs allowed (no file://, ftp://, etc.)
- Blocks internal/private IP addresses (localhost, 192.168.x.x, 10.x.x.x)
- 10-second timeout to prevent hanging
- 5MB content size limit
- Maximum 5 redirects
Workflow Example
When user forwards an email with a link:
User: "Can you summarize this article? https://techcrunch.com/2024/article"
You:
1. Run: python -m src.web_fetcher https://techcrunch.com/2024/article
2. Read the markdown output
3. Provide summary based on the content
4. If content seems incomplete, note the JavaScript limitation
Error Handling
Common errors and how to handle them:
| Error | Meaning | What to Tell User |
|---|---|---|
| "URL validation failed: Blocked hostname" | Trying to access localhost/internal IP | Cannot access internal/private addresses for security |
| "Request timed out" | Site too slow or unreachable | The website didn't respond in time, may be down or slow |
| "Failed to fetch URL: 403" | Site blocking automated access | The website is blocking automated requests |
| "Failed to fetch URL: 404" | Page not found | The URL doesn't exist or has been moved |
| "Content too large" | Page exceeds 5MB | The page is too large to fetch |
Best Practices
- Always inform the user if content seems incomplete or if JavaScript rendering is likely needed
- Provide the source URL in your response so user can verify
- Handle errors gracefully - if fetch fails, ask user if they can copy/paste the content
- Don't retry failed fetches without user confirmation
- Be transparent about limitations (no auth, no JS, etc.)
Notes
- Temporary files are automatically cleaned up (no manual deletion needed)
- The tool uses the same MarkItDown library as document processing
- Content is fetched fresh each time (no caching)
- User-Agent header is set to avoid basic bot detection