| name | web-to-markdown |
| description | Batch-process web pages via headless Playwright browser, extract HTML, convert to markdown using Turndown, and save to timestamped scratchpad file. Use when user asks to "capture these pages as markdown", "save web content", "fetch and convert webpages", or needs clean markdown from HTML. All URLs from one prompt → single file at docs/web-captures/<timestamp>.md. |
Web to Markdown
Overview
Captures web pages using headless Playwright browser automation (handles JavaScript-rendered content), converts HTML to clean markdown via Turndown library, and saves all URLs from a single request into one timestamped file.
Key features:
- Headless browser (no visible window)
- Handles JavaScript-rendered content (SPAs, React, Vue, etc.)
- Batch processing multiple URLs → single file
- Self-contained (no MCP dependency)
Prerequisites
- Node.js/pnpm for running TypeScript scripts
- Dependencies installed:
cd skills/web-to-markdown && pnpm install - Playwright browsers will auto-install on first run
Usage
When user provides URLs and asks to:
- "capture these pages as markdown"
- "save web content for documentation"
- "fetch and convert these webpages"
- "get markdown from these sites"
- "download and convert to markdown"
Output: docs/web-captures/YYYYMMDD_HHMMSS.md containing all pages.
Workflow
Single Command
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts <url1> [url2] [url3] ...
That's it! Script handles:
- Creates timestamped output file with header
- Launches headless Playwright browser
- Scrapes each URL sequentially
- Converts HTML → markdown (Turndown)
- Appends to output file with formatted headers
- Closes browser and reports summary
Examples
Single URL:
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts https://example.com/docs
# Output: docs/web-captures/20251103_143052.md
Multiple URLs (Batch):
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts \
https://example.com/guide \
https://example.com/api \
https://example.com/faq
# Output: docs/web-captures/20251103_143052.md (all 3 pages)
From project root:
pnpm --filter @skills/web-to-markdown tsx scripts/scrape-and-convert.ts <urls...>
Output Format
# Web Captures - YYYY-MM-DD HH:MM:SS
Generated: YYYYMMDD_HHMMSS
URLs: N
---
## 📄 https://example.com/page1
[Converted markdown content...]
---
## 📄 https://example.com/page2
[Converted markdown content...]
---
Implementation Notes
TypeScript + FP Patterns:
- Pure functions (no classes except custom errors)
- Explicit error handling with typed errors:
BrowserError,FileError,HtmlConversionError - Small, focused functions
- Side effects isolated at edges
- CLI-style logging with chalk/ora
File Structure:
skills/web-to-markdown/
├── SKILL.md # This file (workflow instructions)
├── package.json # pnpm workspace config
├── tsconfig.json # TypeScript config
├── scripts/
│ ├── scrape-and-convert.ts # Main CLI (Playwright + Turndown)
│ ├── html-to-markdown.ts # Pure conversion function (Turndown wrapper)
│ └── convert-and-append.ts # Legacy CLI (deprecated, kept for reference)
└── tests/
└── html-to-markdown.test.ts # Unit tests
Error Handling
- Browser launch failure: Check Playwright installation, run
pnpm exec playwright install chromium - Page not found (404): Logs error, continues with other URLs
- Timeout (>30s): Reports slow page, continues to next URL
- Navigation error: Logs error, continues to next URL
- Conversion failure: Reports malformed HTML, skips page
Configuration
Default timeout: 30 seconds per page
To customize, edit DEFAULT_CONFIG in scripts/scrape-and-convert.ts:
const DEFAULT_CONFIG = {
outputDir: 'docs/web-captures',
timeout: 30000, // milliseconds
};
Performance
- Headless mode (no GUI overhead)
- Sequential processing (one URL at a time for stability)
- Browser reuse across URLs (faster than launching per-page)
- ~2-5 seconds per page (depends on site complexity)
Comparison with Alternatives
| Feature | web-to-markdown | scratchpad-fetch | Jina AI Reader |
|---|---|---|---|
| Transport | Playwright (headless) | curl (HTTP) | Cloud API |
| JavaScript | ✅ Full rendering | ❌ No | ✅ Server-side |
| Conversion | ✅ Turndown | ❌ Raw HTML | ✅ LLM-powered |
| Self-hosted | ✅ Yes | ✅ Yes | ❌ Cloud only |
| Setup | pnpm install | None | API key |
| Speed | Medium (2-5s/page) | Fast (<1s) | Fast (~2s) |
| Visible browser | ❌ No (headless) | N/A | N/A |
Troubleshooting
"Executable doesn't exist" error:
cd skills/web-to-markdown
pnpm exec playwright install chromium
Pages timing out:
- Increase timeout in
DEFAULT_CONFIG - Check network connectivity
- Some sites may block automated browsers (use Jina AI Reader alternative)
Empty markdown output:
- Site may use heavy client-side rendering
- Try waiting longer (increase timeout)
- Check if site blocks headless browsers (User-Agent detection)
Notes
- One request = one file (all URLs aggregated)
- Handles JavaScript-rendered content (React, Vue, Angular, etc.)
- Headless by default (no visible browser window)
- Browser auto-installs on first run
- Ideal for documentation scraping and archival
- No external API dependencies