| name | browser-content-capture |
| description | Capture content from JavaScript-rendered pages, login-protected sites, and multi-page documentation using Playwright MCP tools or Claude Chrome extension. Use when WebFetch fails on SPAs, dynamic content, or auth-required pages. |
| version | 1.1.0 |
| tags | browser, playwright, mcp, scraping, spa, authentication, chrome-extension, 2025 |
Browser Content Capture
Capture web content that traditional scrapers cannot access.
Overview
This skill enables content extraction from sources that require browser-level access:
- JavaScript-rendered SPAs (React, Vue, Angular apps)
- Login-protected documentation (private wikis, gated content)
- Dynamic content (infinite scroll, lazy loading, client-side routing)
- Multi-page site crawls (documentation trees, tutorial series)
When to Use This Skill
Use when:
WebFetchreturns empty or partial content- Page requires JavaScript execution to render
- Content is behind authentication
- Need to navigate multi-page structures
- Extracting from client-side routed apps
Do NOT use when:
- Static HTML pages (use
WebFetch- faster) - Public API endpoints (use direct HTTP calls)
- Simple RSS/Atom feeds
Quick Start
Check Available MCP Tools
MCPSearch: "select:mcp__playwright__browser_navigate"
Basic Capture Pattern
# 1. Navigate to URL
mcp__playwright__browser_navigate(url="https://docs.example.com")
# 2. Wait for content to render
mcp__playwright__browser_wait_for(selector=".main-content", timeout=5000)
# 3. Capture page snapshot
snapshot = mcp__playwright__browser_snapshot()
# 4. Extract text content
content = mcp__playwright__browser_evaluate(
script="document.querySelector('.main-content').innerText"
)
MCP Tools Reference
| Tool | Purpose | When to Use |
|---|---|---|
browser_navigate |
Go to URL | First step of any capture |
browser_snapshot |
Get DOM/accessibility tree | Understanding page structure |
browser_evaluate |
Run custom JS | Extract specific content |
browser_click |
Click elements | Navigate menus, pagination |
browser_fill_form |
Fill inputs | Authentication flows |
browser_wait_for |
Wait for selector | Dynamic content loading |
browser_take_screenshot |
Capture image | Visual verification |
browser_console_messages |
Read JS console | Debug extraction issues |
browser_network_requests |
Monitor XHR/fetch | Find API endpoints |
Full tool documentation: See references/mcp-tools.md
Capture Patterns
Pattern 1: SPA Content Extraction
For React/Vue/Angular apps where content renders client-side:
# Navigate and wait for hydration
mcp__playwright__browser_navigate(url="https://react-docs.example.com")
mcp__playwright__browser_wait_for(selector="[data-hydrated='true']", timeout=10000)
# Extract after React mounts
content = mcp__playwright__browser_evaluate(script="""
// Wait for React to finish rendering
await new Promise(r => setTimeout(r, 1000));
return document.querySelector('article').innerText;
""")
Details: See references/spa-extraction.md
Pattern 2: Authentication Flow
For login-protected content:
# Navigate to login
mcp__playwright__browser_navigate(url="https://docs.example.com/login")
# Fill credentials (prompt user for values)
mcp__playwright__browser_fill_form(
selector="#login-form",
values={"username": "...", "password": "..."}
)
# Click submit and wait for redirect
mcp__playwright__browser_click(selector="button[type='submit']")
mcp__playwright__browser_wait_for(selector=".dashboard", timeout=10000)
# Now navigate to protected content
mcp__playwright__browser_navigate(url="https://docs.example.com/private-docs")
Details: See references/auth-handling.md
Pattern 3: Multi-Page Crawl
For documentation with navigation trees:
# Get all page links from sidebar
links = mcp__playwright__browser_evaluate(script="""
return Array.from(document.querySelectorAll('nav a'))
.map(a => ({href: a.href, text: a.innerText}));
""")
# Iterate and capture each page
all_content = []
for link in links:
mcp__playwright__browser_navigate(url=link['href'])
mcp__playwright__browser_wait_for(selector=".content")
content = mcp__playwright__browser_evaluate(
script="document.querySelector('.content').innerText"
)
all_content.append({
'url': link['href'],
'title': link['text'],
'content': content
})
Details: See references/multi-page-crawl.md
Integration with RAG Systems
After capturing content, send it to your knowledge base or RAG pipeline:
import httpx
async def send_to_knowledge_base(content: str, source_url: str):
"""Send captured content to vector database or RAG system."""
async with httpx.AsyncClient() as client:
# Example: Send to Pinecone/Weaviate/custom endpoint
response = await client.post(
"https://api.yourapp.com/v1/knowledge/ingest",
json={
"content": content,
"source_url": source_url,
"metadata": {
"capture_method": "browser_playwright",
"timestamp": "2025-12-29T10:00:00Z"
}
},
headers={"Authorization": f"Bearer {API_KEY}"}
)
return response.json()
# Example usage
content = mcp__playwright__browser_evaluate(script="document.body.innerText")
result = await send_to_knowledge_base(content, "https://docs.example.com/page")
print(f"Ingested document ID: {result['document_id']}")
Fallback Strategy
Use this decision tree for content capture:
User requests content from URL
│
▼
┌─────────────┐
│ Try WebFetch│ ← Fast, no browser needed
└─────────────┘
│
Content OK? ──Yes──► Done
│
No (empty/partial)
│
▼
┌──────────────────┐
│ Check URL pattern│
└──────────────────┘
│
├─ Known SPA (react, vue, angular) ──► Playwright MCP
├─ Requires login ──► Chrome Extension (user session)
└─ Dynamic content ──► Playwright MCP with wait_for
Best Practices
1. Minimize Browser Usage
- Always try
WebFetchfirst (10x faster, no browser overhead) - Cache extracted content to avoid re-scraping
- Use
browser_evaluateto extract only needed content
2. Handle Dynamic Content
- Always use
wait_forafter navigation - Add delays for heavy SPAs:
await new Promise(r => setTimeout(r, 2000)) - Check for loading spinners before extracting
3. Respect Rate Limits
- Add delays between page navigations
- Don't crawl faster than a human would browse
- Honor robots.txt and terms of service
4. Clean Extracted Content
- Remove navigation, headers, footers
- Strip ads and promotional content
- Convert to clean markdown before storing
Troubleshooting
| Issue | Solution |
|---|---|
| Empty content | Add wait_for with appropriate selector |
| Partial render | Increase timeout or add explicit delay |
| Login required | Use Chrome extension with user session |
| CAPTCHA blocking | Manual intervention required |
| Content in iframe | Use browser_evaluate to access iframe content |
Real-World Examples
Example 1: Capturing React Documentation
# Navigate to React docs (client-side rendered)
mcp__playwright__browser_navigate(url="https://react.dev/learn")
# Wait for React to hydrate
mcp__playwright__browser_wait_for(selector="[data-react-hydrated='true']")
# Extract main article content
content = mcp__playwright__browser_evaluate(script="""
const article = document.querySelector('article');
return article ? article.innerText : '';
""")
# Send to knowledge base
await send_to_knowledge_base(content, "https://react.dev/learn")
Example 2: Private Notion Documentation
# User must be logged in via Chrome extension
# Content captures their authenticated session
mcp__playwright__browser_navigate(url="https://notion.so/private-team-docs")
mcp__playwright__browser_wait_for(selector=".notion-page-content")
content = mcp__playwright__browser_evaluate(script="""
return document.querySelector('.notion-page-content').innerText;
""")
Example 3: Multi-Page Tutorial Series
# Capture entire tutorial series
base_url = "https://docs.example.com/tutorials"
pages = []
# Get all tutorial links
mcp__playwright__browser_navigate(url=base_url)
links = mcp__playwright__browser_evaluate(script="""
return Array.from(document.querySelectorAll('.tutorial-link'))
.map(a => a.href);
""")
# Capture each page
for link in links:
mcp__playwright__browser_navigate(url=link)
mcp__playwright__browser_wait_for(selector=".tutorial-content")
content = mcp__playwright__browser_evaluate(script="""
return document.querySelector('.tutorial-content').innerText;
""")
pages.append({'url': link, 'content': content})
# Rate limiting
time.sleep(2)
# Batch ingest
for page in pages:
await send_to_knowledge_base(page['content'], page['url'])
Related Skills
webapp-testing- Playwright test automation patternsapi-integration-patterns- REST API data fetchingvector-search-optimization- Storing captured content in vector DBs
Version: 1.1.0 (December 2025) MCP Requirement: Playwright MCP server or Claude Chrome extension