Claude Code Plugins

Community-maintained marketplace

Feedback

browser-content-capture

@mattnigh/skills_collection
0
0

Capture content from JavaScript-rendered pages, login-protected sites, and multi-page documentation using Playwright MCP tools or Claude Chrome extension. Use when WebFetch fails on SPAs, dynamic content, or auth-required pages.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name browser-content-capture
description Capture content from JavaScript-rendered pages, login-protected sites, and multi-page documentation using Playwright MCP tools or Claude Chrome extension. Use when WebFetch fails on SPAs, dynamic content, or auth-required pages.
version 1.1.0
tags browser, playwright, mcp, scraping, spa, authentication, chrome-extension, 2025

Browser Content Capture

Capture web content that traditional scrapers cannot access.

Overview

This skill enables content extraction from sources that require browser-level access:

  • JavaScript-rendered SPAs (React, Vue, Angular apps)
  • Login-protected documentation (private wikis, gated content)
  • Dynamic content (infinite scroll, lazy loading, client-side routing)
  • Multi-page site crawls (documentation trees, tutorial series)

When to Use This Skill

Use when:

  • WebFetch returns empty or partial content
  • Page requires JavaScript execution to render
  • Content is behind authentication
  • Need to navigate multi-page structures
  • Extracting from client-side routed apps

Do NOT use when:

  • Static HTML pages (use WebFetch - faster)
  • Public API endpoints (use direct HTTP calls)
  • Simple RSS/Atom feeds

Quick Start

Check Available MCP Tools

MCPSearch: "select:mcp__playwright__browser_navigate"

Basic Capture Pattern

# 1. Navigate to URL
mcp__playwright__browser_navigate(url="https://docs.example.com")

# 2. Wait for content to render
mcp__playwright__browser_wait_for(selector=".main-content", timeout=5000)

# 3. Capture page snapshot
snapshot = mcp__playwright__browser_snapshot()

# 4. Extract text content
content = mcp__playwright__browser_evaluate(
    script="document.querySelector('.main-content').innerText"
)

MCP Tools Reference

Tool Purpose When to Use
browser_navigate Go to URL First step of any capture
browser_snapshot Get DOM/accessibility tree Understanding page structure
browser_evaluate Run custom JS Extract specific content
browser_click Click elements Navigate menus, pagination
browser_fill_form Fill inputs Authentication flows
browser_wait_for Wait for selector Dynamic content loading
browser_take_screenshot Capture image Visual verification
browser_console_messages Read JS console Debug extraction issues
browser_network_requests Monitor XHR/fetch Find API endpoints

Full tool documentation: See references/mcp-tools.md


Capture Patterns

Pattern 1: SPA Content Extraction

For React/Vue/Angular apps where content renders client-side:

# Navigate and wait for hydration
mcp__playwright__browser_navigate(url="https://react-docs.example.com")
mcp__playwright__browser_wait_for(selector="[data-hydrated='true']", timeout=10000)

# Extract after React mounts
content = mcp__playwright__browser_evaluate(script="""
    // Wait for React to finish rendering
    await new Promise(r => setTimeout(r, 1000));
    return document.querySelector('article').innerText;
""")

Details: See references/spa-extraction.md

Pattern 2: Authentication Flow

For login-protected content:

# Navigate to login
mcp__playwright__browser_navigate(url="https://docs.example.com/login")

# Fill credentials (prompt user for values)
mcp__playwright__browser_fill_form(
    selector="#login-form",
    values={"username": "...", "password": "..."}
)

# Click submit and wait for redirect
mcp__playwright__browser_click(selector="button[type='submit']")
mcp__playwright__browser_wait_for(selector=".dashboard", timeout=10000)

# Now navigate to protected content
mcp__playwright__browser_navigate(url="https://docs.example.com/private-docs")

Details: See references/auth-handling.md

Pattern 3: Multi-Page Crawl

For documentation with navigation trees:

# Get all page links from sidebar
links = mcp__playwright__browser_evaluate(script="""
    return Array.from(document.querySelectorAll('nav a'))
        .map(a => ({href: a.href, text: a.innerText}));
""")

# Iterate and capture each page
all_content = []
for link in links:
    mcp__playwright__browser_navigate(url=link['href'])
    mcp__playwright__browser_wait_for(selector=".content")
    content = mcp__playwright__browser_evaluate(
        script="document.querySelector('.content').innerText"
    )
    all_content.append({
        'url': link['href'],
        'title': link['text'],
        'content': content
    })

Details: See references/multi-page-crawl.md


Integration with RAG Systems

After capturing content, send it to your knowledge base or RAG pipeline:

import httpx

async def send_to_knowledge_base(content: str, source_url: str):
    """Send captured content to vector database or RAG system."""
    async with httpx.AsyncClient() as client:
        # Example: Send to Pinecone/Weaviate/custom endpoint
        response = await client.post(
            "https://api.yourapp.com/v1/knowledge/ingest",
            json={
                "content": content,
                "source_url": source_url,
                "metadata": {
                    "capture_method": "browser_playwright",
                    "timestamp": "2025-12-29T10:00:00Z"
                }
            },
            headers={"Authorization": f"Bearer {API_KEY}"}
        )
        return response.json()

# Example usage
content = mcp__playwright__browser_evaluate(script="document.body.innerText")
result = await send_to_knowledge_base(content, "https://docs.example.com/page")
print(f"Ingested document ID: {result['document_id']}")

Fallback Strategy

Use this decision tree for content capture:

User requests content from URL
         │
         ▼
    ┌─────────────┐
    │ Try WebFetch│ ← Fast, no browser needed
    └─────────────┘
         │
    Content OK? ──Yes──► Done
         │
         No (empty/partial)
         │
         ▼
    ┌──────────────────┐
    │ Check URL pattern│
    └──────────────────┘
         │
    ├─ Known SPA (react, vue, angular) ──► Playwright MCP
    ├─ Requires login ──► Chrome Extension (user session)
    └─ Dynamic content ──► Playwright MCP with wait_for

Best Practices

1. Minimize Browser Usage

  • Always try WebFetch first (10x faster, no browser overhead)
  • Cache extracted content to avoid re-scraping
  • Use browser_evaluate to extract only needed content

2. Handle Dynamic Content

  • Always use wait_for after navigation
  • Add delays for heavy SPAs: await new Promise(r => setTimeout(r, 2000))
  • Check for loading spinners before extracting

3. Respect Rate Limits

  • Add delays between page navigations
  • Don't crawl faster than a human would browse
  • Honor robots.txt and terms of service

4. Clean Extracted Content

  • Remove navigation, headers, footers
  • Strip ads and promotional content
  • Convert to clean markdown before storing

Troubleshooting

Issue Solution
Empty content Add wait_for with appropriate selector
Partial render Increase timeout or add explicit delay
Login required Use Chrome extension with user session
CAPTCHA blocking Manual intervention required
Content in iframe Use browser_evaluate to access iframe content

Real-World Examples

Example 1: Capturing React Documentation

# Navigate to React docs (client-side rendered)
mcp__playwright__browser_navigate(url="https://react.dev/learn")

# Wait for React to hydrate
mcp__playwright__browser_wait_for(selector="[data-react-hydrated='true']")

# Extract main article content
content = mcp__playwright__browser_evaluate(script="""
    const article = document.querySelector('article');
    return article ? article.innerText : '';
""")

# Send to knowledge base
await send_to_knowledge_base(content, "https://react.dev/learn")

Example 2: Private Notion Documentation

# User must be logged in via Chrome extension
# Content captures their authenticated session

mcp__playwright__browser_navigate(url="https://notion.so/private-team-docs")
mcp__playwright__browser_wait_for(selector=".notion-page-content")

content = mcp__playwright__browser_evaluate(script="""
    return document.querySelector('.notion-page-content').innerText;
""")

Example 3: Multi-Page Tutorial Series

# Capture entire tutorial series
base_url = "https://docs.example.com/tutorials"
pages = []

# Get all tutorial links
mcp__playwright__browser_navigate(url=base_url)
links = mcp__playwright__browser_evaluate(script="""
    return Array.from(document.querySelectorAll('.tutorial-link'))
        .map(a => a.href);
""")

# Capture each page
for link in links:
    mcp__playwright__browser_navigate(url=link)
    mcp__playwright__browser_wait_for(selector=".tutorial-content")
    
    content = mcp__playwright__browser_evaluate(script="""
        return document.querySelector('.tutorial-content').innerText;
    """)
    
    pages.append({'url': link, 'content': content})
    
    # Rate limiting
    time.sleep(2)

# Batch ingest
for page in pages:
    await send_to_knowledge_base(page['content'], page['url'])

Related Skills

  • webapp-testing - Playwright test automation patterns
  • api-integration-patterns - REST API data fetching
  • vector-search-optimization - Storing captured content in vector DBs

Version: 1.1.0 (December 2025) MCP Requirement: Playwright MCP server or Claude Chrome extension