Claude Code Plugins

Community-maintained marketplace

Feedback

website-to-vite-scraper

@breverdbidder/life-os
0
0

Multi-provider website scraper that converts any website (including CSR/SPA) to deployable static sites. Uses Playwright, Apify RAG Browser, Crawl4AI, and Firecrawl for comprehensive scraping. Triggers on requests to clone, reverse-engineer, or convert websites.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name website-to-vite-scraper
description Multi-provider website scraper that converts any website (including CSR/SPA) to deployable static sites. Uses Playwright, Apify RAG Browser, Crawl4AI, and Firecrawl for comprehensive scraping. Triggers on requests to clone, reverse-engineer, or convert websites.
version 2.0

Website-to-Vite Scraper V2

Multi-provider website scraper with AI-powered extraction for any website type.

Scraping Methods

Method Best For Anti-Bot JS Rendering Cost
Playwright General sites, Next.js/React apps ✅ Full FREE
Apify RAG Browser LLM/RAG-optimized content ✅ Adaptive Credits
Crawl4AI AI training data, clean extraction Credits
Firecrawl Protected sites, anti-bot bypass ✅✅ $16/mo

Quick Start

GitHub Actions (Recommended)

# Go to: Actions → Website Scraper V2 → Run workflow
# Options:
#   - URL: https://www.reventure.app/
#   - Project name: reventure-clone
#   - Method: all (tries all providers)
#   - Deploy: true

API MEGA LIBRARY Integration

The following APIs from our library enhance this scraper:

API Purpose Status
APIFY_API_TOKEN RAG Browser, Crawl4AI, Web Scraper ✅ Configured
FIRECRAWL_API_KEY Anti-bot bypass, stealth mode ✅ Configured
BROWSERLESS_API_KEY Alternative headless browser 🔄 Available

MCP Server Integration

Connect Claude Desktop/Cursor to Apify MCP for AI-powered scraping:

{
  "mcpServers": {
    "apify": {
      "command": "npx",
      "args": ["@apify/actors-mcp-server"],
      "env": {
        "APIFY_TOKEN": "your-apify-api-token"
      }
    }
  }
}

Or use hosted: https://mcp.apify.com?token=YOUR_TOKEN

Apify Actors Used

apify/rag-web-browser

  • Purpose: LLM-optimized web content extraction
  • Output: Markdown, HTML, text
  • Features:
    • Playwright adaptive (handles JS)
    • Clean content extraction
    • Link following
    • Metadata extraction

raizen/ai-web-scraper (Crawl4AI)

  • Purpose: AI training data collection
  • Output: Cleaned markdown, structured links
  • Features:
    • Excludes boilerplate (headers, footers, nav)
    • Word count thresholding
    • External link filtering

Firecrawl

  • Purpose: Anti-bot protected sites
  • Output: Markdown, HTML, screenshots
  • Features:
    • Anti-detection technology
    • JavaScript rendering
    • Main content extraction
    • 5-second wait for dynamic content

Output Structure

project-name/
├── dist/
│   ├── index.html      # Best merged HTML
│   ├── screenshot.png  # Full page capture
│   ├── meta.json       # Scrape metadata
│   └── assets/
│       ├── images/     # Downloaded images
│       ├── css/        # Stylesheets
│       └── js/         # Scripts
└── results/
    ├── playwright/     # Raw Playwright output
    ├── apify-rag/      # RAG Browser output
    ├── crawl4ai/       # Crawl4AI output
    └── firecrawl/      # Firecrawl output

Handling CSR/SPA Sites

Sites like Next.js, React, Vue that render client-side require JavaScript execution:

  1. Playwright waits for networkidle + 5 seconds
  2. Apify RAG uses adaptive crawler (Playwright when needed)
  3. Firecrawl has built-in JS rendering

For __NEXT_DATA__ extraction (Next.js sites):

  • Playwright automatically extracts and saves to next_data.json
  • Can be parsed to reconstruct static pages

Workflow Parameters

Parameter Type Default Description
url string required Website URL to scrape
project_name string required Output folder/Cloudflare project name
scrape_method choice playwright Method to use
extract_assets boolean true Download images/CSS/JS
deploy_cloudflare boolean true Deploy to Cloudflare Pages

Cost Optimization

Scenario Recommended Method
Simple static site Playwright (FREE)
JS-heavy SPA Playwright → Apify RAG fallback
Protected site (Cloudflare) Firecrawl
AI/RAG pipeline Apify RAG or Crawl4AI
Maximum coverage all method

Security Assessment

Per API_MEGA_LIBRARY guidelines:

API Security Score Recommendation
Apify 85/100 ✅ ADOPT
Firecrawl 82/100 ✅ ADOPT
Playwright 90/100 ✅ ADOPT (local)

Troubleshooting

Site returns blank page

  1. Try scrape_method: all to use multiple providers
  2. Increase wait time in Playwright
  3. Check if site blocks datacenter IPs → use Firecrawl

Assets not downloading

  1. Some sites block direct asset requests
  2. Use relative paths from original HTML
  3. Check for CORS restrictions

Cloudflare protection detected

  1. Use Firecrawl (has anti-bot bypass)
  2. Or use Apify with residential proxies

Related Skills

  • auction-results - Uses similar scraping for auction data
  • bcpao-scraper - BCPAO property data extraction
  • youtube-transcript - Video content extraction

Changelog

V2.0 (Dec 2025)

  • Added multi-provider support (Playwright, Apify, Firecrawl)
  • MCP server integration
  • Automatic provider fallback
  • Asset downloading
  • Cloudflare Pages deployment

V1.0 (Dec 2025)

  • Initial Playwright-only scraper
  • Basic HTML/CSS/JS extraction