name	website-to-vite-scraper
description	Multi-provider website scraper that converts any website (including CSR/SPA) to deployable static sites. Uses Playwright, Apify RAG Browser, Crawl4AI, and Firecrawl for comprehensive scraping. Triggers on requests to clone, reverse-engineer, or convert websites.
version	2.0

Website-to-Vite Scraper V2

Multi-provider website scraper with AI-powered extraction for any website type.

Scraping Methods

Method	Best For	Anti-Bot	JS Rendering	Cost
Playwright	General sites, Next.js/React apps	❌	✅ Full	FREE
Apify RAG Browser	LLM/RAG-optimized content	✅	✅ Adaptive	Credits
Crawl4AI	AI training data, clean extraction	✅	✅	Credits
Firecrawl	Protected sites, anti-bot bypass	✅✅	✅	$16/mo

Quick Start

GitHub Actions (Recommended)

# Go to: Actions → Website Scraper V2 → Run workflow
# Options:
#   - URL: https://www.reventure.app/
#   - Project name: reventure-clone
#   - Method: all (tries all providers)
#   - Deploy: true

API MEGA LIBRARY Integration

The following APIs from our library enhance this scraper:

API	Purpose	Status
`APIFY_API_TOKEN`	RAG Browser, Crawl4AI, Web Scraper	✅ Configured
`FIRECRAWL_API_KEY`	Anti-bot bypass, stealth mode	✅ Configured
`BROWSERLESS_API_KEY`	Alternative headless browser	🔄 Available

MCP Server Integration

Connect Claude Desktop/Cursor to Apify MCP for AI-powered scraping:

{
  "mcpServers": {
    "apify": {
      "command": "npx",
      "args": ["@apify/actors-mcp-server"],
      "env": {
        "APIFY_TOKEN": "your-apify-api-token"
      }
    }
  }
}

Or use hosted: https://mcp.apify.com?token=YOUR_TOKEN

Apify Actors Used

apify/rag-web-browser

Purpose: LLM-optimized web content extraction
Output: Markdown, HTML, text
Features:
- Playwright adaptive (handles JS)
- Clean content extraction
- Link following
- Metadata extraction

raizen/ai-web-scraper (Crawl4AI)

Purpose: AI training data collection
Output: Cleaned markdown, structured links
Features:
- Excludes boilerplate (headers, footers, nav)
- Word count thresholding
- External link filtering

Firecrawl

Purpose: Anti-bot protected sites
Output: Markdown, HTML, screenshots
Features:
- Anti-detection technology
- JavaScript rendering
- Main content extraction
- 5-second wait for dynamic content

Output Structure

project-name/
├── dist/
│   ├── index.html      # Best merged HTML
│   ├── screenshot.png  # Full page capture
│   ├── meta.json       # Scrape metadata
│   └── assets/
│       ├── images/     # Downloaded images
│       ├── css/        # Stylesheets
│       └── js/         # Scripts
└── results/
    ├── playwright/     # Raw Playwright output
    ├── apify-rag/      # RAG Browser output
    ├── crawl4ai/       # Crawl4AI output
    └── firecrawl/      # Firecrawl output

Handling CSR/SPA Sites

Sites like Next.js, React, Vue that render client-side require JavaScript execution:

Playwright waits for networkidle + 5 seconds
Apify RAG uses adaptive crawler (Playwright when needed)
Firecrawl has built-in JS rendering

For __NEXT_DATA__ extraction (Next.js sites):

Playwright automatically extracts and saves to next_data.json
Can be parsed to reconstruct static pages

Workflow Parameters

Parameter	Type	Default	Description
`url`	string	required	Website URL to scrape
`project_name`	string	required	Output folder/Cloudflare project name
`scrape_method`	choice	playwright	Method to use
`extract_assets`	boolean	true	Download images/CSS/JS
`deploy_cloudflare`	boolean	true	Deploy to Cloudflare Pages

Cost Optimization

Scenario	Recommended Method
Simple static site	Playwright (FREE)
JS-heavy SPA	Playwright → Apify RAG fallback
Protected site (Cloudflare)	Firecrawl
AI/RAG pipeline	Apify RAG or Crawl4AI
Maximum coverage	`all` method

Security Assessment

Per API_MEGA_LIBRARY guidelines:

API	Security Score	Recommendation
Apify	85/100	✅ ADOPT
Firecrawl	82/100	✅ ADOPT
Playwright	90/100	✅ ADOPT (local)

Troubleshooting

Site returns blank page

Try scrape_method: all to use multiple providers
Increase wait time in Playwright
Check if site blocks datacenter IPs → use Firecrawl

Assets not downloading

Some sites block direct asset requests
Use relative paths from original HTML
Check for CORS restrictions

Cloudflare protection detected

Use Firecrawl (has anti-bot bypass)
Or use Apify with residential proxies

Related Skills

auction-results - Uses similar scraping for auction data
bcpao-scraper - BCPAO property data extraction
youtube-transcript - Video content extraction

Changelog

V2.0 (Dec 2025)

Added multi-provider support (Playwright, Apify, Firecrawl)
MCP server integration
Automatic provider fallback
Asset downloading
Cloudflare Pages deployment

V1.0 (Dec 2025)

Initial Playwright-only scraper
Basic HTML/CSS/JS extraction

website-to-vite-scraper

Install Skill

SKILL.md