| name | event-scraper |
| description | Create new event scraping scripts for websites. Use when adding a new event source to the Asheville Event Feed. ALWAYS start by detecting the CMS/platform and trying known API endpoints first. Browser scraping is NOT supported (Vercel limitation). Handles API-based, HTML/JSON-LD, and hybrid patterns with comprehensive testing workflows. |
Event Scraper Skill
Create new event scrapers that integrate with the Asheville Event Feed codebase. This skill provides patterns and guidance for the full lifecycle: exploration, development, testing, and production integration.
⚠️ CRITICAL: API-First Approach
Scrapers run automatically on Vercel which does NOT support browser automation.
You MUST find the site's API before considering any other approach. Modern websites almost always fetch event data from a backend API - your job is to find and use that same API.
Priority Order (STRICTLY follow this order):
- 🥇 Known CMS API - Check the Quick API Lookup table below FIRST
- 🥈 Internal JSON API - Site's own API endpoints (found via page analysis)
- 🥉 Public API - Official documented API (Ticketmaster, Eventbrite, etc.)
- 🏅 HTML with JSON-LD - Structured data embedded in HTML pages
- ❌ Browser scraping - NOT SUPPORTED on Vercel!
🚀 Quick API Endpoint Lookup (TRY THESE FIRST!)
Before doing any exploration, check if the site uses a known CMS/platform and try these endpoints directly:
| CMS/Plugin | Detection Signs | API Endpoint | Key Parameters |
|---|---|---|---|
| WordPress + Tribe Events | /wp-content/, "The Events Calendar" |
/wp-json/tribe/events/v1/events |
start_date, per_page, page |
| WordPress + All Events | "All-in-One Event Calendar" | /wp-json/osec/v1/events |
start, end |
| WordPress REST | /wp-content/, /wp-admin/ |
/wp-json/wp/v2/posts?type=event |
per_page, page |
| Squarespace | squarespace.com, static1.squarespace.com |
{any-page}?format=json |
Append to URL |
| Next.js | /_next/, __NEXT_DATA__ |
/_next/data/{buildId}/{page}.json |
Check page source |
| Eventbrite | eventbrite.com |
Internal API (see eventbrite.ts) | Complex - see example |
| Ticketmaster Venues | Venue ticket sales | Discovery API | venueId, apikey |
Example: Detecting and Using Tribe Events API
If you detect WordPress + Tribe Events, immediately try:
GET https://example.com/wp-json/tribe/events/v1/events?start_date=2025-01-01&per_page=50&page=1
This often returns rich JSON with all event data, proper timezone handling, and pagination.
Required Output Format
Every scraper MUST return ScrapedEvent[]:
interface ScrapedEvent {
sourceId: string; // Unique ID from source platform (prefix with source, e.g., "mx-123")
source: EventSource; // Add to types.ts if new source
title: string;
description?: string;
startDate: Date; // UTC Date object - see Timezone Decision Tree
location?: string; // Format: "Venue, Address, City, State"
zip?: string; // Zip code (from API or fallback utilities)
organizer?: string;
price?: string; // "Free", "$20", "$15 - $30", "Unknown"
url: string; // Unique event URL (used for deduplication)
imageUrl?: string;
interestedCount?: number;
goingCount?: number;
timeUnknown?: boolean; // True if source only provided date, no time
}
PHASE 1: EXPLORATION
Step 1.1: Detect CMS/Platform
Use WebFetch to analyze the target site:
WebFetch URL: https://example.com/events/
Prompt: "Analyze this page:
1. What CMS/platform is it? (WordPress, Squarespace, Next.js, custom)
2. Look for: wp-content, wp-json, squarespace, _next, __NEXT_DATA__
3. Is there JSON-LD structured data in script tags?
4. What event plugin is used? (Tribe Events, All Events Calendar, etc.)
5. Any hints about API endpoints in the HTML?"
Step 1.2: Try Known API Endpoints
Based on CMS detection, immediately try the known API endpoints from the Quick Lookup table:
WebFetch URL: https://example.com/wp-json/tribe/events/v1/events?per_page=5
Prompt: "Analyze this API response:
1. Is it returning JSON event data?
2. What fields are available? (title, start_date, venue, cost, etc.)
3. Is there timezone information?
4. What pagination mechanism is used?
5. List all available fields for each event"
Step 1.3: Test API Parameters
Once you find a working API, test common parameters:
| Parameter | Common Names | Purpose |
|---|---|---|
| Future filter | start_date, after, from, startDate |
Only get future events |
| Page size | per_page, limit, count, pageSize |
Control results per page |
| Pagination | page, offset, cursor, skip |
Navigate pages |
| Sort | orderby, sort, sortValue |
Order results |
WebFetch URL: https://example.com/wp-json/tribe/events/v1/events?start_date=2025-01-01&per_page=50
Prompt: "Does this API support:
1. start_date parameter for filtering future events?
2. per_page parameter for controlling page size?
3. What's the maximum per_page allowed?
4. How does pagination work (page number, next_url, etc.)?"
Step 1.4: Document Field Mapping
Create a mental map of API fields to ScrapedEvent fields:
| API Field | ScrapedEvent Field | Transform Needed |
|---|---|---|
id |
sourceId |
Prefix: "mx-${id}" |
title |
title |
decodeHtmlEntities() |
utc_start_date |
startDate |
new Date(utc + 'Z') |
cost |
price |
Use directly or "Unknown" |
venue.venue |
location |
Build string, decode entities |
venue.zip |
zip |
Use directly or fallback |
url |
url |
Use directly |
⏰ Timezone Decision Tree (CRITICAL!)
Getting timezone right is crucial. Follow this decision tree:
Does the API provide a UTC field (utc_start_date, utc_time, etc.)?
├─ YES → Use directly: new Date(utcField.replace(' ', 'T') + 'Z')
│ This is the SIMPLEST and most reliable approach.
│
└─ NO → Does the API provide ISO 8601 with offset? (e.g., "2025-12-16T19:00:00-05:00")
├─ YES → Use directly: new Date(isoString)
│
└─ NO → Does the API provide local time + timezone name? (e.g., "America/New_York")
├─ YES → Use parseAsEastern(dateStr, timeStr)
│
└─ NO → DANGER! Ambiguous local time.
- Assume Eastern for NC events
- Use parseAsEastern(dateStr, timeStr)
- Verify with test insertion!
Timezone Verification
ALWAYS verify timezone handling by comparing:
- API's local time field (e.g.,
start_date: "2025-12-16 19:00:00") - API's UTC field (e.g.,
utc_start_date: "2025-12-17 00:00:00") - Your parsed Date displayed in Eastern (should match #1)
Example verification:
API local: 19:00:00 (7 PM Eastern)
API UTC: 00:00:00 next day (midnight UTC = 7 PM EST, correct!)
Our parsed: 7:00:00 PM Eastern ✓
📍 Location String Best Practices
Location strings often have issues. Follow these rules:
1. Always Decode HTML Entities
const venueName = decodeHtmlEntities(venue.venue); // "Rock & Roll" not "Rock & Roll"
const address = decodeHtmlEntities(venue.address);
2. Avoid Duplicate City Names
APIs often include city in both venue name and city field:
// BAD: "Turgua Brewing, Fairview, Fairview, NC"
// GOOD: "Turgua Brewing, 123 Main St, Fairview, NC"
if (venue.city && !venue.address?.includes(venue.city)) {
parts.push(venue.city);
}
3. Standard Format
// Format: "Venue, Address, City, State"
const parts = [venueName];
if (venue.address) parts.push(decodeHtmlEntities(venue.address));
if (venue.city && !venue.address?.includes(venue.city)) {
parts.push(venue.city);
}
if (venue.state) parts.push(venue.state);
location = parts.join(', ');
4. Zip Code Fallbacks
let zip = venue?.zip || undefined;
if (!zip && venue?.geo_lat && venue?.geo_lng) {
zip = getZipFromCoords(venue.geo_lat, venue.geo_lng);
}
if (!zip && venue?.city) {
zip = getZipFromCity(venue.city);
}
PHASE 2: DEVELOPMENT
Step 2.1: Add Source Type
Add to lib/scrapers/types.ts:
export type EventSource = 'AVL_TODAY' | ... | 'YOUR_SOURCE';
Step 2.2: Create Scraper
Create lib/scrapers/yoursource.ts:
import { ScrapedEvent } from './types';
import { fetchWithRetry } from '@/lib/utils/retry';
import { isNonNCEvent } from '@/lib/utils/geo';
import { decodeHtmlEntities } from '@/lib/utils/parsers';
import { getZipFromCoords, getZipFromCity } from '@/lib/utils/geo';
import { getTodayStringEastern } from '@/lib/utils/timezone';
const API_BASE = 'https://example.com/wp-json/tribe/events/v1/events';
const PER_PAGE = 50;
const MAX_PAGES = 40;
const DELAY_MS = 200;
const API_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'application/json',
};
export async function scrapeYourSource(): Promise<ScrapedEvent[]> {
console.log('[YourSource] Starting scrape...');
const allEvents: ScrapedEvent[] = [];
const today = getTodayStringEastern();
let page = 1;
let hasMore = true;
while (hasMore && page <= MAX_PAGES) {
try {
const url = new URL(API_BASE);
url.searchParams.set('start_date', today);
url.searchParams.set('per_page', PER_PAGE.toString());
url.searchParams.set('page', page.toString());
console.log(`[YourSource] Fetching page ${page}...`);
const response = await fetchWithRetry(
url.toString(),
{ headers: API_HEADERS, cache: 'no-store' },
{ maxRetries: 3, baseDelay: 1000 }
);
const data = await response.json();
const events = data.events || [];
console.log(`[YourSource] Page ${page}: ${events.length} events`);
for (const event of events) {
const formatted = formatEvent(event);
if (formatted) allEvents.push(formatted);
}
hasMore = !!data.next_rest_url && page < data.total_pages;
page++;
if (hasMore) await new Promise(r => setTimeout(r, DELAY_MS));
} catch (error) {
console.error(`[YourSource] Error on page ${page}:`, error);
break;
}
}
// Filter non-NC events
const ncEvents = allEvents.filter(ev => !isNonNCEvent(ev.title, ev.location));
console.log(`[YourSource] Found ${ncEvents.length} NC events`);
return ncEvents;
}
function formatEvent(event: ApiEvent): ScrapedEvent | null {
// Parse UTC date (see Timezone Decision Tree)
const startDate = new Date(event.utc_start_date.replace(' ', 'T') + 'Z');
if (isNaN(startDate.getTime()) || startDate < new Date()) {
return null;
}
// Build location (see Location Best Practices)
const venue = event.venue;
let location: string | undefined;
if (venue?.venue) {
const parts = [decodeHtmlEntities(venue.venue)];
if (venue.address) parts.push(decodeHtmlEntities(venue.address));
if (venue.city && !venue.address?.includes(venue.city)) parts.push(venue.city);
if (venue.state) parts.push(venue.state);
location = parts.join(', ');
}
// Zip with fallbacks
let zip = venue?.zip || undefined;
if (!zip && venue?.geo_lat && venue?.geo_lng) {
zip = getZipFromCoords(venue.geo_lat, venue.geo_lng);
}
return {
sourceId: `ys-${event.id}`,
source: 'YOUR_SOURCE',
title: decodeHtmlEntities(event.title),
description: event.description ? decodeHtmlEntities(event.description) : undefined,
startDate,
location,
zip,
organizer: event.organizer?.[0]?.organizer,
price: event.cost || 'Unknown',
url: event.url,
imageUrl: event.image?.url,
timeUnknown: event.all_day || false,
};
}
Step 2.3: Create Test Script
Create scripts/scrapers/test-yoursource.ts:
import 'dotenv/config';
import * as fs from 'fs';
import * as path from 'path';
const DEBUG_DIR = path.join(process.cwd(), 'debug-scraper-yoursource');
if (!fs.existsSync(DEBUG_DIR)) {
fs.mkdirSync(DEBUG_DIR, { recursive: true });
}
async function main() {
console.log('='.repeat(60));
console.log('SCRAPER TEST - YourSource');
console.log('='.repeat(60));
// Import scraper
const { scrapeYourSource } = await import('../lib/scrapers/yoursource');
// Run scraper
const startTime = Date.now();
const events = await scrapeYourSource();
const duration = Date.now() - startTime;
// Save results
fs.writeFileSync(
path.join(DEBUG_DIR, 'events.json'),
JSON.stringify(events, null, 2)
);
// Display summary
console.log(`\nCompleted in ${(duration / 1000).toFixed(1)}s`);
console.log(`Found ${events.length} events`);
// Field completeness
const withImages = events.filter(e => e.imageUrl).length;
const withPrices = events.filter(e => e.price && e.price !== 'Unknown').length;
const withZips = events.filter(e => e.zip).length;
console.log(`\nField Completeness:`);
console.log(` Images: ${withImages}/${events.length} (${Math.round(withImages/events.length*100)}%)`);
console.log(` Prices: ${withPrices}/${events.length} (${Math.round(withPrices/events.length*100)}%)`);
console.log(` Zips: ${withZips}/${events.length} (${Math.round(withZips/events.length*100)}%)`);
// Sample events with timezone verification
console.log(`\nSample Events (verify timezone!):`);
for (const e of events.slice(0, 5)) {
console.log(`\n${e.title}`);
console.log(` UTC: ${e.startDate.toISOString()}`);
console.log(` Eastern: ${e.startDate.toLocaleString('en-US', { timeZone: 'America/New_York' })}`);
console.log(` Location: ${e.location || 'N/A'}`);
console.log(` Price: ${e.price}`);
}
console.log(`\nDebug files saved to: ${DEBUG_DIR}`);
}
main().catch(console.error);
Step 2.4: Add to package.json
"test:yoursource": "npx tsx scripts/scrapers/test-yoursource.ts"
PHASE 3: VALIDATION
Run the test script and verify output:
npm run test:yoursource
Validation Checklist
- Timezone correct: Eastern times match expected (7 PM event shows as 7 PM ET)
- No HTML entities: Titles/locations decoded (
¬&) - No duplicate cities: Location format is clean
- Prices reasonable: Mix of Free, $X, Unknown
- Zip codes populated: Most events have zips
- URLs unique: No duplicates
- Future events only: No past dates
PHASE 4: DATABASE TESTING
⚠️ MANDATORY: You MUST Complete This Phase
DO NOT declare production-ready until you have inserted test events into the real database and verified they display correctly.
Scraper output validation alone is NOT sufficient. Database insertion can reveal:
- Timezone conversion issues
- Field truncation
- Constraint violations
- Display problems
Step 4.1: Insert Test Events
// scripts/scrapers/test-yoursource-db.ts
import 'dotenv/config';
import { db } from '../lib/db';
import { events } from '../lib/db/schema';
import { eq } from 'drizzle-orm';
import { scrapeYourSource } from '../lib/scrapers/yoursource';
async function main() {
// Check existing
const existing = await db.select().from(events).where(eq(events.source, 'YOUR_SOURCE'));
console.log(`Existing YOUR_SOURCE events: ${existing.length}`);
// Scrape a few events
const scraped = await scrapeYourSource();
const testEvents = scraped.slice(0, 5);
// Insert
for (const event of testEvents) {
await db.insert(events).values({
...event,
tags: [],
lastSeenAt: new Date(),
}).onConflictDoUpdate({
target: events.url,
set: { lastSeenAt: new Date() },
});
console.log(`Inserted: ${event.title}`);
}
// Verify - THIS IS THE CRITICAL CHECK
console.log('\n=== VERIFICATION ===\n');
const inserted = await db.select().from(events).where(eq(events.source, 'YOUR_SOURCE'));
for (const e of inserted) {
console.log(`${e.title}`);
console.log(` DB Date: ${e.startDate}`);
console.log(` Eastern: ${e.startDate.toLocaleString('en-US', { timeZone: 'America/New_York' })}`);
console.log(` Location: ${e.location}`);
console.log(` Zip: ${e.zip}`);
console.log(` Price: ${e.price}`);
console.log('');
}
console.log('To cleanup: DELETE FROM events WHERE source = \'YOUR_SOURCE\';');
}
main().catch(console.error);
Step 4.2: Verify Checklist
- Events inserted without errors
- Dates display correctly in Eastern time
- All fields populated as expected
- No HTML entities in text
- Zip codes present
Step 4.3: Cleanup Test Data
npx tsx -e "
import 'dotenv/config';
import { db } from './lib/db';
import { events } from './lib/db/schema';
import { eq } from 'drizzle-orm';
db.delete(events).where(eq(events.source, 'YOUR_SOURCE')).then(() => console.log('Cleaned up'));
"
PHASE 5: PRODUCTION INTEGRATION
Step 5.1: Update Cron Route
Edit app/api/cron/scrape/route.ts:
// Add import
import { scrapeYourSource } from '@/lib/scrapers/yoursource';
// Add to Promise.allSettled array
const [..., yourSourceResult] = await Promise.allSettled([
...,
scrapeYourSource(),
]);
// Extract results
const yourSourceEvents = yourSourceResult.status === 'fulfilled' ? yourSourceResult.value : [];
// Log failures
if (yourSourceResult.status === 'rejected')
console.error('[Scrape] YourSource failed:', yourSourceResult.reason);
// Add to stats
stats.scraping.total = ... + yourSourceEvents.length;
// Add to allEvents
const allEvents = [..., ...yourSourceEvents];
// Update log message
console.log(`... YourSource: ${yourSourceEvents.length} ...`);
Step 5.2: Verify TypeScript Compiles
npx tsc --noEmit
PHASE 6: CLEANUP
# Remove debug folder
rm -rf debug-scraper-yoursource
# Remove test DB script if created
rm scripts/scrapers/test-yoursource-db.ts
Integration Checklist
Exploration
- Detected CMS/platform
- Tried known API endpoints
- Tested API parameters (start_date, per_page, page)
- Documented field mapping
- Identified timezone handling approach
Development
- Added source to
types.ts - Created scraper file
- Created test script
- Added npm script
- Added source to
Validation
- Timezone verified (Eastern times correct)
- HTML entities decoded
- Location strings clean (no duplicates)
- Field completeness acceptable
Database Testing (MANDATORY)
- Inserted test events
- Verified dates in database
- Confirmed all fields correct
- Cleaned up test data
Production
- Added to cron route
- TypeScript compiles
- Ready for deployment
Common Utilities Reference
Timezone
import { getTodayStringEastern, parseAsEastern } from '@/lib/utils/timezone';
// Get today's date in Eastern (for API start_date param)
const today = getTodayStringEastern(); // "2025-12-16"
// Parse ambiguous local time as Eastern
const date = parseAsEastern('2025-12-25', '19:00:00');
Price Formatting
import { formatPrice } from '@/lib/utils/parsers';
formatPrice(0); // "Free"
formatPrice(25.50); // "$26"
formatPrice(null); // "Unknown"
HTML Entities
import { decodeHtmlEntities } from '@/lib/utils/parsers';
decodeHtmlEntities('Rock & Roll – Live');
// "Rock & Roll – Live"
Location Filtering
import { isNonNCEvent } from '@/lib/utils/geo';
// Returns true if event should be EXCLUDED (not in NC)
if (isNonNCEvent(event.title, event.location)) continue;
Zip Code Fallbacks
import { getZipFromCoords, getZipFromCity } from '@/lib/utils/geo';
let zip = venue.zip || getZipFromCoords(lat, lng) || getZipFromCity(city);
Troubleshooting
API Returns 403/429
- Add realistic headers (User-Agent, Accept, Referer)
- Increase delays between requests (200-500ms)
- Some APIs require
Refererheader matching the site
Dates Off by Hours
- Check Timezone Decision Tree above
- Verify API returns UTC vs local time
- Compare API local time with your parsed Eastern time
Duplicate Events
- Ensure
urlis unique per event - For recurring events, append date to URL:
${url}#${date}
Missing Events
- Check pagination (off-by-one errors)
- Verify
start_dateparameter format - API may have max page limit
HTML in Titles/Locations
- Apply
decodeHtmlEntities()to ALL text fields - Check for
<br>,<p>tags that need stripping
Duplicate City in Location
- Check if city already in address before appending
- Common with APIs that include full address + separate city field