name	comp-scout-scrape
description	Scrape competition websites, extract structured data, and auto-persist to GitHub issues. Creates issues for new competitions, adds comments for duplicates.

Competition Scraper

Scrape creative writing competitions from Australian aggregator sites and automatically persist to GitHub.

What This Skill Does

Scrapes competitions.com.au and netrewards.com.au
Extracts structured data (dates, prompts, prizes)
Checks for duplicates against existing GitHub issues (by URL and title similarity)
Creates issues for NEW competitions only
Adds comments to existing issues when same competition found on another site
Skips competitions that are already tracked

The scraper already filters out sponsored/lottery ads. Your job is to check for duplicates, then persist only new competitions.

What Counts as "New"

A competition is NEW if:

Its URL is not found in any existing issue body (check the full body text, not just the primary URL field)
AND its normalized title is <80% similar to all existing issue titles

A competition is a DUPLICATE if:

Its URL appears anywhere in an existing issue (body text, comments) → already tracked, skip
Its normalized title is >80% similar to an existing issue title → likely same competition, skip
Same competition found on a different aggregator site → add comment to existing issue noting the alternate URL

Note: An issue body may contain multiple URLs (one per aggregator site). When checking for duplicates, search the entire issue body for the scraped URL, not just a specific field.

Word Limit Clarification

"25WOL" is a category name, NOT a filter. Competitions with 25, 50, or 100 word limits are all valid creative writing competitions - persist them all (if new).

Prerequisites

pip install playwright
playwright install chromium

Also requires:

gh CLI authenticated
Target repository for competition data (not this skills repo)

Workflow

Step 1: Determine Target Repository

The target repo stores competition issues. Specify or get from config:

# From workspace config (if hiivmind-pulse-gh initialized)
TARGET_REPO=$(yq '.repositories[0].full_name' .hiivmind/github/config.yaml 2>/dev/null)

# Or use default/specified
TARGET_REPO="${TARGET_REPO:-discreteds/competition-data}"

Step 2: Scrape Listings

Run the scraper to get structured competition data:

python skills/comp-scout-scrape/scraper.py listings

Output:

{
  "competitions": [
    {
      "url": "https://competitions.com.au/win-example/",
      "site": "competitions.com.au",
      "title": "Win a $500 Gift Card",
      "normalized_title": "500 gift card",
      "brand": "Example Brand",
      "prize_summary": "$500",
      "prize_value": 500,
      "closing_date": "2024-12-31"
    }
  ],
  "scrape_date": "2024-12-09",
  "errors": []
}

Step 3: Check for Existing Issues

For each scraped competition, check if it already exists:

# Get all open competition issues
gh issue list -R "$TARGET_REPO" \
  --label "competition" \
  --state open \
  --json number,title,body \
  --limit 200

Match by:

URL in issue body (exact match = definite duplicate)
Normalized title similarity (>80% = likely duplicate)

Step 4: Fetch Details for New Competitions

For competitions not already tracked, get full details:

python skills/comp-scout-scrape/scraper.py detail "https://competitions.com.au/win-example/"

For multiple new competitions, use batch mode:

echo '{"urls": ["url1", "url2", ...]}' | python skills/comp-scout-scrape/scraper.py details-batch

Step 4.5: Apply Auto-Tagging Rules (NOT Filtering)

IMPORTANT: Auto-tagging is for LABELING issues, not for skipping/excluding competitions.

Check competitions against user preferences from the data repo's CLAUDE.md to determine which labels to apply.

Fetch preferences:

gh api repos/$TARGET_REPO/contents/CLAUDE.md -H "Accept: application/vnd.github.raw" 2>/dev/null

Parse the Detection Keywords section for tagging rules
For each competition, check if title/prize matches any keywords:

For each tag_rule in [for-kids, cruise]:
  For each keyword in tag_rule.keywords:
    If keyword.lower() in (competition.title + competition.prize_summary).lower():
      Add tag_rule.label to issue labels

ALL competitions are ALWAYS persisted as issues. Tagged competitions:
- Get the relevant label applied (e.g., for-kids, cruise)
- Are closed immediately with explanation comment
- But they ARE STILL CREATED as issues (for record-keeping and potential review)

Step 5: Auto-Persist Results

For New Competitions → Create Issue

gh issue create -R "$TARGET_REPO" \
  --title "$TITLE" \
  --label "competition" \
  --label "25wol" \
  --body "$(cat <<'EOF'
## Competition Details

**URL:** {url}
**Brand:** {brand}
**Prize:** {prize_summary}
**Word Limit:** {word_limit} words
**Closes:** {closing_date}
**Draw Date:** {draw_date}
**Winners Notified:** {notification_info}

## Prompt

> {prompt}

---
*Scraped from {site} on {scrape_date}*
EOF
)"

Then set milestone by closing month:

gh issue edit $ISSUE_NUMBER -R "$TARGET_REPO" --milestone "December 2024"

For Duplicates → Add Comment

If competition URL found on another site:

gh issue comment $EXISTING_ISSUE -R "$TARGET_REPO" --body "$(cat <<'EOF'
### Also found on {other_site}

**URL:** {url}
**Title on this site:** {title}
*Discovered: {date}*
EOF
)"

For Filtered Competitions → Create Issue + Close

If competition matched auto-filter keywords:

# Create the issue first (for record-keeping)
ISSUE_URL=$(gh issue create -R "$TARGET_REPO" \
  --title "$TITLE" \
  --label "competition" \
  --label "25wol" \
  --label "$FILTER_LABEL" \
  --body "...")

# Extract issue number
ISSUE_NUMBER=$(echo "$ISSUE_URL" | grep -oE '[0-9]+$')

# Close with explanation
gh issue close $ISSUE_NUMBER -R "$TARGET_REPO" --comment "$(cat <<'EOF'
Auto-filtered: matches '$KEYWORD' in $FILTER_RULE preferences.

See CLAUDE.md in this repository for filter settings.
EOF
)"

Step 6: Report Results

Present confirmation to user:

✅ Scrape complete!

**Created 3 new issues:**
- #42: Win a $500 Coles Gift Card (closes Dec 31)
- #43: Win a Trip to Bali (closes Jan 15)
- #44: Win a Year's Supply of Coffee (closes Dec 20)

**Auto-filtered 2 (created + closed):**
- #45: Win Lego Set (for-kids: matched "Lego")
- #46: Win P&O Cruise (cruise: matched "P&O")

**Found 2 duplicates (added as comments):**
- #38: Win Woolworths Gift Cards (also on netrewards.com.au)
- #39: Win Dreamworld Experience (also on netrewards.com.au)

**Skipped 7 already tracked**

IMPORTANT: Do NOT ask "Would you like me to analyze these?" at the end. When invoked by comp-scout-daily, the workflow will automatically invoke analyze/compose skills next. Report results and stop.

Output Fields

Listing Output

Field	Type	Description
url	string	Full URL to competition detail page
site	string	Source site (competitions.com.au or netrewards.com.au)
title	string	Competition title as displayed
normalized_title	string	Lowercase, prefixes stripped, for matching
brand	string	Sponsor/brand name (if available)
prize_summary	string	Prize description or value badge
prize_value	int/null	Numeric value in dollars
closing_date	string/null	YYYY-MM-DD format

Detail Output

All listing fields plus:

Field	Type	Description
prompt	string	The actual competition question/prompt
word_limit	int	Maximum words (default 25)
entry_method	string	How to submit entry
winner_notification	object/null	Notification details from JSON-LD
scraped_at	string	ISO timestamp of scrape

Winner Notification Object

Field	Type	Description
notification_text	string	Raw notification text
notification_date	string/null	Specific date if mentioned
notification_days	int/null	Days after close/draw
selection_text	string	How winners are selected
selection_date	string/null	When judging occurs

Title Normalization

Titles are normalized for deduplication:

Lowercase
Strip prefixes: "Win ", "Win a ", "Win an ", "Win the ", "Win 1 of "
Remove punctuation
Collapse whitespace

Example:

Original: "Win a $500 Coles Gift Card"
Normalized: "500 coles gift card"

Example Session

User: Scrape competitions

Claude: I'll scrape competitions and persist new ones to GitHub.

[Runs: python skills/comp-scout-scrape/scraper.py listings]

Found 12 competitions from both sites.

[Runs: gh issue list -R discreteds/competition-data --label competition --json number,title,body]

Checking against 45 existing issues...
- 3 are new
- 2 are duplicates (same competition, different source)
- 7 already tracked

Fetching details for 3 new competitions...

[Creates issues and adds comments]

✅ Scrape complete!

**Created 3 new issues:**
- #46: Win a $500 Coles Gift Card (closes Dec 31)
  - Milestone: December 2024
- #47: Win a Trip to Bali (closes Jan 15)
  - Milestone: January 2025
- #48: Win a Year's Supply of Coffee (closes Dec 20)
  - Milestone: December 2024

**Added 2 duplicate comments:**
- #38: Also found on netrewards.com.au
- #39: Also found on netrewards.com.au

CLI Commands Reference

# Scrape all listing pages
python skills/comp-scout-scrape/scraper.py listings

# Get full details for one competition
python skills/comp-scout-scrape/scraper.py detail "URL"

# Get full details for multiple competitions (batch mode)
echo '{"urls": ["url1", "url2"]}' | python skills/comp-scout-scrape/scraper.py details-batch

# Debug: just get URLs
python skills/comp-scout-scrape/scraper.py urls

Batch Details Output

{
  "details": [
    {
      "url": "...",
      "title": "...",
      "prompt": "Tell us in 25 words...",
      "word_limit": 25,
      ...
    }
  ],
  "scrape_date": "2024-12-09",
  "errors": []
}

Persistence Details

This skill handles all GitHub persistence. The separate comp-scout-persist skill is deprecated - its functionality is merged here.

Issue Creation Template

## Competition Details

**URL:** {url}
**Brand:** {brand}
**Prize:** {prize_summary}
**Word Limit:** {word_limit} words
**Closes:** {closing_date}
**Draw Date:** {draw_date}
**Winners Notified:** {notification_info}

## Prompt

> {prompt}

---
*Scraped from {site} on {scrape_date}*

Labels

Label	Description	Auto-applied
`competition`	All competition issues	Always
`25wol`	25 words or less type	Always
`for-kids`	Auto-filtered (kids competitions)	When keyword matches
`cruise`	Auto-filtered (cruise competitions)	When keyword matches
`closing-soon`	Closes within 3 days	By separate check
`entry-drafted`	Entry has been composed	By comp-scout-compose
`entry-submitted`	Entry has been submitted	Manually

Milestones

Issues are assigned to milestones by closing date month:

"December 2024"
"January 2025"
etc.

# Create milestone if needed
gh api repos/$TARGET_REPO/milestones \
  --method POST \
  --field title="$MONTH_YEAR" \
  --field due_on="$LAST_DAY_OF_MONTH"

# Assign to issue
gh issue edit $ISSUE_NUMBER -R "$TARGET_REPO" --milestone "$MONTH_YEAR"

Duplicate Comment Template

### Also found on {other_site}

**URL:** {url}
**Title on this site:** {title}
*Discovered: {date}*

Filtered Issue Handling

When a competition matches filter keywords:

Issue is created (for record-keeping)
Filter label is applied (e.g., for-kids)
Issue is immediately closed with explanation

gh issue close $ISSUE_NUMBER -R "$TARGET_REPO" \
  --comment "Auto-filtered: matches '$KEYWORD' in $FILTER_RULE preferences."

Integration

This skill is invoked by comp-scout-daily as the first step in the workflow.

After scraping, you can:

Use comp-scout-analyze to generate entry strategies
Use comp-scout-compose to write actual entries
Both will auto-persist their results as comments on the issue

comp-scout-scrape

Install Skill

SKILL.md