| name | benchmark-fetcher |
| description | Fetch benchmark performance data from 6 leaderboard websites using Playwright MCP and update model manifests with the latest scores. Supports SWE-bench, TerminalBench, SciCode, LiveCodeBench, MMMU, MMMU Pro, and WebDevArena benchmarks. |
Benchmark Fetcher Skill
Automate the fetching of benchmark performance data from leaderboard websites and update model manifests with the latest scores using advanced browser automation.
Overview
This skill extends benchmark data collection by automating visits to 6 major AI model leaderboard websites, extracting performance scores, and updating model manifests in manifests/models/ with the latest benchmark data.
Key Features:
- Automated Data Collection: Uses Playwright MCP to visit and extract data from 6 leaderboard websites
- Intelligent Model Mapping: Maps website model names to manifest IDs using configurable mappings
- Always Overwrite: Updates manifests with latest benchmark values
- Error Resilient: Retry logic with exponential backoff and graceful degradation
- Comprehensive Reporting: Detailed completion reports with unmapped models and update statistics
Supported Benchmarks
| Benchmark | Website | Manifest Field | Format |
|---|---|---|---|
| SWE-bench | https://www.swebench.com | sweBench |
Percentage (0-100) |
| TerminalBench | https://www.tbench.ai/leaderboard/terminal-bench/2.0 | terminalBench |
Decimal (0-1) |
| MMMU | https://mmmu-benchmark.github.io/#leaderboard | mmmu, mmmuPro |
Percentage (0-100) |
| SciCode | https://scicode-bench.github.io/leaderboard/ | sciCode |
Percentage (0-100) |
| LiveCodeBench | https://livecodebench.github.io/leaderboard.html | liveCodeBench |
Percentage (0-100) |
| WebDevArena | https://web.lmarena.ai/leaderboard | webDevArena |
Percentage (0-100) |
Note: TerminalBench uses a decimal format (0-1 scale), while all other benchmarks use percentage format (0-100 scale).
Usage
Fetch All Benchmarks
Update all model manifests with latest benchmark data from all 6 websites:
node .claude/skills/benchmark-fetcher/scripts/fetch-benchmarks.mjs
Fetch Specific Benchmarks
Update only specific benchmarks:
# Fetch only SWE-bench and TerminalBench
node .claude/skills/benchmark-fetcher/scripts/fetch-benchmarks.mjs --benchmarks swebench,terminalBench
# Fetch only LiveCodeBench
node .claude/skills/benchmark-fetcher/scripts/fetch-benchmarks.mjs --benchmarks liveCodeBench
Fetch for Specific Models
Update benchmarks for specific models only:
# Update only Claude Sonnet 4.5 and GPT-4o
node .claude/skills/benchmark-fetcher/scripts/fetch-benchmarks.mjs --models claude-sonnet-4-5,gpt-4o
Dry Run Mode
Preview what would be updated without actually modifying manifests:
node .claude/skills/benchmark-fetcher/scripts/fetch-benchmarks.mjs --dry-run
Model Name Mapping
How Mapping Works
Each benchmark website uses different naming conventions for models. The references/model-name-mappings.json file maps website-specific model names to manifest IDs.
Example mapping:
{
"swebench": {
"websiteModels": {
"Claude Sonnet 4.5": "claude-sonnet-4-5",
"GPT-4o": "gpt-4o",
"Gemini 2.5 Pro": "gemini-2-5-pro"
}
}
}
Mapping Strategy
The mapper uses a 3-tier fallback strategy:
- Exact match (case-sensitive): "Claude Sonnet 4.5" → "claude-sonnet-4-5"
- Case-insensitive match: "claude sonnet 4.5" → "claude-sonnet-4-5"
- Fuzzy match (normalized): "Claude-Sonnet-4.5" → "claude-sonnet-4-5"
Normalization: Removes spaces, hyphens, and special characters for fuzzy matching.
Adding New Mappings
When the script reports unmapped models, add them to references/model-name-mappings.json:
{
"swebench": {
"websiteModels": {
"New Model Name": "new-model-id"
}
}
}
Data Extraction Process
High-Level Workflow
- Load Configuration: Read mappings, load all model manifests
- Initialize Browser: Start Chrome DevTools MCP browser instance
- Visit Websites: Sequentially visit each benchmark website
- Extract Data: Parse leaderboard tables from page snapshots
- Map Models: Match website model names to manifest IDs
- Update Manifests: Overwrite benchmark values in manifest files
- Generate Report: Show updates, failures, and unmapped models
Website-Specific Extractors
Each benchmark has a dedicated extractor function in scripts/lib/benchmark-extractors.mjs:
extractSWEBench()- Extracts SWE-bench Verified scoresextractTerminalBench()- Extracts TerminalBench 2.0 accuracy (decimal format)extractMMMU()- Extracts both MMMU and MMMU Pro scoresextractSciCode()- Extracts SciCode benchmark scoresextractLiveCodeBench()- Extracts LiveCodeBench Pass@1 scoresextractWebDevArena()- Extracts WebDevArena scores
Special Cases
TerminalBench Format:
- Website displays percentages (42.8%)
- Must store as decimal:
0.428(not42.8) - Extractor handles conversion automatically
MMMU Dual Benchmarks:
- Single website has both MMMU and MMMU Pro leaderboards
- Extractor returns both in one visit:
{ mmmu: Map<manifestId, score>, mmmuPro: Map<manifestId, score> }
Update Strategy
Always Overwrite Policy
The skill uses an always overwrite strategy for benchmark values:
- Existing benchmark values are replaced with latest data from websites
- Null values are populated if found on websites
- Non-null values are updated with latest scores
- No confirmation or comparison - latest data always wins
Rationale: Benchmark scores represent the latest model performance. Websites are the authoritative source.
What Gets Preserved
Only benchmark fields are updated. All other manifest fields are preserved:
- ✅ Preserved:
id,name,description,vendor,size,contextWindow, etc. - 🔄 Updated:
benchmarks.sweBench,benchmarks.terminalBench, etc.
Atomic Updates
Manifests are updated using atomic file writes:
- Validate JSON structure
- Write to temporary file (
.tmp) - Atomic rename to target file
- No partial updates - all or nothing
Error Handling
Retry Logic
Each benchmark extraction uses a 3-attempt retry strategy with exponential backoff:
Attempt 1: Direct extraction (immediate) Attempt 2: Retry after 2 seconds Attempt 3: Final retry after 4 seconds
After 3 failures:
- Take debug screenshot (
/tmp/benchmark-{id}-error.png) - Log error details
- Skip benchmark and continue with others
Error Categories
Website Access Errors:
- Cause: Site down, network timeout, rate limiting
- Handling: Retry 3 times, then skip benchmark
Extraction Errors:
- Cause: Page structure changed, data not found
- Handling: Screenshot for debugging, skip benchmark
Mapping Errors:
- Cause: Model name not in mapping configuration
- Handling: Log unmapped model, continue with others
Manifest Update Errors:
- Cause: File write errors, invalid JSON
- Handling: Atomic write protects against corruption, rollback on error
Graceful Degradation
The skill continues processing even when errors occur:
- If 1 benchmark fails, others still process
- If 1 model can't be mapped, others still update
- Partial success is better than no success
- Completion report shows exactly what succeeded/failed
Completion Report
After execution, a detailed report shows:
Summary Section
📊 Benchmark Fetch Report
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ Successfully Fetched (5/6 benchmarks)
✓ SWE-bench (swebench.com)
✓ TerminalBench (tbench.ai)
✓ MMMU + MMMU Pro (mmmu-benchmark.github.io)
✓ SciCode (scicode-bench.github.io)
✓ LiveCodeBench (livecodebench.github.io)
❌ Failed to Fetch (1/6 benchmarks)
✗ WebDevArena (web.lmarena.ai)
Reason: Timeout after 3 retries
Manifest Updates
📝 Manifest Updates
✅ Updated: 15 manifests
• claude-sonnet-4-5: 3 benchmarks updated
- sweBench: null → 74.4
- terminalBench: 0.428 → 0.604
- liveCodeBench: 47.1 → 52.3
• gpt-4o: 2 benchmarks updated
- sweBench: 21.62 → 23.5
- sciCode: 1.5 → 2.1
Unmapped Models
⚠️ Unmapped Models (require manual mapping)
SWE-bench:
• "Qwen-Coder-2.5" → Add to model-name-mappings.json
• "DeepSeek-Coder-V2" → Add to model-name-mappings.json
Suggestion: Update references/model-name-mappings.json
Statistics
📈 Statistics
Total benchmarks fetched: 247 values
Total manifests updated: 15 files
Execution time: 45.2s
Average time per benchmark: 7.5s
Next Steps
✅ Complete! Next steps:
1. Review updated manifests in manifests/models/
2. Add unmapped models to references/model-name-mappings.json
3. Retry failed benchmarks if needed
4. Run validation: npm run test:validate
5. Commit changes when satisfied
Tool Integration
Chrome DevTools MCP
The skill uses Chrome DevTools MCP tools for browser automation:
Navigation:
await mcp__chrome-devtools__navigate_page({
url: 'https://www.swebench.com',
type: 'url'
})
Wait for Content:
await mcp__chrome-devtools__wait_for({
text: 'Leaderboard'
})
Take Snapshot:
const snapshot = await mcp__chrome-devtools__take_snapshot()
// Parse snapshot.content for leaderboard data
Debug Screenshots:
await mcp__chrome-devtools__take_screenshot({
filePath: '/tmp/debug-screenshot.png'
})
Best Practices
Running the Skill
- Run during off-peak hours to avoid rate limiting
- Review unmapped models and update mappings before next run
- Validate manifests after updates:
npm run test:validate - Check for website changes if extraction fails repeatedly
- Keep mappings updated as new models appear on leaderboards
Maintaining Mappings
- Check completion reports for unmapped models
- Add mappings immediately after discovering new models
- Use canonical manifest IDs as mapping targets
- Test mappings with
--modelsflag to verify - Document special cases in mapping file comments
Troubleshooting
Extraction fails for a benchmark:
- Check if website structure changed
- Review debug screenshots in
/tmp/ - Update extractor logic if needed
Model not updating:
- Verify model exists in
manifests/models/ - Check mapping configuration
- Ensure model appears on leaderboard website
TerminalBench shows wrong values:
- Verify decimal format (0.428 not 42.8)
- Check extractor conversion logic
- Validate against website directly
Files Modified
After running this skill:
- Model manifests:
manifests/models/*.json- Updated with latest benchmark scores - No other files modified: The skill only updates benchmark fields in manifests
Validation
Always validate manifests after updates:
# Run schema validation
npm run test:validate
# Check JSON formatting
node -c manifests/models/*.json
Next Steps After Execution
- Review updates: Check manifest changes make sense
- Update mappings: Add newly discovered models to
model-name-mappings.json - Retry failures: Re-run with
--benchmarksfor failed benchmarks - Validate: Run
npm run test:validateto ensure schema compliance - Commit changes: Commit updated manifests to repository