| name | AILANG Post-Release Tasks |
| description | Run automated post-release workflow (eval baselines, dashboard, docs) for AILANG releases. Executes 19-benchmark agent suite + full standard eval with validation and progress reporting. Use when user says "post-release tasks for vX.X.X" or "update dashboard". Fully autonomous with pre-flight checks. |
AILANG Post-Release Tasks
Run post-release tasks for an AILANG release: evaluation baselines, dashboard updates, and documentation.
Quick Start
Most common usage:
# User says: "Run post-release tasks for v0.3.14"
# This skill will:
# 1. Run eval baseline (ALL 6 PRODUCTION MODELS, both languages) - ALWAYS USE --full FOR RELEASES
# 2. Update website dashboard (markdown + JSON with history)
# 3. Extract metrics for CHANGELOG
# 4. Guide through design doc and public doc updates
🚨 CRITICAL: For releases, ALWAYS use --full flag by default
- Dev models (without --full) are only for quick testing/validation, NOT releases
- Users expect full benchmark results when they say "post-release" or "update dashboard"
- Never start with dev models and then try to add production models later
When to Use This Skill
Invoke this skill when:
- User says "post-release tasks", "update dashboard", "run benchmarks"
- After successful release (once GitHub release is published)
- User asks about eval baselines or benchmark results
- User wants to update documentation after a release
Available Scripts
scripts/run_eval_baseline.sh <version> [--full]
Run evaluation baseline for a release version.
🚨 CRITICAL: ALWAYS use --full for releases!
Usage:
# ✅ CORRECT - For releases (ALL 6 production models)
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.3.14 --full
.claude/skills/post-release/scripts/run_eval_baseline.sh v0.3.14 --full
# ❌ WRONG - Only use without --full for quick testing/validation
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.3.14
Output:
Running eval baseline for 0.3.14...
Mode: FULL (all 6 production models)
Expected cost: ~$0.50-1.00
Expected time: ~15-20 minutes
[Running benchmarks...]
✓ Baseline complete
Results: eval_results/baselines/0.3.14
Files: 264 result files
What it does:
- Step 1: Runs
make eval-baseline(standard 0-shot + repair evaluation)- Tests both AILANG and Python implementations
- Uses all 6 production models (--full) or 3 dev models (default)
- Tests all benchmarks in benchmarks/ directory
- Step 2: Runs agent eval on curated benchmark suite
- Current suite (v0.4.0+): 19 benchmarks across 2 tiers
- Tier 1 (Smoke Tests - 8): fizzbuzz, recursion_factorial, recursion_fibonacci,
simple_print, records_person, list_operations, string_manipulation, nested_records
- Expected: 95-100% success
- Tier 2 (Differentiators - 11): higher_order_functions, pattern_matching_complex,
record_update, effect_composition, effect_tracking_io_fs, effect_pure_separation,
exhaustive_pattern_matching, type_safe_record_access, explicit_state_threading,
deterministic_list_transform, referential_transparency
- Expected: 60-80% success (agent outperforms 0-shot)
- See BENCHMARK_AUDIT_ANALYSIS.md for detailed rationale
- Tier 1 (Smoke Tests - 8): fizzbuzz, recursion_factorial, recursion_fibonacci,
simple_print, records_person, list_operations, string_manipulation, nested_records
- Uses haiku+sonnet (--full) or haiku only (default)
- Tests both AILANG and Python implementations
- Current suite (v0.4.0+): 19 benchmarks across 2 tiers
- Saves combined results to eval_results/baselines/X.X.X/
- Accepts version with or without 'v' prefix
scripts/update_dashboard.sh <version>
Update website benchmark dashboard with new release data.
Usage:
.claude/skills/post-release/scripts/update_dashboard.sh 0.3.14
Output:
Updating dashboard for 0.3.14...
1/5 Generating Docusaurus markdown...
✓ Written to docs/docs/benchmarks/performance.md
2/5 Generating dashboard JSON with history...
✓ Written to docs/static/benchmarks/latest.json (history preserved)
3/5 Validating JSON...
✓ Version: 0.3.14
✓ Success rate: 0.627
4/5 Clearing Docusaurus cache...
✓ Cache cleared
5/5 Summary
✓ Dashboard updated for 0.3.14
✓ Markdown: docs/docs/benchmarks/performance.md
✓ JSON: docs/static/benchmarks/latest.json
Next steps:
1. Test locally: cd docs && npm start
2. Visit: http://localhost:3000/ailang/docs/benchmarks/performance
3. Verify timeline shows 0.3.14
4. Commit: git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
5. Commit: git commit -m 'Update benchmark dashboard for 0.3.14'
6. Push: git push
What it does:
- Generates Docusaurus-formatted markdown
- Updates dashboard JSON with history preservation
- Validates JSON structure (version matches input exactly)
- Clears Docusaurus build cache
- Provides next steps for testing and committing
- Accepts version with or without 'v' prefix
scripts/extract_changelog_metrics.sh [json_file]
Extract benchmark metrics from dashboard JSON for CHANGELOG.
Usage:
.claude/skills/post-release/scripts/extract_changelog_metrics.sh
# Or specify JSON file:
.claude/skills/post-release/scripts/extract_changelog_metrics.sh docs/static/benchmarks/latest.json
Output:
Extracting metrics from docs/static/benchmarks/latest.json...
=== CHANGELOG.md Template ===
### Benchmark Results (M-EVAL)
**Overall Performance**: 59.1% success rate (399 total runs)
**By Language:**
- **AILANG**: 33.0% - New language, learning curve
- **Python**: 87.0% - Baseline for comparison
- **Gap: 54.0 percentage points (expected for new language)
**Comparison**: -15.2% AILANG regression from 0.3.14 (48.2% → 33.0%)
=== End Template ===
Use this template in CHANGELOG.md for 0.3.15
What it does:
- Parses dashboard JSON for metrics
- Calculates percentages and gap between AILANG/Python
- Auto-compares with previous version from history
- Formats comparison text automatically (+X% improvement or -X% regression)
- Generates ready-to-paste CHANGELOG template with no manual work needed
Post-Release Workflow
1. Verify Release Exists
git tag -l vX.X.X
gh release view vX.X.X
If release doesn't exist, run release-manager skill first.
2. Run Eval Baseline
🚨 CRITICAL: ALWAYS use --full for releases!
Correct workflow:
# ✅ ALWAYS do this for releases - ONE command, let it complete
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X --full
This runs all 6 production models with both AILANG and Python.
Cost: ~$0.50-1.00 Time: ~15-20 minutes
❌ WRONG workflow (what happened with v0.3.22):
# DON'T do this for releases!
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X # Missing --full!
# Then try to add production models later with --skip-existing
# Result: Confusion, multiple processes, incomplete baseline
If baseline times out or is interrupted:
# Resume with ALL 6 models (maintains --full semantics)
ailang eval-suite --full --langs python,ailang --parallel 5 \
--output eval_results/baselines/X.X.X --skip-existing
The --skip-existing flag skips benchmarks that already have result files, allowing resumption of interrupted runs. But ONLY use this for recovery, not as a strategy to "add more models later".
3. Update Website Dashboard
Use the automation script:
.claude/skills/post-release/scripts/update_dashboard.sh X.X.X
IMPORTANT: This script automatically:
- Generates Docusaurus markdown (docs/docs/benchmarks/performance.md)
- Updates JSON with history preservation (docs/static/benchmarks/latest.json)
- Does NOT overwrite historical data - merges new version into existing history
- Validates JSON structure before writing
- Clears Docusaurus cache to prevent webpack errors
Test locally (optional but recommended):
cd docs && npm start
# Visit: http://localhost:3000/ailang/docs/benchmarks/performance
Verify:
- Timeline shows vX.X.X
- Success rate matches eval results
- No errors or warnings
Commit dashboard updates:
git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
git commit -m "Update benchmark dashboard for vX.X.X"
git push
4. Extract Metrics for CHANGELOG
Use the automation script:
.claude/skills/post-release/scripts/extract_changelog_metrics.sh
This outputs a ready-to-paste template with:
- Overall success rate
- AILANG-only rate (important!)
- Python baseline rate
- Gap analysis
- Automatic comparison with previous version (no manual work!)
Update CHANGELOG.md:
- Paste the complete template into CHANGELOG.md under version section
- The comparison is already calculated and formatted
- Optionally add context about specific improvements or regressions
Example output (ready to paste):
### Benchmark Results (M-EVAL)
**Overall Performance**: 59.1% success rate (399 total runs)
**By Language:**
- **AILANG**: 33.0% - New language, learning curve
- **Python**: 87.0% - Baseline for comparison
- **Gap**: 54.0 percentage points (expected for new language)
**Comparison**: -15.2% AILANG regression from 0.3.14 (48.2% → 33.0%)
4a. Analyze Agent Evaluation Results (NEW!)
Check agent efficiency metrics:
# Get KPIs (turns, tokens, cost by language)
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/X.X.X
Key metrics to track:
- Avg Turns: AILANG vs Python (target: ≤1.5x gap)
- Avg Tokens: AILANG vs Python (target: ≤2.0x gap)
- Success Rate: Both should be 100% (agent mode corrects mistakes)
If AILANG significantly worse than Python:
- Identify expensive benchmarks (from "Most Expensive" section)
- View transcripts to understand issues:
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/baselines/X.X.X <benchmark> - File optimization tasks for next release
- See eval-analyzer skill's agent_optimization_guide.md for strategies
Example good result:
🐍 Python: 10.6 avg turns, 58k tokens, 100% success
🔷 AILANG: 12.0 avg turns, 72k tokens, 100% success
Gap: 1.13x turns, 1.24x tokens ✅ (within target!)
Example needs work:
🐍 Python: 10.6 avg turns, 58k tokens, 100% success
🔷 AILANG: 18.0 avg turns, 178k tokens, 100% success
Gap: 1.7x turns, 3.0x tokens ⚠️ (needs optimization!)
5. Update Design Docs
- Move completed design docs from
design_docs/planned/todesign_docs/implemented/vX_Y/ - Update design docs with what was actually implemented (vs planned)
- Create new design docs for deferred features
6. Update Public Documentation
- Update
prompts/with latest AILANG syntax - Update website docs (
docs/) with latest features - Remove outdated examples or references
- Add new examples to website
- Update
docs/guides/evaluation/if significant benchmark improvements - Update
docs/LIMITATIONS.md:- Remove limitations that were fixed in this release
- Add new known limitations discovered during development/testing
- Update workarounds if they changed
- Update version numbers in "Since" and "Fixed in" fields
- Test examples: Verify that limitations listed still exist and workarounds still work
# Test examples from LIMITATIONS.md # Example: Test Y-combinator still fails (should fail) echo 'let Y = \f. (\x. f(x(x)))(\x. f(x(x))) in Y' | ailang repl # Example: Test named recursion works (should succeed) ailang run examples/factorial.ail # Example: Test polymorphic operator limitation (should panic with floats) # Create test file and verify behavior matches documentation - Commit changes:
git add docs/LIMITATIONS.md && git commit -m "Update LIMITATIONS.md for vX.X.X"
Resources
Post-Release Checklist
See resources/post_release_checklist.md for complete step-by-step checklist.
Prerequisites
- Release vX.X.X completed successfully
- Git tag vX.X.X exists
- GitHub release published with all binaries
ailangbinary installed (for eval baseline)- Node.js/npm installed (for dashboard, optional)
Common Issues
Eval Baseline Times Out
Solution: Use --skip-existing flag to resume:
bin/ailang eval-suite --full --skip-existing --output eval_results/baselines/vX.X.X
Dashboard Shows "null" for Aggregates
Cause: Wrong JSON file (performance matrix vs dashboard JSON)
Solution: Use update_dashboard.sh script, not manual file copying
Webpack/Cache Errors in Docusaurus
Cause: Stale build cache
Solution: Run cd docs && npm run clear && rm -rf .docusaurus build
Dashboard Shows Old Version
Cause: Didn't run update_dashboard.sh with correct version
Solution: Re-run update_dashboard.sh X.X.X with correct version
Progressive Disclosure
This skill loads information progressively:
- Always loaded: This SKILL.md file (YAML frontmatter + workflow overview)
- Execute as needed: Scripts in
scripts/directory (automation) - Load on demand:
resources/post_release_checklist.md(detailed checklist)
Scripts execute without loading into context window, saving tokens while providing powerful automation.
Recent Improvements (v0.3.15)
Version Format Handling
All scripts now accept version with or without 'v' prefix:
- ✅
0.3.15works - ✅
v0.3.15works - Scripts pass version to underlying tools exactly as given
- No more "directory not found" errors due to version format mismatch
Auto-Comparison in Metrics
The extract_changelog_metrics.sh script now:
- Automatically finds previous version from history
- Calculates AILANG performance difference
- Formats comparison text ("+X% improvement" or "-X% regression")
- Shows before/after percentages: "(48.2% → 33.0%)"
- Zero manual jq queries needed!
Eliminated Manual Work
Before v0.3.15, you needed to:
- Run 15+ manual jq queries to extract metrics
- Manually compare with previous version
- Format comparison text yourself
- Handle version prefix inconsistencies
After v0.3.15:
- 3 scripts do everything automatically
- No jq queries needed
- No manual calculations
- Version format doesn't matter
Notes
- This skill follows Anthropic's Agent Skills specification (Oct 2025)
- Scripts handle 100% of automation (eval baseline, dashboard, metrics extraction)
- Can be run hours or even days after release
- Dashboard JSON preserves history - never overwrites historical data
- Always use
--fullflag for release baselines (all production models) - Design doc migration requires manual review