| name | AILANG Post-Release Tasks |
| description | Run automated post-release workflow (eval baselines, dashboard, docs) for AILANG releases. Executes 46-benchmark suite (medium/hard/stretch) + full standard eval with validation and progress reporting. Use when user says "post-release tasks for vX.X.X" or "update dashboard". Fully autonomous with pre-flight checks. |
AILANG Post-Release Tasks
Run post-release tasks for an AILANG release: evaluation baselines, dashboard updates, and documentation.
Quick Start
Most common usage:
# User says: "Run post-release tasks for v0.3.14"
# This skill will:
# 1. Run eval baseline (ALL 6 PRODUCTION MODELS, both languages) - ALWAYS USE --full FOR RELEASES
# 2. Update website dashboard (JSON with history preservation)
# 3. Update axiom scorecard KPI (if features affect axiom compliance)
# 4. Extract metrics and UPDATE CHANGELOG.md automatically
# 5. Move design docs from planned/ to implemented/
# 6. Run docs-sync to verify website accuracy (version constants, PLANNED banners, examples)
# 7. Commit all changes to git
🚨 CRITICAL: For releases, ALWAYS use --full flag by default
- Dev models (without --full) are only for quick testing/validation, NOT releases
- Users expect full benchmark results when they say "post-release" or "update dashboard"
- Never start with dev models and then try to add production models later
When to Use This Skill
Invoke this skill when:
- User says "post-release tasks", "update dashboard", "run benchmarks"
- After successful release (once GitHub release is published)
- User asks about eval baselines or benchmark results
- User wants to update documentation after a release
Available Scripts
scripts/run_eval_baseline.sh <version> [--full]
Run evaluation baseline for a release version.
🚨 CRITICAL: ALWAYS use --full for releases!
Usage:
# ✅ CORRECT - For releases (ALL 6 production models)
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.3.14 --full
.claude/skills/post-release/scripts/run_eval_baseline.sh v0.3.14 --full
# ❌ WRONG - Only use without --full for quick testing/validation
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.3.14
Output:
Running eval baseline for 0.3.14...
Mode: FULL (all 6 production models)
Expected cost: ~$0.50-1.00
Expected time: ~15-20 minutes
[Running benchmarks...]
✓ Baseline complete
Results: eval_results/baselines/0.3.14
Files: 264 result files
What it does:
- Step 1: Runs
make eval-baseline(standard 0-shot + repair evaluation)- Tests both AILANG and Python implementations
- Uses all 6 production models (--full) or 3 dev models (default)
- Tests all benchmarks in benchmarks/ directory
- Step 2: Runs agent eval on full benchmark suite
- Current suite (v0.4.8+): 46 benchmarks (trimmed from 56)
- Removed: trivial benchmarks (print tests), most easy benchmarks
- Kept: fizzbuzz (1 easy for validation), all medium/hard/stretch benchmarks
- Stretch goals (6 new): symbolic_diff, mini_interpreter, lambda_calc, graph_bfs, type_unify, red_black_tree
- Expected: ~55-70% success rate with haiku, higher with sonnet/opus
- Uses haiku+sonnet (--full) or haiku only (default)
- Tests both AILANG and Python implementations
- Current suite (v0.4.8+): 46 benchmarks (trimmed from 56)
- Saves combined results to eval_results/baselines/X.X.X/
- Accepts version with or without 'v' prefix
scripts/update_dashboard.sh <version>
Update website benchmark dashboard with new release data.
Usage:
.claude/skills/post-release/scripts/update_dashboard.sh 0.3.14
Output:
Updating dashboard for 0.3.14...
1/5 Generating Docusaurus markdown...
✓ Written to docs/docs/benchmarks/performance.md
2/5 Generating dashboard JSON with history...
✓ Written to docs/static/benchmarks/latest.json (history preserved)
3/5 Validating JSON...
✓ Version: 0.3.14
✓ Success rate: 0.627
4/5 Clearing Docusaurus cache...
✓ Cache cleared
5/5 Summary
✓ Dashboard updated for 0.3.14
✓ Markdown: docs/docs/benchmarks/performance.md
✓ JSON: docs/static/benchmarks/latest.json
Next steps:
1. Test locally: cd docs && npm start
2. Visit: http://localhost:3000/ailang/docs/benchmarks/performance
3. Verify timeline shows 0.3.14
4. Commit: git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
5. Commit: git commit -m 'Update benchmark dashboard for 0.3.14'
6. Push: git push
What it does:
- Generates Docusaurus-formatted markdown
- Updates dashboard JSON with history preservation
- Validates JSON structure (version matches input exactly)
- Clears Docusaurus build cache
- Provides next steps for testing and committing
- Accepts version with or without 'v' prefix
scripts/extract_changelog_metrics.sh [json_file]
Extract benchmark metrics from dashboard JSON for CHANGELOG.
Usage:
.claude/skills/post-release/scripts/extract_changelog_metrics.sh
# Or specify JSON file:
.claude/skills/post-release/scripts/extract_changelog_metrics.sh docs/static/benchmarks/latest.json
Output:
Extracting metrics from docs/static/benchmarks/latest.json...
=== CHANGELOG.md Template ===
### Benchmark Results (M-EVAL)
**Overall Performance**: 59.1% success rate (399 total runs)
**By Language:**
- **AILANG**: 33.0% - New language, learning curve
- **Python**: 87.0% - Baseline for comparison
- **Gap: 54.0 percentage points (expected for new language)
**Comparison**: -15.2% AILANG regression from 0.3.14 (48.2% → 33.0%)
=== End Template ===
Use this template in CHANGELOG.md for 0.3.15
What it does:
- Parses dashboard JSON for metrics
- Calculates percentages and gap between AILANG/Python
- Auto-compares with previous version from history
- Formats comparison text automatically (+X% improvement or -X% regression)
- Generates ready-to-paste CHANGELOG template with no manual work needed
scripts/cleanup_design_docs.sh <version> [--dry-run] [--force] [--check-only]
Move design docs from planned/ to implemented/ after a release.
Features:
- Detects duplicates (docs already in implemented/)
- Detects misplaced docs (Target: field doesn't match folder version)
- Only moves docs with "Status: Implemented" in their frontmatter
- Docs with other statuses are flagged for review
Usage:
# Check-only: Report issues without making changes
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --check-only
# Preview what would be moved/deleted/relocated
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --dry-run
# Execute: Move implemented docs, delete duplicates, relocate misplaced
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9
# Force move all docs regardless of status
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --force
Output:
Design Doc Cleanup for v0.5.9
==================================
Checking 5 design doc(s) in design_docs/planned/v0_5_9/:
Phase 1: Detecting issues...
[DUPLICATE] m-fix-if-else-let-block.md
Already exists in design_docs/implemented/v0_5_9/
[MISPLACED] m-codegen-value-types.md
Target: v0.5.10 (folder: v0_5_9)
Should be in: design_docs/planned/v0_5_10/
Issues found:
- 1 duplicate(s) (can be deleted)
- 1 misplaced doc(s) (wrong version folder)
Phase 2: Processing docs...
[DELETED] m-fix-if-else-let-block.md (duplicate - already in implemented/)
[RELOCATED] m-codegen-value-types.md → design_docs/planned/v0_5_10/
[MOVED] m-dx11-cycles.md → design_docs/implemented/v0_5_9/
[NEEDS REVIEW] m-unfinished-feature.md
Found: **Status**: Planned
Summary:
✓ Deleted 1 duplicate(s)
✓ Relocated 1 doc(s) to correct version folder
✓ Moved 1 doc(s) to design_docs/implemented/v0_5_9/
⚠ 1 doc(s) need review (not marked as Implemented)
What it does:
- Phase 1 (Detection): Identifies duplicates and misplaced docs
- Phase 2 (Processing):
- Deletes duplicates (same file already exists in implemented/)
- Relocates misplaced docs (Target: version doesn't match folder)
- Moves docs with "Status: Implemented" to implemented/
- Flags docs without Implemented status for review
- Creates target folders if needed
- Removes empty planned folder after cleanup
- Use
--check-onlyto only report issues,--dry-runto preview actions,--forceto move all
Post-Release Workflow
1. Verify Release Exists
git tag -l vX.X.X
gh release view vX.X.X
If release doesn't exist, run release-manager skill first.
2. Run Eval Baseline
🚨 CRITICAL: ALWAYS use --full for releases!
Correct workflow:
# ✅ ALWAYS do this for releases - ONE command, let it complete
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X --full
This runs all 6 production models with both AILANG and Python.
Cost: ~$0.50-1.00 Time: ~15-20 minutes
❌ WRONG workflow (what happened with v0.3.22):
# DON'T do this for releases!
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X # Missing --full!
# Then try to add production models later with --skip-existing
# Result: Confusion, multiple processes, incomplete baseline
If baseline times out or is interrupted:
# Resume with ALL 6 models (maintains --full semantics)
ailang eval-suite --full --langs python,ailang --parallel 5 \
--output eval_results/baselines/X.X.X --skip-existing
The --skip-existing flag skips benchmarks that already have result files, allowing resumption of interrupted runs. But ONLY use this for recovery, not as a strategy to "add more models later".
3. Update Website Dashboard
Use the automation script:
.claude/skills/post-release/scripts/update_dashboard.sh X.X.X
IMPORTANT: This script automatically:
- Generates Docusaurus markdown (docs/docs/benchmarks/performance.md)
- Updates JSON with history preservation (docs/static/benchmarks/latest.json)
- Does NOT overwrite historical data - merges new version into existing history
- Validates JSON structure before writing
- Clears Docusaurus cache to prevent webpack errors
Test locally (optional but recommended):
cd docs && npm start
# Visit: http://localhost:3000/ailang/docs/benchmarks/performance
Verify:
- Timeline shows vX.X.X
- Success rate matches eval results
- No errors or warnings
Commit dashboard updates:
git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
git commit -m "Update benchmark dashboard for vX.X.X"
git push
4. Update Axiom Scorecard
Review and update the axiom scorecard:
# View current scorecard
ailang axioms
# The scorecard is at docs/static/benchmarks/axiom_scorecard.json
# Update scores if features were added/improved that affect axiom compliance
When to update scores:
- +1 → +2 if a partial implementation becomes complete
- New feature aligns with an axiom → update evidence
- Gaps were fixed → remove from gaps array
- Add to history array to track KPI over time
Update history entry:
{
"version": "vX.X.X",
"date": "YYYY-MM-DD",
"score": 18,
"maxScore": 24,
"percentage": 75.0,
"notes": "Added capability budgets (A9 +1)"
}
5. Extract Metrics for CHANGELOG
Generate metrics template:
.claude/skills/post-release/scripts/extract_changelog_metrics.sh X.X.X
This outputs a formatted template with:
- Overall success rate
- Standard eval metrics (0-shot, final with repair, repair effectiveness)
- Agent eval metrics by language
- Automatic comparison with previous version (no manual work!)
Update CHANGELOG.md automatically:
- Run the script to generate the template
- Insert the "Benchmark Results (M-EVAL)" section into CHANGELOG.md
- Place it after the feature/fix sections and before the next version
For CHANGELOG template format, see `resources/version_notes.md`.
5a. Analyze Agent Evaluation Results
# Get KPIs (turns, tokens, cost by language)
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/X.X.X
Target metrics: Avg Turns ≤1.5x gap, Avg Tokens ≤2.0x gap vs Python.
For detailed agent analysis guide, see `resources/version_notes.md`.
6. Move Design Docs to Implemented
Step 1: Check for issues (duplicates, misplaced docs):
.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X --check-only
Step 2: Preview all changes:
.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X --dry-run
Step 3: Check any flagged docs:
[DUPLICATE]- These will be deleted (already in implemented/)[MISPLACED]- These will be relocated to correct version folder[NEEDS REVIEW]- Update**Status**:toImplementedif done, or leave for next version
Step 4: Execute the cleanup:
.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X
Step 5: Commit the changes:
git add design_docs/
git commit -m "docs: cleanup design docs for vX_Y_Z"
The script automatically:
- Deletes duplicates (same file already in implemented/)
- Relocates misplaced docs (Target: version doesn't match folder)
- Moves docs with "Status: Implemented" to implemented/
- Flags remaining docs for manual review
7. Update Public Documentation
- Update
prompts/with latest AILANG syntax - Update website docs (
docs/) with latest features - Remove outdated examples or references
- Add new examples to website
- Update
docs/guides/evaluation/if significant benchmark improvements - Update
docs/LIMITATIONS.md:- Remove limitations that were fixed in this release
- Add new known limitations discovered during development/testing
- Update workarounds if they changed
- Update version numbers in "Since" and "Fixed in" fields
- Test examples: Verify that limitations listed still exist and workarounds still work
# Test examples from LIMITATIONS.md # Example: Test Y-combinator still fails (should fail) echo 'let Y = \f. (\x. f(x(x)))(\x. f(x(x))) in Y' | ailang repl # Example: Test named recursion works (should succeed) ailang run examples/factorial.ail # Example: Test polymorphic operator limitation (should panic with floats) # Create test file and verify behavior matches documentation - Commit changes:
git add docs/LIMITATIONS.md && git commit -m "Update LIMITATIONS.md for vX.X.X"
8. Run Documentation Sync Check
Run docs-sync to verify website accuracy:
# Check version constants are correct
.claude/skills/docs-sync/scripts/check_versions.sh
# Audit design docs vs website claims
.claude/skills/docs-sync/scripts/audit_design_docs.sh
# Generate full sync report
.claude/skills/docs-sync/scripts/generate_report.sh
What docs-sync checks:
- Version constants in
docs/src/constants/version.jsmatch git tag - Teaching prompt references point to latest version
- Architecture pages have PLANNED banners for unimplemented features
- Design docs status (planned vs implemented) matches website claims
- Examples referenced in website actually work
If issues found:
- Update version.js if stale
- Add PLANNED banners to theoretical feature pages
- Move implemented features from roadmap to current sections
- Fix broken example references
Commit docs-sync fixes:
git add docs/
git commit -m "docs: sync website with vX.X.X implementation"
See docs-sync skill for full documentation.
Resources
Post-Release Checklist
See `resources/post_release_checklist.md` for complete step-by-step checklist.
Prerequisites
- Release vX.X.X completed successfully
- Git tag vX.X.X exists
- GitHub release published with all binaries
ailangbinary installed (for eval baseline)- Node.js/npm installed (for dashboard, optional)
Common Issues
Eval Baseline Times Out
Solution: Use --skip-existing flag to resume:
bin/ailang eval-suite --full --skip-existing --output eval_results/baselines/vX.X.X
Dashboard Shows "null" for Aggregates
Cause: Wrong JSON file (performance matrix vs dashboard JSON)
Solution: Use update_dashboard.sh script, not manual file copying
Webpack/Cache Errors in Docusaurus
Cause: Stale build cache
Solution: Run cd docs && npm run clear && rm -rf .docusaurus build
Dashboard Shows Old Version
Cause: Didn't run update_dashboard.sh with correct version
Solution: Re-run update_dashboard.sh X.X.X with correct version
Progressive Disclosure
This skill loads information progressively:
- Always loaded: This SKILL.md file (YAML frontmatter + workflow overview)
- Execute as needed: Scripts in
scripts/directory (automation) - Load on demand:
resources/post_release_checklist.md(detailed checklist)
Scripts execute without loading into context window, saving tokens while providing powerful automation.
Version Notes
For historical improvements and lessons learned, see `resources/version_notes.md`.
Key points:
- All scripts accept version with or without 'v' prefix
- Use
--validateflag to check configuration before running - Agent eval requires explicit
--benchmarkslist (defined in script)
Notes
- This skill follows Anthropic's Agent Skills specification (Oct 2025)
- Scripts handle 100% of automation (eval baseline, dashboard, metrics extraction, design docs)
- Can be run hours or even days after release
- Dashboard JSON preserves history - never overwrites historical data
- Always use
--fullflag for release baselines (all production models) - Design docs cleanup now handles duplicates, misplaced docs, and status-based moves
--check-onlyto report issues without changes--dry-runto preview all actions--forceto move all regardless of status