name	AILANG Post-Release Tasks
description	Run automated post-release workflow (eval baselines, dashboard, docs) for AILANG releases. Executes 19-benchmark agent suite + full standard eval with validation and progress reporting. Use when user says "post-release tasks for vX.X.X" or "update dashboard". Fully autonomous with pre-flight checks.

AILANG Post-Release Tasks

Run post-release tasks for an AILANG release: evaluation baselines, dashboard updates, and documentation.

Quick Start

Most common usage:

# User says: "Run post-release tasks for v0.3.14"
# This skill will:
# 1. Run eval baseline (ALL 6 PRODUCTION MODELS, both languages) - ALWAYS USE --full FOR RELEASES
# 2. Update website dashboard (markdown + JSON with history)
# 3. Extract metrics for CHANGELOG
# 4. Guide through design doc and public doc updates

🚨 CRITICAL: For releases, ALWAYS use --full flag by default

Dev models (without --full) are only for quick testing/validation, NOT releases
Users expect full benchmark results when they say "post-release" or "update dashboard"
Never start with dev models and then try to add production models later

When to Use This Skill

Invoke this skill when:

User says "post-release tasks", "update dashboard", "run benchmarks"
After successful release (once GitHub release is published)
User asks about eval baselines or benchmark results
User wants to update documentation after a release

Available Scripts

`scripts/run_eval_baseline.sh <version> [--full]`

Run evaluation baseline for a release version.

🚨 CRITICAL: ALWAYS use --full for releases!

Usage:

# ✅ CORRECT - For releases (ALL 6 production models)
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.3.14 --full
.claude/skills/post-release/scripts/run_eval_baseline.sh v0.3.14 --full

# ❌ WRONG - Only use without --full for quick testing/validation
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.3.14

Output:

Running eval baseline for 0.3.14...
Mode: FULL (all 6 production models)
Expected cost: ~$0.50-1.00
Expected time: ~15-20 minutes

[Running benchmarks...]

✓ Baseline complete
  Results: eval_results/baselines/0.3.14
  Files: 264 result files

What it does:

Step 1: Runs make eval-baseline (standard 0-shot + repair evaluation)
- Tests both AILANG and Python implementations
- Uses all 6 production models (--full) or 3 dev models (default)
- Tests all benchmarks in benchmarks/ directory
Step 2: Runs agent eval on curated benchmark suite
- Current suite (v0.4.0+): 19 benchmarks across 2 tiers
  - Tier 1 (Smoke Tests - 8): fizzbuzz, recursion_factorial, recursion_fibonacci, simple_print, records_person, list_operations, string_manipulation, nested_records
    - Expected: 95-100% success
  - Tier 2 (Differentiators - 11): higher_order_functions, pattern_matching_complex, record_update, effect_composition, effect_tracking_io_fs, effect_pure_separation, exhaustive_pattern_matching, type_safe_record_access, explicit_state_threading, deterministic_list_transform, referential_transparency
    - Expected: 60-80% success (agent outperforms 0-shot)
  - See BENCHMARK_AUDIT_ANALYSIS.md for detailed rationale
- Uses haiku+sonnet (--full) or haiku only (default)
- Tests both AILANG and Python implementations
Saves combined results to eval_results/baselines/X.X.X/
Accepts version with or without 'v' prefix

`scripts/update_dashboard.sh <version>`

Update website benchmark dashboard with new release data.

Usage:

.claude/skills/post-release/scripts/update_dashboard.sh 0.3.14

Output:

Updating dashboard for 0.3.14...

1/5 Generating Docusaurus markdown...
  ✓ Written to docs/docs/benchmarks/performance.md

2/5 Generating dashboard JSON with history...
  ✓ Written to docs/static/benchmarks/latest.json (history preserved)

3/5 Validating JSON...
  ✓ Version: 0.3.14
  ✓ Success rate: 0.627

4/5 Clearing Docusaurus cache...
  ✓ Cache cleared

5/5 Summary
  ✓ Dashboard updated for 0.3.14
  ✓ Markdown: docs/docs/benchmarks/performance.md
  ✓ JSON: docs/static/benchmarks/latest.json

Next steps:
  1. Test locally: cd docs && npm start
  2. Visit: http://localhost:3000/ailang/docs/benchmarks/performance
  3. Verify timeline shows 0.3.14
  4. Commit: git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
  5. Commit: git commit -m 'Update benchmark dashboard for 0.3.14'
  6. Push: git push

What it does:

Generates Docusaurus-formatted markdown
Updates dashboard JSON with history preservation
Validates JSON structure (version matches input exactly)
Clears Docusaurus build cache
Provides next steps for testing and committing
Accepts version with or without 'v' prefix

`scripts/extract_changelog_metrics.sh [json_file]`

Extract benchmark metrics from dashboard JSON for CHANGELOG.

Usage:

.claude/skills/post-release/scripts/extract_changelog_metrics.sh
# Or specify JSON file:
.claude/skills/post-release/scripts/extract_changelog_metrics.sh docs/static/benchmarks/latest.json

Output:

Extracting metrics from docs/static/benchmarks/latest.json...

=== CHANGELOG.md Template ===

### Benchmark Results (M-EVAL)

**Overall Performance**: 59.1% success rate (399 total runs)

**By Language:**
- **AILANG**: 33.0% - New language, learning curve
- **Python**: 87.0% - Baseline for comparison
- **Gap: 54.0 percentage points (expected for new language)

**Comparison**: -15.2% AILANG regression from 0.3.14 (48.2% → 33.0%)

=== End Template ===

Use this template in CHANGELOG.md for 0.3.15

What it does:

Parses dashboard JSON for metrics
Calculates percentages and gap between AILANG/Python
Auto-compares with previous version from history
Formats comparison text automatically (+X% improvement or -X% regression)
Generates ready-to-paste CHANGELOG template with no manual work needed

Post-Release Workflow

1. Verify Release Exists

git tag -l vX.X.X
gh release view vX.X.X

If release doesn't exist, run release-manager skill first.

2. Run Eval Baseline

🚨 CRITICAL: ALWAYS use --full for releases!

Correct workflow:

# ✅ ALWAYS do this for releases - ONE command, let it complete
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X --full

This runs all 6 production models with both AILANG and Python.

Cost: ~$0.50-1.00 Time: ~15-20 minutes

❌ WRONG workflow (what happened with v0.3.22):

# DON'T do this for releases!
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X  # Missing --full!
# Then try to add production models later with --skip-existing
# Result: Confusion, multiple processes, incomplete baseline

If baseline times out or is interrupted:

# Resume with ALL 6 models (maintains --full semantics)
ailang eval-suite --full --langs python,ailang --parallel 5 \
  --output eval_results/baselines/X.X.X --skip-existing

The --skip-existing flag skips benchmarks that already have result files, allowing resumption of interrupted runs. But ONLY use this for recovery, not as a strategy to "add more models later".

3. Update Website Dashboard

Use the automation script:

.claude/skills/post-release/scripts/update_dashboard.sh X.X.X

IMPORTANT: This script automatically:

Generates Docusaurus markdown (docs/docs/benchmarks/performance.md)
Updates JSON with history preservation (docs/static/benchmarks/latest.json)
Does NOT overwrite historical data - merges new version into existing history
Validates JSON structure before writing
Clears Docusaurus cache to prevent webpack errors

Test locally (optional but recommended):

cd docs && npm start
# Visit: http://localhost:3000/ailang/docs/benchmarks/performance

Verify:

Timeline shows vX.X.X
Success rate matches eval results
No errors or warnings

Commit dashboard updates:

git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
git commit -m "Update benchmark dashboard for vX.X.X"
git push

4. Extract Metrics for CHANGELOG

Use the automation script:

.claude/skills/post-release/scripts/extract_changelog_metrics.sh

This outputs a ready-to-paste template with:

Overall success rate
AILANG-only rate (important!)
Python baseline rate
Gap analysis
Automatic comparison with previous version (no manual work!)

Update CHANGELOG.md:

Paste the complete template into CHANGELOG.md under version section
The comparison is already calculated and formatted
Optionally add context about specific improvements or regressions

Example output (ready to paste):

### Benchmark Results (M-EVAL)

**Overall Performance**: 59.1% success rate (399 total runs)

**By Language:**
- **AILANG**: 33.0% - New language, learning curve
- **Python**: 87.0% - Baseline for comparison
- **Gap**: 54.0 percentage points (expected for new language)

**Comparison**: -15.2% AILANG regression from 0.3.14 (48.2% → 33.0%)

4a. Analyze Agent Evaluation Results (NEW!)

Check agent efficiency metrics:

# Get KPIs (turns, tokens, cost by language)
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/X.X.X

Key metrics to track:

Avg Turns: AILANG vs Python (target: ≤1.5x gap)
Avg Tokens: AILANG vs Python (target: ≤2.0x gap)
Success Rate: Both should be 100% (agent mode corrects mistakes)

If AILANG significantly worse than Python:

Identify expensive benchmarks (from "Most Expensive" section)
View transcripts to understand issues: .claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/baselines/X.X.X <benchmark>
File optimization tasks for next release
See eval-analyzer skill's agent_optimization_guide.md for strategies

Example good result:

🐍 Python: 10.6 avg turns, 58k tokens, 100% success
🔷 AILANG: 12.0 avg turns, 72k tokens, 100% success
Gap: 1.13x turns, 1.24x tokens ✅ (within target!)

Example needs work:

🐍 Python: 10.6 avg turns, 58k tokens, 100% success
🔷 AILANG: 18.0 avg turns, 178k tokens, 100% success
Gap: 1.7x turns, 3.0x tokens ⚠️ (needs optimization!)

5. Update Design Docs

Move completed design docs from design_docs/planned/ to design_docs/implemented/vX_Y/
Update design docs with what was actually implemented (vs planned)
Create new design docs for deferred features

6. Update Public Documentation

Update prompts/ with latest AILANG syntax
Update website docs (docs/) with latest features
Remove outdated examples or references
Add new examples to website
Update docs/guides/evaluation/ if significant benchmark improvements
Update docs/LIMITATIONS.md:
- Remove limitations that were fixed in this release
- Add new known limitations discovered during development/testing
- Update workarounds if they changed
- Update version numbers in "Since" and "Fixed in" fields
- Test examples: Verify that limitations listed still exist and workarounds still work
```
# Test examples from LIMITATIONS.md
# Example: Test Y-combinator still fails (should fail)
echo 'let Y = \f. (\x. f(x(x)))(\x. f(x(x))) in Y' | ailang repl

# Example: Test named recursion works (should succeed)
ailang run examples/factorial.ail

# Example: Test polymorphic operator limitation (should panic with floats)
# Create test file and verify behavior matches documentation
```
- Commit changes: git add docs/LIMITATIONS.md && git commit -m "Update LIMITATIONS.md for vX.X.X"

Resources

Post-Release Checklist

See resources/post_release_checklist.md for complete step-by-step checklist.

Prerequisites

Release vX.X.X completed successfully
Git tag vX.X.X exists
GitHub release published with all binaries
ailang binary installed (for eval baseline)
Node.js/npm installed (for dashboard, optional)

Common Issues

Eval Baseline Times Out

Solution: Use --skip-existing flag to resume:

bin/ailang eval-suite --full --skip-existing --output eval_results/baselines/vX.X.X

Dashboard Shows "null" for Aggregates

Cause: Wrong JSON file (performance matrix vs dashboard JSON) Solution: Use update_dashboard.sh script, not manual file copying

Webpack/Cache Errors in Docusaurus

Cause: Stale build cache Solution: Run cd docs && npm run clear && rm -rf .docusaurus build

Dashboard Shows Old Version

Cause: Didn't run update_dashboard.sh with correct version Solution: Re-run update_dashboard.sh X.X.X with correct version

Progressive Disclosure

This skill loads information progressively:

Always loaded: This SKILL.md file (YAML frontmatter + workflow overview)
Execute as needed: Scripts in scripts/ directory (automation)
Load on demand: resources/post_release_checklist.md (detailed checklist)

Scripts execute without loading into context window, saving tokens while providing powerful automation.

Recent Improvements (v0.3.15)

Version Format Handling

All scripts now accept version with or without 'v' prefix:

✅ 0.3.15 works
✅ v0.3.15 works
Scripts pass version to underlying tools exactly as given
No more "directory not found" errors due to version format mismatch

Auto-Comparison in Metrics

The extract_changelog_metrics.sh script now:

Automatically finds previous version from history
Calculates AILANG performance difference
Formats comparison text ("+X% improvement" or "-X% regression")
Shows before/after percentages: "(48.2% → 33.0%)"
Zero manual jq queries needed!

Eliminated Manual Work

Before v0.3.15, you needed to:

Run 15+ manual jq queries to extract metrics
Manually compare with previous version
Format comparison text yourself
Handle version prefix inconsistencies

After v0.3.15:

3 scripts do everything automatically
No jq queries needed
No manual calculations
Version format doesn't matter

Notes

This skill follows Anthropic's Agent Skills specification (Oct 2025)
Scripts handle 100% of automation (eval baseline, dashboard, metrics extraction)
Can be run hours or even days after release
Dashboard JSON preserves history - never overwrites historical data
Always use --full flag for release baselines (all production models)
Design doc migration requires manual review

Install Skill

SKILL.md

AILANG Post-Release Tasks

Quick Start

When to Use This Skill

Available Scripts

scripts/run_eval_baseline.sh <version> [--full]

scripts/update_dashboard.sh <version>

scripts/extract_changelog_metrics.sh [json_file]

Post-Release Workflow

1. Verify Release Exists

2. Run Eval Baseline

3. Update Website Dashboard

4. Extract Metrics for CHANGELOG

4a. Analyze Agent Evaluation Results (NEW!)

5. Update Design Docs

6. Update Public Documentation

Resources

Post-Release Checklist

Prerequisites

Common Issues

Eval Baseline Times Out

Dashboard Shows "null" for Aggregates

Webpack/Cache Errors in Docusaurus

Dashboard Shows Old Version

Progressive Disclosure

Recent Improvements (v0.3.15)

Version Format Handling

Auto-Comparison in Metrics

Eliminated Manual Work

Notes

`scripts/run_eval_baseline.sh <version> [--full]`

`scripts/update_dashboard.sh <version>`

`scripts/extract_changelog_metrics.sh [json_file]`