| name | experiment-analysis |
| description | Analyze GRPO training runs for learning dynamics and pipeline performance. Use when diagnosing training issues, reviewing Elo progression, checking throughput, or updating experiment results. |
Experiment Analysis
Diagnose GRPO training runs using WandB metrics and Axiom logs.
Quick Reference
| Question | Command |
|---|---|
| Full Elo analysis | uv run python .claude/skills/experiment-analysis/analyze_elo.py <run> |
| Compare sweep runs | uv run python .claude/skills/experiment-analysis/analyze_sweep.py --sweep <prefix> |
| Is model learning? | uv run python scripts/wandb_cli.py get-metrics -r <run> --all-metrics |
| Rollout throughput? | uv run python scripts/axiom_cli.py rollout-timing --last 6h |
| Any errors? | uv run python scripts/axiom_cli.py errors --last 1h |
| Extraction rate? | uv run python scripts/axiom_cli.py extraction-stats --last 24h |
| System health? | uv run python scripts/axiom_cli.py health --last 1h |
Tools Overview
WandB CLI (scripts/wandb_cli.py)
Training metrics and Elo ratings. Use for:
- Elo trajectory analysis (learning signal)
- Reward/loss curves
- KL divergence and grad norm
Axiom CLI (scripts/axiom_cli.py)
Real-time logs and events. Use for:
- Rollout timing and throughput
- Inference engine performance
- Error monitoring
- Order extraction stats
Detailed Guides
- Learning Dynamics - Elo, rewards, KL analysis
- Pipeline Performance - Throughput, timing, errors
- Experiment Tracker Guide - Updating docs/experiment-tracker.md
- Examples - Real analysis walkthrough
Key Metrics
Learning Signal (Fixed Reference Analysis)
Key insight: Win rate against a dynamic league is meaningless. Use FIXED references.
| Metric | Good Sign | Bad Sign |
|---|---|---|
| base_model Elo | Declining | Stable/Rising |
| Baseline bot Elo | Declining (exploited) | Rising |
| Best checkpoint - base_model gap | Growing | Shrinking |
| Older checkpoint Elo | Declining | Stable |
| KL divergence | Stable <0.1 | Spikes >0.2 |
Fixed references (base_model, chaos_bot, etc.) don't change, so their Elo changes = learning. Elo gap (best checkpoint - base_model) measures how much better trained model is.
Performance
| Metric | Target | Action if Miss |
|---|---|---|
| Rollout p95 duration | <120s | Check inference engine |
| Extraction rate | >95% | Check logits processor |
| Error rate | <1% | Check Axiom errors |
| Grad norm | <50 | Policy may be unstable |