Claude Code Plugins

Community-maintained marketplace

Feedback

experiment-analysis

@bglick13/diplomacy-v2
0
0

Analyze GRPO training runs for learning dynamics and pipeline performance. Use when diagnosing training issues, reviewing Elo progression, checking throughput, or updating experiment results.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name experiment-analysis
description Analyze GRPO training runs for learning dynamics and pipeline performance. Use when diagnosing training issues, reviewing Elo progression, checking throughput, or updating experiment results.

Experiment Analysis

Diagnose GRPO training runs using WandB metrics and Axiom logs.

Quick Reference

Question Command
Full Elo analysis uv run python .claude/skills/experiment-analysis/analyze_elo.py <run>
Compare sweep runs uv run python .claude/skills/experiment-analysis/analyze_sweep.py --sweep <prefix>
Is model learning? uv run python scripts/wandb_cli.py get-metrics -r <run> --all-metrics
Rollout throughput? uv run python scripts/axiom_cli.py rollout-timing --last 6h
Any errors? uv run python scripts/axiom_cli.py errors --last 1h
Extraction rate? uv run python scripts/axiom_cli.py extraction-stats --last 24h
System health? uv run python scripts/axiom_cli.py health --last 1h

Tools Overview

WandB CLI (scripts/wandb_cli.py)

Training metrics and Elo ratings. Use for:

  • Elo trajectory analysis (learning signal)
  • Reward/loss curves
  • KL divergence and grad norm

Axiom CLI (scripts/axiom_cli.py)

Real-time logs and events. Use for:

  • Rollout timing and throughput
  • Inference engine performance
  • Error monitoring
  • Order extraction stats

Detailed Guides

Key Metrics

Learning Signal (Fixed Reference Analysis)

Key insight: Win rate against a dynamic league is meaningless. Use FIXED references.

Metric Good Sign Bad Sign
base_model Elo Declining Stable/Rising
Baseline bot Elo Declining (exploited) Rising
Best checkpoint - base_model gap Growing Shrinking
Older checkpoint Elo Declining Stable
KL divergence Stable <0.1 Spikes >0.2

Fixed references (base_model, chaos_bot, etc.) don't change, so their Elo changes = learning. Elo gap (best checkpoint - base_model) measures how much better trained model is.

Performance

Metric Target Action if Miss
Rollout p95 duration <120s Check inference engine
Extraction rate >95% Check logits processor
Error rate <1% Check Axiom errors
Grad norm <50 Policy may be unstable