Claude Code Plugins

Community-maintained marketplace

Feedback

Design, configure, launch, and analyze ablation sweeps for GRPO training. Use for hypothesis testing, hyperparameter experiments, and systematic comparisons.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name perform-sweep
description Design, configure, launch, and analyze ablation sweeps for GRPO training. Use for hypothesis testing, hyperparameter experiments, and systematic comparisons.

Perform Sweep

End-to-end workflow for running ablation experiments on the Diplomacy GRPO training pipeline.

Quick Reference

Phase Action Command
Configure Create sweep.yaml See YAML Reference
Validate Dry run python scripts/launch_sweep.py <path> --dry-run
Info Show config python scripts/launch_sweep.py <path> --info
Launch Start sweep python scripts/launch_sweep.py <path>
Status Check progress python scripts/launch_sweep.py <path> --status
List List all sweeps python scripts/launch_sweep.py --list
Analyze Compare results Use experiment-analysis skill

Workflow

1. Hypothesis Design

  • Review recent experiments in experiments/experiment-tracker.md
  • Identify one variable to test (e.g., horizon length, scoring function)
  • Predict expected outcome
  • Document reasoning in sweep.yaml hypothesis field

2. YAML Configuration

Create experiments/sweeps/<name>/sweep.yaml:

metadata:
  name: "my-ablation"
  description: "Testing hypothesis X"
  hypothesis: "Longer horizons should improve strategic play"
  experiment_tag_prefix: "my-ablation"

defaults:
  total_steps: 100

runs:
  A:
    name: "control"
    description: "Baseline configuration"
    config:
      experiment_tag: "${metadata.experiment_tag_prefix}-A"
  B:
    name: "treatment"
    description: "With longer horizon"
    config:
      rollout_horizon_years: 8
      experiment_tag: "${metadata.experiment_tag_prefix}-B"

See YAML Reference for full schema.

3. Validate Configuration

# Show sweep info
python scripts/launch_sweep.py experiments/sweeps/<name>/ --info

# Dry run (validates config, shows what would run)
python scripts/launch_sweep.py experiments/sweeps/<name>/ --dry-run

4. Launch and Monitor

# Launch (fire-and-forget - runs in cloud)
python scripts/launch_sweep.py experiments/sweeps/<name>/

# Check status anytime
python scripts/launch_sweep.py experiments/sweeps/<name>/ --status

# List all sweeps
python scripts/launch_sweep.py --list

5. Analysis

After sweep completes, use the experiment-analysis skill:

# Full analysis for each run
uv run python .claude/skills/experiment-analysis/analyze_elo.py <run-name>

# Compare in WandB
# Filter by experiment_tag_prefix (e.g., "my-ablation")

Key Features

  • Fire-and-forget: Launch and close laptop - sweep runs in Modal cloud
  • Auto-resume: If Modal times out (24hr max), sweep automatically respawns
  • Sequential execution: Runs one training at a time (infra constraint)
  • Progress tracking: State saved after each run for recovery

Example Sweeps

See existing sweeps in experiments/sweeps/:

  • longer-horizon-inverted-weight-ablation/ - 2x2 ablation on horizon and scoring

Integration

  • Use experiment-analysis skill for post-sweep metrics analysis
  • Results logged to WandB with experiment_tag for filtering
  • Document findings in sweep directory's results.md