name	perform-sweep
description	Design, configure, launch, and analyze ablation sweeps for GRPO training. Use for hypothesis testing, hyperparameter experiments, and systematic comparisons.

Perform Sweep

Name: perform-sweep
Author: bglick13

End-to-end workflow for running ablation experiments on the Diplomacy GRPO training pipeline.

Quick Reference

Phase	Action	Command
Configure	Create sweep.yaml	See YAML Reference
Validate	Dry run	`python scripts/launch_sweep.py <path> --dry-run`
Info	Show config	`python scripts/launch_sweep.py <path> --info`
Launch	Start sweep	`python scripts/launch_sweep.py <path>`
Status	Check progress	`python scripts/launch_sweep.py <path> --status`
List	List all sweeps	`python scripts/launch_sweep.py --list`
Analyze	Compare results	Use `experiment-analysis` skill

Workflow

1. Hypothesis Design

Review recent experiments in experiments/experiment-tracker.md
Identify one variable to test (e.g., horizon length, scoring function)
Predict expected outcome
Document reasoning in sweep.yaml hypothesis field

2. YAML Configuration

Create experiments/sweeps/<name>/sweep.yaml:

metadata:
  name: "my-ablation"
  description: "Testing hypothesis X"
  hypothesis: "Longer horizons should improve strategic play"
  experiment_tag_prefix: "my-ablation"

defaults:
  total_steps: 100

runs:
  A:
    name: "control"
    description: "Baseline configuration"
    config:
      experiment_tag: "${metadata.experiment_tag_prefix}-A"
  B:
    name: "treatment"
    description: "With longer horizon"
    config:
      rollout_horizon_years: 8
      experiment_tag: "${metadata.experiment_tag_prefix}-B"

See YAML Reference for full schema.

3. Validate Configuration

# Show sweep info
python scripts/launch_sweep.py experiments/sweeps/<name>/ --info

# Dry run (validates config, shows what would run)
python scripts/launch_sweep.py experiments/sweeps/<name>/ --dry-run

4. Launch and Monitor

# Launch (fire-and-forget - runs in cloud)
python scripts/launch_sweep.py experiments/sweeps/<name>/

# Check status anytime
python scripts/launch_sweep.py experiments/sweeps/<name>/ --status

# List all sweeps
python scripts/launch_sweep.py --list

5. Analysis

After sweep completes, use the experiment-analysis skill:

# Full analysis for each run
uv run python .claude/skills/experiment-analysis/analyze_elo.py <run-name>

# Compare in WandB
# Filter by experiment_tag_prefix (e.g., "my-ablation")

Key Features

Fire-and-forget: Launch and close laptop - sweep runs in Modal cloud
Auto-resume: If Modal times out (24hr max), sweep automatically respawns
Sequential execution: Runs one training at a time (infra constraint)
Progress tracking: State saved after each run for recovery

Example Sweeps

See existing sweeps in experiments/sweeps/:

longer-horizon-inverted-weight-ablation/ - 2x2 ablation on horizon and scoring

Integration

Use experiment-analysis skill for post-sweep metrics analysis
Results logged to WandB with experiment_tag for filtering
Document findings in sweep directory's results.md

perform-sweep

Install Skill

SKILL.md