run-experiment

name	run-experiment
description	Execute the complete experimental workflow - model optimization followed by evaluation - for all runs in a scaffolded experiment. Use after scaffold-experiment to submit jobs to SLURM.

Your Task

Orchestrate experiment execution by reading tool specifications from experiment_summary.yaml and calling the appropriate tool modules sequentially:

Read experiment_summary.yaml to identify tools being used
Execute model optimization (fine-tuning) for all runs
Wait for optimization to complete (REQUIRED)
Execute model evaluation for all runs

This ensures the entire experiment runs from training through evaluation with proper dependency management.

Prerequisites

experiment_summary.yaml exists (from design-experiment skill)
Scaffolding complete (from scaffold-experiment skill)
SLURM cluster access

Workflow

High-Level Steps

Locate experiment - Find experiment directory (current dir or ask user)
Verify scaffolding - Ensure configs exist for optimization and evaluation
Read tool specifications - Parse experiment_summary.yaml "tools" section
Execute optimization - Call optimizer module (torchtune)
Execute evaluation - Call evaluator module (inspect) - MUST wait for optimization
Create orchestration log - Document process in run-experiment.log
Report combined summary - Show complete status

Tool Modules

Optimizer modules: See optimizers/ for tool-specific execution logic

Currently supported: torchtune (fine-tuning)
Future: DSPy (prompt optimization), custom trainers

Evaluator modules: See evaluators/ for tool-specific execution logic

Currently supported: inspect-ai
Future: custom evaluation frameworks

Detailed Workflows

For step-by-step execution details:

Torchtune execution: workflows/torchtune.md
Inspect execution: workflows/inspect.md

Reading Tool Specifications

Parse experiment_summary.yaml "tools" section to identify frameworks:

Expected format:

tools:
  preparation: "torchtune"
  evaluation: "inspect-ai"

Tool to module mapping:

torchtune → optimizers/torchtune/
inspect-ai → evaluators/inspect/

If tools section missing: Assume torchtune + inspect-ai (backward compatibility)

Sequential Execution

CRITICAL: Evaluation MUST wait for optimization to complete.

Why? Evaluation jobs need optimized model checkpoints.

Implementation:

Execute optimizer module (torchtune fine-tuning)
Monitor until ALL optimization jobs complete
Only then execute evaluator module (inspect evaluation)
Monitor until ALL evaluation jobs complete
Report combined results

Logging

Create orchestration log at {experiment_dir}/run-experiment.log:

Log format:

[YYYY-MM-DD HH:MM:SS] ACTION: Description
Details: {specifics}
Result: {outcome}

What to log:

Experiment discovery and validation
Scaffolding verification
Tool module invocations (timestamps, results, durations)
Completion status (successes/failures)
Errors or warnings
Final combined summary
Paths to results and module logs

Expected Outputs

After successful execution:

Logs created:

run-experiment.log - Orchestration log
Optimizer module logs (e.g., detailed fine-tuning execution)
Evaluator module logs (e.g., detailed evaluation execution)

Status updated:

Run tracking logs updated with job IDs, timestamps, states
All execution details recorded in module logs

Artifacts created:

Model checkpoints from optimization
Evaluation logs from evaluation

Logging: All actions logged to {experiment_dir}/run-{torchtune|inspect}.log (see logging.md)

Error Handling

If experiment_summary.yaml not found:

Suggest running design-experiment skill first
Do not proceed

If scaffolding incomplete:

Report which parts missing
Suggest running scaffold-experiment skill
Can proceed with optimization only if just evaluation configs missing

If optimization fails:

Log failure details
Do NOT proceed to evaluation (missing model checkpoints)
Report failure and stop

If evaluation fails:

Log failure details
Optimization results still valid
Report partial success

If user cancels:

SLURM jobs continue running independently
Can resume monitoring by re-running skill

Validation Checklist

Before reporting success, verify:

✓ experiment_summary.yaml found and read
✓ Scaffolding verified
✓ Optimizer module executed and completed
✓ Evaluator module executed and completed
✓ Model checkpoints exist
✓ Evaluation logs exist
✓ Orchestration log created
✓ All module logs exist

Output Summary

Provide comprehensive summary after completion:

## Run Experiment Complete

Experiment: `{experiment_dir}`

### Optimization Results

✓ {N}/{M} runs completed successfully
Duration: {duration}

**Completed runs:** [list with times]
**Failed runs:** [list with errors]
**Model checkpoints:** {paths}

### Evaluation Results

✓ {N}/{M} evaluations completed successfully
Duration: {duration}

**Completed evaluations:** [list with times]
**Failed evaluations:** [list with errors]
**Evaluation logs:** {paths}

### Total Time

Complete workflow: {total_duration}
- Optimization: {opt_duration}
- Evaluation: {eval_duration}

### Next Steps

1. View results: `inspect view --port=$(get_free_port)`
2. Export data: `inspect log export ...`
3. Analyze results (see experiment_summary.yaml for configuration)

Important Notes

Orchestration principles:

This skill orchestrates rather than implements
Each tool module maintains its own detailed log
Sequential execution is mandatory (evaluation requires optimization complete)
Partial success is acceptable (some runs succeed, others fail)
Tool modules can be executed independently if needed

Relationship to other skills:

Before: design-experiment, scaffold-experiment
After: analyze-experiment (planned)
Standalone: Individual tool modules can run independently

Resumability:

Re-running run-experiment is safe
Tool modules check for completed jobs
Won't re-submit successful jobs

Install Skill

SKILL.md