| name | doe-suite-experiments |
| description | Run remote experiments (and process results) using the doe-suite framework. Use when running experiments remotely or when processing experiment results to create plots or other diagrams. |
This skill orchestrates the complete workflow for running DoE-Suite experiments:
- Validates suite configuration and launches remote infrastructure
- Executes experiments on AWS EC2, ETHZ Euler, or existing remote machines
- Monitors progress and automatically collects results
- Processes results with ETL (Extract-Transform-Load) pipelines
- Generates visualizations and aggregated data
- Creates and edits suite designs and ETL configurations
Key Concepts
DoE-Suite is a tool for remote experiment management following Design of Experiments (DoE) methodology:
- Suite: A collection of experiments defined in a YAML configuration file
- Experiment: A set of runs with different configurations
- Factor: A parameter that varies between runs (e.g., model, algorithm)
- Level: A specific value a factor takes in a run (e.g., mnist, resnet18)
- Run: A single execution with a specific combination of factor levels
- ETL Pipeline: Extract-Transform-Load pipeline that processes results locally
- Super-ETL: Aggregates and processes results across multiple suite runs
Prerequisites
The doe-suite will report configuration issues when invoked. Key requirements:
.envrcfile with environment variables:DOES_PROJECT_DIR: Project root directoryDOES_PROJECT_ID_SUFFIX: Project identifier suffixDOES_SSH_KEY_NAME: AWS SSH key nameDOES_CLOUD: Cloud provider (aws)AWS_PROFILE: AWS credentials profileAWS_REGION: AWS region to deploy in
- AWS credentials properly configured
- SSH key must exist in the target AWS region
Working Directory: All doe-suite commands must be run from the doe-suite directory within your project.
Available Suites
Suites are defined as YAML files in doe-suite-config/designs/. To discover available suites:
# List all suite design files
ls doe-suite-config/designs/*.yml
# Get information about configured suites
make info
# View what a specific suite will execute
make design suite=<suite-name>
Each suite typically defines:
- Host types: Machine sizes and configurations (e.g., small, medium, large)
- Factors: Parameters that vary between runs (e.g., model, algorithm, dataset)
- Experiments: Collections of runs with different factor combinations
- ETL pipelines: How to process and visualize results
Machine configurations are defined in doe-suite-config/group_vars/<host-type>/main.yml.
Workflow Steps
1. Design and Validate Suite (Optional)
Before running experiments, you can validate your suite design:
# View all commands that will be executed
make design suite=<suite-name>
# Validate design and show with default values
make design-validate suite=<suite-name>
This helps catch configuration errors before launching expensive cloud resources.
2. Launch Experiment
source .envrc && make run suite=<suite-name> id=new
What happens automatically:
- Creates a new suite ID (Unix timestamp, e.g., 1767685699)
- Sets up cloud networking (VPC, subnet, security groups, instance)
- Launches EC2 instance(s) with appropriate size
- Installs dependencies (Rust, AWS CLI, task spooler)
- Clones code from GitHub (
mainbranch) - Builds the project with
cargo build --release - Downloads model artifacts from S3
- Enqueues all experiment jobs to task spooler
- Starts execution sequentially
Expected Duration:
- Small suites: 15-30 minutes per model
- Medium suites: 1-3 hours per model
- Large suites: 4-12 hours per model
Options:
id=new: Start a new runid=last: Continue the most recent runid=<suite_id>: Continue a specific runmake run-keep: Keep instances running after completion (don't forget to clean up!)
3. Monitor Progress
Continuously monitor to detect issues early:
source .envrc && make run suite=<suite-name> id=last
This script terminates once all results are done. It keeps checking status of the experiments and downloading results, up until a timeout. Then the script can be re-invoked to continue monitoring.
Output at the end shows:
- Total jobs completed/remaining (e.g., "3/3 jobs")
- Progress percentage
- ETL pipeline warnings/errors
- Current job status (queued/running/finished)
4. Review Raw Results
Results are automatically downloaded to:
doe-suite-results/<suite-name>_<suite-id>/exp_<exp-name>/run_<N>/rep_<R>/<host-type>/host_<H>/
Directory structure:
<suite-name>_<suite-id>: Suite and timestamp IDexp_<exp-name>: Experiment name from designrun_<N>: Run index (starts at 0) - different factor level combinationsrep_<R>: Repetition index (starts at 0)<host-type>: Machine type (small/medium/large)host_<H>: Host index (starts at 0)
Key files:
output.csv: Experiment metrics (prover time, verifier time, proof size)runtime_info.yaml: Wall time and max memory (RSS)stdout.log: Program outputstderr.log: Compilation warnings (usually normal)
Example:
doe-suite-results/nocom-small_1767685699/exp_poly_kzg_small/run_0/rep_0/small/host_0/output.csv
5. Process Results with ETL
Automatic ETL (Per-Suite)
ETL pipelines defined in the suite design run automatically when results are downloaded. They process individual suite results.
To manually re-run (useful for tweaking plots):
make etl suite=<suite-name> id=last
# Or develop/test ETL changes:
make etl-design suite=<suite-name> id=last
Super-ETL (Cross-Suite Aggregation)
Super-ETL configurations (doe-suite-config/super_etl/) aggregate results from multiple suite runs and generate comparative visualizations.
Configuration structure:
$SUITE_ID$:
suite-name-1: "1767685699"
suite-name-2: "1748250734"
suite-name-3: "1748545862"
$PIPELINES$:
pipeline-name:
experiments: "*" # or list specific experiments
extractors:
CsvExtractor: {file_regex: ["*.csv"]}
IgnoreExtractor: {file_regex: ["*.log"]}
transformers: []
loaders:
CsvSummaryLoader: {skip_empty: True}
To add new results:
- Update suite IDs in the super-ETL config file (
doe-suite-config/super_etl/<config>.yml) - Run the super-ETL pipeline
source .envrc && make etl-super config=<config-name> out=<output-dir> pipelines="<pipeline1>,<pipeline2>"
Common output locations:
<output-dir>/<config>/overview/overview.csv<output-dir>/<config>/<pipeline-name>/*.pdf<output-dir>/<config>/<pipeline-name>/*.html<output-dir>/<config>/<pipeline-name>/*.csv
When you add new results to the super-ETL config, verify that the outputs contain the new data by checking the csv files.
6. Cleanup
Automatic cleanup: Instances are terminated automatically after suite completion (exit code 0).
Manual cleanup (if suite fails or is interrupted):
make clean
Always verify in AWS Console that instances are terminated!
Common Issues & Solutions
SSH Key Not Found
Error: InvalidKeyPair.NotFound: The key pair 'xxx' does not exist
Solution:
- Verify
DOES_SSH_KEY_NAMEin.envrcmatches a key in the target AWS region - Update
.envrcto use correct key name - Ensure
AWS_REGIONmatches where your key exists
Region Mismatch
Error: AMI or resources not available in region
Solution:
- Check
AWS_REGIONin.envrcmatches your configuration - Verify AMI IDs in
doe-suite-config/group_vars/<size>/main.ymlexist in target region
Code Build Failures
Error: Compilation errors during setup
Solution:
- Ensure GitHub repository
mainbranch is in working state - Check that all dependencies are specified in the setup role
- Review
stderr.logfrom failed experiment for specific errors
Experiment Panics
Error: thread 'main' panicked at src/bin/time_circuit.rs
Solution:
- Check that
test.zshscript passes correct number of arguments to binary - Verify argument order matches what
time_circuit.rsexpects - Test locally first with same parameters
Experiments may run succesfully but the remote code may have errored, which should be visible in stderr. If this is the case, the result files should provide more information. Double check with me whether your proposed fix makes sense. Always commit and push the fix, so it gets pulled on the remote server as well when re-invoking doe-suite
ETL Pipeline Errors
Error: file=runtime_info.yaml matches no extractor
Root cause: The ETL pipeline encounters a file that doesn't match any configured extractor.
Solution:
Ignore the file: Add it to
IgnoreExtractorin the suite design:$ETL$: overview: extractors: IgnoreExtractor: {file_regex: ["stdout.log", "stderr.log", "runtime_info.yaml"]}Extract the file: Add appropriate extractor (e.g.,
YamlExtractor) if you need the data:extractors: YamlExtractor: {file_regex: ["runtime_info.yaml"]}Verify files exist: Check that the experiment actually produced the expected output files
Note: ETL runs both automatically (after each job) and during make status. A failing ETL doesn't block experiment execution but prevents result aggregation.
Understanding Results
Result Files
Each experiment produces files based on the commands executed and ETL extractors configured:
Common file types:
- CSV files: Structured data (extracted by
CsvExtractor) - JSON files: Structured data (extracted by
JsonExtractor) - YAML files: Configuration/metadata (extracted by
YamlExtractor) - Log files: stdout/stderr output (usually ignored or extracted separately)
- Custom files: Any output your experiment produces
Runtime metrics (if using /usr/bin/time):
- Wall time: Total elapsed time
- max_rss: Maximum resident set size (memory usage)
- CPU usage: User and system time
ETL Processing Stages
Extract: Aggregate raw result files into a pandas DataFrame
- Each file matches one extractor by regex pattern
- Files that don't match any extractor cause errors (add to
IgnoreExtractor)
Transform: Process and aggregate data
RepAggTransformer: Aggregate over repetitions (mean, std, etc.)MergeRowsTransformer: Combine data from multiple sources- Custom transformers: Project-specific data processing
Load: Generate outputs
CsvSummaryLoader: Save processed data as CSV- Plotting loaders: Generate visualizations
- Custom loaders: Project-specific outputs
Important Notes
⚠️ Cost Warning: Running experiments on AWS incurs costs. EC2 instances are charged per hour.
- Check AWS EC2 pricing for your region and instance types
- Set up billing alerts in AWS Console
- Always verify instances are terminated after experiments
💡 Infrastructure Cleanup:
- Automatic: Instances terminate when suite completes successfully
- Manual: Use
make cleanif suite fails or is interrupted - Verify: Always check AWS Console that resources are terminated
🔄 Retrying Failed Runs:
- Fix code issues in your repository
- Re-run with
id=newto create a fresh run - Or continue with
id=lastif the suite partially completed
📊 Comparing Results:
- Use Super-ETL to aggregate results from multiple suite runs
- Update suite IDs in
doe-suite-config/super_etl/<config>.yml - Run
make etl-super config=<config> out=<output-dir>
🐛 Debugging:
- Use
make design suite=<suite-name>to validate before running - Check
doe-suite-results/<suite>_<id>/for stdout/stderr logs - Use
make run-keepto keep instances alive for debugging (remember to clean up!). This is particularly useful when running new code, where we might have bugs that require re-runs.
Other tips:
- Check the docs in
doe-suite/docs/for more details on how to edit ETL workflows, server provisioning and other doe-suite syntax. - It is possible to run multiple suites in parallel. Just run them in different sub-processes or agents.
- Do not run bash commands with sleep to monitor.
make run suite=<suite-name> id=lastis the correct way to monitor progress.
Example Workflow
# 1. Navigate to doe-suite directory
cd doe-suite
source .envrc
# 2. Validate suite design
make design suite=<suite-name>
# 3. Launch experiment
make run suite=<suite-name> id=new
# 4. Monitor progress periodically
make status suite=<suite-name> id=last
# 5. Once complete (X/X jobs), review raw results
ls -lh ../doe-suite-results/<suite-name>_<id>/
# 6. Process results with Super-ETL (if configured)
make etl-super config=<config-name> out=../doe-suite-results-aggregated
# 7. View aggregated results
cat ../doe-suite-results-aggregated/<config>/overview/overview.csv
# 8. Verify cleanup (should be automatic)
# Check AWS Console or run: make clean
Additional Commands
# Get help and see all available commands
make help
# Get info about configured suites
make info
# Manually trigger ETL on existing results
make etl suite=<suite-name> id=<suite-id>
# Clean up all cloud resources
make clean
# Clean up only cloud resources (keep local results)
make clean-cloud