| 0 | Validating ML baselines before novel methods (Deep Research SOP Pipeline D) |
| 1 | Reproducing experiments for NeurIPS/ICML challenges |
| 2 | Verifying SOTA claims before building on prior work |
| 3 | Creating reproducibility packages for Quality Gate 1 |
| 4 | Debugging performance gaps between published and reproduced results |
| 5 | Quick proof-of-concept implementations (use prototyping) |
| 6 | When exact reproducibility not critical (demo projects) |
| 7 | Papers without code or detailed methodology |
| 8 | Industry benchmarks without academic rigor |
| 9 | Results within +/- 1% of published metrics |
| 10 | 3+ successful reproductions in fresh Docker |
| 11 | All 47+ hyperparameters documented |
| 12 | Reproducibility package tested independently |
| 13 | Dataset checksums verified (SHA256) |
| 14 | Random seeds documented, deterministic results |
| 15 | [object Object] |
| 16 | [object Object] |
| 17 | [object Object] |
| 18 | [object Object] |
| 19 | [object Object] |
| 20 | NEVER claim reproduction without statistical validation (t-test, CI) |
| 21 | ALWAYS document exact framework versions (pip freeze, conda export) |
| 22 | NEVER skip random seed validation (test 3+ runs) |
| 23 | ALWAYS verify dataset integrity (SHA256, sample counts, splits) |
| 24 | NEVER assume default hyperparameters (extract from paper/code) |
| 25 | [object Object] |
| 26 | [object Object] |
| 27 | [object Object] |
| 28 | [object Object] |
| 29 | [object Object] |
name: baseline-replication
description: "Replicate published ML baseline experiments with exact reproducibility
\ (\xB11% tolerance) for Deep Research SOP Pipeline D. Use when validating baselines,
\ reproducing experiments, verifying published results, or preparing for novel method
\ development."
version: 1.0.0
category: research
tags:
- research
- analysis
- planning author: ruv
Baseline Replication
Overview
Replicates published machine learning baseline methods with exact reproducibility, ensuring results match within ±1% tolerance. This skill implements Deep Research SOP Pipeline D baseline validation, which is a prerequisite for developing novel methods.
Prerequisites
- Python 3.8+ with PyTorch/TensorFlow
- Docker (for reproducibility)
- Git and Git LFS
- Access to datasets (HuggingFace, academic repositories)
What This Skill Does
- Extracts methodology from papers and code repositories
- Validates datasets match baseline specifications exactly
- Implements baseline with exact hyperparameters
- Runs experiments with deterministic settings
- Validates results within ±1% statistical tolerance
- Creates reproducibility package tested in fresh Docker environment
- Generates Quality Gate 1 validation checklist
Quick Start (30 minutes)
Basic Replication
# 1. Specify baseline to replicate
BASELINE_PAPER="BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2019)"
BASELINE_CODE="https://github.com/google-research/bert"
TARGET_METRIC="Accuracy on SQuAD 2.0"
PUBLISHED_RESULT=0.948
# 2. Run replication workflow
./scripts/replicate-baseline.sh \
--paper "$BASELINE_PAPER" \
--code "$BASELINE_CODE" \
--metric "$TARGET_METRIC" \
--expected "$PUBLISHED_RESULT"
# 3. Review results
cat output/baseline-bert/replication-report.md
Expected output:
✓ Paper analyzed: Extracted 47 hyperparameters
✓ Dataset validated: SQuAD 2.0 matches baseline
✓ Implementation complete: 12 BERT layers, 110M parameters
✓ Training complete: 3 epochs, 26.3 GPU hours
✓ Results validated: 0.945 vs 0.948 (within ±1% tolerance)
✓ Reproducibility verified: 3/3 fresh reproductions successful
→ Quality Gate 1: APPROVED
Step-by-Step Guide
Phase 1: Paper Analysis (15 minutes)
Extract Methodology
# Coordinate with researcher agent
./scripts/analyze-paper.sh --paper "arXiv:2103.00020"
The script extracts:
- Model architecture (layers, hidden sizes, attention heads)
- Training hyperparameters (learning rate, batch size, warmup)
- Optimization details (optimizer type, weight decay, dropout)
- Dataset specifications (splits, preprocessing, tokenization)
- Evaluation metrics (primary and secondary)
Output: baseline-specification.md with all extracted details
Validate Completeness
# Check for missing hyperparameters
./scripts/validate-spec.sh baseline-specification.md
Common Missing Details:
- Learning rate schedule (linear warmup, cosine decay)
- Random seeds (NumPy, PyTorch, Python)
- Hardware specifications (GPU type, memory)
- Framework versions (PyTorch 1.7 vs 1.13 numerical differences)
If details missing:
- Check paper supplements
- Check official code config files
- Check GitHub issues
- Contact authors
Phase 2: Dataset Validation (20 minutes)
Coordinate with data-steward Agent
# Validate dataset matches baseline specs
./scripts/validate-dataset.sh \
--dataset "SQuAD 2.0" \
--splits "train:130k,dev:12k" \
--preprocessing "WordPiece tokenization, max_length=384"
data-steward checks:
- Exact dataset version (v2.0, not v1.1)
- Sample counts match (training: 130,319 examples)
- Data splits match (80/10/10 vs 90/10)
- Preprocessing matches (lower-casing, accent stripping)
- Checksum validation (SHA256 hashes)
Output: dataset-validation-report.md
Phase 3: Implementation (2 hours)
Coordinate with coder Agent
# Implement baseline with exact specifications
./scripts/implement-baseline.sh \
--spec baseline-specification.md \
--framework pytorch \
--template resources/templates/bert-base.py
coder creates:
# baseline-bert-implementation.py
import torch
import random
import numpy as np
# CRITICAL: Set all random seeds for reproducibility
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
np.random.seed(42)
random.seed(42)
# CRITICAL: Enable deterministic mode
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True)
# Exact hyperparameters from paper
config = {
"num_layers": 12,
"hidden_size": 768,
"num_attention_heads": 12,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"learning_rate": 5e-5, # From paper section 4.2
"batch_size": 32, # Per-GPU batch size
"num_epochs": 3,
"warmup_steps": 10000, # 10% of training steps
"weight_decay": 0.01,
"dropout": 0.1
}
Unit Tests:
pytest baseline-bert-implementation_test.py -v
Output: Fully tested implementation matching baseline exactly
Phase 4: Experiment Execution (4-8 hours)
Coordinate with tester Agent
# Run experiments with monitoring
./scripts/run-experiments.sh \
--implementation baseline-bert-implementation.py \
--config config/bert-squad.yaml \
--gpus 4 \
--monitor true
tester executes:
Environment Setup:
# Create deterministic environment docker build -t baseline-bert:v1.0 -f Dockerfile . docker run --gpus all -v $(pwd):/workspace baseline-bert:v1.0Training with Monitoring:
# Log training curves from torch.utils.tensorboard import SummaryWriter writer = SummaryWriter('logs/baseline-bert') for epoch in range(3): for batch in dataloader: loss = model(batch) writer.add_scalar('Loss/train', loss, global_step) writer.add_scalar('LR', optimizer.param_groups[0]['lr'], global_step)Checkpoint Saving:
# Save best checkpoint torch.save({ 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, 'accuracy': accuracy }, 'checkpoints/best-model.pt')
Output:
training.log- Complete training logsbest-model.pt- Best checkpointmetrics.json- All evaluation metrics
Phase 5: Results Validation (30 minutes)
Statistical Comparison
# Compare reproduced vs published results
./scripts/compare-results.sh \
--reproduced 0.945 \
--published 0.948 \
--tolerance 0.01
Validation checks:
import scipy.stats as stats
# Paired t-test
reproduced = [0.945, 0.946, 0.944] # 3 runs
published = 0.948
difference = np.mean(reproduced) - published
percent_diff = (difference / published) * 100
# Within tolerance?
within_tolerance = abs(difference / published) <= 0.01
# Statistical significance
t_stat, p_value = stats.ttest_1samp(reproduced, published)
confidence_interval = stats.t.interval(0.95, len(reproduced)-1,
loc=np.mean(reproduced),
scale=stats.sem(reproduced))
print(f"Reproduced: {np.mean(reproduced):.3f} ± {np.std(reproduced):.3f}")
print(f"Published: {published:.3f}")
print(f"Difference: {difference:.3f} ({percent_diff:.2f}%)")
print(f"Within ±1% tolerance: {within_tolerance}")
print(f"95% CI: [{confidence_interval[0]:.3f}, {confidence_interval[1]:.3f}]")
If results differ > 1%:
# Debug systematically
./scripts/debug-divergence.sh \
--reproduced 0.932 \
--published 0.948
Common causes:
- Random seed not propagated to all libraries
- Framework version differences (PyTorch 1.7 vs 1.13)
- Hardware differences (V100 vs A100 numerical precision)
- Missing hyperparameter (learning rate schedule)
- Dataset preprocessing mismatch
Output: baseline-bert-comparison.md with statistical analysis
Phase 6: Reproducibility Packaging (30 minutes)
Coordinate with archivist Agent
# Create complete reproducibility package
./scripts/create-repro-package.sh \
--name baseline-bert \
--code baseline-bert-implementation.py \
--model best-model.pt \
--env requirements.txt
archivist creates:
baseline-bert-repro.tar.gz
├── README.md # ≤5 steps to reproduce
├── requirements.txt # Exact versions
├── Dockerfile # Exact environment
├── src/
│ ├── baseline-bert-implementation.py
│ ├── data_loader.py
│ └── train.py
├── data/
│ └── download_instructions.txt
├── models/
│ └── best-model.pt
├── logs/
│ └── training.log
├── results/
│ ├── metrics.json
│ └── comparison.csv
└── MANIFEST.txt # SHA256 checksums
README.md (≤5 steps):
# BERT SQuAD 2.0 Baseline Reproduction
## Quick Reproduction (3 steps)
1. Build Docker environment:
```bash
docker build -t bert-squad:v1.0 .
Download SQuAD 2.0 dataset:
./download_data.shRun training:
docker run --gpus all -v $(pwd):/workspace bert-squad:v1.0 python src/train.py
Expected result: 0.945 ± 0.001 accuracy (within ±1% of published 0.948)
#### Test Reproducibility
```bash
# Fresh Docker reproduction
./scripts/test-reproducibility.sh --package baseline-bert-repro.tar.gz --runs 3
Output: 3 successful reproductions with deterministic results
Phase 7: Quality Gate 1 Validation (15 minutes)
Coordinate with evaluator Agent
# Validate Quality Gate 1 requirements
./scripts/validate-gate-1.sh --baseline baseline-bert
evaluator checks:
- ✅ Baseline specification complete (47/47 hyperparameters)
- ✅ Dataset validation passed
- ✅ Implementation tested (100% unit test coverage)
- ✅ Results within ±1% tolerance (0.945 vs 0.948)
- ✅ Reproducibility verified (3/3 fresh reproductions)
- ✅ Code documented and archived
Decision Logic:
if results_within_tolerance and reproducibility_verified:
decision = "APPROVED"
elif minor_gaps_fixable:
decision = "CONDITIONAL" # e.g., 1.2% difference but deterministic
else:
decision = "REJECT" # e.g., 5% difference, non-deterministic
Output: gate-1-validation-checklist.md
# Quality Gate 1: Baseline Validation
## Status: APPROVED
### Requirements
- [x] Baseline specification document complete
- [x] Dataset validation passed
- [x] Implementation tested and reviewed
- [x] Results within ±1% of published (0.945 vs 0.948)
- [x] Reproducibility package tested in fresh environment
- [x] Documentation complete
- [x] All artifacts archived
### Evidence
- Baseline spec: `baseline-bert-specification.md`
- Dataset validation: `dataset-validation-report.md`
- Implementation: `baseline-bert-implementation.py` (100% test coverage)
- Results comparison: `baseline-bert-comparison.md`
- Reproducibility package: `baseline-bert-repro.tar.gz` (3/3 successful)
### Approval
**Date**: 2025-11-01
**Approved By**: evaluator agent
**Next Step**: Proceed to Pipeline D novel method development
Advanced Features
Multi-Baseline Comparison
# Compare multiple baselines simultaneously
./scripts/compare-baselines.sh \
--baselines "bert-base,roberta-base,electra-base" \
--dataset "SQuAD 2.0" \
--metrics "accuracy,f1,em"
Ablation Study Integration
# Once baseline validated, run ablations
./scripts/run-ablations.sh \
--baseline baseline-bert \
--ablations "no-warmup,no-weight-decay,smaller-lr"
Continuous Validation
# Set up monitoring for baseline drift
./scripts/setup-monitoring.sh \
--baseline baseline-bert \
--schedule "weekly" \
--alert-threshold 0.02
Troubleshooting
Issue: Missing Hyperparameters
Symptoms: Specification validation fails with missing details Cause: Paper doesn't document all hyperparameters Solution:
# Check official code config files
grep -r "learning_rate\|batch_size\|warmup" ${BASELINE_CODE}/
# Check GitHub issues
gh issue list --repo ${BASELINE_REPO} --search "hyperparameter"
# Contact authors
./scripts/contact-authors.sh --paper "arXiv:2103.00020" --question "learning rate schedule"
Issue: Results Diverge > 1%
Symptoms: Reproduced 0.932, published 0.948 (1.7% difference) Solution:
# Systematic debugging
./scripts/debug-divergence.sh --detailed
# Check random seeds
python -c "import torch; print(torch.initial_seed())"
# Check framework version
python -c "import torch; print(torch.__version__)"
# Enable detailed logging
python baseline-bert-implementation.py --debug --log-level DEBUG
Issue: Non-Deterministic Results
Symptoms: 3 runs produce 0.945, 0.951, 0.938 (high variance) Solution:
# Force deterministic mode
import torch
torch.use_deterministic_algorithms(True)
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
# Check for non-deterministic operations
torch.set_deterministic(True) # Will error if non-deterministic ops used
Issue: Docker Environment Fails
Symptoms: Docker build or run errors Solution:
# Check Docker resources
docker system df
docker system prune -a # Free up space if needed
# Use pre-built base image
docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
# Debug interactively
docker run -it --gpus all pytorch/pytorch:1.13.1 bash
Output Files
| File | Description | Size |
|---|---|---|
baseline-{method}-specification.md |
Extracted methodology | ~5KB |
dataset-validation-report.md |
Dataset validation results | ~2KB |
baseline-{method}-implementation.py |
Clean implementation | ~10KB |
baseline-{method}-implementation_test.py |
Unit tests | ~5KB |
training.log |
Complete training logs | ~100MB |
best-model.pt |
Best checkpoint | ~400MB |
metrics.json |
All evaluation metrics | ~1KB |
baseline-{method}-comparison.md |
Results comparison | ~3KB |
baseline-{method}-comparison.csv |
Metrics table | ~1KB |
baseline-{method}-repro.tar.gz |
Reproducibility package | ~450MB |
gate-1-validation-checklist.md |
Quality Gate 1 evidence | ~3KB |
Integration with Deep Research SOP
Pipeline D: Method Development
Baseline replication is mandatory before novel method development:
Pipeline D Flow:
1. Replicate Baseline (this skill) → Quality Gate 1
2. Develop Novel Method (method-development skill)
3. Ablation Studies (5+ ablations required)
4. Statistical Validation (p < 0.05)
5. Submit for Gate 2 review
Agent Coordination
Sequential workflow:
researcher:
- Analyze paper
- Extract methodology
- Identify data sources
↓
data-steward:
- Validate datasets
- Check integrity
- Verify preprocessing
↓
coder:
- Implement baseline
- Add unit tests
- Code review
↓
tester:
- Run experiments
- Monitor training
- Collect metrics
↓
archivist:
- Create repro package
- Test fresh reproduction
- Archive artifacts
↓
evaluator:
- Validate Gate 1
- Generate checklist
- Approve/Conditional/Reject
Memory MCP Integration
# Store baseline specification
memory-store --key "sop/pipeline-d/baseline-bert/specification" \
--value "$(cat baseline-bert-specification.md)" \
--layer long_term
# Store validation results
memory-store --key "sop/pipeline-d/baseline-bert/gate-1" \
--value "$(cat gate-1-validation-checklist.md)" \
--layer long_term
Related Skills
- method-development - Develop novel methods after baseline validation
- holistic-evaluation - Run HELM + CheckList evaluations (Pipeline E)
- gate-validation - Quality Gate approval workflow
- reproducibility-audit - Test reproducibility packages
- literature-synthesis - PRISMA systematic reviews
Resources
Official Standards
Tools
- Docker - Reproducible environments
- DVC - Data version control
- Weights & Biases - Experiment tracking
- MLflow - ML lifecycle management
Deep Research SOP Documentation
- Pipeline D specification:
docs/deep-research-sop-gap-analysis.md - Quality Gates overview:
CLAUDE.md - Agent definitions:
agents/research/ - Command specifications:
.claude/commands/research/
Created: 2025-11-01 Version: 1.0.0 Category: Deep Research SOP Pipeline: D (Method Development) Quality Gate: 1 (Baseline Validation) Estimated Time: 8-12 hours (first baseline), 4-6 hours (subsequent)
Core Principles
Baseline Replication operates on 3 fundamental principles:
Principle 1: Exact Reproducibility (+-1% Tolerance)
All baseline replications must match published results within +-1% statistical tolerance using identical hyperparameters, datasets, and deterministic settings. This validates that baselines are understood and provides a fair comparison foundation.
In practice:
- All 47+ hyperparameters extracted from paper, code, and supplements
- Deterministic mode enforced (torch.use_deterministic_algorithms(True))
- 3 independent runs with identical results verify determinism
- Statistical validation with paired t-tests confirms tolerance
Principle 2: Fresh Environment Reproducibility
Reproducibility packages must work in completely fresh Docker environments without cached dependencies, manual interventions, or undocumented setup steps. This ensures true reproducibility rather than "works on my machine" artifacts.
In practice:
- Docker containers with pinned dependencies (pip freeze, conda export)
- 3 successful fresh reproductions from scratch required
- README with 5 steps max to reproduce results
- SHA256 checksums verify dataset integrity
Principle 3: Complete Documentation Before Novel Development
No novel method development proceeds without Quality Gate 1 baseline validation. This prevents building on misunderstood foundations and ensures fair performance comparisons.
In practice:
- Baseline specification document with all hyperparameters
- Reproducibility package tested independently
- Statistical comparison report (reproduced vs published)
- Quality Gate 1 APPROVED status before method-development skill
Common Anti-Patterns
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Skipping Baseline Replication | Comparing novel methods to reported baseline results without replication introduces confounds (different hardware, frameworks, datasets) | Complete full baseline replication with +-1% validation before claiming improvements |
| Ignoring Non-Determinism | Running experiments without fixed seeds produces variance across runs, making +-1% tolerance impossible to verify | Force deterministic mode, test with 3 identical runs, document any remaining variance sources |
| Incomplete Hyperparameter Extraction | Missing hyperparameters like learning rate schedule or warmup steps causes silent performance gaps | Extract all 47+ hyperparameters from paper + code + supplements, contact authors for missing details |
Conclusion
Baseline Replication provides rigorous methodology for reproducing published ML results with exact reproducibility (+-1% tolerance), establishing validated foundations for novel method development. By enforcing deterministic implementations, fresh environment testing, and complete documentation, this skill ensures baselines are understood rather than assumed.
Use this skill before developing novel methods (Deep Research SOP Pipeline D prerequisite), when validating SOTA claims, or when preparing reproducibility packages for publication. The 7-phase workflow (paper analysis, dataset validation, implementation, experiments, validation, packaging, Quality Gate 1) produces independently verified baseline implementations with statistical confidence. The result is fair performance comparisons and solid foundations for research contributions.