0	Validating ML baselines before novel methods (Deep Research SOP Pipeline D)
1	Reproducing experiments for NeurIPS/ICML challenges
2	Verifying SOTA claims before building on prior work
3	Creating reproducibility packages for Quality Gate 1
4	Debugging performance gaps between published and reproduced results
5	Quick proof-of-concept implementations (use prototyping)
6	When exact reproducibility not critical (demo projects)
7	Papers without code or detailed methodology
8	Industry benchmarks without academic rigor
9	Results within +/- 1% of published metrics
10	3+ successful reproductions in fresh Docker
11	All 47+ hyperparameters documented
12	Reproducibility package tested independently
13	Dataset checksums verified (SHA256)
14	Random seeds documented, deterministic results
15	[object Object]
16	[object Object]
17	[object Object]
18	[object Object]
19	[object Object]
20	NEVER claim reproduction without statistical validation (t-test, CI)
21	ALWAYS document exact framework versions (pip freeze, conda export)
22	NEVER skip random seed validation (test 3+ runs)
23	ALWAYS verify dataset integrity (SHA256, sample counts, splits)
24	NEVER assume default hyperparameters (extract from paper/code)
25	[object Object]
26	[object Object]
27	[object Object]
28	[object Object]
29	[object Object]

name: baseline-replication description: "Replicate published ML baseline experiments with exact reproducibility
\ (\xB11% tolerance) for Deep Research SOP Pipeline D. Use when validating baselines,
\ reproducing experiments, verifying published results, or preparing for novel method
\ development." version: 1.0.0 category: research tags:

research
analysis
planning author: ruv

Baseline Replication

Overview

Replicates published machine learning baseline methods with exact reproducibility, ensuring results match within ±1% tolerance. This skill implements Deep Research SOP Pipeline D baseline validation, which is a prerequisite for developing novel methods.

Prerequisites

Python 3.8+ with PyTorch/TensorFlow
Docker (for reproducibility)
Git and Git LFS
Access to datasets (HuggingFace, academic repositories)

What This Skill Does

Extracts methodology from papers and code repositories
Validates datasets match baseline specifications exactly
Implements baseline with exact hyperparameters
Runs experiments with deterministic settings
Validates results within ±1% statistical tolerance
Creates reproducibility package tested in fresh Docker environment
Generates Quality Gate 1 validation checklist

Quick Start (30 minutes)

Basic Replication

# 1. Specify baseline to replicate
BASELINE_PAPER="BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2019)"
BASELINE_CODE="https://github.com/google-research/bert"
TARGET_METRIC="Accuracy on SQuAD 2.0"
PUBLISHED_RESULT=0.948

# 2. Run replication workflow
./scripts/replicate-baseline.sh \
  --paper "$BASELINE_PAPER" \
  --code "$BASELINE_CODE" \
  --metric "$TARGET_METRIC" \
  --expected "$PUBLISHED_RESULT"

# 3. Review results
cat output/baseline-bert/replication-report.md

Expected output:

✓ Paper analyzed: Extracted 47 hyperparameters
✓ Dataset validated: SQuAD 2.0 matches baseline
✓ Implementation complete: 12 BERT layers, 110M parameters
✓ Training complete: 3 epochs, 26.3 GPU hours
✓ Results validated: 0.945 vs 0.948 (within ±1% tolerance)
✓ Reproducibility verified: 3/3 fresh reproductions successful
→ Quality Gate 1: APPROVED

Step-by-Step Guide

Phase 1: Paper Analysis (15 minutes)

Extract Methodology

# Coordinate with researcher agent
./scripts/analyze-paper.sh --paper "arXiv:2103.00020"

The script extracts:

Model architecture (layers, hidden sizes, attention heads)
Training hyperparameters (learning rate, batch size, warmup)
Optimization details (optimizer type, weight decay, dropout)
Dataset specifications (splits, preprocessing, tokenization)
Evaluation metrics (primary and secondary)

Output: baseline-specification.md with all extracted details

Validate Completeness

# Check for missing hyperparameters
./scripts/validate-spec.sh baseline-specification.md

Common Missing Details:

Learning rate schedule (linear warmup, cosine decay)
Random seeds (NumPy, PyTorch, Python)
Hardware specifications (GPU type, memory)
Framework versions (PyTorch 1.7 vs 1.13 numerical differences)

If details missing:

Check paper supplements
Check official code config files
Check GitHub issues
Contact authors

Phase 2: Dataset Validation (20 minutes)

Coordinate with data-steward Agent

# Validate dataset matches baseline specs
./scripts/validate-dataset.sh \
  --dataset "SQuAD 2.0" \
  --splits "train:130k,dev:12k" \
  --preprocessing "WordPiece tokenization, max_length=384"

data-steward checks:

Exact dataset version (v2.0, not v1.1)
Sample counts match (training: 130,319 examples)
Data splits match (80/10/10 vs 90/10)
Preprocessing matches (lower-casing, accent stripping)
Checksum validation (SHA256 hashes)

Output: dataset-validation-report.md

Phase 3: Implementation (2 hours)

Coordinate with coder Agent

# Implement baseline with exact specifications
./scripts/implement-baseline.sh \
  --spec baseline-specification.md \
  --framework pytorch \
  --template resources/templates/bert-base.py

coder creates:

# baseline-bert-implementation.py
import torch
import random
import numpy as np

# CRITICAL: Set all random seeds for reproducibility
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
np.random.seed(42)
random.seed(42)

# CRITICAL: Enable deterministic mode
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True)

# Exact hyperparameters from paper
config = {
    "num_layers": 12,
    "hidden_size": 768,
    "num_attention_heads": 12,
    "intermediate_size": 3072,
    "max_position_embeddings": 512,
    "learning_rate": 5e-5,  # From paper section 4.2
    "batch_size": 32,       # Per-GPU batch size
    "num_epochs": 3,
    "warmup_steps": 10000,  # 10% of training steps
    "weight_decay": 0.01,
    "dropout": 0.1
}

Unit Tests:

pytest baseline-bert-implementation_test.py -v

Output: Fully tested implementation matching baseline exactly

Phase 4: Experiment Execution (4-8 hours)

Coordinate with tester Agent

# Run experiments with monitoring
./scripts/run-experiments.sh \
  --implementation baseline-bert-implementation.py \
  --config config/bert-squad.yaml \
  --gpus 4 \
  --monitor true

tester executes:

Environment Setup:

# Create deterministic environment
docker build -t baseline-bert:v1.0 -f Dockerfile .
docker run --gpus all -v $(pwd):/workspace baseline-bert:v1.0

Training with Monitoring:

# Log training curves
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('logs/baseline-bert')

for epoch in range(3):
    for batch in dataloader:
        loss = model(batch)
        writer.add_scalar('Loss/train', loss, global_step)
        writer.add_scalar('LR', optimizer.param_groups[0]['lr'], global_step)

Checkpoint Saving:

# Save best checkpoint
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
    'accuracy': accuracy
}, 'checkpoints/best-model.pt')

Output:

training.log - Complete training logs
best-model.pt - Best checkpoint
metrics.json - All evaluation metrics

Phase 5: Results Validation (30 minutes)

Statistical Comparison

# Compare reproduced vs published results
./scripts/compare-results.sh \
  --reproduced 0.945 \
  --published 0.948 \
  --tolerance 0.01

Validation checks:

import scipy.stats as stats

# Paired t-test
reproduced = [0.945, 0.946, 0.944]  # 3 runs
published = 0.948

difference = np.mean(reproduced) - published
percent_diff = (difference / published) * 100

# Within tolerance?
within_tolerance = abs(difference / published) <= 0.01

# Statistical significance
t_stat, p_value = stats.ttest_1samp(reproduced, published)
confidence_interval = stats.t.interval(0.95, len(reproduced)-1,
                                       loc=np.mean(reproduced),
                                       scale=stats.sem(reproduced))

print(f"Reproduced: {np.mean(reproduced):.3f} ± {np.std(reproduced):.3f}")
print(f"Published: {published:.3f}")
print(f"Difference: {difference:.3f} ({percent_diff:.2f}%)")
print(f"Within ±1% tolerance: {within_tolerance}")
print(f"95% CI: [{confidence_interval[0]:.3f}, {confidence_interval[1]:.3f}]")

If results differ > 1%:

# Debug systematically
./scripts/debug-divergence.sh \
  --reproduced 0.932 \
  --published 0.948

Common causes:

Random seed not propagated to all libraries
Framework version differences (PyTorch 1.7 vs 1.13)
Hardware differences (V100 vs A100 numerical precision)
Missing hyperparameter (learning rate schedule)
Dataset preprocessing mismatch

Output: baseline-bert-comparison.md with statistical analysis

Phase 6: Reproducibility Packaging (30 minutes)

Coordinate with archivist Agent

# Create complete reproducibility package
./scripts/create-repro-package.sh \
  --name baseline-bert \
  --code baseline-bert-implementation.py \
  --model best-model.pt \
  --env requirements.txt

archivist creates:

baseline-bert-repro.tar.gz
├── README.md                    # ≤5 steps to reproduce
├── requirements.txt             # Exact versions
├── Dockerfile                   # Exact environment
├── src/
│   ├── baseline-bert-implementation.py
│   ├── data_loader.py
│   └── train.py
├── data/
│   └── download_instructions.txt
├── models/
│   └── best-model.pt
├── logs/
│   └── training.log
├── results/
│   ├── metrics.json
│   └── comparison.csv
└── MANIFEST.txt                 # SHA256 checksums

README.md (≤5 steps):

# BERT SQuAD 2.0 Baseline Reproduction

## Quick Reproduction (3 steps)

1. Build Docker environment:
   ```bash
   docker build -t bert-squad:v1.0 .

Download SQuAD 2.0 dataset:
```
./download_data.sh
```

Run training:

docker run --gpus all -v $(pwd):/workspace bert-squad:v1.0 python src/train.py

Expected result: 0.945 ± 0.001 accuracy (within ±1% of published 0.948)


#### Test Reproducibility
```bash
# Fresh Docker reproduction
./scripts/test-reproducibility.sh --package baseline-bert-repro.tar.gz --runs 3

Output: 3 successful reproductions with deterministic results

Phase 7: Quality Gate 1 Validation (15 minutes)

Coordinate with evaluator Agent

# Validate Quality Gate 1 requirements
./scripts/validate-gate-1.sh --baseline baseline-bert

evaluator checks:

✅ Baseline specification complete (47/47 hyperparameters)
✅ Dataset validation passed
✅ Implementation tested (100% unit test coverage)
✅ Results within ±1% tolerance (0.945 vs 0.948)
✅ Reproducibility verified (3/3 fresh reproductions)
✅ Code documented and archived

Decision Logic:

if results_within_tolerance and reproducibility_verified:
    decision = "APPROVED"
elif minor_gaps_fixable:
    decision = "CONDITIONAL"  # e.g., 1.2% difference but deterministic
else:
    decision = "REJECT"  # e.g., 5% difference, non-deterministic

Output: gate-1-validation-checklist.md

# Quality Gate 1: Baseline Validation

## Status: APPROVED

### Requirements
- [x] Baseline specification document complete
- [x] Dataset validation passed
- [x] Implementation tested and reviewed
- [x] Results within ±1% of published (0.945 vs 0.948)
- [x] Reproducibility package tested in fresh environment
- [x] Documentation complete
- [x] All artifacts archived

### Evidence
- Baseline spec: `baseline-bert-specification.md`
- Dataset validation: `dataset-validation-report.md`
- Implementation: `baseline-bert-implementation.py` (100% test coverage)
- Results comparison: `baseline-bert-comparison.md`
- Reproducibility package: `baseline-bert-repro.tar.gz` (3/3 successful)

### Approval
**Date**: 2025-11-01
**Approved By**: evaluator agent
**Next Step**: Proceed to Pipeline D novel method development

Advanced Features

Multi-Baseline Comparison

# Compare multiple baselines simultaneously
./scripts/compare-baselines.sh \
  --baselines "bert-base,roberta-base,electra-base" \
  --dataset "SQuAD 2.0" \
  --metrics "accuracy,f1,em"

Ablation Study Integration

# Once baseline validated, run ablations
./scripts/run-ablations.sh \
  --baseline baseline-bert \
  --ablations "no-warmup,no-weight-decay,smaller-lr"

Continuous Validation

# Set up monitoring for baseline drift
./scripts/setup-monitoring.sh \
  --baseline baseline-bert \
  --schedule "weekly" \
  --alert-threshold 0.02

Troubleshooting

Issue: Missing Hyperparameters

Symptoms: Specification validation fails with missing details Cause: Paper doesn't document all hyperparameters Solution:

# Check official code config files
grep -r "learning_rate\|batch_size\|warmup" ${BASELINE_CODE}/

# Check GitHub issues
gh issue list --repo ${BASELINE_REPO} --search "hyperparameter"

# Contact authors
./scripts/contact-authors.sh --paper "arXiv:2103.00020" --question "learning rate schedule"

Issue: Results Diverge > 1%

Symptoms: Reproduced 0.932, published 0.948 (1.7% difference) Solution:

# Systematic debugging
./scripts/debug-divergence.sh --detailed

# Check random seeds
python -c "import torch; print(torch.initial_seed())"

# Check framework version
python -c "import torch; print(torch.__version__)"

# Enable detailed logging
python baseline-bert-implementation.py --debug --log-level DEBUG

Issue: Non-Deterministic Results

Symptoms: 3 runs produce 0.945, 0.951, 0.938 (high variance) Solution:

# Force deterministic mode
import torch
torch.use_deterministic_algorithms(True)
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'

# Check for non-deterministic operations
torch.set_deterministic(True)  # Will error if non-deterministic ops used

Issue: Docker Environment Fails

Symptoms: Docker build or run errors Solution:

# Check Docker resources
docker system df
docker system prune -a  # Free up space if needed

# Use pre-built base image
docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime

# Debug interactively
docker run -it --gpus all pytorch/pytorch:1.13.1 bash

Output Files

File	Description	Size
`baseline-{method}-specification.md`	Extracted methodology	~5KB
`dataset-validation-report.md`	Dataset validation results	~2KB
`baseline-{method}-implementation.py`	Clean implementation	~10KB
`baseline-{method}-implementation_test.py`	Unit tests	~5KB
`training.log`	Complete training logs	~100MB
`best-model.pt`	Best checkpoint	~400MB
`metrics.json`	All evaluation metrics	~1KB
`baseline-{method}-comparison.md`	Results comparison	~3KB
`baseline-{method}-comparison.csv`	Metrics table	~1KB
`baseline-{method}-repro.tar.gz`	Reproducibility package	~450MB
`gate-1-validation-checklist.md`	Quality Gate 1 evidence	~3KB

Integration with Deep Research SOP

Pipeline D: Method Development

Baseline replication is mandatory before novel method development:

Pipeline D Flow:
1. Replicate Baseline (this skill) → Quality Gate 1
2. Develop Novel Method (method-development skill)
3. Ablation Studies (5+ ablations required)
4. Statistical Validation (p < 0.05)
5. Submit for Gate 2 review

Agent Coordination

Sequential workflow:

researcher:
  - Analyze paper
  - Extract methodology
  - Identify data sources
  ↓
data-steward:
  - Validate datasets
  - Check integrity
  - Verify preprocessing
  ↓
coder:
  - Implement baseline
  - Add unit tests
  - Code review
  ↓
tester:
  - Run experiments
  - Monitor training
  - Collect metrics
  ↓
archivist:
  - Create repro package
  - Test fresh reproduction
  - Archive artifacts
  ↓
evaluator:
  - Validate Gate 1
  - Generate checklist
  - Approve/Conditional/Reject

Memory MCP Integration

# Store baseline specification
memory-store --key "sop/pipeline-d/baseline-bert/specification" \
             --value "$(cat baseline-bert-specification.md)" \
             --layer long_term

# Store validation results
memory-store --key "sop/pipeline-d/baseline-bert/gate-1" \
             --value "$(cat gate-1-validation-checklist.md)" \
             --layer long_term

Related Skills

method-development - Develop novel methods after baseline validation
holistic-evaluation - Run HELM + CheckList evaluations (Pipeline E)
gate-validation - Quality Gate approval workflow
reproducibility-audit - Test reproducibility packages
literature-synthesis - PRISMA systematic reviews

Resources

Official Standards

Tools

Docker - Reproducible environments
DVC - Data version control
Weights & Biases - Experiment tracking
MLflow - ML lifecycle management

Deep Research SOP Documentation

Pipeline D specification: docs/deep-research-sop-gap-analysis.md
Quality Gates overview: CLAUDE.md
Agent definitions: agents/research/
Command specifications: .claude/commands/research/

Created: 2025-11-01 Version: 1.0.0 Category: Deep Research SOP Pipeline: D (Method Development) Quality Gate: 1 (Baseline Validation) Estimated Time: 8-12 hours (first baseline), 4-6 hours (subsequent)

Core Principles

Baseline Replication operates on 3 fundamental principles:

Principle 1: Exact Reproducibility (+-1% Tolerance)

All baseline replications must match published results within +-1% statistical tolerance using identical hyperparameters, datasets, and deterministic settings. This validates that baselines are understood and provides a fair comparison foundation.

In practice:

All 47+ hyperparameters extracted from paper, code, and supplements
Deterministic mode enforced (torch.use_deterministic_algorithms(True))
3 independent runs with identical results verify determinism
Statistical validation with paired t-tests confirms tolerance

Principle 2: Fresh Environment Reproducibility

Reproducibility packages must work in completely fresh Docker environments without cached dependencies, manual interventions, or undocumented setup steps. This ensures true reproducibility rather than "works on my machine" artifacts.

In practice:

Docker containers with pinned dependencies (pip freeze, conda export)
3 successful fresh reproductions from scratch required
README with 5 steps max to reproduce results
SHA256 checksums verify dataset integrity

Principle 3: Complete Documentation Before Novel Development

No novel method development proceeds without Quality Gate 1 baseline validation. This prevents building on misunderstood foundations and ensures fair performance comparisons.

In practice:

Baseline specification document with all hyperparameters
Reproducibility package tested independently
Statistical comparison report (reproduced vs published)
Quality Gate 1 APPROVED status before method-development skill

Common Anti-Patterns

Anti-Pattern	Problem	Solution
Skipping Baseline Replication	Comparing novel methods to reported baseline results without replication introduces confounds (different hardware, frameworks, datasets)	Complete full baseline replication with +-1% validation before claiming improvements
Ignoring Non-Determinism	Running experiments without fixed seeds produces variance across runs, making +-1% tolerance impossible to verify	Force deterministic mode, test with 3 identical runs, document any remaining variance sources
Incomplete Hyperparameter Extraction	Missing hyperparameters like learning rate schedule or warmup steps causes silent performance gaps	Extract all 47+ hyperparameters from paper + code + supplements, contact authors for missing details

Conclusion

Baseline Replication provides rigorous methodology for reproducing published ML results with exact reproducibility (+-1% tolerance), establishing validated foundations for novel method development. By enforcing deterministic implementations, fresh environment testing, and complete documentation, this skill ensures baselines are understood rather than assumed.

Use this skill before developing novel methods (Deep Research SOP Pipeline D prerequisite), when validating SOTA claims, or when preparing reproducibility packages for publication. The 7-phase workflow (paper analysis, dataset validation, implementation, experiments, validation, packaging, Quality Gate 1) produces independently verified baseline implementations with statistical confidence. The result is fair performance comparisons and solid foundations for research contributions.

Install Skill

SKILL.md

Baseline Replication

Overview

Prerequisites

What This Skill Does

Quick Start (30 minutes)

Basic Replication

Step-by-Step Guide

Phase 1: Paper Analysis (15 minutes)

Extract Methodology

Validate Completeness

Phase 2: Dataset Validation (20 minutes)

Coordinate with data-steward Agent

Phase 3: Implementation (2 hours)

Coordinate with coder Agent

Phase 4: Experiment Execution (4-8 hours)

Coordinate with tester Agent

Phase 5: Results Validation (30 minutes)

Statistical Comparison

Phase 6: Reproducibility Packaging (30 minutes)

Coordinate with archivist Agent

Phase 7: Quality Gate 1 Validation (15 minutes)

Coordinate with evaluator Agent

Advanced Features

Multi-Baseline Comparison

Ablation Study Integration

Continuous Validation

Troubleshooting

Issue: Missing Hyperparameters

Issue: Results Diverge > 1%

Issue: Non-Deterministic Results

Issue: Docker Environment Fails

Output Files

Integration with Deep Research SOP

Pipeline D: Method Development

Agent Coordination

Memory MCP Integration

Related Skills

Resources

Official Standards

Tools

Deep Research SOP Documentation

Core Principles

Principle 1: Exact Reproducibility (+-1% Tolerance)

Principle 2: Fresh Environment Reproducibility

Principle 3: Complete Documentation Before Novel Development

Common Anti-Patterns

Conclusion