| name | hyperparameter-tuning |
| description | Hyperparameter search - grid/random/Bayesian strategies, learning rate tuning, early stopping |
Hyperparameter Tuning Skill
When to Use This Skill
Use this skill when:
- User wants to improve model accuracy but not sure what to tune
- Training plateaus or performance is suboptimal (70% → 75%?)
- User asks "should I tune hyperparameters?" or "what should I tune first?"
- User wants to implement hyperparameter search (grid search, random search, Bayesian optimization)
- Deciding between Optuna, Ray Tune, W&B Sweeps, or manual tuning
- User asks "how many hyperparameters should I try?" or "how long will search take?"
- Model is underfitting (high train and val loss) vs overfitting (high train loss, low val loss)
- User is copying a paper's hyperparameters but results don't match
- Budget allocation question: "Should I train longer or try more configs?"
- User wants to understand learning rate importance relative to other hyperparameters
Do NOT use when:
- User has specific bugs unrelated to hyperparameters (training crashes, NaN losses)
- Only discussing optimizer choice without tuning questions
- Model is already converging well with current hyperparameters
- User is asking about data preprocessing or feature engineering
- Hyperparameter search is already set up and running (just report results)
Core Principles
1. Hyperparameter Importance Hierarchy (NOT All Equal)
The BIGGEST mistake users make: treating all hyperparameters as equally important.
Importance Ranking (for typical supervised learning):
Tier 1 - Critical (10x impact):
├─ Learning Rate (most important)
└─ Learning Rate Schedule
Tier 2 - High Impact (5x impact):
├─ Batch Size (affects LR, gradient noise, memory)
└─ Optimizer Type (Adam vs SGD affects LR ranges)
Tier 3 - Medium Impact (2x impact):
├─ Weight Decay (L2 regularization)
├─ Optimizer Parameters (momentum, beta)
└─ Warmup (critical for transformers)
Tier 4 - Low-Medium Impact (1.5x impact):
├─ Model Width/Depth (architectural)
├─ Dropout Rate (regularization)
└─ Gradient Clipping (stability)
Tier 5 - Low Impact (<1.2x):
├─ Activation Functions (ReLU vs GELU)
├─ LayerNorm Epsilon
└─ Adam Epsilon
What This Means:
- Learning rate alone can change accuracy from 50% → 80%
- Model width change from 128 → 256 typically gives 2-3% improvement
- Dropout 0.1 → 0.5 might give 2-4% improvement (if overfitting)
- Optimizer epsilon has almost no impact
Quantitative Example (CIFAR-10, ResNet18):
Effect on accuracy of individual changes:
LR from 0.001 → 0.01: 70% → 84% (+14%) ← HUGE
Batch size from 32 → 128: 84% → 82% (-2%) ← small impact
Width from 64 → 128: 84% → 86% (+2%) ← small impact
Dropout 0.0 → 0.3: 86% → 85% (-1%) ← tiny impact
Total tuning time SHOULD be allocated:
- 40% to learning rate (most important)
- 30% to learning rate schedule
- 15% to batch size and optimizer choice
- 10% to regularization (dropout, weight decay)
- 5% to everything else
Decision Rule: Tune in order of importance. Only move to next tier if current tier is optimized.
2. When to Tune vs When to Leave Defaults
Don't Tune When:
- ✗ Model converges well (val loss decreasing, no plateau)
- ✗ Time budget is <1 hour (manual tuning likely faster)
- ✗ Model underfits (both train and val loss are high) - add capacity instead
- ✗ Data is tiny (<1000 examples) - data collection beats tuning
- ✗ Using pre-trained models for fine-tuning - defaults often work
DO Tune When:
- ✓ Training plateaus early (loss stops improving by epoch 30)
- ✓ Train/val gap is large (overfitting, need better hyperparameters)
- ✓ Time budget is >1 hour and compute available
- ✓ Model has capacity but not using it (convergence too slow)
- ✓ Targeting SOTA or competition results (last 2-5% squeeze)
Diagnostic Tree:
Is performance acceptable?
├─ YES → Don't tune. Tuning won't help much.
└─ NO → Check the problem:
├─ High train loss, high val loss? → UNDERFITTING
│ └─ Solution: Increase model capacity, train longer
│ (Not a tuning problem)
│
├─ Low train loss, high val loss? → OVERFITTING
│ └─ Solution: Tune weight decay, dropout, LR schedule
│
├─ Training converging too slowly? → BAD LR
│ └─ Solution: Tune learning rate (critical!)
│
└─ Training unstable (losses spike)? → LR too high or batch too small
└─ Solution: Lower LR, increase batch size, add gradient clipping
3. Learning Rate is THE Hyperparameter to Tune First
Learning rate matters more than ANYTHING else. Here's why:
Impact on Training:
- LR too small: Glacial convergence, never reaches good minima (underfitting effect)
- LR too large: Oscillation or divergence, never converges (instability)
- LR just right: Fast convergence to good minima (optimal learning)
Typical LR Impact:
LR = 0.0001: Loss = 0.5, Acc = 60% (too small, underfitting)
LR = 0.001: Loss = 0.3, Acc = 75% (getting better)
LR = 0.01: Loss = 0.2, Acc = 85% (optimal)
LR = 0.1: Loss = 0.4, Acc = 70% (too large, oscillating)
LR = 1.0: Loss = NaN, Acc = 0% (diverging)
When to Tune LR First:
- Always. Before ANYTHING else.
- Even if you don't tune anything else, tune learning rate.
- Proper LR gives 5-10% improvement alone.
- Everything else: 2-5% improvement.
Default LR Ranges by Optimizer:
SGD with momentum: 0.01 - 0.1 (start at 0.01)
Adam: 0.0001 - 0.001 (start at 0.001)
AdamW: 0.0001 - 0.001 (start at 0.0005)
RMSprop: 0.0001 - 0.01 (start at 0.0005)
For transformers: usually 0.00005 - 0.0005 (MUCH smaller)
For fine-tuning: usually 0.0001 - 0.001 (smaller than training)
Pro Tip: Use learning rate finder (LRFinder, lr_find in fastai) to get good starting range in 1 epoch.
Decision Framework: Which Search Strategy to Use
Criterion 1: Number of Hyperparameters to Tune
1-2 parameters → Grid search is fine
Example: Tuning just learning rate and weight decay
Effort: 5-25 configurations
Best tool: Manual or simple loop
3-4 parameters → Random search
Example: LR, batch size, weight decay, warmup
Effort: 50-200 configurations
Best tool: Optuna or Ray Tune
5+ parameters → Bayesian optimization (Optuna)
Example: LR, batch size, weight decay, warmup, dropout, LR schedule type
Effort: 100-500 configurations
Best tool: Optuna (required) or Ray Tune
When you don't know → Always use Random Search as default
Criterion 2: Time Budget Available
Budget = (GPU time available) / (Training time per epoch)
< 10 hours budget:
- Tune ONLY learning rate (1-2 hours search)
- Use learning rate finder + manual exploration
- 5-10 LR values, 1 seed each
10-100 hours budget:
- Random search over 3-4 hyperparameters
- 50-100 configurations
- Use Optuna or Ray Tune
- 1 seed per config (save repeats for later)
100-1000 hours budget:
- Bayesian optimization (Optuna) over 4-5 parameters
- 200-300 configurations
- Use ensembling: multiple runs of top 5 configs
- 2-3 seeds for final configs
1000+ hours budget:
- Full Bayesian optimization with early stopping
- 500+ configurations
- Can afford to try many promising directions
- 3+ seeds for final configs, ensemble for SOTA
Criterion 3: Search Strategy Decision Matrix
| Few Params | Many Params | Unknown Params
| (1-3) | (4-6) | (Uncertain)
──────────────┼────────────┼─────────────┼──────────────
Short time | Manual | Random | Random Search
(<10 hrs) | Grid | Search | (narrow scope)
| | |
Medium time | Grid or | Random | Bayesian
(10-100 hrs) | Random | Search | (Optuna)
| | (Optuna) |
──────────────┼────────────┼─────────────┼──────────────
Long time | Grid or | Bayesian | Bayesian
(100+ hrs) | Random | (Optuna) | (Optuna)
Search Strategy Details
Strategy 1: Grid Search (When to Use, When NOT to Use)
Grid Search: Try all combinations of predefined values.
PROS:
- Simple to understand and implement
- Guarantees checking all points in search space
- Results easily interpretable (best point is in grid)
- Good for visualization and analysis
CONS:
- Exponential complexity: O(k^n) where k=values, n=dimensions
- 5 params × 5 values each = 3,125 configurations (130 days compute!)
- Poor for high-dimensional spaces (5+ parameters)
- Wastes compute on unimportant dimensions
When to Use:
- ✓ 1-2 hyperparameters only
- ✓ <50 total configurations
- ✓ Quick experiments (1-10 hour budget)
- ✓ Final refinement near known good point
When NOT to Use:
- ✗ 4+ hyperparameters
- ✗ High-dimensional spaces
- ✗ Unknown optimal ranges
- ✗ Limited compute budget
Example: Grid Search (Good Use):
# GOOD: Only 2 parameters, 3×4=12 configurations
import itertools
learning_rates = [0.001, 0.01, 0.1]
weight_decays = [0.0, 0.0001, 0.001, 0.01]
best_acc = 0
for lr, wd in itertools.product(learning_rates, weight_decays):
model = create_model()
acc = train_and_evaluate(model, lr=lr, weight_decay=wd)
if acc > best_acc:
best_acc = acc
best_config = {"lr": lr, "wd": wd}
print(f"Best accuracy: {best_acc}")
print(f"Best config: {best_config}")
# 12 configurations × 30 min each = 6 hours total
# Very reasonable!
Anti-Example: Grid Search (Bad Use):
# WRONG: 5 parameters, 5^5=3,125 configurations
# This is 130 days of compute - completely impractical
learning_rates = [0.0001, 0.001, 0.01, 0.1, 1.0]
batch_sizes = [16, 32, 64, 128, 256]
weight_decays = [0.0, 0.0001, 0.001, 0.01, 0.1]
dropouts = [0.0, 0.2, 0.4, 0.6, 0.8]
warmup_steps = [0, 100, 500, 1000, 5000]
# DO NOT DO THIS - grid explosion is real
Strategy 2: Random Search (Default Choice for Most Cases)
Random Search: Sample hyperparameters randomly from search space.
PROS:
- Much better than grid in 4+ dimensions (Bergstra & Bengio 2012)
- 100-200 random samples often better than 100 grid points
- Easy to implement and parallelize
- Can sample continuous spaces naturally
- Efficient use of limited compute budget
CONS:
- Not systematic (might miss obvious points)
- Requires defining search space ranges (hard part)
- No exploitation of promising regions (unlike Bayesian)
- Results less deterministic than grid
When to Use:
- ✓ 3-5 hyperparameters
- ✓ 50-300 configurations available
- ✓ Unknown optimal ranges
- ✓ Want simple, efficient method
- ✓ Default choice when unsure
When NOT to Use:
- ✗ 1-2 hyperparameters (grid is simpler)
- ✗ Very large budgets (1000+ hrs, use Bayesian)
- ✗ Need guaranteed convergence to local optimum
Example: Random Search (Recommended):
# GOOD: 4 parameters, random sampling, efficient
import numpy as np
from scipy.stats import loguniform, uniform
# Define search space with proper scales
learning_rate_dist = loguniform(a=0.00001, b=0.1) # Log scale!
batch_size_dist = [16, 32, 64, 128, 256]
weight_decay_dist = loguniform(a=0.0, b=0.1) # Log scale!
dropout_dist = uniform(loc=0.0, scale=0.8)
best_acc = 0
for trial in range(100): # 100 configurations, not 3,125
lr = learning_rate_dist.rvs()
batch_size = np.random.choice(batch_size_dist)
wd = weight_decay_dist.rvs()
dropout = dropout_dist.rvs()
model = create_model(dropout=dropout)
acc = train_and_evaluate(
model,
lr=lr,
batch_size=batch_size,
weight_decay=wd
)
if acc > best_acc:
best_acc = acc
best_config = {
"lr": lr,
"batch_size": batch_size,
"weight_decay": wd,
"dropout": dropout
}
print(f"Best accuracy: {best_acc}")
print(f"Best config: {best_config}")
# 100 configurations × 30 min each = 50 hours total
# 100 trials >> 5^4=625 grid points, but MUCH better scaling
Strategy 3: Bayesian Optimization (Best for Limited Budget)
Bayesian Optimization: Build probabilistic model of function, use to guide search.
How It Works:
- Start with 5-10 random trials (exploratory phase)
- Build surrogate model (Gaussian Process) of performance vs hyperparameters
- Use acquisition function to select next promising region to sample
- Train model, update surrogate, repeat
- Balance exploration (new regions) vs exploitation (known good regions)
PROS:
- Uses all prior information to guide next trial selection
- 2-10x more efficient than random search
- Handles many parameters well (5-10+)
- Built-in uncertainty estimates
CONS:
- More complex to implement and understand
- Surrogate model overhead (negligible vs training time)
- Requires tool like Optuna or Ray Tune
- Less interpretable than grid/random (can't show "grid")
When to Use:
- ✓ 5+ hyperparameters
- ✓ 200+ configurations budget
- ✓ Each trial is expensive (>1 hour)
- ✓ Want best results with limited budget
- ✓ Will use Optuna, Ray Tune, or W&B Sweeps
When NOT to Use:
- ✗ <20 configurations (overhead not worth it)
- ✗ Very cheap trials where random is simpler
- ✗ Need to explain exactly what was tested (use grid)
Example: Bayesian with Optuna (Industry Standard):
# GOOD: Professional hyperparameter search with Optuna
import optuna
from optuna.pruners import MedianPruner
def objective(trial):
# Suggest hyperparameters from search space
learning_rate = trial.suggest_float(
"learning_rate",
1e-5, 1e-1,
log=True # Log scale (CRITICAL!)
)
batch_size = trial.suggest_categorical(
"batch_size",
[16, 32, 64, 128, 256]
)
weight_decay = trial.suggest_float(
"weight_decay",
1e-5, 1e-1,
log=True # Log scale!
)
dropout = trial.suggest_float(
"dropout",
0.0, 0.8 # Linear scale
)
# Create and train model
model = create_model(dropout=dropout)
best_val_acc = 0
for epoch in range(100):
train(model, lr=learning_rate, batch_size=batch_size,
weight_decay=weight_decay)
val_acc = validate(model)
# CRITICAL: Early stopping in search (prune bad trials)
trial.report(val_acc, epoch)
if trial.should_prune(): # Stops bad trials early!
raise optuna.TrialPruned()
if val_acc > best_val_acc:
best_val_acc = val_acc
return best_val_acc
# Create study with pruning (saves 70% compute)
study = optuna.create_study(
direction="maximize",
pruner=MedianPruner()
)
# Run search: 200 trials with Bayesian guidance
study.optimize(objective, n_trials=200, n_jobs=4)
print(f"Best accuracy: {study.best_value}")
print(f"Best config: {study.best_params}")
# Early stopping + Bayesian optimization saves massive compute
# 200 trials × 30 epochs on average = vs 200 × 100 without pruning
Search Space Design (Critical Details Often Missed)
1. Scale Selection for Continuous Parameters
Learning Rate and Weight Decay: USE LOG SCALE
# WRONG: Linear scale for learning rate
learning_rates_linear = [0.0001, 0.002, 0.004, 0.006, 0.008, 0.01]
# Ranges: 0.0001→0.002 is 20x, but only uses 1/5 of range
# Ranges: 0.008→0.01 is 1.25x, but uses 1/5 of range
# BROKEN: Unequal coverage of important ranges
# CORRECT: Log scale for learning rate
import numpy as np
learning_rates_log = np.logspace(-4, -2, 6) # 10^-4 to 10^-2
# [0.0001, 0.000215, 0.000464, 0.001, 0.00215, 0.00464, 0.01]
# Each step is ~2.15x (equal importance)
# GOOD: Even coverage across exponential range
Why Log Scale for LR:
- Effect on loss is exponential, not linear
- 10x change in LR has similar impact anywhere in range
- Linear scale bunches tiny values together, wastes space on large values
- Log scale: 0.0001 to 0.01 gets fair representation
Parameters That Need Log Scale:
- Learning rate (most critical)
- Weight decay
- Learning rate schedule decay (gamma in step decay)
- Regularization strength
- Any parameter spanning >1 order of magnitude
Dropout, Warmup, Others: USE LINEAR SCALE
# CORRECT: Linear scale for dropout (0.0 to 0.8)
dropout_values = np.linspace(0.0, 0.8, 5)
# [0.0, 0.2, 0.4, 0.6, 0.8]
# GOOD: Each increase is meaningful
# CORRECT: Linear scale for warmup steps
warmup_steps = [0, 250, 500, 750, 1000]
# Linear relationships make sense here
2. Search Space Ranges (Common Mistakes)
Learning Rate Range Often Too Small:
# WRONG: Too narrow range
lr_range = [0.001, 0.0015, 0.002, 0.0025, 0.003]
# Optimal might be 0.01 or 0.0001, both outside range!
# CORRECT: Wider range covering multiple orders of magnitude
lr_range = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1] # Or use loguniform(1e-5, 1e-1)
Batch Size Range Considerations:
# Batch size affects memory AND gradient noise
# Small batch (16-32): Noisy gradients, good regularization, needs low LR
# Large batch (256+): Stable gradients, less regularization, can use high LR
# CORRECT: Include range of batch sizes
batch_sizes = [16, 32, 64, 128, 256]
# INTERACTION: Large batch + same LR usually worse than small batch
# This is WHY you need to search both together (not separately)
Weight Decay Range:
# Log scale, typically 0 to 0.1
# For well-regularized models: 1e-5 to 1e-1
# For barely regularized: 0.0 to 1e-3
# CORRECT: Use log scale
weight_decays = [0.0, 1e-5, 1e-4, 1e-3, 1e-2, 0.1]
Budget Allocation: Seeds vs Configurations
Key Decision: Should you train many configurations once or few configurations multiple times?
Answer: MANY CONFIGURATIONS, SINGLE SEED
Why:
Budget = 100 hours
Option A: Many configurations, 1 seed each
├─ 100 configurations × 1 seed = 100 trials
├─ Find best at 85% accuracy
└─ Top 5 can be rerun with 5 seeds for ensemble
Option B: Few configurations, 5 seeds each
├─ 20 configurations × 5 seeds = 100 trials
├─ Find best at 83% accuracy
└─ Know best is 82-84%, but suboptimal choice
Option A is ALWAYS better because:
- Finding good configuration is harder than averaging noise
- Top configuration with 1 seed > random configuration averaged 5x
- Can always rerun top 5 with multiple seeds if needed
- Larger exploration space finds fundamentally better hyperparameters
Recommended Allocation:
Total budget: 200 configurations × 30 min = 100 hours
Phase 1: Wide exploration (100 configurations, 1 seed each)
├─ Random or Bayesian over full search space
└─ Find top 10 candidates
Phase 2: Refinement (50 configurations, 1 seed each)
├─ Search near best from Phase 1
├─ Explore unexplored neighbors
└─ Find top 5 refined candidates
Phase 3: Validation (5 configurations, 3 seeds each)
├─ Run best from Phase 2 with multiple seeds
├─ Report mean ± std
└─ Ensemble predictions from 3 models
Total: 100 + 50 + 15 = 165 trials (realistic)
Early Stopping in Hyperparameter Search (Critical for Efficiency)
Key Concept: During hyperparameter search, stop trials that are clearly bad early.
NOT the Same As:
- Early stopping during training (regularization technique) - still do this!
- Stopping tuning when results plateau (quit tuning) - different concept
Early Stopping in Search: Abandon bad hyperparameter configurations before full training.
How It Works:
# With early stopping in search
for trial in range(100):
model = create_model()
for epoch in range(100):
train(model, epoch)
val_acc = validate(model)
# Check if this trial is hopeless
if val_acc < best_val_acc - 10: # Way worse than best
break # Stop and try next configuration!
# Or use automated pruning (Optuna does this)
# Result: 100 trials × ~30 epochs on average = 3000 epoch-trials
# vs 100 trials × 100 epochs = 10000 epoch-trials
# Saves 70% compute, finds same best configuration!
When to Prune:
Trial accuracy worse than best by:
Epoch 5: >15% → PRUNE (hopeless, try next)
Epoch 10: >10% → PRUNE
Epoch 30: >5% → PRUNE
Epoch 50: >2% → DON'T PRUNE (still might recover)
Epoch 80+: Never prune (almost done training)
Optuna's Pruning Strategy:
import optuna
study = optuna.create_study(
direction="maximize",
pruner=optuna.pruners.MedianPruner(
n_startup_trials=5, # First 5 trials always complete
n_warmup_steps=10, # No pruning until epoch 10
interval_steps=1, # Check every epoch
)
)
# MedianPruner removes trials worse than median at each epoch
# Automatically saves ~50-70% compute
Tools and Frameworks Comparison
1. Manual Grid Search (DIY)
# Pros: Full control, simple, good for 1-2 parameters
# Cons: Doesn't scale to many parameters
import itertools
configs = itertools.product(
[0.001, 0.01, 0.1],
[0.0, 0.0001, 0.001]
)
best = None
for lr, wd in configs:
acc = train_and_evaluate(lr=lr, weight_decay=wd)
if best is None or acc > best['acc']:
best = {'lr': lr, 'wd': wd, 'acc': acc}
When to Use: <50 configurations, quick experiments
2. Optuna (Industry Standard)
# Pros: Bayesian optimization, pruning, very popular
# Cons: Slightly more complex
import optuna
def objective(trial):
lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
wd = trial.suggest_float("wd", 1e-5, 1e-1, log=True)
model = create_model()
for epoch in range(100):
train(model, lr=lr, weight_decay=wd)
val_acc = validate(model)
trial.report(val_acc, epoch)
if trial.should_prune():
raise optuna.TrialPruned()
return val_acc
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=200)
When to Use: 5+ parameters, 200+ trials, need efficiency
Why It's Best:
- Bayesian optimization guides search efficiently
- Pruning saves 50-70% compute
- Handles many parameters well
- Simple API once you understand it
3. Ray Tune (For Distributed Search)
# Pros: Distributed search, good for many trials in parallel
# Cons: More setup needed
from ray import tune
def train_model(config):
model = create_model()
for epoch in range(100):
train(model, lr=config['lr'], batch_size=config['batch_size'])
val_acc = validate(model)
tune.report(accuracy=val_acc)
analysis = tune.run(
train_model,
config={
"lr": tune.loguniform(1e-5, 1e-1),
"batch_size": tune.choice([16, 32, 64, 128]),
},
num_samples=200,
scheduler=tune.ASHAScheduler(
time_attr="training_iteration",
metric="accuracy",
mode="max",
max_t=100,
),
verbose=1,
)
When to Use: Distributed setup (multiple GPUs/machines), 500+ trials
4. Weights & Biases (W&B) Sweeps (For Collaboration)
# Pros: Visual dashboard, team collaboration, easy integration
# Cons: Requires W&B account, less control than Optuna
# sweep_config.yaml:
program: train.py
method: bayes
metric:
name: val_accuracy
goal: maximize
parameters:
learning_rate:
min: 0.00001
max: 0.1
distribution: log_uniform
weight_decay:
min: 0.00001
max: 0.1
distribution: log_uniform
# Then run: wandb sweep sweep_config.yaml
When to Use: Team settings, want visual results, corporate environment
When to Use Manual Tuning vs Automated Search
Manual Tuning (Sometimes Better Than You'd Think)
Process:
- Set learning rate with learning rate finder (1 epoch)
- Train with this LR, watch training curves
- If loss oscillates → lower LR by 2x → retrain
- If loss plateaus → lower LR by 3x → retrain
- Repeat until training stable and converging well
- Done!
When It's Actually Faster:
- Total experiments: 3-5 (vs 50+ for search)
- Time: 1-2 hours (vs 20+ hours for automated)
- Result: Often 80-85% (vs 85%+ for search)
# Manual tuning example
learning_rates = [0.0001] # Start low and safe
for lr in learning_rates:
model = create_model()
losses = train(model, lr=lr)
# If oscillating, reduce LR
if losses[-10:].std() > losses[-50:-10].std():
learning_rates.append(lr * 0.5)
# If plateauing, reduce LR
elif losses[-10:].mean() - losses[-50:-10].mean() < 0.01:
learning_rates.append(lr * 0.5)
# If good convergence, done!
else:
print(f"Good LR found: {lr}")
break
Pros:
- Fast for 1-2 hyperparameters
- Understand the hyperparameters better
- Good when compute is limited
- Better for quick iteration
Cons:
- Doesn't explore systematically
- Easy to get stuck in local view
- Not reproducible (depends on your intuition)
- Doesn't find global optimum
Use Manual When:
- ✓ Tuning only learning rate
- ✓ Quick experiments (< 1 hour)
- ✓ Testing ideas rapidly
- ✓ Compute very limited
- ✓ New problem/dataset (explore first)
Use Automated When:
- ✓ Tuning 3+ hyperparameters
- ✓ Targeting SOTA results
- ✓ Compute available (10+ hours)
- ✓ Want reproducible results
- ✓ Need best possible configuration
Common Pitfalls and How to Avoid Them
Pitfall 1: Not Using Log Scale for Learning Rate
Problem: Linear scale [0.0001, 0.002, 0.004, 0.006, 0.008, 0.01] misses optimal Fix: Use logarithmic scale np.logspace(-4, -2, 6) Impact: Can miss 3-5% accuracy improvement
Pitfall 2: Tuning Too Many Hyperparameters at Once
Problem: 5 parameters × 5 values = 3,125 configs, impractical Fix: Prioritize - tune LR first, then batch size, then others Impact: Saves 100x compute while finding better results
Pitfall 3: Using Grid Search in High Dimensions
Problem: Grid search is O(k^n), explodes quickly Fix: Use random search for 4+ parameters, Bayesian for 5+ Impact: Random search is 10x more efficient
Pitfall 4: Training All Trials to Completion
Problem: Bad trials waste compute (no early stopping in search) Fix: Use Optuna with MedianPruner to prune bad trials Impact: Save 50-70% compute, same best result
Pitfall 5: Searching Over Architecture Before Optimizing Learning Rate
Problem: Model width 128→256 with bad LR gives noisy results Fix: Fix learning rate first, then tune architecture Impact: Avoid confounding, find LR gives 5-10%, width gives 2%
Pitfall 6: Single Seed for Final Configuration
Problem: One training run, variance unknown Fix: Run top 5 configs with 3+ seeds Impact: Know confidence intervals, can ensemble
Pitfall 7: Search Space Too Narrow
Problem: LR range [0.005, 0.01] misses better values outside Fix: Start with wide range (1e-5 to 1e-1), narrow after Impact: Find better optima, can always refine later
Pitfall 8: Not Checking for Interactions Between Hyperparameters
Problem: Assumes hyperparameters are independent Reality: Batch size and LR interact, warmup and scheduler interact Fix: Bayesian optimization naturally handles interactions Impact: Find better combined configurations
Pitfall 9: Stopping Search Too Early
Problem: First 20 trials don't converge, give up Fix: Run at least 50-100 trials (Bayesian gets better with more) Impact: Bayesian optimization needs warm-up, improves significantly
Pitfall 10: Not Comparing to Baseline
Problem: Find best config is 82%, don't know if better than default Fix: Include default hyperparameters as explicit trial Impact: Know if search is even helping (sometimes default is good)
Hyperparameter Importance Empirical Results (Case Studies)
Case Study 1: CIFAR-10 ResNet-18
| Change | Accuracy Shift | Relative Importance |
|---|---|---|
| LR: 0.001 → 0.01 | +14% | 100% ← CRITICAL |
| Batch size: 32 → 128 | -2% | Low (but affects LR) |
| Weight decay: 0 → 0.0001 | +2% | 15% |
| Dropout: 0 → 0.3 | +1% | 7% |
| Model width: 64 → 128 | +2% | 15% |
Lesson: LR is 7-20x more important than individual architectural changes
Case Study 2: ImageNet Fine-tuning (Pretrained ResNet-50)
| Change | Accuracy Shift | Relative Importance |
|---|---|---|
| LR: 0.01 → 0.001 | +3% | 100% ← CRITICAL |
| Warmup: 0 → 1000 steps | +0.5% | 15% |
| Weight decay: 0 → 0.001 | +0.5% | 15% |
| Frozen layers: 0 → 3 | +1% | 30% |
Lesson: Fine-tuning is LR-dominated; architecture matters less for pretrained
Rationalization Table: How to Handle Common Arguments
| User Says | What They Mean | Reality | What to Do |
|---|---|---|---|
| "Grid search is most thorough" | Should check all combinations | Grid is O(k^n), explodes | Show random search beats grid in 5+ dims |
| "More hyperparameters = more flexibility" | Want to tune everything | Most don't matter | Show importance hierarchy, tune LR first |
| "I'll tune architecture first" | Want to find model size | Bad LR confounds results | Insist on fixing LR first |
| "Linear spacing is uniform" | Want equal coverage | Effect is exponential | Show log scale finds optimal 3-5% better |
| "Longer training gives better results" | Can't prune early | Bad config won't improve | Show early stopping pruning saves 70% |
| "I ran 5 configs and found best" | Early results seem good | Variance of 5 runs is high | Need 20+ to be confident |
| "This LR seems good" | One training run looks ok | Might just be lucky run | Run 3 seeds, report mean ± std |
| "My compute is limited" | Can't do full search | Limited budget favors random | Allocate to many configs × 1 seed |
Red Flags: When Something is Wrong
🚩 Red Flag 1: Training loss is extremely noisy (spikes up and down)
- Likely cause: Learning rate too high
- Fix: Reduce learning rate by 10x, try again
🚩 Red Flag 2: All trials have similar accuracy (within 0.5%)
- Likely cause: Search space too narrow or search space overlaps
- Fix: Expand search space, verify random sampling is working
🚩 Red Flag 3: Best trial is at edge of search space
- Likely cause: Search space is too small, optimal is outside
- Fix: Expand bounds in that direction
🚩 Red Flag 4: Early stopping pruned 95% of trials
- Likely cause: Initial configuration space very poor
- Fix: Expand search space, adjust pruning thresholds
🚩 Red Flag 5: Trial finished in 1 epoch (model crashed or diverged)
- Likely cause: Learning rate way too high or batch size incompatible
- Fix: Check LR bounds are reasonable, verify code works
🚩 Red Flag 6: Default hyperparameters beat tuned ones
- Likely cause: Search space poorly designed, not enough trials
- Fix: Expand search space, run more trials, check for bugs
🚩 Red Flag 7: Same "best" configuration found in two independent searches
- Positive indicator: Robust result, likely good hyperparameter
- Action: Can be confident in this configuration
Quick Reference: Decision Tree
Need to improve model performance?
│
├─ Model underfits (high train + val loss)?
│ └─ → Add capacity or train longer (not a tuning problem)
│
├─ Training converges too slowly?
│ └─ → Tune learning rate first (critical!)
│
├─ Training is unstable (losses spike)?
│ └─ → Lower learning rate or increase batch size
│
├─ Overfitting (low train loss, high val loss)?
│ └─ → Tune weight decay, dropout, learning rate schedule
│
├─ How many hyperparameters to tune?
│ ├─ 1-2 params → Use manual tuning or grid search
│ ├─ 3-4 params → Use random search
│ └─ 5+ params → Use Bayesian optimization (Optuna)
│
├─ How much compute available?
│ ├─ <10 hours → Tune only learning rate
│ ├─ 10-100 hours → Random search over 3-4 params
│ └─ 100+ hours → Bayesian optimization, multiple seeds
│
└─ Should you run multiple seeds?
├─ During search: NO (use compute for many configs instead)
└─ For final configs: YES (1-3 seeds per top-5 candidates)
Advanced Topics
Learning Rate Warmup (Critical for Transformers)
What It Is: Start with very small LR, gradually increase to target over N steps, then decay.
Why It Matters:
- Transformers are unstable without warmup
- Initial gradients can be very large (unstable)
- Gradual increase lets model stabilize
- Warmup is ESSENTIAL for BERT, GPT, ViT, etc.
Typical Warmup Schedule:
# Linear warmup then cosine decay
# Common: 10% of total steps for warmup
import math
def get_lr(step, total_steps, warmup_steps, max_lr):
if step < warmup_steps:
# Linear warmup: 0 → max_lr
return max_lr * (step / warmup_steps)
else:
# Cosine decay: max_lr → 0.1 * max_lr
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return 0.5 * max_lr * (1 + math.cos(math.pi * progress))
# Example:
total_steps = 10000
warmup_steps = 1000 # 10% warmup
max_lr = 0.001
for step in range(total_steps):
lr = get_lr(step, total_steps, warmup_steps, max_lr)
# use lr for this step
When to Tune Warmup:
- Essential for transformers (BERT, GPT, ViT)
- Important for large models (ResNet-50+)
- Can skip for small models (ResNet-18)
- Typical: 5-10% of total steps
Warmup Parameters to Consider:
warmup_steps: How many steps to warm up (10% of total)warmup_schedule: Linear vs exponential warmup- Interaction with learning rate: Must tune together!
Batch Size and Learning Rate Interaction (Critical)
Key Finding: Batch size and learning rate are NOT independent.
The Relationship:
Large batch size → Less gradient noise → Can use larger LR
Small batch size → More gradient noise → Need smaller LR
Rule of thumb: LR ∝ sqrt(batch_size)
Doubling batch size → can increase LR by ~1.4x
Example: CIFAR-10 ResNet18:
Batch Size 32, LR 0.01: Accuracy 84%
Batch Size 32, LR 0.05: Accuracy 81% (too high)
Batch Size 128, LR 0.01: Accuracy 82% (too low for large batch)
Batch Size 128, LR 0.02: Accuracy 84% (recovered!)
Batch Size 128, LR 0.03: Accuracy 85% (slightly better, larger batch benefits)
What This Means:
- Can't tune batch size and LR independently
- Must tune them together
- This is why Bayesian optimization is better (handles interactions)
- Grid search would need to search all combinations
Implication for Search:
- Include both batch size AND LR in search space
- Don't fix batch size, then tune LR
- Don't tune LR, then change batch size
- Search them together for best results
Momentum and Optimizer-Specific Parameters
SGD with Momentum:
# Momentum: accelerates gradient descent
# High momentum (0.9): Faster convergence, but overshoots minima
# Low momentum (0.5): Slower, but more stable
learning_rates = [0.01, 0.1] # Higher for SGD
momentums = [0.8, 0.9, 0.95]
# SGD works best with moderate LR + high momentum
# Default: momentum=0.9
Adam Parameters:
# Adam is more forgiving (less sensitive to hyperparameters)
# But still worth tuning learning rate
# Beta1 (exponential decay for 1st moment): usually 0.9 (don't change)
# Beta2 (exponential decay for 2nd moment): usually 0.999 (don't change)
# Epsilon: usually 1e-8 (don't bother tuning)
learning_rates = [0.0001, 0.001, 0.01] # Lower for Adam
weight_decays = [0.0, 0.0001, 0.001] # Adam needs this
# Adam is more robust, good default optimizer
Which Optimizer to Choose:
SGD + Momentum:
Pros: Better generalization, well-understood
Cons: More sensitive to LR, slower convergence
Use for: Vision (CNN), competitive results
Adam:
Pros: Faster convergence, less tuning, robust
Cons: Slightly worse generalization, adaptive complexity
Use for: NLP, transformers, quick experiments
AdamW:
Pros: Better weight decay, all advantages of Adam
Cons: None really
Use for: Modern default, transformers, NLP
RMSprop:
Pros: Good for RNNs, good convergence
Cons: Less popular, fewer resources
Use for: RNNs, rarely these days
Weight Decay and L2 Regularization
What's the Difference:
- L2 regularization (added to loss): Works with all optimizers
- Weight decay (parameter update): Works correctly only with SGD
- AdamW: Fixes Adam's weight decay issue
Impact on Regularization:
# High weight decay: Strong regularization, lower capacity
weight_decay = 0.01
# Low weight decay: Weak regularization, higher capacity
weight_decay = 0.0001
# For overfitting: Start with weight_decay = 1e-4 to 1e-3
# For underfitting: Reduce to 1e-5 or 0.0
Tuning Weight Decay:
If overfitting (low train loss, high val loss):
├─ Try increasing weight decay (0.0001 → 0.001 → 0.01)
└─ Or reduce model capacity
└─ Or collect more data
If underfitting (high train loss):
└─ Reduce weight decay to 0.0
Typical Values:
Vision models (ResNet, etc): 1e-4 to 1e-3
Transformers (BERT, GPT): 0.01 to 0.1
Small networks: 1e-5 to 1e-4
Huge models (1B+): 0.0 or very small
Learning Rate Schedules Worth Tuning
Constant LR (no schedule):
- Pros: Simple, good for comparison baseline
- Cons: Suboptimal convergence
- Use when: Testing new architecture quickly
Step Decay (multiply by 0.1 every N epochs):
# Divide LR by 10 at specific epochs
milestones = [30, 60, 90] # For 100 epoch training
for epoch in range(100):
if epoch in milestones:
lr *= 0.1
Exponential Decay (multiply by factor each epoch):
# Gradual decay, smoother than step
decay_rate = 0.96
for epoch in range(100):
lr = initial_lr * (decay_rate ** epoch)
Cosine Annealing (cosine decay from max to min):
# Best for convergence, used in SOTA papers
import math
def cosine_annealing(epoch, total_epochs, min_lr, max_lr):
return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * epoch / total_epochs))
# Smooth decay, no discontinuities
OneCycleLR (up then down):
# Used in FastAI, very effective
# Goes: max_lr → min_lr → max_lr/25
# Over entire training
Which to Choose:
Vision (CNN): Step decay or cosine annealing
Transformers: Warmup then cosine or constant
Fine-tuning: Linear decay (slowly reduce)
Quick experiments: Constant LR
SOTA results: Cosine annealing with warmup
Hyperparameter Interactions: Complex Cases
Interaction 1: Batch Size × Learning Rate
Already covered above - MUST tune together
Interaction 2: Model Capacity × Regularization
Large model + weak regularization → Overfitting
Large model + strong regularization → Good generalization
Small model + strong regularization → Underfitting
Don't increase regularization for small models!
Interaction 3: Warmup × Learning Rate
High LR needs more warmup steps
Low LR needs less warmup
For LR=0.001: warmup_steps = 500
For LR=0.1: warmup_steps = 5000 (higher LR = more warmup)
Interaction 4: Weight Decay × Optimizer
SGD: Weight decay works as specified
Adam: Weight decay doesn't work properly (use AdamW!)
AdamW: Weight decay works correctly
When Model Capacity is the Real Problem
Underfitting Signs:
Training accuracy: 50%
Validation accuracy: 48%
Gap: Small (not overfitting)
→ Model doesn't have capacity to learn
→ Add more parameters (wider/deeper)
→ Tuning hyperparameters won't help much
Fix for Underfitting (not tuning):
# WRONG: Tuning hyperparameters
for lr in learning_rates:
model = SmallModel() # Too small!
train(model, lr=lr) # Still won't converge
# CORRECT: Add model capacity
model = LargeModel() # More parameters
train(model, lr=0.01) # Now it converges well
Capacity Sizing Rules:
Dataset size 10K images: Small model ok (100K parameters)
Dataset size 100K images: Medium model (1M parameters)
Dataset size 1M+ images: Large model (10M+ parameters)
If training data < 10K: Use pre-trained, don't train from scratch
If training data > 1M: Larger models generally better
Debugging Hyperparameter Search
Debugging Checklist:
Are trials actually different?
# Check that suggested values are being used for trial in study.trials[:5]: print(f"LR: {trial.params['lr']}") print(f"Batch size: {trial.params['batch_size']}") # If all same, check suggest_* callsAre results being recorded?
# Verify accuracy improving or worsening meaningfully for trial in study.trials: print(f"Params: {trial.params}, Value: {trial.value}") # Should see range of values, not all sameIs pruning too aggressive?
# Check how many trials got pruned n_pruned = sum(1 for t in study.trials if t.state == optuna.trial.TrialState.PRUNED) print(f"Pruned {n_pruned}/{len(study.trials)}") # If >90% pruned: Expand search space or adjust pruning thresholdsAre hyperparameters in right range?
# Check if best trial is at boundary best = study.best_params search_space = {...} # Your defined space for param, value in best.items(): if value == search_space[param][0] or value == search_space[param][-1]: print(f"WARNING: {param} at boundary!")Is search space reasonable?
# Quick sanity check: Run 5 random configs # Should see different accuracies (not all 50%, not all 95%)
Complete Optuna Workflow Example (Production Ready)
Full Example from Start to Finish:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
# Step 1: Define the objective function
def objective(trial):
# Suggest hyperparameters
learning_rate = trial.suggest_float(
"learning_rate",
1e-5, 1e-1,
log=True # CRITICAL: Log scale for LR
)
batch_size = trial.suggest_categorical(
"batch_size",
[16, 32, 64, 128]
)
weight_decay = trial.suggest_float(
"weight_decay",
1e-6, 1e-2,
log=True # Log scale for weight decay
)
dropout_rate = trial.suggest_float(
"dropout_rate",
0.0, 0.5 # Linear scale for dropout
)
optimizer_type = trial.suggest_categorical(
"optimizer",
["adam", "sgd"]
)
# Build model with suggested hyperparameters
model = create_model(dropout=dropout_rate)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Create optimizer
if optimizer_type == "adam":
optimizer = torch.optim.Adam(
model.parameters(),
lr=learning_rate,
weight_decay=weight_decay
)
else: # sgd
optimizer = torch.optim.SGD(
model.parameters(),
lr=learning_rate,
momentum=0.9,
weight_decay=weight_decay
)
# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=100
)
# Training loop with pruning
best_val_acc = 0
for epoch in range(100):
# Train
model.train()
train_loss = 0
for batch_x, batch_y in train_loader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
optimizer.zero_grad()
logits = model(batch_x)
loss = nn.CrossEntropyLoss()(logits, batch_y)
loss.backward()
optimizer.step()
train_loss += loss.item()
# Validate
model.eval()
val_correct = 0
val_total = 0
with torch.no_grad():
for batch_x, batch_y in val_loader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
logits = model(batch_x)
predictions = logits.argmax(dim=1)
val_correct += (predictions == batch_y).sum().item()
val_total += batch_y.size(0)
val_acc = val_correct / val_total
if val_acc > best_val_acc:
best_val_acc = val_acc
# Step scheduler
scheduler.step()
# Report to trial and prune if needed (CRITICAL!)
trial.report(val_acc, epoch)
if trial.should_prune():
raise optuna.TrialPruned()
return best_val_acc
# Step 2: Create study with optimization
# TPESampler: Tree-structured Parzen Estimator (better than default)
sampler = TPESampler(seed=42)
study = optuna.create_study(
direction="maximize",
sampler=sampler,
pruner=MedianPruner(
n_startup_trials=5, # First 5 trials always complete
n_warmup_steps=10, # No pruning until epoch 10
interval_steps=1 # Check every epoch
)
)
# Step 3: Optimize (run search)
study.optimize(
objective,
n_trials=200, # Run 200 configurations
n_jobs=4, # Parallel execution on 4 GPUs if available
show_progress_bar=True
)
# Step 4: Analyze results
print(f"Best accuracy: {study.best_value:.4f}")
print(f"Best hyperparameters: {study.best_params}")
# Step 5: Visualize results (optional but useful)
try:
import matplotlib.pyplot as plt
fig = optuna.visualization.plot_optimization_history(study).show()
except:
pass
# Step 6: Run final validation with best config
# (With 3 seeds, report mean ± std)
best_params = study.best_params
final_accuracies = []
for seed in range(3):
model = create_model(dropout=best_params['dropout_rate'])
# ... train with best_params ...
final_acc = validate(model) # Your validation function
final_accuracies.append(final_acc)
print(f"Final result: {np.mean(final_accuracies):.4f} ± {np.std(final_accuracies):.4f}")
Key Points in This Example:
- Log scale for learning rate and weight decay (CRITICAL)
- Linear scale for dropout (CORRECT)
- Trial pruning to save compute (ESSENTIAL)
- LR scheduler with optimizer
- Running final validation with multiple seeds
- Clear reporting of best config
Grid Search at Scale: When It Breaks Down
Small Grid (Works Fine):
3 params × 3 values each = 27 configs
Time: 27 × 30 min = 810 minutes = 13.5 hours
Practical? YES
Medium Grid (Getting Expensive):
4 params × 4 values each = 256 configs
Time: 256 × 30 min = 7680 minutes = 128 hours = 5.3 days
Practical? MAYBE (if you have the compute)
Large Grid (Impractical):
5 params × 5 values each = 3,125 configs
Time: 3,125 × 30 min = 93,750 minutes = 65 days
Practical? NO
Random search: 200 configs = 6,400 minutes = 4.4 days
→ 15x FASTER, BETTER RESULTS
Always Use Random When Grid > 100 Configs
Common Search Space Mistakes (With Fixes)
Mistake 1: LR range too narrow
# WRONG: Only covers small range
lr_values = [0.008, 0.009, 0.01, 0.011, 0.012]
# CORRECT: Covers multiple orders of magnitude
lr_values = np.logspace(-4, -1, 6) # [1e-4, 1e-3, 1e-2, 1e-1]
Mistake 2: Batch size without corresponding LR adjustment
# WRONG: Searches batch size but LR fixed at 0.001
batch_sizes = [32, 64, 128, 256]
learning_rate = 0.001 # Fixed!
# CORRECT: Search both batch size AND LR together
# Large batch needs larger LR
batch_sizes = [32, 64, 128, 256]
learning_rates = [0.001, 0.002, 0.003, 0.005, 0.01]
Mistake 3: Linear spacing for exponential parameters
# WRONG: Linear spacing for weight decay
wd_values = [0.0, 0.025, 0.05, 0.075, 0.1]
# CORRECT: Log spacing for weight decay
wd_values = np.logspace(-5, -1, 6) # [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
Mistake 4: Dropout range that's too wide
# WRONG: Including 0.9 dropout (destroys model)
dropout_values = [0.0, 0.3, 0.6, 0.9]
# CORRECT: Reasonable regularization range
dropout_values = [0.0, 0.2, 0.4, 0.6]
When to Stop Searching and Go With What You Have
Stop Conditions:
Diminishing Returns
- First 50 trials: Found 80% of best accuracy
- Next 50 trials: Found 15% improvement
- Next 50 trials: Found 4% improvement
- → Stop when improvement/trial drops below 0.1%
Time Budget Exhausted
- Planned for 100 hours, used 100 hours
- → Run final validation and ship results
Best Config Appears Stable
- Same best configuration in last 20 trials
- Different search random seeds find same optimum
- → Confidence in result, safe to stop
No Config Improvement
- Last 30 trials all worse than current best
- Pruning catching most trials
- → Search converged, time to stop
Decision Rule:
Number of trials = min(
total_budget // cost_per_trial,
until_improvement < 0.1%,
until_same_best_for_20_trials
)
Summary: Best Practices
- Prioritize Learning Rate - Most important hyperparameter by far (7-20x impact)
- Use Log Scale - For LR, weight decay, regularization strength
- Avoid Grid Search - Exponential complexity O(k^n), use random for 4+ params
- Allocate for Many Configs - Broad exploration > Multiple runs of few configs (5-10x better)
- Enable Early Stopping - In search itself (pruning bad trials), saves 50-70% compute
- Use Optuna - Industry standard with Bayesian optimization + pruning
- Run Multiple Seeds - Only for final top-5 candidates (3 seeds), not all trials
- Start With Defaults - Only tune if underperforming (don't waste compute)
- Check for Interactions - Batch size and LR interact strongly (tune together)
- Compare to Baseline - Include default config to verify search helps
- Tune Warmup with LR - Critical for transformers, must co-tune
- Match Optimizer to Task - SGD for vision/SOTA, Adam/AdamW for NLP/transformers
- Use Log Scale for Exponential Parameters - Critical for finding optimal
- Stop When Returns Diminish - Once improvement <0.1% per trial, stop searching
- Debug Search Systematically - Check bounds, pruning rates, parameter suggestions