| name | learning-rate-scheduling |
| description | Learning rate scheduling - warmup, schedulers, decay strategies, modern best practices |
Learning Rate Scheduling Skill
When to Use This Skill
Use this skill when:
- User asks "should I use a learning rate scheduler?"
- Training plateaus or loss stops improving
- Training transformers or large models (warmup critical)
- User wants to implement OneCycleLR or specific scheduler
- Training is unstable in early epochs
- User asks "what learning rate should I use?"
- Deciding between constant LR and scheduled LR
- User is copying a paper's training recipe
- Implementing modern training pipelines (vision, NLP, RL)
- User suggests "just use constant LR" (rationalization)
Do NOT use when:
- User has specific bugs unrelated to scheduling
- Only discussing optimizer choice (no schedule questions)
- Training already working well and no LR questions asked
Core Principles
1. Why Learning Rate Scheduling Matters
Learning rate scheduling is one of the MOST IMPACTFUL hyperparameters:
High LR Early (Exploration):
- Fast initial progress through parameter space
- Escape poor local minima
- Rapid loss reduction in early epochs
Low LR Late (Exploitation):
- Fine-tune to sharper, better minima
- Improve generalization (test accuracy)
- Stable convergence without oscillation
Quantitative Impact:
- Proper scheduling improves final test accuracy by 2-5% (SIGNIFICANT)
- Standard practice in all SOTA papers (ResNet, EfficientNet, ViT, BERT, GPT)
- Not optional for competitive performance
When Constant LR Fails:
- Can't explore quickly AND converge precisely
- Either too high (never converges) or too low (too slow)
- Leaves 2-5% performance on the table
2. Decision Framework: When to Schedule vs Constant LR
Use Scheduler When:
✅ Long training (>30 epochs)
- Scheduling essential for multi-stage training
- Different LR regimes needed across training
- Example: 90-epoch ImageNet training
✅ Large model on large dataset
- Training from scratch on ImageNet, COCO, etc.
- Benefits from exploration → exploitation strategy
- Example: ResNet-50 on ImageNet
✅ Training plateaus or loss stops improving
- Current LR too high for current parameter regime
- Reducing LR breaks plateau
- Example: Validation loss stuck for 10+ epochs
✅ Following established training recipes
- Papers publish schedules for reproducibility
- Vision models typically use MultiStepLR or Cosine
- Example: ResNet paper specifies drop at epochs 30, 60, 90
✅ Want competitive SOTA performance
- Squeezing out last 2-5% accuracy
- Required for benchmarks and competitions
- Example: Targeting SOTA on CIFAR-10
Maybe Don't Need Scheduler When:
❌ Very short training (<10 epochs)
- Not enough time for multi-stage scheduling
- Constant LR or simple linear decay sufficient
- Example: Quick fine-tuning for 5 epochs
❌ OneCycle is the strategy itself
- OneCycleLR IS the training strategy (not separate)
- Don't combine OneCycle with another scheduler
- Example: FastAI-style 20-epoch training
❌ Hyperparameter search phase
- Constant LR simpler to compare across runs
- Add scheduling after finding good architecture/optimizer
- Example: Running 50 architecture trials
❌ Transfer learning fine-tuning
- Small number of epochs on pretrained model
- Constant small LR often sufficient
- Example: Fine-tuning BERT for 3 epochs
❌ Reinforcement learning
- RL typically uses constant LR (exploration/exploitation balance different)
- Some exceptions (PPO sometimes uses linear decay)
- Example: DQN, A3C usually constant LR
Default Recommendation:
For >30 epoch training: USE A SCHEDULER (typically CosineAnnealingLR) For <10 epoch training: Constant LR usually fine For 10-30 epochs: Try both, scheduler usually wins
3. Major Scheduler Types - Complete Comparison
StepLR / MultiStepLR (Classic Vision)
Use When:
- Training CNNs (ResNet, VGG, etc.)
- Following established recipe from paper
- Want simple, interpretable schedule
How It Works:
- Drop LR by constant factor at specific epochs
- StepLR: every N epochs
- MultiStepLR: at specified milestone epochs
Implementation:
# StepLR: Drop every 30 epochs
scheduler = torch.optim.lr_scheduler.StepLR(
optimizer,
step_size=30, # Drop every 30 epochs
gamma=0.1 # Multiply LR by 0.1 (10x reduction)
)
# MultiStepLR: Drop at specific milestones (more control)
scheduler = torch.optim.lr_scheduler.MultiStepLR(
optimizer,
milestones=[30, 60, 90], # Drop at these epochs
gamma=0.1 # Multiply by 0.1 each time
)
# Training loop
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step() # Call AFTER each epoch
Example Schedule (initial_lr=0.1):
- Epochs 0-29: LR = 0.1
- Epochs 30-59: LR = 0.01 (dropped by 10x)
- Epochs 60-89: LR = 0.001 (dropped by 10x again)
- Epochs 90-99: LR = 0.0001
Pros:
- Simple and interpretable
- Well-established in papers (easy to reproduce)
- Works well for vision models
Cons:
- Requires manual milestone selection
- Sharp LR drops can cause temporary instability
- Need to know total training epochs in advance
Best For: Classical CNN training (ResNet, VGG) following paper recipes
CosineAnnealingLR (Modern Default)
Use When:
- Training modern vision models (ViT, EfficientNet)
- Want smooth decay without manual milestones
- Don't want to tune milestone positions
How It Works:
- Smooth cosine curve from initial_lr to eta_min
- Gradual decay, no sharp drops
- LR = eta_min + (initial_lr - eta_min) * (1 + cos(π * epoch / T_max)) / 2
Implementation:
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=100, # Total epochs (LR reaches eta_min at epoch 100)
eta_min=1e-5 # Minimum LR (default: 0)
)
# Training loop
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step() # Call AFTER each epoch
Example Schedule (initial_lr=0.1, eta_min=1e-5):
- Epoch 0: LR = 0.1
- Epoch 25: LR ≈ 0.075
- Epoch 50: LR ≈ 0.05
- Epoch 75: LR ≈ 0.025
- Epoch 100: LR = 0.00001
Pros:
- No milestone tuning required
- Smooth decay (no instability from sharp drops)
- Widely used in modern papers
- Works well across many domains
Cons:
- Must know total epochs in advance
- Can't adjust schedule during training
Best Practice: ALWAYS COMBINE WITH WARMUP for large models:
# Warmup for 5 epochs, then cosine for 95 epochs
warmup = torch.optim.lr_scheduler.LinearLR(
optimizer,
start_factor=0.01, # Start at 1% of base LR
end_factor=1.0, # Ramp to 100%
total_iters=5 # Over 5 epochs
)
cosine = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=95, # 95 epochs after warmup
eta_min=1e-5
)
scheduler = torch.optim.lr_scheduler.SequentialLR(
optimizer,
schedulers=[warmup, cosine],
milestones=[5] # Switch to cosine after 5 epochs
)
Best For: Modern vision models, transformers, default choice for most problems
ReduceLROnPlateau (Adaptive)
Use When:
- Don't know optimal schedule in advance
- Want adaptive approach based on validation performance
- Training plateaus and you want automatic LR reduction
How It Works:
- Monitors validation metric (loss or accuracy)
- Reduces LR when metric stops improving
- Requires passing metric to scheduler.step()
Implementation:
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer,
mode='min', # 'min' for loss, 'max' for accuracy
factor=0.1, # Reduce LR by 10x when plateau detected
patience=10, # Wait 10 epochs before reducing
threshold=1e-4, # Minimum change to count as improvement
threshold_mode='rel', # 'rel' or 'abs'
cooldown=0, # Epochs to wait after LR reduction
min_lr=1e-6, # Don't reduce below this
verbose=True # Print when LR reduced
)
# Training loop
for epoch in range(100):
train_loss = train_one_epoch(model, train_loader, optimizer)
val_loss = validate(model, val_loader)
# IMPORTANT: Pass validation metric to step()
scheduler.step(val_loss) # NOT scheduler.step() alone!
Example Behavior (patience=10, factor=0.1):
- Epochs 0-30: Val loss improving, LR = 0.001
- Epochs 31-40: Val loss plateaus at 0.15, patience counting
- Epoch 41: Plateau detected, LR reduced to 0.0001
- Epochs 42-60: Val loss improving again with lower LR
- Epoch 61: Plateau again, LR reduced to 0.00001
Pros:
- Adaptive - no manual tuning required
- Based on actual training progress
- Good for unknown optimal schedule
Cons:
- Can be too conservative (waits long before reducing)
- Requires validation metric (can't use train loss alone)
- May reduce LR too late or not enough
Tuning Tips:
- Smaller patience (5-10) for faster adaptation
- Larger patience (10-20) for more conservative
- Factor of 0.1 (10x) is standard, but 0.5 (2x) more gradual
Best For: Exploratory training, unknown optimal schedule, adaptive pipelines
OneCycleLR (Fast Training)
Use When:
- Limited compute budget (want fast convergence)
- Training for relatively few epochs (10-30)
- Following FastAI-style training
- Want aggressive schedule for quick results
How It Works:
- Ramps UP from low LR to max_lr (first 30% by default)
- Ramps DOWN from max_lr to very low LR (remaining 70%)
- Steps EVERY BATCH (not every epoch) - CRITICAL DIFFERENCE
Implementation:
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=0.1, # Peak learning rate (TUNE THIS!)
steps_per_epoch=len(train_loader), # Batches per epoch
epochs=20, # Total epochs
pct_start=0.3, # Ramp up for first 30%
anneal_strategy='cos', # 'cos' or 'linear'
div_factor=25, # initial_lr = max_lr / 25
final_div_factor=10000 # final_lr = max_lr / 10000
)
# Training loop - NOTE: step() EVERY BATCH
for epoch in range(20):
for batch in train_loader:
loss = train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # CALL EVERY BATCH, NOT EVERY EPOCH!
Example Schedule (max_lr=0.1, 20 epochs, 400 batches/epoch):
- Batches 0-2400 (epochs 0-6): LR ramps from 0.004 → 0.1
- Batches 2400-8000 (epochs 6-20): LR ramps from 0.1 → 0.00001
CRITICAL: Tuning max_lr:
OneCycleLR is VERY sensitive to max_lr choice. Too high = instability.
Method 1 - LR Finder (RECOMMENDED):
# Run LR finder first (see LR Finder section)
optimal_lr = find_lr(model, train_loader, optimizer) # e.g., 0.01
max_lr = optimal_lr * 10 # Use 10x optimal as max_lr
Method 2 - Manual tuning:
- Start with max_lr = 0.1
- If training unstable, try 0.03, 0.01
- If training too slow, try 0.3, 1.0
Pros:
- Very fast convergence (fewer epochs needed)
- Strong final performance
- Popular in FastAI community
Cons:
- Sensitive to max_lr (requires tuning)
- Steps every batch (easy to mess up)
- Not ideal for very long training (>50 epochs)
Common Mistakes:
- Calling scheduler.step() per epoch instead of per batch
- Not tuning max_lr (using default blindly)
- Using for very long training (OneCycle designed for shorter cycles)
Best For: FastAI-style training, limited compute budget, 10-30 epoch training
Advanced OneCycleLR Tuning
If lowering max_lr doesn't resolve instability, try these advanced tuning options:
1. Adjust pct_start (warmup fraction):
# Default: 0.3 (30% warmup, 70% cooldown)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
pct_start=0.3) # Default
# If unstable at peak: Increase to 0.4 or 0.5 (longer warmup)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
pct_start=0.5) # Gentler ramp to peak
# If unstable in cooldown: Decrease to 0.2 (shorter warmup, gentler descent)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
pct_start=0.2)
2. Adjust div_factor (initial LR):
# Default: 25 (initial_lr = max_lr / 25)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
div_factor=25) # Start at 0.004
# If unstable at start: Increase to 50 or 100 (start even lower)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
div_factor=100) # Start at 0.001
3. Adjust final_div_factor (final LR):
# Default: 10000 (final_lr = max_lr / 10000)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
final_div_factor=10000) # End at 0.00001
# If unstable at end: Decrease to 1000 (end at higher LR)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
final_div_factor=1000) # End at 0.0001
4. Add gradient clipping:
# In training loop
for batch in train_loader:
loss = train_step(model, batch, optimizer)
loss.backward()
# Clip gradients to prevent instability
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
5. Consider OneCycle may not be right for your problem:
- Very deep networks (>100 layers): May be too unstable for OneCycle's aggressive schedule
- Large models (>100M params): May need gentler schedule (Cosine + warmup)
- Sensitive architectures (some transformers): OneCycle's rapid LR changes can destabilize
Alternative: Use CosineAnnealing + warmup for more stable training:
# More stable alternative to OneCycle
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=15, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
LinearLR (Warmup)
Use When:
- Need warmup at training start
- Ramping up LR gradually over first few epochs
- Combining with another scheduler (SequentialLR)
How It Works:
- Linearly interpolates LR from start_factor to end_factor
- Typically used for warmup: start_factor=0.01, end_factor=1.0
Implementation:
# Standalone linear warmup
scheduler = torch.optim.lr_scheduler.LinearLR(
optimizer,
start_factor=0.01, # Start at 1% of base LR
end_factor=1.0, # End at 100% of base LR
total_iters=5 # Over 5 epochs
)
# More common: Combine with main scheduler
warmup = torch.optim.lr_scheduler.LinearLR(
optimizer,
start_factor=0.01,
total_iters=5
)
main = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=95
)
scheduler = torch.optim.lr_scheduler.SequentialLR(
optimizer,
schedulers=[warmup, main],
milestones=[5] # Switch after 5 epochs
)
# Training loop
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Example Schedule (base_lr=0.1):
- Epoch 0: LR = 0.001 (1%)
- Epoch 1: LR = 0.0208 (20.8%)
- Epoch 2: LR = 0.0406 (40.6%)
- Epoch 3: LR = 0.0604 (60.4%)
- Epoch 4: LR = 0.0802 (80.2%)
- Epoch 5: LR = 0.1 (100%, then switch to main scheduler)
Best For: Warmup phase for transformers and large models
ExponentialLR (Continuous Decay)
Use When:
- Want smooth, continuous decay
- Simpler alternative to Cosine
- Prefer exponential over linear decay
How It Works:
- Multiply LR by gamma every epoch
- LR(epoch) = initial_lr * gamma^epoch
Implementation:
scheduler = torch.optim.lr_scheduler.ExponentialLR(
optimizer,
gamma=0.95 # Multiply by 0.95 each epoch
)
# Training loop
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Example Schedule (initial_lr=0.1, gamma=0.95):
- Epoch 0: LR = 0.1
- Epoch 10: LR = 0.0599
- Epoch 50: LR = 0.0077
- Epoch 100: LR = 0.0059
Tuning gamma:
- Want 10x decay over 100 epochs: gamma = 0.977 (0.1^(1/100))
- Want 100x decay over 100 epochs: gamma = 0.955 (0.01^(1/100))
- General formula: gamma = (target_lr / initial_lr)^(1/epochs)
Pros:
- Very smooth decay
- Simple to implement
Cons:
- Hard to intuit gamma value for desired final LR
- Less popular than Cosine (Cosine is better default)
Best For: Cases where you want exponential decay specifically
LambdaLR (Custom Schedules)
Use When:
- Need custom schedule not provided by standard schedulers
- Implementing paper-specific schedule
- Advanced use cases (e.g., transformer inverse sqrt schedule)
How It Works:
- Provide function that computes LR multiplier for each epoch
- LR(epoch) = initial_lr * lambda(epoch)
Implementation:
# Example: Warmup then constant
def warmup_lambda(epoch):
if epoch < 5:
return (epoch + 1) / 5 # Linear warmup
else:
return 1.0 # Constant after warmup
scheduler = torch.optim.lr_scheduler.LambdaLR(
optimizer,
lr_lambda=warmup_lambda
)
# Example: Transformer inverse square root schedule
def transformer_schedule(epoch):
warmup_steps = 4000
step = epoch + 1
return min(step ** (-0.5), step * warmup_steps ** (-1.5))
scheduler = torch.optim.lr_scheduler.LambdaLR(
optimizer,
lr_lambda=transformer_schedule
)
# Example: Polynomial decay
def polynomial_decay(epoch):
return (1 - epoch / 100) ** 0.9 # Decay to 0 at epoch 100
scheduler = torch.optim.lr_scheduler.LambdaLR(
optimizer,
lr_lambda=polynomial_decay
)
Best For: Custom schedules, implementing specific papers, advanced users
4. Warmup Strategies - CRITICAL FOR TRANSFORMERS
Why Warmup is Essential
Problem at Training Start:
- Weights are randomly initialized
- Gradients can be very large and unstable
- BatchNorm statistics are uninitialized
- High LR can cause immediate divergence (NaN loss)
Solution: Gradual LR Increase
- Start with very low LR (1% of target)
- Linearly increase to target LR over first few epochs
- Allows model to stabilize before aggressive learning
Quantitative Impact:
- Transformers WITHOUT warmup: Often diverge or train very unstably
- Transformers WITH warmup: Stable training, better final performance
- Vision models: Warmup improves stability, sometimes +0.5-1% accuracy
When Warmup is MANDATORY
ALWAYS use warmup when:
✅ Training transformers (ViT, BERT, GPT, T5, etc.)
- Transformers REQUIRE warmup - not optional
- Without warmup, training often diverges
- Standard practice in all transformer papers
✅ Large batch sizes (>512)
- Large batches → larger effective learning rate
- Warmup prevents early instability
- Standard for distributed training
✅ High initial learning rates
- If starting with LR > 0.001, use warmup
- Warmup allows higher peak LR safely
✅ Training from scratch (not fine-tuning)
- Random initialization needs gentle start
- Fine-tuning can often skip warmup (weights already good)
Usually use warmup when:
✅ Large models (>100M parameters) ✅ Using AdamW optimizer (common with transformers) ✅ Following modern training recipes
May skip warmup when:
❌ Fine-tuning pretrained models (weights already trained) ❌ Small learning rates (< 0.0001) ❌ Small models (<10M parameters) ❌ Established recipe without warmup (e.g., some CNN papers)
Warmup Implementation Patterns
Pattern 1: Linear Warmup + Cosine Decay (Most Common)
import torch.optim.lr_scheduler as lr_scheduler
# Warmup for 5 epochs
warmup = lr_scheduler.LinearLR(
optimizer,
start_factor=0.01, # Start at 1% of base LR
end_factor=1.0, # End at 100% of base LR
total_iters=5 # Over 5 epochs
)
# Cosine decay for remaining 95 epochs
cosine = lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=95, # 95 epochs after warmup
eta_min=1e-5 # Final LR
)
# Combine sequentially
scheduler = lr_scheduler.SequentialLR(
optimizer,
schedulers=[warmup, cosine],
milestones=[5] # Switch to cosine after epoch 5
)
# Training loop
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Schedule Visualization (base_lr=0.001):
- Epochs 0-4: Linear ramp from 0.00001 → 0.001 (warmup)
- Epochs 5-99: Cosine decay from 0.001 → 0.00001
Use For: Vision transformers, modern CNNs, most large-scale training
Pattern 2: Linear Warmup + MultiStepLR
# Warmup for 5 epochs
warmup = lr_scheduler.LinearLR(
optimizer,
start_factor=0.01,
total_iters=5
)
# Step decay at 30, 60, 90
steps = lr_scheduler.MultiStepLR(
optimizer,
milestones=[30, 60, 90],
gamma=0.1
)
# Combine
scheduler = lr_scheduler.SequentialLR(
optimizer,
schedulers=[warmup, steps],
milestones=[5]
)
Use For: Classical CNN training with warmup
Pattern 3: Manual Warmup (More Control)
def get_lr_schedule(epoch, total_epochs, base_lr, warmup_epochs=5):
"""
Custom schedule with warmup and cosine decay.
"""
if epoch < warmup_epochs:
# Linear warmup
return base_lr * (epoch + 1) / warmup_epochs
else:
# Cosine decay
progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
return base_lr * 0.5 * (1 + math.cos(math.pi * progress))
# Training loop
for epoch in range(100):
lr = get_lr_schedule(epoch, total_epochs=100, base_lr=0.001)
for param_group in optimizer.param_groups:
param_group['lr'] = lr
train_one_epoch(model, train_loader, optimizer)
Use For: Custom schedules, research, maximum control
Pattern 4: Transformer-Style Warmup (Inverse Square Root)
def transformer_lr_schedule(step, d_model, warmup_steps):
"""
Transformer schedule from "Attention is All You Need".
LR increases during warmup, then decreases proportionally to inverse sqrt of step.
"""
step = step + 1 # 1-indexed
return (d_model ** -0.5) * min(step ** -0.5, step * warmup_steps ** -1.5)
scheduler = lr_scheduler.LambdaLR(
optimizer,
lr_lambda=lambda step: transformer_lr_schedule(step, d_model=512, warmup_steps=4000)
)
# Training loop - NOTE: step every BATCH for this schedule
for epoch in range(epochs):
for batch in train_loader:
train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # Step every batch
Use For: Transformer models (BERT, GPT), following original papers
Warmup Duration Guidelines
How many warmup epochs?
- Transformers: 5-20 epochs (or 5-10% of total training)
- Vision models: 5-10 epochs
- Very large models (>1B params): 10-20 epochs
- Small models: 3-5 epochs
Rule of thumb: 5-10% of total training epochs
Examples:
- 100-epoch training: 5-10 epoch warmup
- 20-epoch training: 2-3 epoch warmup
- 300-epoch training: 15-30 epoch warmup
"But My Transformer Trained Fine Without Warmup"
Some users report training transformers without warmup successfully. Here's the reality:
What "fine" actually means:
- Training didn't diverge (NaN loss) - that's a low bar
- Got reasonable accuracy - but NOT optimal accuracy
- One successful run doesn't mean it's optimal or reliable
What you're missing without warmup:
1. Performance gap (1-3% accuracy):
Without warmup: Training works, achieves 85% accuracy
With warmup: Same model achieves 87-88% accuracy
That 2-3% is SIGNIFICANT:
- Difference between competitive and SOTA
- Difference between accepted and rejected paper
- Difference between passing and failing business metrics
2. Training stability:
Without warmup:
- Some runs diverge → need to restart with lower LR
- Sensitive to initialization seed
- Requires careful LR tuning
- Success rate: 60-80% of runs
With warmup:
- Stable training → consistent results
- Robust to initialization
- Wider stable LR range
- Success rate: 95-100% of runs
3. Hyperparameter sensitivity:
Without warmup:
- Very sensitive to initial LR choice (0.001 works, 0.0015 diverges)
- Sensitive to batch size
- Sensitive to optimizer settings
With warmup:
- More forgiving LR range (0.0005-0.002 all work)
- Less sensitive to batch size
- Robust optimizer configuration
Empirical Evidence - Published Papers:
Check transformer papers - ALL use warmup:
| Model | Paper | Warmup |
|---|---|---|
| ViT | Dosovitskiy et al., 2020 | ✅ Linear, 10k steps |
| DeiT | Touvron et al., 2021 | ✅ Linear, 5 epochs |
| Swin | Liu et al., 2021 | ✅ Linear, 20 epochs |
| BERT | Devlin et al., 2018 | ✅ Linear, 10k steps |
| GPT-2 | Radford et al., 2019 | ✅ Linear warmup |
| GPT-3 | Brown et al., 2020 | ✅ Linear warmup |
| T5 | Raffel et al., 2020 | ✅ Inverse sqrt warmup |
Every competitive transformer model uses warmup - there's a reason.
"But I got 85% accuracy without warmup!"
Great! Now try with warmup and see if you get 87-88%. You probably will.
The cost-benefit analysis:
# Cost: One line of code
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
scheduler = SequentialLR(optimizer, [warmup, main], [5])
# Benefit:
# - 1-3% better accuracy
# - More stable training
# - Higher success rate
# - Wider stable hyperparameter range
Recommendation:
- Run ablation study: Train your model with and without warmup
- Compare: Final test accuracy, training stability, number of failed runs
- You'll find warmup gives better results with minimal complexity
Bottom line: Just because something "works" doesn't mean it's optimal. Warmup is standard practice for transformers because it consistently improves results.
5. LR Finder - Finding Optimal Initial LR
What is LR Finder?
Method from Leslie Smith (2015): Cyclical Learning Rates paper
Core Idea:
- Start with very small LR (1e-8)
- Gradually increase LR (multiply by ~1.1 each batch)
- Train for a few hundred steps, record loss at each LR
- Plot loss vs LR
- Choose LR where loss decreases fastest (steepest descent)
Why It Works:
- Too low LR: Loss decreases very slowly
- Optimal LR: Loss decreases rapidly (steepest slope)
- Too high LR: Loss plateaus or increases (instability)
Typical Findings:
- Loss decreases fastest at some LR (e.g., 0.01)
- Loss starts increasing at higher LR (e.g., 0.1)
- Choose LR slightly below fastest descent point (e.g., 0.003-0.01)
LR Finder Implementation
import torch
import matplotlib.pyplot as plt
import numpy as np
def find_lr(model, train_loader, optimizer, loss_fn, device,
start_lr=1e-8, end_lr=10, num_iter=100, smooth_f=0.05):
"""
LR Finder: Sweep learning rates and plot loss curve.
Args:
model: PyTorch model
train_loader: Training data loader
optimizer: Optimizer (will be modified)
loss_fn: Loss function
device: Device to train on
start_lr: Starting learning rate (default: 1e-8)
end_lr: Ending learning rate (default: 10)
num_iter: Number of iterations (default: 100)
smooth_f: Smoothing factor for loss (default: 0.05)
Returns:
lrs: List of learning rates tested
losses: List of losses at each LR
"""
# Save initial model state to restore later
model.train()
initial_state = model.state_dict()
# Calculate LR multiplier for exponential increase
lr_mult = (end_lr / start_lr) ** (1 / num_iter)
lrs = []
losses = []
best_loss = float('inf')
avg_loss = 0
lr = start_lr
# Iterate through training data
iterator = iter(train_loader)
for iteration in range(num_iter):
try:
data, target = next(iterator)
except StopIteration:
# Restart iterator if we run out of data
iterator = iter(train_loader)
data, target = next(iterator)
# Set learning rate
for param_group in optimizer.param_groups:
param_group['lr'] = lr
# Forward pass
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = loss_fn(output, target)
# Compute smoothed loss (exponential moving average)
if iteration == 0:
avg_loss = loss.item()
else:
avg_loss = smooth_f * loss.item() + (1 - smooth_f) * avg_loss
# Record
lrs.append(lr)
losses.append(avg_loss)
# Track best loss
if avg_loss < best_loss:
best_loss = avg_loss
# Stop if loss explodes (>4x best loss)
if avg_loss > 4 * best_loss:
print(f"Stopping early at iteration {iteration}: loss exploded")
break
# Backward pass
loss.backward()
optimizer.step()
# Increase learning rate
lr *= lr_mult
if lr > end_lr:
break
# Restore model to initial state
model.load_state_dict(initial_state)
# Plot results
plt.figure(figsize=(10, 6))
plt.plot(lrs, losses)
plt.xscale('log')
plt.xlabel('Learning Rate (log scale)')
plt.ylabel('Loss')
plt.title('LR Finder')
plt.grid(True, alpha=0.3)
# Mark suggested LR (10x below minimum loss)
min_loss_idx = np.argmin(losses)
suggested_lr = lrs[max(0, min_loss_idx - 5)] # A bit before minimum
plt.axvline(suggested_lr, color='red', linestyle='--',
label=f'Suggested LR: {suggested_lr:.2e}')
plt.legend()
plt.show()
print(f"\nLR Finder Results:")
print(f" Minimum loss at LR: {lrs[min_loss_idx]:.2e}")
print(f" Suggested starting LR: {suggested_lr:.2e}")
print(f" (Choose LR where loss decreases fastest, before minimum)")
return lrs, losses
def suggest_lr_from_finder(lrs, losses):
"""
Suggest optimal learning rate from LR finder results.
Strategy: Find LR where loss gradient is steepest (fastest decrease).
"""
# Compute gradient of loss w.r.t. log(LR)
log_lrs = np.log10(lrs)
gradients = np.gradient(losses, log_lrs)
# Find steepest descent (most negative gradient)
steepest_idx = np.argmin(gradients)
# Suggested LR is at steepest point or slightly before
suggested_lr = lrs[steepest_idx]
return suggested_lr
Using LR Finder
Basic Usage:
# Setup model, optimizer, loss
model = YourModel().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1) # LR will be overridden
loss_fn = torch.nn.CrossEntropyLoss()
# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
# Manually inspect plot and choose LR
# Look for: steepest descent point (fastest loss decrease)
# Typically: 10x lower than loss minimum
# Example: If minimum is at 0.1, choose 0.01 as starting LR
base_lr = 0.01 # Based on plot inspection
Automated LR Selection:
# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
# Get suggested LR
suggested_lr = suggest_lr_from_finder(lrs, losses)
# Use suggested LR
optimizer = torch.optim.SGD(model.parameters(), lr=suggested_lr)
Using with OneCycleLR:
# Find optimal LR
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses) # e.g., 0.01
# OneCycleLR: Use 5-10x optimal as max_lr
max_lr = optimal_lr * 10 # e.g., 0.1
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=max_lr,
steps_per_epoch=len(train_loader),
epochs=20
)
Interpreting LR Finder Results
Typical Plot Patterns:
Loss
|
| X <-- Loss explodes (LR too high)
| X
| X
| X <-- Loss minimum (still too high)
| X
| X <-- CHOOSE HERE (steepest descent)
| X
| X
| X
|X___________
1e-8 1e-4 1e-2 0.1 1.0 10
Learning Rate
How to Choose:
Steepest Descent (BEST):
- Find where loss decreases fastest (steepest downward slope)
- This is optimal LR for rapid convergence
- Example: If steepest at 0.01, choose 0.01
Before Minimum (SAFE):
- Find minimum loss LR (e.g., 0.1)
- Choose 10x lower (e.g., 0.01)
- More conservative, safer choice
Avoid:
- Don't choose minimum itself (often too high)
- Don't choose where loss is flat (too low, slow progress)
- Don't choose where loss increases (way too high)
Guidelines:
- For SGD: Choose at steepest descent
- For Adam: Choose 10x below steepest (Adam more sensitive)
- For OneCycle: Use steepest as optimal, 5-10x as max_lr
When to Use LR Finder
Use LR Finder When:
✅ Starting new project (unknown optimal LR) ✅ New architecture or dataset ✅ Tuning OneCycleLR (finding max_lr) ✅ Transitioning between optimizers ✅ Having training instability issues
Can Skip When:
❌ Following established paper recipe (LR already known) ❌ Fine-tuning (small LR like 1e-5 typically works) ❌ Very constrained time/resources ❌ Using adaptive methods (ReduceLROnPlateau)
Best Practice:
- Run LR finder once at project start
- Use found LR for all subsequent runs
- Re-run if changing optimizer, architecture, or batch size significantly
6. Scheduler Selection Guide
Selection Flowchart
1. What's your training duration?
- <10 epochs: Constant LR or simple linear decay
- 10-30 epochs: OneCycleLR (fast) or CosineAnnealingLR
- >30 epochs: CosineAnnealingLR or MultiStepLR
2. What's your model type?
- Transformer (ViT, BERT, GPT): CosineAnnealing + WARMUP (mandatory)
- CNN (ResNet, EfficientNet): MultiStepLR or CosineAnnealing + optional warmup
- Small model: Simpler schedulers (StepLR) or constant LR
3. Do you know optimal schedule?
- Yes (from paper): Use paper's schedule (MultiStepLR usually)
- No (exploring): ReduceLROnPlateau or CosineAnnealing
- Want fast results: OneCycleLR + LR finder
4. What's your compute budget?
- High budget (100+ epochs): CosineAnnealing or MultiStepLR
- Low budget (10-20 epochs): OneCycleLR
- Adaptive budget: ReduceLROnPlateau (stops when plateau)
Paper Recipe vs Modern Best Practices
If goal is EXACT REPRODUCTION:
Use paper's exact schedule (down to every detail):
# Example: Reproducing ResNet paper (He et al., 2015)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# No warmup (paper didn't use it)
# Train for 100 epochs
Rationale:
- Reproduce results exactly
- Enable apples-to-apples comparison
- Validate paper's claims
- Establish baseline before improvements
If goal is BEST PERFORMANCE:
Use modern recipe (benefit from years of community learning):
# Modern equivalent: ResNet with modern practices
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Train for 100 epochs
Rationale:
- Typically +0.5-2% better accuracy than original paper
- More stable training
- Reflects 5-10 years of community improvements
- SOTA competitive performance
Evolution of LR Scheduling Practices:
Early Deep Learning (2012-2016):
- Scheduler: StepLR with manual milestones
- Warmup: Not used (not yet discovered)
- Optimizer: SGD with momentum
- Examples: AlexNet, VGG, ResNet, Inception
Mid Period (2017-2019):
- Scheduler: CosineAnnealing introduced, OneCycleLR popular
- Warmup: Starting to be used for large batch training
- Optimizer: SGD still dominant, Adam increasingly common
- Examples: ResNeXt, DenseNet, MobileNet
Modern Era (2020-2025):
- Scheduler: CosineAnnealing default, OneCycle for fast training
- Warmup: Standard practice (mandatory for transformers)
- Optimizer: AdamW increasingly preferred for transformers
- Examples: ViT, EfficientNet, ConvNeXt, Swin, DeiT
Practical Workflow:
Step 1: Reproduce paper recipe
# Use exact paper settings
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# Should match paper's reported accuracy (e.g., 76.5%)
Step 2: Validate reproduction
- If you get 76.5% (matches paper): ✅ Reproduction successful
- If you get 74% (2% worse): ❌ Implementation bug, fix first
- If you get 78% (2% better): ✅ Great! Proceed to modern recipe
Step 3: Try modern recipe
# Add warmup + cosine
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Expect +0.5-2% improvement (e.g., 77-78.5%)
Step 4: Compare results
| Version | Accuracy | Notes |
|---|---|---|
| Paper recipe | 76.5% | Baseline (reproduces paper) |
| Modern recipe | 78.0% | +1.5% from warmup + cosine |
When to Use Which:
Use Paper Recipe:
- Publishing reproduction study
- Comparing to paper's baseline
- Validating implementation correctness
- Research requiring exact reproducibility
Use Modern Recipe:
- Building production system (want best performance)
- Competing in benchmark (need SOTA results)
- Publishing new method (should use modern baseline)
- Limited compute (modern practices more efficient)
Trade-off Table:
| Aspect | Paper Recipe | Modern Recipe |
|---|---|---|
| Reproducibility | ✅ Exact | ⚠️ Better but different |
| Performance | ⚠️ Good (for its time) | ✅ Better (+0.5-2%) |
| Comparability | ✅ To paper | ✅ To SOTA |
| Compute efficiency | ⚠️ May be suboptimal | ✅ Modern optimizations |
| Training stability | ⚠️ Variable | ✅ More stable (warmup) |
Bottom Line:
Both are valid depending on your goal:
- Research/reproduction: Start with paper recipe
- Production/competition: Use modern recipe
- Best practice: Validate with paper recipe, deploy with modern recipe
Domain-Specific Recommendations
Image Classification (CNNs)
Standard Recipe (ResNet, VGG):
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# Train for 100 epochs
Modern Recipe (EfficientNet, RegNet):
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-5)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Train for 100 epochs
Vision Transformers (ViT, Swin, DeiT)
Standard Recipe:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.05)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=10)
cosine = CosineAnnealingLR(optimizer, T_max=290, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [10])
# Train for 300 epochs
# WARMUP IS MANDATORY
NLP Transformers (BERT, GPT, T5)
Standard Recipe:
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.01)
# Linear warmup + linear decay
def lr_lambda(step):
warmup_steps = 10000
total_steps = 100000
if step < warmup_steps:
return step / warmup_steps
else:
return max(0.0, (total_steps - step) / (total_steps - warmup_steps))
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
# Step every batch, not epoch
# WARMUP IS MANDATORY
Object Detection (Faster R-CNN, YOLO)
Standard Recipe:
optimizer = torch.optim.SGD(model.parameters(), lr=0.02, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[16, 22], gamma=0.1)
# Train for 26 epochs
Fast Training (Limited Compute)
FastAI Recipe:
# Run LR finder first
optimal_lr = find_lr(model, train_loader, optimizer, loss_fn, device)
max_lr = optimal_lr * 10
optimizer = torch.optim.SGD(model.parameters(), lr=max_lr)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=max_lr,
steps_per_epoch=len(train_loader),
epochs=20,
pct_start=0.3
)
# Train for 20 epochs
# Step every batch
7. Common Scheduling Pitfalls
Pitfall 1: No Warmup for Transformers
WRONG:
# Training Vision Transformer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
# ❌ No warmup - training will be very unstable or diverge
RIGHT:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# ✅ Warmup prevents early instability
Why It Matters:
- Transformers with high LR at start → NaN loss, divergence
- Random initialization needs gradual LR ramp
- 5-10 epoch warmup is STANDARD practice
How to Detect:
- Loss is NaN or explodes in first few epochs
- Training very unstable early, stabilizes later
- Gradients extremely large at start
Pitfall 2: Wrong scheduler.step() Placement
WRONG (Most Schedulers):
for epoch in range(epochs):
for batch in train_loader:
loss = train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # ❌ Stepping every batch, not every epoch
RIGHT:
for epoch in range(epochs):
for batch in train_loader:
loss = train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # ✅ Step AFTER each epoch
EXCEPTION (OneCycleLR):
for epoch in range(epochs):
for batch in train_loader:
loss = train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # ✅ OneCycle steps EVERY BATCH
Why It Matters:
- CosineAnnealing with T_max=100 expects 100 steps (epochs)
- Stepping every batch: If 390 batches/epoch, LR decays in <1 epoch
- LR reaches minimum way too fast
How to Detect:
- LR decays to minimum in first epoch
- Print LR each step:
print(optimizer.param_groups[0]['lr']) - Check if LR changes every batch (wrong) vs every epoch (right)
Rule:
- Most schedulers (Step, Cosine, Exponential): Step per epoch
- OneCycleLR only: Step per batch
- ReduceLROnPlateau: Step per epoch with validation metric
Pitfall 3: scheduler.step() Before optimizer.step()
WRONG:
loss.backward()
scheduler.step() # ❌ Wrong order
optimizer.step()
RIGHT:
loss.backward()
optimizer.step() # ✅ Update weights first
scheduler.step() # Then update LR
Why It Matters:
- Scheduler updates LR based on current epoch/step
- Should update weights with current LR, THEN move to next LR
- Wrong order = off-by-one error in schedule
How to Detect:
- Usually subtle, hard to notice
- Best practice: always optimizer.step() then scheduler.step()
Pitfall 4: Not Passing Metric to ReduceLROnPlateau
WRONG:
scheduler = ReduceLROnPlateau(optimizer)
for epoch in range(epochs):
train_loss = train_one_epoch(model, train_loader, optimizer)
scheduler.step() # ❌ No metric passed
RIGHT:
scheduler = ReduceLROnPlateau(optimizer, mode='min')
for epoch in range(epochs):
train_loss = train_one_epoch(model, train_loader, optimizer)
val_loss = validate(model, val_loader)
scheduler.step(val_loss) # ✅ Pass validation metric
Why It Matters:
- ReduceLROnPlateau NEEDS metric to detect plateau
- Without metric, scheduler doesn't know when to reduce LR
- Will get error or incorrect behavior
How to Detect:
- Error message: "ReduceLROnPlateau needs a metric"
- LR never reduces even when training plateaus
Pitfall 5: Using OneCycle for Long Training
SUBOPTIMAL:
# Training for 200 epochs
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=200, steps_per_epoch=len(train_loader))
# ❌ OneCycle designed for shorter training (10-30 epochs)
BETTER:
# For long training, use Cosine
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=10)
cosine = CosineAnnealingLR(optimizer, T_max=190, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [10])
# ✅ Cosine better suited for long training
Why It Matters:
- OneCycle's aggressive up-then-down profile works for short training
- For long training, gentler cosine decay more stable
- OneCycle typically used for 10-30 epochs in FastAI style
When to Use Each:
- OneCycle: 10-30 epochs, limited compute, want fast results
- Cosine: 50+ epochs, full training, want best final performance
Pitfall 6: Not Tuning max_lr for OneCycle
WRONG:
# Just guessing max_lr
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20, steps_per_epoch=len(train_loader))
# ❌ Random max_lr without tuning
# Might be too high (unstable) or too low (slow)
RIGHT:
# Step 1: Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses) # e.g., 0.01
# Step 2: Use 5-10x optimal as max_lr
max_lr = optimal_lr * 10 # e.g., 0.1
scheduler = OneCycleLR(optimizer, max_lr=max_lr, epochs=20, steps_per_epoch=len(train_loader))
# ✅ Tuned max_lr based on LR finder
Why It Matters:
- OneCycle is VERY sensitive to max_lr
- Too high: Training unstable, loss explodes
- Too low: Slow training, underperforms
- LR finder finds optimal, use 5-10x as max_lr
How to Tune:
- Run LR finder (see LR Finder section)
- Find optimal LR (steepest descent point)
- Use 5-10x optimal as max_lr for OneCycle
- If still unstable, reduce max_lr (try 3x, 2x)
Pitfall 7: Forgetting to Adjust T_max After Adding Warmup
WRONG:
# Want 100 epoch training
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=100) # ❌ Should be 95
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
RIGHT:
# Want 100 epoch training
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95) # ✅ 100 - 5 = 95
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
Why It Matters:
- Total training is warmup + main schedule
- If warmup is 5 epochs and cosine is 100, total is 105 epochs
- T_max should be (total_epochs - warmup_epochs)
How to Calculate:
total_epochs = 100
warmup_epochs = 5
T_max = total_epochs - warmup_epochs # 95
Pitfall 8: Using Same LR for All Param Groups
SUBOPTIMAL:
# Fine-tuning: applying same LR to all layers
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# ❌ Backbone and head both use 1e-3
BETTER:
# Fine-tuning: lower LR for pretrained backbone, higher for new head
optimizer = torch.optim.Adam([
{'params': model.backbone.parameters(), 'lr': 1e-4}, # Lower LR for pretrained
{'params': model.head.parameters(), 'lr': 1e-3} # Higher LR for random init
])
scheduler = CosineAnnealingLR(optimizer, T_max=100)
# ✅ Scheduler applies to all param groups proportionally
Why It Matters:
- Pretrained layers need smaller LR (already trained)
- New layers need higher LR (random initialization)
- Schedulers work with param groups automatically
Note: Schedulers multiply all param groups by same factor, preserving relative ratios
Pitfall 9: Not Monitoring LR During Training
PROBLEM:
- Schedule not behaving as expected
- Hard to debug without visibility into LR
SOLUTION:
# Log LR every epoch
for epoch in range(epochs):
current_lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch}: LR = {current_lr:.6f}")
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
# Or use TensorBoard
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
for epoch in range(epochs):
current_lr = optimizer.param_groups[0]['lr']
writer.add_scalar('Learning Rate', current_lr, epoch)
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Best Practice:
- Always log LR to console or TensorBoard
- Plot LR schedule before training (see next section)
- Verify schedule matches expectations
Pitfall 10: Not Validating Schedule Before Training
PROBLEM:
- Run full training, discover schedule was wrong
- Waste compute on incorrect schedule
SOLUTION: Dry-run the schedule:
def plot_schedule(scheduler_fn, num_epochs):
"""
Plot LR schedule before training to verify it's correct.
"""
# Create dummy model and optimizer
model = torch.nn.Linear(1, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = scheduler_fn(optimizer)
lrs = []
for epoch in range(num_epochs):
lrs.append(optimizer.param_groups[0]['lr'])
optimizer.step() # Dummy step
scheduler.step()
# Plot
plt.figure(figsize=(10, 6))
plt.plot(lrs)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('LR Schedule')
plt.grid(True, alpha=0.3)
plt.show()
# Usage
def my_scheduler(opt):
warmup = LinearLR(opt, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(opt, T_max=95)
return SequentialLR(opt, [warmup, cosine], [5])
plot_schedule(my_scheduler, num_epochs=100)
# Verify plot looks correct BEFORE training
Best Practice:
- Plot schedule before every major training run
- Verify warmup duration, decay shape, final LR
- Catch mistakes early (T_max wrong, step placement, etc.)
8. Modern Best Practices (2024-2025)
Vision Models (CNNs, ResNets, ConvNeXt)
Standard Recipe:
# Optimizer
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.1,
momentum=0.9,
weight_decay=1e-4
)
# Scheduler: MultiStepLR or CosineAnnealing
# Option 1: MultiStepLR (classical)
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# Option 2: CosineAnnealing (modern)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Training
epochs = 100
for epoch in range(epochs):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Key Points:
- SGD with momentum (0.9) is standard for CNNs
- LR = 0.1 for batch size 256 (scale linearly for other batch sizes)
- Warmup optional but beneficial (5 epochs)
- CosineAnnealing increasingly preferred over MultiStepLR
Vision Transformers (ViT, Swin, DeiT)
Standard Recipe:
# Optimizer
optimizer = torch.optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=0.05,
betas=(0.9, 0.999)
)
# Scheduler: MUST include warmup
warmup_epochs = 10
cosine_epochs = 290
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_epochs)
cosine = CosineAnnealingLR(optimizer, T_max=cosine_epochs, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [warmup_epochs])
# Training
epochs = 300
for epoch in range(epochs):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Key Points:
- AdamW optimizer (not SGD)
- Warmup is MANDATORY (10-20 epochs)
- Long training (300 epochs typical)
- LR = 1e-3 for batch size 512 (scale for other sizes)
- Cosine decay to very small LR (1e-5)
Why Warmup is Critical for ViT:
- Self-attention layers highly sensitive to initialization
- High LR at start causes gradient explosion
- Warmup allows attention patterns to stabilize
NLP Transformers (BERT, GPT, T5)
Standard Recipe:
# Optimizer
optimizer = torch.optim.AdamW(
model.parameters(),
lr=5e-4,
weight_decay=0.01,
betas=(0.9, 0.999)
)
# Scheduler: Linear warmup + linear decay (or inverse sqrt)
total_steps = len(train_loader) * epochs
warmup_steps = int(0.1 * total_steps) # 10% warmup
def lr_lambda(step):
if step < warmup_steps:
return step / warmup_steps
else:
return max(0.0, (total_steps - step) / (total_steps - warmup_steps))
scheduler = LambdaLR(optimizer, lr_lambda)
# Training: step EVERY BATCH
for epoch in range(epochs):
for batch in train_loader:
train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # Step every batch, not epoch
Key Points:
- AdamW optimizer
- Warmup is MANDATORY (typically 10% of training)
- Linear warmup + linear decay (BERT, GPT-2 style)
- Step scheduler EVERY BATCH (not every epoch)
- LR typically 1e-4 to 5e-4
Alternative: Inverse Square Root (Original Transformer):
def transformer_schedule(step):
warmup_steps = 4000
step = step + 1
return (d_model ** -0.5) * min(step ** -0.5, step * warmup_steps ** -1.5)
scheduler = LambdaLR(optimizer, transformer_schedule)
Object Detection (Faster R-CNN, YOLO, DETR)
Standard Recipe (Two-stage detectors):
# Optimizer
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.02,
momentum=0.9,
weight_decay=1e-4
)
# Scheduler: MultiStepLR with short schedule
scheduler = MultiStepLR(optimizer, milestones=[16, 22], gamma=0.1)
# Training
epochs = 26 # Shorter than classification
for epoch in range(epochs):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Standard Recipe (Transformer detectors like DETR):
# Optimizer
optimizer = torch.optim.AdamW(
[
{'params': model.backbone.parameters(), 'lr': 1e-5}, # Lower for backbone
{'params': model.transformer.parameters(), 'lr': 1e-4} # Higher for transformer
],
weight_decay=1e-4
)
# Scheduler: Step decay
scheduler = MultiStepLR(optimizer, milestones=[200], gamma=0.1)
# Training: Long schedule for DETR
epochs = 300
Key Points:
- Detection typically shorter training than classification
- Lower LR (0.02 vs 0.1) due to task difficulty
- DETR needs very long training (300 epochs)
Semantic Segmentation (U-Net, DeepLab, SegFormer)
Standard Recipe (CNN-based):
# Optimizer
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9,
weight_decay=1e-4
)
# Scheduler: Polynomial decay (common in segmentation)
def poly_lr_lambda(epoch):
return (1 - epoch / total_epochs) ** 0.9
scheduler = LambdaLR(optimizer, poly_lr_lambda)
# Training
epochs = 100
for epoch in range(epochs):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Key Points:
- Polynomial decay common in segmentation (DeepLab papers)
- Lower initial LR (0.01) than classification
- Power of 0.9 standard
Fast Training / Limited Compute (FastAI Style)
OneCycle Recipe:
# Step 1: Find optimal LR
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses) # e.g., 0.01
max_lr = optimal_lr * 10 # e.g., 0.1
# Step 2: OneCycleLR
optimizer = torch.optim.SGD(model.parameters(), lr=max_lr, momentum=0.9)
scheduler = OneCycleLR(
optimizer,
max_lr=max_lr,
steps_per_epoch=len(train_loader),
epochs=20,
pct_start=0.3, # 30% warmup, 70% cooldown
anneal_strategy='cos'
)
# Step 3: Train (step every batch)
for epoch in range(20):
for batch in train_loader:
train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # Every batch
Key Points:
- Use LR finder to tune max_lr (CRITICAL)
- Train for fewer epochs (10-30)
- Step scheduler every batch
- Often achieves 90-95% of full training performance in 20-30% of time
Fine-Tuning Pretrained Models
Standard Recipe:
# Optimizer: Different LRs for backbone vs head
optimizer = torch.optim.AdamW([
{'params': model.backbone.parameters(), 'lr': 1e-5}, # Very low for pretrained
{'params': model.head.parameters(), 'lr': 1e-3} # Higher for new head
])
# Scheduler: Simple cosine or even constant
# Option 1: Constant LR (fine-tuning often doesn't need scheduling)
scheduler = None
# Option 2: Gentle cosine decay
scheduler = CosineAnnealingLR(optimizer, T_max=10, eta_min=1e-6)
# Training: Short duration
epochs = 10 # Fine-tuning is quick
for epoch in range(epochs):
train_one_epoch(model, train_loader, optimizer)
if scheduler:
scheduler.step()
Key Points:
- Much lower LR for pretrained parts (1e-5)
- Higher LR for new/random parts (1e-3)
- Short training (3-10 epochs)
- Scheduling often optional (constant LR works)
- No warmup needed (weights already good)
Large Batch Training (Batch Size > 1024)
Standard Recipe:
# Linear LR scaling rule: LR scales with batch size
base_lr = 0.1 # For batch size 256
batch_size = 2048
scaled_lr = base_lr * (batch_size / 256) # 0.8 for batch 2048
# Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=scaled_lr, momentum=0.9)
# Scheduler: MUST include warmup (critical for large batch)
warmup_epochs = 5
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_epochs)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [warmup_epochs])
# Training
epochs = 100
for epoch in range(epochs):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Key Points:
- Scale LR linearly with batch size (LR = base_lr * batch_size / base_batch_size)
- Warmup is MANDATORY for large batch (5-10 epochs minimum)
- Longer warmup for very large batches (>4096: use 10-20 epochs)
Why Warmup Critical for Large Batch:
- Large batch = larger effective LR
- High effective LR at start causes instability
- Warmup prevents divergence
Modern Defaults by Domain (2025)
| Domain | Optimizer | Scheduler | Warmup | Epochs |
|---|---|---|---|---|
| Vision (CNN) | SGD (0.9) | Cosine or MultiStep | Optional (5) | 100-200 |
| Vision (ViT) | AdamW | Cosine | MANDATORY (10-20) | 300 |
| NLP (BERT/GPT) | AdamW | Linear | MANDATORY (10%) | Varies |
| Detection | SGD | MultiStep | Optional | 26-300 |
| Segmentation | SGD | Polynomial | Optional | 100 |
| Fast/OneCycle | SGD | OneCycle | Built-in | 10-30 |
| Fine-tuning | AdamW | Constant/Cosine | No | 3-10 |
| Large Batch | SGD | Cosine | MANDATORY (5-20) | 100-200 |
9. Debugging Scheduler Issues
Issue: Training Unstable / Loss Spikes
Symptoms:
- Loss increases suddenly during training
- NaN or Inf loss
- Training was stable, then becomes unstable
Likely Causes:
No warmup (transformers, large models)
- Solution: Add 5-10 epoch warmup
LR too high at start
- Solution: Lower initial LR or extend warmup
LR drop too sharp (MultiStepLR)
- Solution: Use gentler scheduler (Cosine) or smaller gamma
Debugging Steps:
# 1. Print LR every epoch
for epoch in range(epochs):
current_lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch}: LR = {current_lr:.6e}")
# 2. Check if loss spike correlates with LR change
loss = train_one_epoch(model, train_loader, optimizer)
print(f" Loss = {loss:.4f}")
scheduler.step()
# 3. Plot LR and loss together
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(lr_history)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.subplot(1, 2, 2)
plt.plot(loss_history)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
Solutions:
- Add/extend warmup:
LinearLR(optimizer, start_factor=0.01, total_iters=10) - Lower initial LR:
lr = 0.01instead oflr = 0.1 - Gentler scheduler:
CosineAnnealingLRinstead ofMultiStepLR - Gradient clipping:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Issue: Training Plateaus Too Early
Symptoms:
- Loss stops decreasing after 20-30 epochs
- Validation accuracy flat
- Training seems stuck
Likely Causes:
Not using scheduler (constant LR too high for current regime)
- Solution: Add scheduler (CosineAnnealing or ReduceLROnPlateau)
Scheduler reducing LR too early
- Solution: Push back milestones or increase patience
LR already too low
- Solution: Check current LR, may need to restart with higher initial LR
Debugging Steps:
# Check current LR
current_lr = optimizer.param_groups[0]['lr']
print(f"Current LR: {current_lr:.6e}")
# If LR very low (<1e-6), plateau might be due to other issues (architecture, data, etc.)
# If LR still high (>1e-3), should reduce LR to break plateau
Solutions:
Add ReduceLROnPlateau: Automatically reduces when plateau detected
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)Manual LR reduction: If at epoch 30 and plateaued, reduce LR by 10x now
for param_group in optimizer.param_groups: param_group['lr'] *= 0.1Use scheduler from start next time:
scheduler = CosineAnnealingLR(optimizer, T_max=100)
Issue: Poor Final Performance (Train > Val Gap)
Symptoms:
- Training accuracy high (95%), validation lower (88%)
- Model overfitting
- Test performance disappointing
Likely Causes (Scheduling Related):
LR not low enough at end
- Solution: Lower eta_min or extend training
Not using scheduler (constant LR doesn't fine-tune)
- Solution: Add scheduler to reduce LR in late training
Scheduler ending too early
- Solution: Extend training or adjust T_max
Debugging Steps:
# Check final LR
final_lr = optimizer.param_groups[0]['lr']
print(f"Final LR: {final_lr:.6e}")
# Final LR should be very low (1e-5 to 1e-6)
# If final LR still high (>1e-3), model didn't fine-tune properly
Solutions:
- Lower eta_min:
CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6) - Extend training: Train for more epochs to allow LR to decay further
- Add late-stage fine-tuning:
# After main training, do 10 more epochs with very low LR for param_group in optimizer.param_groups: param_group['lr'] = 1e-5 for epoch in range(10): train_one_epoch(model, train_loader, optimizer)
Note: If train-val gap large, may also need regularization (not scheduling issue)
Issue: LR Decays Too Fast
Symptoms:
- LR reaches minimum in first few epochs
- Training very slow after initial epochs
- Looks like constant very low LR
Likely Causes:
scheduler.step() called every batch instead of epoch
- Solution: Move scheduler.step() outside batch loop
T_max too small (e.g., T_max=10 but training for 100 epochs)
- Solution: Set T_max = total_epochs
Using OneCycle unintentionally
- Solution: Verify scheduler type
Debugging Steps:
# Print LR first few epochs
for epoch in range(10):
print(f"Epoch {epoch}: LR = {optimizer.param_groups[0]['lr']:.6e}")
for batch in train_loader:
train_step(model, batch, optimizer)
# scheduler.step() # ❌ If this is here, that's the bug
scheduler.step() # ✅ Should be here
Solutions:
- Move scheduler.step() to correct location (after epoch, not after batch)
- Fix T_max:
T_max = total_epochsorT_max = total_epochs - warmup_epochs - Verify scheduler type:
print(type(scheduler))
Issue: OneCycleLR Not Working
Symptoms:
- Training with OneCycle becomes unstable around peak LR
- Loss increases during ramp-up phase
- Worse performance than expected
Likely Causes:
max_lr too high
- Solution: Run LR finder, use lower max_lr
scheduler.step() placement wrong (should be per batch)
- Solution: Call scheduler.step() every batch
Not tuning max_lr
- Solution: Use LR finder to find optimal, use 5-10x as max_lr
Debugging Steps:
# Plot LR schedule
lrs = []
for epoch in range(epochs):
for batch in train_loader:
lrs.append(optimizer.param_groups[0]['lr'])
scheduler.step()
plt.plot(lrs)
plt.xlabel('Batch')
plt.ylabel('Learning Rate')
plt.title('OneCycle LR Schedule')
plt.show()
# Should see: ramp up to max_lr, then ramp down
# If doesn't look like that, scheduler.step() placement wrong
Solutions:
Run LR finder first:
optimal_lr = find_lr(model, train_loader, optimizer, loss_fn, device) max_lr = optimal_lr * 10 # Or try 5x, 3x if 10x unstableLower max_lr manually:
# If max_lr=0.1 unstable, try 0.03 or 0.01 scheduler = OneCycleLR(optimizer, max_lr=0.03, ...)Verify step() every batch:
for epoch in range(epochs): for batch in train_loader: train_step(model, batch, optimizer) optimizer.step() scheduler.step() # ✅ Every batch
Issue: Warmup Not Working
Symptoms:
- Training still unstable in first few epochs despite warmup
- Loss spikes even with warmup
- NaN loss at start
Likely Causes:
Warmup too short (need longer ramp-up)
- Solution: Extend warmup from 5 to 10-20 epochs
start_factor too high (not starting low enough)
- Solution: Use start_factor=0.001 instead of 0.01
Warmup not actually being used (SequentialLR bug)
- Solution: Verify warmup scheduler is active early
Debugging Steps:
# Print LR first 10 epochs
for epoch in range(10):
current_lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch}: LR = {current_lr:.6e}")
# Should see gradual increase from low to high
# If jumps immediately to high, warmup not working
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Solutions:
Extend warmup:
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=20) # 20 epochsLower start_factor:
warmup = LinearLR(optimizer, start_factor=0.001, total_iters=5) # Start at 0.1%Verify SequentialLR milestone:
# Milestone should match warmup duration scheduler = SequentialLR(optimizer, [warmup, cosine], milestones=[20])Add gradient clipping as additional safeguard:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Issue: ReduceLROnPlateau Never Reduces LR
Symptoms:
- Using ReduceLROnPlateau for 50+ epochs
- Validation loss clearly plateaued
- Learning rate never reduces
Debugging Steps:
1. Verify metric is being passed:
val_loss = validate(model, val_loader)
print(f"Epoch {epoch}: val_loss = {val_loss:.6f}") # Print metric
scheduler.step(val_loss) # Ensure passing metric
2. Check mode is correct:
# For loss (want to minimize):
scheduler = ReduceLROnPlateau(optimizer, mode='min')
# For accuracy (want to maximize):
scheduler = ReduceLROnPlateau(optimizer, mode='max')
Wrong mode means scheduler waits for opposite direction (loss increasing instead of decreasing).
3. Check threshold isn't too strict:
# Default threshold=1e-4 (0.01% improvement threshold)
# If val_loss 0.5000 → 0.4999 (0.02% improvement), counts as improvement
# If threshold too high, tiny improvements prevent reduction
# Solution: Lower threshold to be more sensitive
scheduler = ReduceLROnPlateau(optimizer, threshold=1e-5)
# Or remove threshold entirely
scheduler = ReduceLROnPlateau(optimizer, threshold=0)
4. Enable verbose logging:
scheduler = ReduceLROnPlateau(optimizer, verbose=True)
# Prints: "Epoch 00042: reducing learning rate of group 0 to 1.0000e-04"
# when it reduces
5. Verify plateau is real:
# Plot validation loss over time
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(val_losses)
plt.xlabel('Epoch')
plt.ylabel('Validation Loss')
plt.title('Validation Loss Over Time')
plt.grid(True, alpha=0.3)
plt.show()
# Check: Is loss truly flat, or still slowly improving?
# Tiny improvements (0.4500 → 0.4499) count as progress
6. Check cooldown isn't preventing reduction:
# Default cooldown=0, but if set higher, prevents reduction after recent reduction
scheduler = ReduceLROnPlateau(optimizer, cooldown=0) # No cooldown
Common Causes Table:
| Problem | Symptom | Solution |
|---|---|---|
| Not passing metric | Error or no reduction | scheduler.step(val_loss) |
| Wrong mode | Never reduces | mode='min' for loss, mode='max' for accuracy |
| Threshold too strict | Ignores small improvements | Lower to threshold=1e-5 or 0 |
| Metric still improving | Not actually plateaued | Increase patience or accept slow progress |
| Cooldown active | Reducing but waiting | Set cooldown=0 |
| Min_lr reached | Can't reduce further | Check current LR, may be at min_lr |
Example Fix:
scheduler = ReduceLROnPlateau(
optimizer,
mode='min', # For loss minimization
factor=0.1, # Reduce by 10x
patience=10, # Wait 10 epochs
threshold=0, # Accept any improvement (most sensitive)
threshold_mode='rel',
cooldown=0, # No cooldown period
min_lr=1e-6, # Minimum LR allowed
verbose=True # Print when reducing
)
# Training loop
for epoch in range(epochs):
train_loss = train_one_epoch(model, train_loader, optimizer)
val_loss = validate(model, val_loader)
print(f"Epoch {epoch}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}")
scheduler.step(val_loss) # Pass validation loss
# Print current LR
current_lr = optimizer.param_groups[0]['lr']
print(f" Current LR: {current_lr:.6e}")
Advanced Debugging:
If still not reducing, manually check scheduler logic:
# Get scheduler state
print(f"Best metric so far: {scheduler.best}")
print(f"Epochs without improvement: {scheduler.num_bad_epochs}")
print(f"Patience: {scheduler.patience}")
# If num_bad_epochs < patience, it's still waiting
# If num_bad_epochs >= patience, should reduce next step
10. Rationalization Table
When users rationalize away proper LR scheduling, counter with:
| Rationalization | Reality | Counter-Argument |
|---|---|---|
| "Constant LR is simpler" | Leaves 2-5% performance on table | "One line of code for 2-5% better accuracy is excellent ROI" |
| "Warmup seems optional" | MANDATORY for transformers | "Without warmup, transformers diverge or train unstably" |
| "I don't know which scheduler to use" | CosineAnnealing is great default | "CosineAnnealingLR works well for most cases, zero tuning" |
| "Scheduling is too complicated" | Modern frameworks make it trivial | "scheduler = CosineAnnealingLR(optimizer, T_max=100) - that's it" |
| "Papers don't mention scheduling" | They do, in implementation details | "Check paper's code repo or appendix - scheduling always there" |
| "My model is too small to need scheduling" | Even small models benefit | "Scheduling helps all models converge to better minima" |
| "Just use Adam, it adapts automatically" | Adam still benefits from scheduling | "SOTA transformers use AdamW + scheduling (BERT, GPT, ViT)" |
| "I'll tune it later" | Scheduling should be from start | "Scheduling is core hyperparameter, not optional add-on" |
| "OneCycle always best" | Only for specific scenarios | "OneCycle great for fast training (<30 epochs), not long training" |
| "I don't have time to run LR finder" | Takes 5 minutes, saves hours | "LR finder runs in minutes, prevents wasted training runs" |
| "Warmup adds complexity" | One extra line of code | "SequentialLR([warmup, cosine], [5]) - that's the complexity" |
| "My training is already good enough" | Could be 2-5% better | "SOTA papers all use scheduling - it's standard practice" |
| "Reducing LR will slow training" | Reduces LR when high LR hurts | "High LR early (fast), low LR late (fine-tune) = best of both" |
| "I don't know what T_max to use" | T_max = total_epochs | "Just set T_max to your total training epochs" |
11. Red Flags Checklist
Watch for these warning signs that indicate scheduling problems:
Critical Red Flags (Fix Immediately):
🚨 Training transformer without warmup
- Impact: High risk of divergence, NaN loss
- Fix: Add 5-10 epoch warmup immediately
🚨 Loss NaN or exploding in first few epochs
- Impact: Training failed
- Fix: Add warmup, lower initial LR, gradient clipping
🚨 scheduler.step() called every batch for Cosine/Step schedulers
- Impact: LR decays 100x too fast
- Fix: Move scheduler.step() outside batch loop
🚨 Not passing metric to ReduceLROnPlateau
- Impact: Scheduler doesn't work at all
- Fix: scheduler.step(val_loss)
Important Red Flags (Should Fix):
⚠️ Training >30 epochs without scheduler
- Impact: Leaving 2-5% performance on table
- Fix: Add CosineAnnealingLR or MultiStepLR
⚠️ OneCycle with random max_lr (not tuned)
- Impact: Unstable training or suboptimal performance
- Fix: Run LR finder, tune max_lr
⚠️ Large batch (>512) without warmup
- Impact: Training instability
- Fix: Add 5-10 epoch warmup
⚠️ Vision transformer with constant LR
- Impact: Poor convergence, unstable training
- Fix: Add warmup + cosine schedule
⚠️ Training plateaus but no scheduler to reduce LR
- Impact: Stuck at local minimum
- Fix: Add ReduceLROnPlateau or manually reduce LR
Minor Red Flags (Consider Fixing):
⚡ CNN training without any scheduling
- Impact: Missing 1-3% accuracy
- Fix: Add MultiStepLR or CosineAnnealingLR
⚡ Not monitoring LR during training
- Impact: Hard to debug schedule issues
- Fix: Log LR every epoch
⚡ T_max doesn't match training duration
- Impact: Schedule ends too early/late
- Fix: Set T_max = total_epochs - warmup_epochs
⚡ Using same LR for pretrained and new layers (fine-tuning)
- Impact: Suboptimal fine-tuning
- Fix: Use different LRs for param groups
⚡ Not validating schedule before full training
- Impact: Risk wasting compute on wrong schedule
- Fix: Plot schedule dry-run before training
12. Quick Reference
Scheduler Selection Cheatsheet
Q: What should I use for...
Vision CNN (100 epochs)?
→ CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-5)
Vision Transformer?
→ LinearLR(warmup 5) + CosineAnnealingLR(T_max=95) [WARMUP MANDATORY]
NLP Transformer?
→ LinearLR(warmup 10%) + LinearLR(decay) [WARMUP MANDATORY]
Fast training (<30 epochs)?
→ OneCycleLR(max_lr=tune_with_LR_finder)
Don't know optimal schedule?
→ ReduceLROnPlateau(mode='min', patience=10)
Training plateaued?
→ Add ReduceLROnPlateau or manually reduce LR by 10x now
Following paper recipe?
→ Use paper's exact schedule (usually MultiStepLR)
Fine-tuning pretrained model?
→ Constant low LR (1e-5) or gentle CosineAnnealing
Large batch (>512)?
→ LinearLR(warmup 5-10) + CosineAnnealingLR [WARMUP MANDATORY]
Step Placement Quick Reference
# Most schedulers (Step, Cosine, Exponential)
for epoch in range(epochs):
for batch in train_loader:
train_step(...)
scheduler.step() # AFTER epoch
# OneCycleLR (EXCEPTION)
for epoch in range(epochs):
for batch in train_loader:
train_step(...)
scheduler.step() # AFTER each batch
# ReduceLROnPlateau (pass metric)
for epoch in range(epochs):
for batch in train_loader:
train_step(...)
val_loss = validate(...)
scheduler.step(val_loss) # Pass metric
Warmup Quick Reference
# Pattern: Warmup + Cosine (most common)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# When warmup is MANDATORY:
# ✅ Transformers (ViT, BERT, GPT)
# ✅ Large batch (>512)
# ✅ High initial LR
# ✅ Training from scratch
# When warmup is optional:
# ❌ Fine-tuning
# ❌ Small LR (<1e-4)
# ❌ Small models
LR Finder Quick Reference
# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
# Find optimal (steepest descent)
optimal_lr = suggest_lr_from_finder(lrs, losses)
# Use cases:
# - Direct use: optimizer = SGD(params, lr=optimal_lr)
# - OneCycle: max_lr = optimal_lr * 10
# - Conservative: base_lr = optimal_lr * 0.1
Summary
Learning rate scheduling is CRITICAL for competitive model performance:
Key Takeaways:
- Scheduling improves final accuracy by 2-5% - not optional for SOTA
- Warmup is MANDATORY for transformers - prevents divergence
- CosineAnnealingLR is best default - works well, zero tuning
- Use LR finder for new problems - finds optimal initial LR in minutes
- OneCycleLR needs max_lr tuning - run LR finder first
- Watch scheduler.step() placement - most per epoch, OneCycle per batch
- Always monitor LR during training - log to console or TensorBoard
- Plot schedule before training - catch mistakes early
Modern Defaults (2025):
- Vision CNNs: SGD + CosineAnnealingLR (optional warmup)
- Vision Transformers: AdamW + Warmup + CosineAnnealingLR (warmup mandatory)
- NLP Transformers: AdamW + Warmup + Linear decay (warmup mandatory)
- Fast Training: SGD + OneCycleLR (tune max_lr with LR finder)
When In Doubt:
- Use CosineAnnealingLR with T_max = total_epochs
- Add 5-epoch warmup for large models
- Run LR finder if unsure about initial LR
- Log LR every epoch to monitor schedule
Learning rate scheduling is one of the highest-ROI hyperparameters - master it for significantly better model performance.