name	learning-rate-scheduling
description	Learning rate scheduling - warmup, schedulers, decay strategies, modern best practices

Learning Rate Scheduling Skill

When to Use This Skill

Use this skill when:

User asks "should I use a learning rate scheduler?"
Training plateaus or loss stops improving
Training transformers or large models (warmup critical)
User wants to implement OneCycleLR or specific scheduler
Training is unstable in early epochs
User asks "what learning rate should I use?"
Deciding between constant LR and scheduled LR
User is copying a paper's training recipe
Implementing modern training pipelines (vision, NLP, RL)
User suggests "just use constant LR" (rationalization)

Do NOT use when:

User has specific bugs unrelated to scheduling
Only discussing optimizer choice (no schedule questions)
Training already working well and no LR questions asked

Core Principles

1. Why Learning Rate Scheduling Matters

Learning rate scheduling is one of the MOST IMPACTFUL hyperparameters:

High LR Early (Exploration):

Fast initial progress through parameter space
Escape poor local minima
Rapid loss reduction in early epochs

Low LR Late (Exploitation):

Fine-tune to sharper, better minima
Improve generalization (test accuracy)
Stable convergence without oscillation

Quantitative Impact:

Proper scheduling improves final test accuracy by 2-5% (SIGNIFICANT)
Standard practice in all SOTA papers (ResNet, EfficientNet, ViT, BERT, GPT)
Not optional for competitive performance

When Constant LR Fails:

Can't explore quickly AND converge precisely
Either too high (never converges) or too low (too slow)
Leaves 2-5% performance on the table

2. Decision Framework: When to Schedule vs Constant LR

Use Scheduler When:

✅ Long training (>30 epochs)

Scheduling essential for multi-stage training
Different LR regimes needed across training
Example: 90-epoch ImageNet training

✅ Large model on large dataset

Training from scratch on ImageNet, COCO, etc.
Benefits from exploration → exploitation strategy
Example: ResNet-50 on ImageNet

✅ Training plateaus or loss stops improving

Current LR too high for current parameter regime
Reducing LR breaks plateau
Example: Validation loss stuck for 10+ epochs

✅ Following established training recipes

Papers publish schedules for reproducibility
Vision models typically use MultiStepLR or Cosine
Example: ResNet paper specifies drop at epochs 30, 60, 90

✅ Want competitive SOTA performance

Squeezing out last 2-5% accuracy
Required for benchmarks and competitions
Example: Targeting SOTA on CIFAR-10

Maybe Don't Need Scheduler When:

❌ Very short training (<10 epochs)

Not enough time for multi-stage scheduling
Constant LR or simple linear decay sufficient
Example: Quick fine-tuning for 5 epochs

❌ OneCycle is the strategy itself

OneCycleLR IS the training strategy (not separate)
Don't combine OneCycle with another scheduler
Example: FastAI-style 20-epoch training

❌ Hyperparameter search phase

Constant LR simpler to compare across runs
Add scheduling after finding good architecture/optimizer
Example: Running 50 architecture trials

❌ Transfer learning fine-tuning

Small number of epochs on pretrained model
Constant small LR often sufficient
Example: Fine-tuning BERT for 3 epochs

❌ Reinforcement learning

RL typically uses constant LR (exploration/exploitation balance different)
Some exceptions (PPO sometimes uses linear decay)
Example: DQN, A3C usually constant LR

Default Recommendation:

For >30 epoch training: USE A SCHEDULER (typically CosineAnnealingLR) For <10 epoch training: Constant LR usually fine For 10-30 epochs: Try both, scheduler usually wins

3. Major Scheduler Types - Complete Comparison

StepLR / MultiStepLR (Classic Vision)

Use When:

Training CNNs (ResNet, VGG, etc.)
Following established recipe from paper
Want simple, interpretable schedule

How It Works:

Drop LR by constant factor at specific epochs
StepLR: every N epochs
MultiStepLR: at specified milestone epochs

Implementation:

# StepLR: Drop every 30 epochs
scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer,
    step_size=30,    # Drop every 30 epochs
    gamma=0.1        # Multiply LR by 0.1 (10x reduction)
)

# MultiStepLR: Drop at specific milestones (more control)
scheduler = torch.optim.lr_scheduler.MultiStepLR(
    optimizer,
    milestones=[30, 60, 90],  # Drop at these epochs
    gamma=0.1                  # Multiply by 0.1 each time
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()  # Call AFTER each epoch

Example Schedule (initial_lr=0.1):

Epochs 0-29: LR = 0.1
Epochs 30-59: LR = 0.01 (dropped by 10x)
Epochs 60-89: LR = 0.001 (dropped by 10x again)
Epochs 90-99: LR = 0.0001

Pros:

Simple and interpretable
Well-established in papers (easy to reproduce)
Works well for vision models

Cons:

Requires manual milestone selection
Sharp LR drops can cause temporary instability
Need to know total training epochs in advance

Best For: Classical CNN training (ResNet, VGG) following paper recipes

CosineAnnealingLR (Modern Default)

Use When:

Training modern vision models (ViT, EfficientNet)
Want smooth decay without manual milestones
Don't want to tune milestone positions

How It Works:

Smooth cosine curve from initial_lr to eta_min
Gradual decay, no sharp drops
LR = eta_min + (initial_lr - eta_min) * (1 + cos(π * epoch / T_max)) / 2

Implementation:

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=100,      # Total epochs (LR reaches eta_min at epoch 100)
    eta_min=1e-5    # Minimum LR (default: 0)
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()  # Call AFTER each epoch

Example Schedule (initial_lr=0.1, eta_min=1e-5):

Epoch 0: LR = 0.1
Epoch 25: LR ≈ 0.075
Epoch 50: LR ≈ 0.05
Epoch 75: LR ≈ 0.025
Epoch 100: LR = 0.00001

Pros:

No milestone tuning required
Smooth decay (no instability from sharp drops)
Widely used in modern papers
Works well across many domains

Cons:

Must know total epochs in advance
Can't adjust schedule during training

Best Practice: ALWAYS COMBINE WITH WARMUP for large models:

# Warmup for 5 epochs, then cosine for 95 epochs
warmup = torch.optim.lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,  # Start at 1% of base LR
    end_factor=1.0,     # Ramp to 100%
    total_iters=5       # Over 5 epochs
)

cosine = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=95,          # 95 epochs after warmup
    eta_min=1e-5
)

scheduler = torch.optim.lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup, cosine],
    milestones=[5]  # Switch to cosine after 5 epochs
)

Best For: Modern vision models, transformers, default choice for most problems

ReduceLROnPlateau (Adaptive)

Use When:

Don't know optimal schedule in advance
Want adaptive approach based on validation performance
Training plateaus and you want automatic LR reduction

How It Works:

Monitors validation metric (loss or accuracy)
Reduces LR when metric stops improving
Requires passing metric to scheduler.step()

Implementation:

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode='min',          # 'min' for loss, 'max' for accuracy
    factor=0.1,          # Reduce LR by 10x when plateau detected
    patience=10,         # Wait 10 epochs before reducing
    threshold=1e-4,      # Minimum change to count as improvement
    threshold_mode='rel', # 'rel' or 'abs'
    cooldown=0,          # Epochs to wait after LR reduction
    min_lr=1e-6,         # Don't reduce below this
    verbose=True         # Print when LR reduced
)

# Training loop
for epoch in range(100):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    val_loss = validate(model, val_loader)

    # IMPORTANT: Pass validation metric to step()
    scheduler.step(val_loss)  # NOT scheduler.step() alone!

Example Behavior (patience=10, factor=0.1):

Epochs 0-30: Val loss improving, LR = 0.001
Epochs 31-40: Val loss plateaus at 0.15, patience counting
Epoch 41: Plateau detected, LR reduced to 0.0001
Epochs 42-60: Val loss improving again with lower LR
Epoch 61: Plateau again, LR reduced to 0.00001

Pros:

Adaptive - no manual tuning required
Based on actual training progress
Good for unknown optimal schedule

Cons:

Can be too conservative (waits long before reducing)
Requires validation metric (can't use train loss alone)
May reduce LR too late or not enough

Tuning Tips:

Smaller patience (5-10) for faster adaptation
Larger patience (10-20) for more conservative
Factor of 0.1 (10x) is standard, but 0.5 (2x) more gradual

Best For: Exploratory training, unknown optimal schedule, adaptive pipelines

OneCycleLR (Fast Training)

Use When:

Limited compute budget (want fast convergence)
Training for relatively few epochs (10-30)
Following FastAI-style training
Want aggressive schedule for quick results

How It Works:

Ramps UP from low LR to max_lr (first 30% by default)
Ramps DOWN from max_lr to very low LR (remaining 70%)
Steps EVERY BATCH (not every epoch) - CRITICAL DIFFERENCE

Implementation:

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.1,                    # Peak learning rate (TUNE THIS!)
    steps_per_epoch=len(train_loader),  # Batches per epoch
    epochs=20,                     # Total epochs
    pct_start=0.3,                 # Ramp up for first 30%
    anneal_strategy='cos',         # 'cos' or 'linear'
    div_factor=25,                 # initial_lr = max_lr / 25
    final_div_factor=10000         # final_lr = max_lr / 10000
)

# Training loop - NOTE: step() EVERY BATCH
for epoch in range(20):
    for batch in train_loader:
        loss = train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # CALL EVERY BATCH, NOT EVERY EPOCH!

Example Schedule (max_lr=0.1, 20 epochs, 400 batches/epoch):

Batches 0-2400 (epochs 0-6): LR ramps from 0.004 → 0.1
Batches 2400-8000 (epochs 6-20): LR ramps from 0.1 → 0.00001

CRITICAL: Tuning max_lr:

OneCycleLR is VERY sensitive to max_lr choice. Too high = instability.

Method 1 - LR Finder (RECOMMENDED):

# Run LR finder first (see LR Finder section)
optimal_lr = find_lr(model, train_loader, optimizer)  # e.g., 0.01
max_lr = optimal_lr * 10  # Use 10x optimal as max_lr

Method 2 - Manual tuning:

Start with max_lr = 0.1
If training unstable, try 0.03, 0.01
If training too slow, try 0.3, 1.0

Pros:

Very fast convergence (fewer epochs needed)
Strong final performance
Popular in FastAI community

Cons:

Sensitive to max_lr (requires tuning)
Steps every batch (easy to mess up)
Not ideal for very long training (>50 epochs)

Common Mistakes:

Calling scheduler.step() per epoch instead of per batch
Not tuning max_lr (using default blindly)
Using for very long training (OneCycle designed for shorter cycles)

Best For: FastAI-style training, limited compute budget, 10-30 epoch training

Advanced OneCycleLR Tuning

If lowering max_lr doesn't resolve instability, try these advanced tuning options:

1. Adjust pct_start (warmup fraction):

# Default: 0.3 (30% warmup, 70% cooldown)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       pct_start=0.3)  # Default

# If unstable at peak: Increase to 0.4 or 0.5 (longer warmup)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       pct_start=0.5)  # Gentler ramp to peak

# If unstable in cooldown: Decrease to 0.2 (shorter warmup, gentler descent)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       pct_start=0.2)

2. Adjust div_factor (initial LR):

# Default: 25 (initial_lr = max_lr / 25)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       div_factor=25)  # Start at 0.004

# If unstable at start: Increase to 50 or 100 (start even lower)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       div_factor=100)  # Start at 0.001

3. Adjust final_div_factor (final LR):

# Default: 10000 (final_lr = max_lr / 10000)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       final_div_factor=10000)  # End at 0.00001

# If unstable at end: Decrease to 1000 (end at higher LR)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       final_div_factor=1000)  # End at 0.0001

4. Add gradient clipping:

# In training loop
for batch in train_loader:
    loss = train_step(model, batch, optimizer)
    loss.backward()

    # Clip gradients to prevent instability
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()
    scheduler.step()

5. Consider OneCycle may not be right for your problem:

Very deep networks (>100 layers): May be too unstable for OneCycle's aggressive schedule
Large models (>100M params): May need gentler schedule (Cosine + warmup)
Sensitive architectures (some transformers): OneCycle's rapid LR changes can destabilize

Alternative: Use CosineAnnealing + warmup for more stable training:

# More stable alternative to OneCycle
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=15, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])

LinearLR (Warmup)

Use When:

Need warmup at training start
Ramping up LR gradually over first few epochs
Combining with another scheduler (SequentialLR)

How It Works:

Linearly interpolates LR from start_factor to end_factor
Typically used for warmup: start_factor=0.01, end_factor=1.0

Implementation:

# Standalone linear warmup
scheduler = torch.optim.lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,  # Start at 1% of base LR
    end_factor=1.0,     # End at 100% of base LR
    total_iters=5       # Over 5 epochs
)

# More common: Combine with main scheduler
warmup = torch.optim.lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,
    total_iters=5
)

main = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=95
)

scheduler = torch.optim.lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup, main],
    milestones=[5]  # Switch after 5 epochs
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Example Schedule (base_lr=0.1):

Epoch 0: LR = 0.001 (1%)
Epoch 1: LR = 0.0208 (20.8%)
Epoch 2: LR = 0.0406 (40.6%)
Epoch 3: LR = 0.0604 (60.4%)
Epoch 4: LR = 0.0802 (80.2%)
Epoch 5: LR = 0.1 (100%, then switch to main scheduler)

Best For: Warmup phase for transformers and large models

ExponentialLR (Continuous Decay)

Use When:

Want smooth, continuous decay
Simpler alternative to Cosine
Prefer exponential over linear decay

How It Works:

Multiply LR by gamma every epoch
LR(epoch) = initial_lr * gamma^epoch

Implementation:

scheduler = torch.optim.lr_scheduler.ExponentialLR(
    optimizer,
    gamma=0.95  # Multiply by 0.95 each epoch
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Example Schedule (initial_lr=0.1, gamma=0.95):

Epoch 0: LR = 0.1
Epoch 10: LR = 0.0599
Epoch 50: LR = 0.0077
Epoch 100: LR = 0.0059

Tuning gamma:

Want 10x decay over 100 epochs: gamma = 0.977 (0.1^(1/100))
Want 100x decay over 100 epochs: gamma = 0.955 (0.01^(1/100))
General formula: gamma = (target_lr / initial_lr)^(1/epochs)

Pros:

Very smooth decay
Simple to implement

Cons:

Hard to intuit gamma value for desired final LR
Less popular than Cosine (Cosine is better default)

Best For: Cases where you want exponential decay specifically

LambdaLR (Custom Schedules)

Use When:

Need custom schedule not provided by standard schedulers
Implementing paper-specific schedule
Advanced use cases (e.g., transformer inverse sqrt schedule)

How It Works:

Provide function that computes LR multiplier for each epoch
LR(epoch) = initial_lr * lambda(epoch)

Implementation:

# Example: Warmup then constant
def warmup_lambda(epoch):
    if epoch < 5:
        return (epoch + 1) / 5  # Linear warmup
    else:
        return 1.0  # Constant after warmup

scheduler = torch.optim.lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=warmup_lambda
)

# Example: Transformer inverse square root schedule
def transformer_schedule(epoch):
    warmup_steps = 4000
    step = epoch + 1
    return min(step ** (-0.5), step * warmup_steps ** (-1.5))

scheduler = torch.optim.lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=transformer_schedule
)

# Example: Polynomial decay
def polynomial_decay(epoch):
    return (1 - epoch / 100) ** 0.9  # Decay to 0 at epoch 100

scheduler = torch.optim.lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=polynomial_decay
)

Best For: Custom schedules, implementing specific papers, advanced users

4. Warmup Strategies - CRITICAL FOR TRANSFORMERS

Why Warmup is Essential

Problem at Training Start:

Weights are randomly initialized
Gradients can be very large and unstable
BatchNorm statistics are uninitialized
High LR can cause immediate divergence (NaN loss)

Solution: Gradual LR Increase

Start with very low LR (1% of target)
Linearly increase to target LR over first few epochs
Allows model to stabilize before aggressive learning

Quantitative Impact:

Transformers WITHOUT warmup: Often diverge or train very unstably
Transformers WITH warmup: Stable training, better final performance
Vision models: Warmup improves stability, sometimes +0.5-1% accuracy

When Warmup is MANDATORY

ALWAYS use warmup when:

✅ Training transformers (ViT, BERT, GPT, T5, etc.)

Transformers REQUIRE warmup - not optional
Without warmup, training often diverges
Standard practice in all transformer papers

✅ Large batch sizes (>512)

Large batches → larger effective learning rate
Warmup prevents early instability
Standard for distributed training

✅ High initial learning rates

If starting with LR > 0.001, use warmup
Warmup allows higher peak LR safely

✅ Training from scratch (not fine-tuning)

Random initialization needs gentle start
Fine-tuning can often skip warmup (weights already good)

Usually use warmup when:

✅ Large models (>100M parameters) ✅ Using AdamW optimizer (common with transformers) ✅ Following modern training recipes

May skip warmup when:

❌ Fine-tuning pretrained models (weights already trained) ❌ Small learning rates (< 0.0001) ❌ Small models (<10M parameters) ❌ Established recipe without warmup (e.g., some CNN papers)

Warmup Implementation Patterns

Pattern 1: Linear Warmup + Cosine Decay (Most Common)

import torch.optim.lr_scheduler as lr_scheduler

# Warmup for 5 epochs
warmup = lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,  # Start at 1% of base LR
    end_factor=1.0,     # End at 100% of base LR
    total_iters=5       # Over 5 epochs
)

# Cosine decay for remaining 95 epochs
cosine = lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=95,          # 95 epochs after warmup
    eta_min=1e-5       # Final LR
)

# Combine sequentially
scheduler = lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup, cosine],
    milestones=[5]  # Switch to cosine after epoch 5
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Schedule Visualization (base_lr=0.001):

Epochs 0-4: Linear ramp from 0.00001 → 0.001 (warmup)
Epochs 5-99: Cosine decay from 0.001 → 0.00001

Use For: Vision transformers, modern CNNs, most large-scale training

Pattern 2: Linear Warmup + MultiStepLR

# Warmup for 5 epochs
warmup = lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,
    total_iters=5
)

# Step decay at 30, 60, 90
steps = lr_scheduler.MultiStepLR(
    optimizer,
    milestones=[30, 60, 90],
    gamma=0.1
)

# Combine
scheduler = lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup, steps],
    milestones=[5]
)

Use For: Classical CNN training with warmup

Pattern 3: Manual Warmup (More Control)

def get_lr_schedule(epoch, total_epochs, base_lr, warmup_epochs=5):
    """
    Custom schedule with warmup and cosine decay.
    """
    if epoch < warmup_epochs:
        # Linear warmup
        return base_lr * (epoch + 1) / warmup_epochs
    else:
        # Cosine decay
        progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
        return base_lr * 0.5 * (1 + math.cos(math.pi * progress))

# Training loop
for epoch in range(100):
    lr = get_lr_schedule(epoch, total_epochs=100, base_lr=0.001)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    train_one_epoch(model, train_loader, optimizer)

Use For: Custom schedules, research, maximum control

Pattern 4: Transformer-Style Warmup (Inverse Square Root)

def transformer_lr_schedule(step, d_model, warmup_steps):
    """
    Transformer schedule from "Attention is All You Need".
    LR increases during warmup, then decreases proportionally to inverse sqrt of step.
    """
    step = step + 1  # 1-indexed
    return (d_model ** -0.5) * min(step ** -0.5, step * warmup_steps ** -1.5)

scheduler = lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=lambda step: transformer_lr_schedule(step, d_model=512, warmup_steps=4000)
)

# Training loop - NOTE: step every BATCH for this schedule
for epoch in range(epochs):
    for batch in train_loader:
        train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # Step every batch

Use For: Transformer models (BERT, GPT), following original papers

Warmup Duration Guidelines

How many warmup epochs?

Transformers: 5-20 epochs (or 5-10% of total training)
Vision models: 5-10 epochs
Very large models (>1B params): 10-20 epochs
Small models: 3-5 epochs

Rule of thumb: 5-10% of total training epochs

Examples:

100-epoch training: 5-10 epoch warmup
20-epoch training: 2-3 epoch warmup
300-epoch training: 15-30 epoch warmup

"But My Transformer Trained Fine Without Warmup"

Some users report training transformers without warmup successfully. Here's the reality:

What "fine" actually means:

Training didn't diverge (NaN loss) - that's a low bar
Got reasonable accuracy - but NOT optimal accuracy
One successful run doesn't mean it's optimal or reliable

What you're missing without warmup:

1. Performance gap (1-3% accuracy):

Without warmup: Training works, achieves 85% accuracy
With warmup: Same model achieves 87-88% accuracy

That 2-3% is SIGNIFICANT:

Difference between competitive and SOTA
Difference between accepted and rejected paper
Difference between passing and failing business metrics

2. Training stability:

Without warmup:
- Some runs diverge → need to restart with lower LR
- Sensitive to initialization seed
- Requires careful LR tuning
- Success rate: 60-80% of runs

With warmup:
- Stable training → consistent results
- Robust to initialization
- Wider stable LR range
- Success rate: 95-100% of runs

3. Hyperparameter sensitivity:

Without warmup:

Very sensitive to initial LR choice (0.001 works, 0.0015 diverges)
Sensitive to batch size
Sensitive to optimizer settings

With warmup:

More forgiving LR range (0.0005-0.002 all work)
Less sensitive to batch size
Robust optimizer configuration

Empirical Evidence - Published Papers:

Check transformer papers - ALL use warmup:

Model	Paper	Warmup
ViT	Dosovitskiy et al., 2020	✅ Linear, 10k steps
DeiT	Touvron et al., 2021	✅ Linear, 5 epochs
Swin	Liu et al., 2021	✅ Linear, 20 epochs
BERT	Devlin et al., 2018	✅ Linear, 10k steps
GPT-2	Radford et al., 2019	✅ Linear warmup
GPT-3	Brown et al., 2020	✅ Linear warmup
T5	Raffel et al., 2020	✅ Inverse sqrt warmup

Every competitive transformer model uses warmup - there's a reason.

"But I got 85% accuracy without warmup!"

Great! Now try with warmup and see if you get 87-88%. You probably will.

The cost-benefit analysis:

# Cost: One line of code
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
scheduler = SequentialLR(optimizer, [warmup, main], [5])

# Benefit:
# - 1-3% better accuracy
# - More stable training
# - Higher success rate
# - Wider stable hyperparameter range

Recommendation:

Run ablation study: Train your model with and without warmup
Compare: Final test accuracy, training stability, number of failed runs
You'll find warmup gives better results with minimal complexity

Bottom line: Just because something "works" doesn't mean it's optimal. Warmup is standard practice for transformers because it consistently improves results.

5. LR Finder - Finding Optimal Initial LR

What is LR Finder?

Method from Leslie Smith (2015): Cyclical Learning Rates paper

Core Idea:

Start with very small LR (1e-8)
Gradually increase LR (multiply by ~1.1 each batch)
Train for a few hundred steps, record loss at each LR
Plot loss vs LR
Choose LR where loss decreases fastest (steepest descent)

Why It Works:

Too low LR: Loss decreases very slowly
Optimal LR: Loss decreases rapidly (steepest slope)
Too high LR: Loss plateaus or increases (instability)

Typical Findings:

Loss decreases fastest at some LR (e.g., 0.01)
Loss starts increasing at higher LR (e.g., 0.1)
Choose LR slightly below fastest descent point (e.g., 0.003-0.01)

LR Finder Implementation

import torch
import matplotlib.pyplot as plt
import numpy as np

def find_lr(model, train_loader, optimizer, loss_fn, device,
            start_lr=1e-8, end_lr=10, num_iter=100, smooth_f=0.05):
    """
    LR Finder: Sweep learning rates and plot loss curve.

    Args:
        model: PyTorch model
        train_loader: Training data loader
        optimizer: Optimizer (will be modified)
        loss_fn: Loss function
        device: Device to train on
        start_lr: Starting learning rate (default: 1e-8)
        end_lr: Ending learning rate (default: 10)
        num_iter: Number of iterations (default: 100)
        smooth_f: Smoothing factor for loss (default: 0.05)

    Returns:
        lrs: List of learning rates tested
        losses: List of losses at each LR
    """
    # Save initial model state to restore later
    model.train()
    initial_state = model.state_dict()

    # Calculate LR multiplier for exponential increase
    lr_mult = (end_lr / start_lr) ** (1 / num_iter)

    lrs = []
    losses = []
    best_loss = float('inf')
    avg_loss = 0

    lr = start_lr

    # Iterate through training data
    iterator = iter(train_loader)
    for iteration in range(num_iter):
        try:
            data, target = next(iterator)
        except StopIteration:
            # Restart iterator if we run out of data
            iterator = iter(train_loader)
            data, target = next(iterator)

        # Set learning rate
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr

        # Forward pass
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)

        # Compute smoothed loss (exponential moving average)
        if iteration == 0:
            avg_loss = loss.item()
        else:
            avg_loss = smooth_f * loss.item() + (1 - smooth_f) * avg_loss

        # Record
        lrs.append(lr)
        losses.append(avg_loss)

        # Track best loss
        if avg_loss < best_loss:
            best_loss = avg_loss

        # Stop if loss explodes (>4x best loss)
        if avg_loss > 4 * best_loss:
            print(f"Stopping early at iteration {iteration}: loss exploded")
            break

        # Backward pass
        loss.backward()
        optimizer.step()

        # Increase learning rate
        lr *= lr_mult
        if lr > end_lr:
            break

    # Restore model to initial state
    model.load_state_dict(initial_state)

    # Plot results
    plt.figure(figsize=(10, 6))
    plt.plot(lrs, losses)
    plt.xscale('log')
    plt.xlabel('Learning Rate (log scale)')
    plt.ylabel('Loss')
    plt.title('LR Finder')
    plt.grid(True, alpha=0.3)

    # Mark suggested LR (10x below minimum loss)
    min_loss_idx = np.argmin(losses)
    suggested_lr = lrs[max(0, min_loss_idx - 5)]  # A bit before minimum
    plt.axvline(suggested_lr, color='red', linestyle='--',
                label=f'Suggested LR: {suggested_lr:.2e}')
    plt.legend()
    plt.show()

    print(f"\nLR Finder Results:")
    print(f"  Minimum loss at LR: {lrs[min_loss_idx]:.2e}")
    print(f"  Suggested starting LR: {suggested_lr:.2e}")
    print(f"  (Choose LR where loss decreases fastest, before minimum)")

    return lrs, losses


def suggest_lr_from_finder(lrs, losses):
    """
    Suggest optimal learning rate from LR finder results.

    Strategy: Find LR where loss gradient is steepest (fastest decrease).
    """
    # Compute gradient of loss w.r.t. log(LR)
    log_lrs = np.log10(lrs)
    gradients = np.gradient(losses, log_lrs)

    # Find steepest descent (most negative gradient)
    steepest_idx = np.argmin(gradients)

    # Suggested LR is at steepest point or slightly before
    suggested_lr = lrs[steepest_idx]

    return suggested_lr

Using LR Finder

Basic Usage:

# Setup model, optimizer, loss
model = YourModel().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)  # LR will be overridden
loss_fn = torch.nn.CrossEntropyLoss()

# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)

# Manually inspect plot and choose LR
# Look for: steepest descent point (fastest loss decrease)
# Typically: 10x lower than loss minimum

# Example: If minimum is at 0.1, choose 0.01 as starting LR
base_lr = 0.01  # Based on plot inspection

Automated LR Selection:

# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)

# Get suggested LR
suggested_lr = suggest_lr_from_finder(lrs, losses)

# Use suggested LR
optimizer = torch.optim.SGD(model.parameters(), lr=suggested_lr)

Using with OneCycleLR:

# Find optimal LR
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses)  # e.g., 0.01

# OneCycleLR: Use 5-10x optimal as max_lr
max_lr = optimal_lr * 10  # e.g., 0.1

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=max_lr,
    steps_per_epoch=len(train_loader),
    epochs=20
)

Interpreting LR Finder Results

Typical Plot Patterns:

Loss
|
|         X  <-- Loss explodes (LR too high)
|        X
|       X
|      X     <-- Loss minimum (still too high)
|     X
|    X       <-- CHOOSE HERE (steepest descent)
|   X
|  X
| X
|X___________
  1e-8  1e-4  1e-2  0.1  1.0  10
              Learning Rate

How to Choose:

Steepest Descent (BEST):
- Find where loss decreases fastest (steepest downward slope)
- This is optimal LR for rapid convergence
- Example: If steepest at 0.01, choose 0.01
Before Minimum (SAFE):
- Find minimum loss LR (e.g., 0.1)
- Choose 10x lower (e.g., 0.01)
- More conservative, safer choice
Avoid:
- Don't choose minimum itself (often too high)
- Don't choose where loss is flat (too low, slow progress)
- Don't choose where loss increases (way too high)

Guidelines:

For SGD: Choose at steepest descent
For Adam: Choose 10x below steepest (Adam more sensitive)
For OneCycle: Use steepest as optimal, 5-10x as max_lr

When to Use LR Finder

Use LR Finder When:

✅ Starting new project (unknown optimal LR) ✅ New architecture or dataset ✅ Tuning OneCycleLR (finding max_lr) ✅ Transitioning between optimizers ✅ Having training instability issues

Can Skip When:

❌ Following established paper recipe (LR already known) ❌ Fine-tuning (small LR like 1e-5 typically works) ❌ Very constrained time/resources ❌ Using adaptive methods (ReduceLROnPlateau)

Best Practice:

Run LR finder once at project start
Use found LR for all subsequent runs
Re-run if changing optimizer, architecture, or batch size significantly

6. Scheduler Selection Guide

Selection Flowchart

1. What's your training duration?

<10 epochs: Constant LR or simple linear decay
10-30 epochs: OneCycleLR (fast) or CosineAnnealingLR
>30 epochs: CosineAnnealingLR or MultiStepLR

2. What's your model type?

Transformer (ViT, BERT, GPT): CosineAnnealing + WARMUP (mandatory)
CNN (ResNet, EfficientNet): MultiStepLR or CosineAnnealing + optional warmup
Small model: Simpler schedulers (StepLR) or constant LR

3. Do you know optimal schedule?

Yes (from paper): Use paper's schedule (MultiStepLR usually)
No (exploring): ReduceLROnPlateau or CosineAnnealing
Want fast results: OneCycleLR + LR finder

4. What's your compute budget?

High budget (100+ epochs): CosineAnnealing or MultiStepLR
Low budget (10-20 epochs): OneCycleLR
Adaptive budget: ReduceLROnPlateau (stops when plateau)

Paper Recipe vs Modern Best Practices

If goal is EXACT REPRODUCTION:

Use paper's exact schedule (down to every detail):

# Example: Reproducing ResNet paper (He et al., 2015)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# No warmup (paper didn't use it)
# Train for 100 epochs

Rationale:

Reproduce results exactly
Enable apples-to-apples comparison
Validate paper's claims
Establish baseline before improvements

If goal is BEST PERFORMANCE:

Use modern recipe (benefit from years of community learning):

# Modern equivalent: ResNet with modern practices
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Train for 100 epochs

Rationale:

Typically +0.5-2% better accuracy than original paper
More stable training
Reflects 5-10 years of community improvements
SOTA competitive performance

Evolution of LR Scheduling Practices:

Early Deep Learning (2012-2016):

Scheduler: StepLR with manual milestones
Warmup: Not used (not yet discovered)
Optimizer: SGD with momentum
Examples: AlexNet, VGG, ResNet, Inception

Mid Period (2017-2019):

Scheduler: CosineAnnealing introduced, OneCycleLR popular
Warmup: Starting to be used for large batch training
Optimizer: SGD still dominant, Adam increasingly common
Examples: ResNeXt, DenseNet, MobileNet

Modern Era (2020-2025):

Scheduler: CosineAnnealing default, OneCycle for fast training
Warmup: Standard practice (mandatory for transformers)
Optimizer: AdamW increasingly preferred for transformers
Examples: ViT, EfficientNet, ConvNeXt, Swin, DeiT

Practical Workflow:

Step 1: Reproduce paper recipe

# Use exact paper settings
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# Should match paper's reported accuracy (e.g., 76.5%)

Step 2: Validate reproduction

If you get 76.5% (matches paper): ✅ Reproduction successful
If you get 74% (2% worse): ❌ Implementation bug, fix first
If you get 78% (2% better): ✅ Great! Proceed to modern recipe

Step 3: Try modern recipe

# Add warmup + cosine
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Expect +0.5-2% improvement (e.g., 77-78.5%)

Step 4: Compare results

Version	Accuracy	Notes
Paper recipe	76.5%	Baseline (reproduces paper)
Modern recipe	78.0%	+1.5% from warmup + cosine

When to Use Which:

Use Paper Recipe:

Publishing reproduction study
Comparing to paper's baseline
Validating implementation correctness
Research requiring exact reproducibility

Use Modern Recipe:

Building production system (want best performance)
Competing in benchmark (need SOTA results)
Publishing new method (should use modern baseline)
Limited compute (modern practices more efficient)

Trade-off Table:

Aspect	Paper Recipe	Modern Recipe
Reproducibility	✅ Exact	⚠️ Better but different
Performance	⚠️ Good (for its time)	✅ Better (+0.5-2%)
Comparability	✅ To paper	✅ To SOTA
Compute efficiency	⚠️ May be suboptimal	✅ Modern optimizations
Training stability	⚠️ Variable	✅ More stable (warmup)

Bottom Line:

Both are valid depending on your goal:

Research/reproduction: Start with paper recipe
Production/competition: Use modern recipe
Best practice: Validate with paper recipe, deploy with modern recipe

Domain-Specific Recommendations

Image Classification (CNNs)

Standard Recipe (ResNet, VGG):

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# Train for 100 epochs

Modern Recipe (EfficientNet, RegNet):

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-5)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Train for 100 epochs

Vision Transformers (ViT, Swin, DeiT)

Standard Recipe:

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.05)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=10)
cosine = CosineAnnealingLR(optimizer, T_max=290, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [10])
# Train for 300 epochs
# WARMUP IS MANDATORY

NLP Transformers (BERT, GPT, T5)

Standard Recipe:

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.01)

# Linear warmup + linear decay
def lr_lambda(step):
    warmup_steps = 10000
    total_steps = 100000
    if step < warmup_steps:
        return step / warmup_steps
    else:
        return max(0.0, (total_steps - step) / (total_steps - warmup_steps))

scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
# Step every batch, not epoch
# WARMUP IS MANDATORY

Object Detection (Faster R-CNN, YOLO)

Standard Recipe:

optimizer = torch.optim.SGD(model.parameters(), lr=0.02, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[16, 22], gamma=0.1)
# Train for 26 epochs

Fast Training (Limited Compute)

FastAI Recipe:

# Run LR finder first
optimal_lr = find_lr(model, train_loader, optimizer, loss_fn, device)
max_lr = optimal_lr * 10

optimizer = torch.optim.SGD(model.parameters(), lr=max_lr)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=max_lr,
    steps_per_epoch=len(train_loader),
    epochs=20,
    pct_start=0.3
)
# Train for 20 epochs
# Step every batch

7. Common Scheduling Pitfalls

Pitfall 1: No Warmup for Transformers

WRONG:

# Training Vision Transformer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
# ❌ No warmup - training will be very unstable or diverge

RIGHT:

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# ✅ Warmup prevents early instability

Why It Matters:

Transformers with high LR at start → NaN loss, divergence
Random initialization needs gradual LR ramp
5-10 epoch warmup is STANDARD practice

How to Detect:

Loss is NaN or explodes in first few epochs
Training very unstable early, stabilizes later
Gradients extremely large at start

Pitfall 2: Wrong scheduler.step() Placement

WRONG (Most Schedulers):

for epoch in range(epochs):
    for batch in train_loader:
        loss = train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # ❌ Stepping every batch, not every epoch

RIGHT:

for epoch in range(epochs):
    for batch in train_loader:
        loss = train_step(model, batch, optimizer)
        optimizer.step()

    scheduler.step()  # ✅ Step AFTER each epoch

EXCEPTION (OneCycleLR):

for epoch in range(epochs):
    for batch in train_loader:
        loss = train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # ✅ OneCycle steps EVERY BATCH

Why It Matters:

CosineAnnealing with T_max=100 expects 100 steps (epochs)
Stepping every batch: If 390 batches/epoch, LR decays in <1 epoch
LR reaches minimum way too fast

How to Detect:

LR decays to minimum in first epoch
Print LR each step: print(optimizer.param_groups[0]['lr'])
Check if LR changes every batch (wrong) vs every epoch (right)

Rule:

Most schedulers (Step, Cosine, Exponential): Step per epoch
OneCycleLR only: Step per batch
ReduceLROnPlateau: Step per epoch with validation metric

Pitfall 3: scheduler.step() Before optimizer.step()

WRONG:

loss.backward()
scheduler.step()      # ❌ Wrong order
optimizer.step()

RIGHT:

loss.backward()
optimizer.step()      # ✅ Update weights first
scheduler.step()      # Then update LR

Why It Matters:

Scheduler updates LR based on current epoch/step
Should update weights with current LR, THEN move to next LR
Wrong order = off-by-one error in schedule

How to Detect:

Usually subtle, hard to notice
Best practice: always optimizer.step() then scheduler.step()

Pitfall 4: Not Passing Metric to ReduceLROnPlateau

WRONG:

scheduler = ReduceLROnPlateau(optimizer)
for epoch in range(epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    scheduler.step()  # ❌ No metric passed

RIGHT:

scheduler = ReduceLROnPlateau(optimizer, mode='min')
for epoch in range(epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    val_loss = validate(model, val_loader)
    scheduler.step(val_loss)  # ✅ Pass validation metric

Why It Matters:

ReduceLROnPlateau NEEDS metric to detect plateau
Without metric, scheduler doesn't know when to reduce LR
Will get error or incorrect behavior

How to Detect:

Error message: "ReduceLROnPlateau needs a metric"
LR never reduces even when training plateaus

Pitfall 5: Using OneCycle for Long Training

SUBOPTIMAL:

# Training for 200 epochs
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=200, steps_per_epoch=len(train_loader))
# ❌ OneCycle designed for shorter training (10-30 epochs)

BETTER:

# For long training, use Cosine
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=10)
cosine = CosineAnnealingLR(optimizer, T_max=190, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [10])
# ✅ Cosine better suited for long training

Why It Matters:

OneCycle's aggressive up-then-down profile works for short training
For long training, gentler cosine decay more stable
OneCycle typically used for 10-30 epochs in FastAI style

When to Use Each:

OneCycle: 10-30 epochs, limited compute, want fast results
Cosine: 50+ epochs, full training, want best final performance

Pitfall 6: Not Tuning max_lr for OneCycle

WRONG:

# Just guessing max_lr
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20, steps_per_epoch=len(train_loader))
# ❌ Random max_lr without tuning
# Might be too high (unstable) or too low (slow)

RIGHT:

# Step 1: Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses)  # e.g., 0.01

# Step 2: Use 5-10x optimal as max_lr
max_lr = optimal_lr * 10  # e.g., 0.1

scheduler = OneCycleLR(optimizer, max_lr=max_lr, epochs=20, steps_per_epoch=len(train_loader))
# ✅ Tuned max_lr based on LR finder

Why It Matters:

OneCycle is VERY sensitive to max_lr
Too high: Training unstable, loss explodes
Too low: Slow training, underperforms
LR finder finds optimal, use 5-10x as max_lr

How to Tune:

Run LR finder (see LR Finder section)
Find optimal LR (steepest descent point)
Use 5-10x optimal as max_lr for OneCycle
If still unstable, reduce max_lr (try 3x, 2x)

Pitfall 7: Forgetting to Adjust T_max After Adding Warmup

WRONG:

# Want 100 epoch training
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=100)  # ❌ Should be 95
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])

RIGHT:

# Want 100 epoch training
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95)  # ✅ 100 - 5 = 95
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])

Why It Matters:

Total training is warmup + main schedule
If warmup is 5 epochs and cosine is 100, total is 105 epochs
T_max should be (total_epochs - warmup_epochs)

How to Calculate:

total_epochs = 100
warmup_epochs = 5
T_max = total_epochs - warmup_epochs  # 95

Pitfall 8: Using Same LR for All Param Groups

SUBOPTIMAL:

# Fine-tuning: applying same LR to all layers
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# ❌ Backbone and head both use 1e-3

BETTER:

# Fine-tuning: lower LR for pretrained backbone, higher for new head
optimizer = torch.optim.Adam([
    {'params': model.backbone.parameters(), 'lr': 1e-4},  # Lower LR for pretrained
    {'params': model.head.parameters(), 'lr': 1e-3}       # Higher LR for random init
])
scheduler = CosineAnnealingLR(optimizer, T_max=100)
# ✅ Scheduler applies to all param groups proportionally

Why It Matters:

Pretrained layers need smaller LR (already trained)
New layers need higher LR (random initialization)
Schedulers work with param groups automatically

Note: Schedulers multiply all param groups by same factor, preserving relative ratios

Pitfall 9: Not Monitoring LR During Training

PROBLEM:

Schedule not behaving as expected
Hard to debug without visibility into LR

SOLUTION:

# Log LR every epoch
for epoch in range(epochs):
    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: LR = {current_lr:.6f}")

    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

# Or use TensorBoard
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()

for epoch in range(epochs):
    current_lr = optimizer.param_groups[0]['lr']
    writer.add_scalar('Learning Rate', current_lr, epoch)

    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Best Practice:

Always log LR to console or TensorBoard
Plot LR schedule before training (see next section)
Verify schedule matches expectations

Pitfall 10: Not Validating Schedule Before Training

PROBLEM:

Run full training, discover schedule was wrong
Waste compute on incorrect schedule

SOLUTION: Dry-run the schedule:

def plot_schedule(scheduler_fn, num_epochs):
    """
    Plot LR schedule before training to verify it's correct.
    """
    # Create dummy model and optimizer
    model = torch.nn.Linear(1, 1)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
    scheduler = scheduler_fn(optimizer)

    lrs = []
    for epoch in range(num_epochs):
        lrs.append(optimizer.param_groups[0]['lr'])
        optimizer.step()  # Dummy step
        scheduler.step()

    # Plot
    plt.figure(figsize=(10, 6))
    plt.plot(lrs)
    plt.xlabel('Epoch')
    plt.ylabel('Learning Rate')
    plt.title('LR Schedule')
    plt.grid(True, alpha=0.3)
    plt.show()

# Usage
def my_scheduler(opt):
    warmup = LinearLR(opt, start_factor=0.01, total_iters=5)
    cosine = CosineAnnealingLR(opt, T_max=95)
    return SequentialLR(opt, [warmup, cosine], [5])

plot_schedule(my_scheduler, num_epochs=100)
# Verify plot looks correct BEFORE training

Best Practice:

Plot schedule before every major training run
Verify warmup duration, decay shape, final LR
Catch mistakes early (T_max wrong, step placement, etc.)

8. Modern Best Practices (2024-2025)

Vision Models (CNNs, ResNets, ConvNeXt)

Standard Recipe:

# Optimizer
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4
)

# Scheduler: MultiStepLR or CosineAnnealing
# Option 1: MultiStepLR (classical)
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)

# Option 2: CosineAnnealing (modern)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])

# Training
epochs = 100
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Key Points:

SGD with momentum (0.9) is standard for CNNs
LR = 0.1 for batch size 256 (scale linearly for other batch sizes)
Warmup optional but beneficial (5 epochs)
CosineAnnealing increasingly preferred over MultiStepLR

Vision Transformers (ViT, Swin, DeiT)

Standard Recipe:

# Optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.05,
    betas=(0.9, 0.999)
)

# Scheduler: MUST include warmup
warmup_epochs = 10
cosine_epochs = 290
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_epochs)
cosine = CosineAnnealingLR(optimizer, T_max=cosine_epochs, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [warmup_epochs])

# Training
epochs = 300
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Key Points:

AdamW optimizer (not SGD)
Warmup is MANDATORY (10-20 epochs)
Long training (300 epochs typical)
LR = 1e-3 for batch size 512 (scale for other sizes)
Cosine decay to very small LR (1e-5)

Why Warmup is Critical for ViT:

Self-attention layers highly sensitive to initialization
High LR at start causes gradient explosion
Warmup allows attention patterns to stabilize

NLP Transformers (BERT, GPT, T5)

Standard Recipe:

# Optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=5e-4,
    weight_decay=0.01,
    betas=(0.9, 0.999)
)

# Scheduler: Linear warmup + linear decay (or inverse sqrt)
total_steps = len(train_loader) * epochs
warmup_steps = int(0.1 * total_steps)  # 10% warmup

def lr_lambda(step):
    if step < warmup_steps:
        return step / warmup_steps
    else:
        return max(0.0, (total_steps - step) / (total_steps - warmup_steps))

scheduler = LambdaLR(optimizer, lr_lambda)

# Training: step EVERY BATCH
for epoch in range(epochs):
    for batch in train_loader:
        train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # Step every batch, not epoch

Key Points:

AdamW optimizer
Warmup is MANDATORY (typically 10% of training)
Linear warmup + linear decay (BERT, GPT-2 style)
Step scheduler EVERY BATCH (not every epoch)
LR typically 1e-4 to 5e-4

Alternative: Inverse Square Root (Original Transformer):

def transformer_schedule(step):
    warmup_steps = 4000
    step = step + 1
    return (d_model ** -0.5) * min(step ** -0.5, step * warmup_steps ** -1.5)

scheduler = LambdaLR(optimizer, transformer_schedule)

Object Detection (Faster R-CNN, YOLO, DETR)

Standard Recipe (Two-stage detectors):

# Optimizer
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.02,
    momentum=0.9,
    weight_decay=1e-4
)

# Scheduler: MultiStepLR with short schedule
scheduler = MultiStepLR(optimizer, milestones=[16, 22], gamma=0.1)

# Training
epochs = 26  # Shorter than classification
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Standard Recipe (Transformer detectors like DETR):

# Optimizer
optimizer = torch.optim.AdamW(
    [
        {'params': model.backbone.parameters(), 'lr': 1e-5},  # Lower for backbone
        {'params': model.transformer.parameters(), 'lr': 1e-4}  # Higher for transformer
    ],
    weight_decay=1e-4
)

# Scheduler: Step decay
scheduler = MultiStepLR(optimizer, milestones=[200], gamma=0.1)

# Training: Long schedule for DETR
epochs = 300

Key Points:

Detection typically shorter training than classification
Lower LR (0.02 vs 0.1) due to task difficulty
DETR needs very long training (300 epochs)

Semantic Segmentation (U-Net, DeepLab, SegFormer)

Standard Recipe (CNN-based):

# Optimizer
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    weight_decay=1e-4
)

# Scheduler: Polynomial decay (common in segmentation)
def poly_lr_lambda(epoch):
    return (1 - epoch / total_epochs) ** 0.9

scheduler = LambdaLR(optimizer, poly_lr_lambda)

# Training
epochs = 100
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Key Points:

Polynomial decay common in segmentation (DeepLab papers)
Lower initial LR (0.01) than classification
Power of 0.9 standard

Fast Training / Limited Compute (FastAI Style)

OneCycle Recipe:

# Step 1: Find optimal LR
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses)  # e.g., 0.01
max_lr = optimal_lr * 10  # e.g., 0.1

# Step 2: OneCycleLR
optimizer = torch.optim.SGD(model.parameters(), lr=max_lr, momentum=0.9)
scheduler = OneCycleLR(
    optimizer,
    max_lr=max_lr,
    steps_per_epoch=len(train_loader),
    epochs=20,
    pct_start=0.3,        # 30% warmup, 70% cooldown
    anneal_strategy='cos'
)

# Step 3: Train (step every batch)
for epoch in range(20):
    for batch in train_loader:
        train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # Every batch

Key Points:

Use LR finder to tune max_lr (CRITICAL)
Train for fewer epochs (10-30)
Step scheduler every batch
Often achieves 90-95% of full training performance in 20-30% of time

Fine-Tuning Pretrained Models

Standard Recipe:

# Optimizer: Different LRs for backbone vs head
optimizer = torch.optim.AdamW([
    {'params': model.backbone.parameters(), 'lr': 1e-5},  # Very low for pretrained
    {'params': model.head.parameters(), 'lr': 1e-3}       # Higher for new head
])

# Scheduler: Simple cosine or even constant
# Option 1: Constant LR (fine-tuning often doesn't need scheduling)
scheduler = None

# Option 2: Gentle cosine decay
scheduler = CosineAnnealingLR(optimizer, T_max=10, eta_min=1e-6)

# Training: Short duration
epochs = 10  # Fine-tuning is quick
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    if scheduler:
        scheduler.step()

Key Points:

Much lower LR for pretrained parts (1e-5)
Higher LR for new/random parts (1e-3)
Short training (3-10 epochs)
Scheduling often optional (constant LR works)
No warmup needed (weights already good)

Large Batch Training (Batch Size > 1024)

Standard Recipe:

# Linear LR scaling rule: LR scales with batch size
base_lr = 0.1  # For batch size 256
batch_size = 2048
scaled_lr = base_lr * (batch_size / 256)  # 0.8 for batch 2048

# Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=scaled_lr, momentum=0.9)

# Scheduler: MUST include warmup (critical for large batch)
warmup_epochs = 5
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_epochs)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [warmup_epochs])

# Training
epochs = 100
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Key Points:

Scale LR linearly with batch size (LR = base_lr * batch_size / base_batch_size)
Warmup is MANDATORY for large batch (5-10 epochs minimum)
Longer warmup for very large batches (>4096: use 10-20 epochs)

Why Warmup Critical for Large Batch:

Large batch = larger effective LR
High effective LR at start causes instability
Warmup prevents divergence

Modern Defaults by Domain (2025)

Domain	Optimizer	Scheduler	Warmup	Epochs
Vision (CNN)	SGD (0.9)	Cosine or MultiStep	Optional (5)	100-200
Vision (ViT)	AdamW	Cosine	MANDATORY (10-20)	300
NLP (BERT/GPT)	AdamW	Linear	MANDATORY (10%)	Varies
Detection	SGD	MultiStep	Optional	26-300
Segmentation	SGD	Polynomial	Optional	100
Fast/OneCycle	SGD	OneCycle	Built-in	10-30
Fine-tuning	AdamW	Constant/Cosine	No	3-10
Large Batch	SGD	Cosine	MANDATORY (5-20)	100-200

9. Debugging Scheduler Issues

Issue: Training Unstable / Loss Spikes

Symptoms:

Loss increases suddenly during training
NaN or Inf loss
Training was stable, then becomes unstable

Likely Causes:

No warmup (transformers, large models)
- Solution: Add 5-10 epoch warmup
LR too high at start
- Solution: Lower initial LR or extend warmup
LR drop too sharp (MultiStepLR)
- Solution: Use gentler scheduler (Cosine) or smaller gamma

Debugging Steps:

# 1. Print LR every epoch
for epoch in range(epochs):
    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: LR = {current_lr:.6e}")

    # 2. Check if loss spike correlates with LR change
    loss = train_one_epoch(model, train_loader, optimizer)
    print(f"  Loss = {loss:.4f}")

    scheduler.step()

# 3. Plot LR and loss together
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(lr_history)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.subplot(1, 2, 2)
plt.plot(loss_history)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

Solutions:

Add/extend warmup: LinearLR(optimizer, start_factor=0.01, total_iters=10)
Lower initial LR: lr = 0.01 instead of lr = 0.1
Gentler scheduler: CosineAnnealingLR instead of MultiStepLR
Gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Issue: Training Plateaus Too Early

Symptoms:

Loss stops decreasing after 20-30 epochs
Validation accuracy flat
Training seems stuck

Likely Causes:

Not using scheduler (constant LR too high for current regime)
- Solution: Add scheduler (CosineAnnealing or ReduceLROnPlateau)
Scheduler reducing LR too early
- Solution: Push back milestones or increase patience
LR already too low
- Solution: Check current LR, may need to restart with higher initial LR

Debugging Steps:

# Check current LR
current_lr = optimizer.param_groups[0]['lr']
print(f"Current LR: {current_lr:.6e}")

# If LR very low (<1e-6), plateau might be due to other issues (architecture, data, etc.)
# If LR still high (>1e-3), should reduce LR to break plateau

Solutions:

Add ReduceLROnPlateau: Automatically reduces when plateau detected

scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)

Manual LR reduction: If at epoch 30 and plateaued, reduce LR by 10x now

for param_group in optimizer.param_groups:
    param_group['lr'] *= 0.1

Use scheduler from start next time:

scheduler = CosineAnnealingLR(optimizer, T_max=100)

Issue: Poor Final Performance (Train > Val Gap)

Symptoms:

Training accuracy high (95%), validation lower (88%)
Model overfitting
Test performance disappointing

Likely Causes (Scheduling Related):

LR not low enough at end
- Solution: Lower eta_min or extend training
Not using scheduler (constant LR doesn't fine-tune)
- Solution: Add scheduler to reduce LR in late training
Scheduler ending too early
- Solution: Extend training or adjust T_max

Debugging Steps:

# Check final LR
final_lr = optimizer.param_groups[0]['lr']
print(f"Final LR: {final_lr:.6e}")

# Final LR should be very low (1e-5 to 1e-6)
# If final LR still high (>1e-3), model didn't fine-tune properly

Solutions:

Lower eta_min: CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
Extend training: Train for more epochs to allow LR to decay further

Add late-stage fine-tuning:

# After main training, do 10 more epochs with very low LR
for param_group in optimizer.param_groups:
    param_group['lr'] = 1e-5
for epoch in range(10):
    train_one_epoch(model, train_loader, optimizer)

Note: If train-val gap large, may also need regularization (not scheduling issue)

Issue: LR Decays Too Fast

Symptoms:

LR reaches minimum in first few epochs
Training very slow after initial epochs
Looks like constant very low LR

Likely Causes:

scheduler.step() called every batch instead of epoch
- Solution: Move scheduler.step() outside batch loop
T_max too small (e.g., T_max=10 but training for 100 epochs)
- Solution: Set T_max = total_epochs
Using OneCycle unintentionally
- Solution: Verify scheduler type

Debugging Steps:

# Print LR first few epochs
for epoch in range(10):
    print(f"Epoch {epoch}: LR = {optimizer.param_groups[0]['lr']:.6e}")
    for batch in train_loader:
        train_step(model, batch, optimizer)
        # scheduler.step()  # ❌ If this is here, that's the bug
    scheduler.step()  # ✅ Should be here

Solutions:

Move scheduler.step() to correct location (after epoch, not after batch)
Fix T_max: T_max = total_epochs or T_max = total_epochs - warmup_epochs
Verify scheduler type: print(type(scheduler))

Issue: OneCycleLR Not Working

Symptoms:

Training with OneCycle becomes unstable around peak LR
Loss increases during ramp-up phase
Worse performance than expected

Likely Causes:

max_lr too high
- Solution: Run LR finder, use lower max_lr
scheduler.step() placement wrong (should be per batch)
- Solution: Call scheduler.step() every batch
Not tuning max_lr
- Solution: Use LR finder to find optimal, use 5-10x as max_lr

Debugging Steps:

# Plot LR schedule
lrs = []
for epoch in range(epochs):
    for batch in train_loader:
        lrs.append(optimizer.param_groups[0]['lr'])
        scheduler.step()

plt.plot(lrs)
plt.xlabel('Batch')
plt.ylabel('Learning Rate')
plt.title('OneCycle LR Schedule')
plt.show()

# Should see: ramp up to max_lr, then ramp down
# If doesn't look like that, scheduler.step() placement wrong

Solutions:

Run LR finder first:

optimal_lr = find_lr(model, train_loader, optimizer, loss_fn, device)
max_lr = optimal_lr * 10  # Or try 5x, 3x if 10x unstable

Lower max_lr manually:

# If max_lr=0.1 unstable, try 0.03 or 0.01
scheduler = OneCycleLR(optimizer, max_lr=0.03, ...)

Verify step() every batch:

for epoch in range(epochs):
    for batch in train_loader:
        train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # ✅ Every batch

Issue: Warmup Not Working

Symptoms:

Training still unstable in first few epochs despite warmup
Loss spikes even with warmup
NaN loss at start

Likely Causes:

Warmup too short (need longer ramp-up)
- Solution: Extend warmup from 5 to 10-20 epochs
start_factor too high (not starting low enough)
- Solution: Use start_factor=0.001 instead of 0.01
Warmup not actually being used (SequentialLR bug)
- Solution: Verify warmup scheduler is active early

Debugging Steps:

# Print LR first 10 epochs
for epoch in range(10):
    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: LR = {current_lr:.6e}")
    # Should see gradual increase from low to high
    # If jumps immediately to high, warmup not working

    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Solutions:

Extend warmup:

warmup = LinearLR(optimizer, start_factor=0.01, total_iters=20)  # 20 epochs

Lower start_factor:

warmup = LinearLR(optimizer, start_factor=0.001, total_iters=5)  # Start at 0.1%

Verify SequentialLR milestone:

# Milestone should match warmup duration
scheduler = SequentialLR(optimizer, [warmup, cosine], milestones=[20])

Add gradient clipping as additional safeguard:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Issue: ReduceLROnPlateau Never Reduces LR

Symptoms:

Using ReduceLROnPlateau for 50+ epochs
Validation loss clearly plateaued
Learning rate never reduces

Debugging Steps:

1. Verify metric is being passed:

val_loss = validate(model, val_loader)
print(f"Epoch {epoch}: val_loss = {val_loss:.6f}")  # Print metric
scheduler.step(val_loss)  # Ensure passing metric

2. Check mode is correct:

# For loss (want to minimize):
scheduler = ReduceLROnPlateau(optimizer, mode='min')

# For accuracy (want to maximize):
scheduler = ReduceLROnPlateau(optimizer, mode='max')

Wrong mode means scheduler waits for opposite direction (loss increasing instead of decreasing).

3. Check threshold isn't too strict:

# Default threshold=1e-4 (0.01% improvement threshold)
# If val_loss 0.5000 → 0.4999 (0.02% improvement), counts as improvement
# If threshold too high, tiny improvements prevent reduction

# Solution: Lower threshold to be more sensitive
scheduler = ReduceLROnPlateau(optimizer, threshold=1e-5)

# Or remove threshold entirely
scheduler = ReduceLROnPlateau(optimizer, threshold=0)

4. Enable verbose logging:

scheduler = ReduceLROnPlateau(optimizer, verbose=True)
# Prints: "Epoch 00042: reducing learning rate of group 0 to 1.0000e-04"
# when it reduces

5. Verify plateau is real:

# Plot validation loss over time
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(val_losses)
plt.xlabel('Epoch')
plt.ylabel('Validation Loss')
plt.title('Validation Loss Over Time')
plt.grid(True, alpha=0.3)
plt.show()

# Check: Is loss truly flat, or still slowly improving?
# Tiny improvements (0.4500 → 0.4499) count as progress

6. Check cooldown isn't preventing reduction:

# Default cooldown=0, but if set higher, prevents reduction after recent reduction
scheduler = ReduceLROnPlateau(optimizer, cooldown=0)  # No cooldown

Common Causes Table:

Problem	Symptom	Solution
Not passing metric	Error or no reduction	`scheduler.step(val_loss)`
Wrong mode	Never reduces	`mode='min'` for loss, `mode='max'` for accuracy
Threshold too strict	Ignores small improvements	Lower to `threshold=1e-5` or `0`
Metric still improving	Not actually plateaued	Increase patience or accept slow progress
Cooldown active	Reducing but waiting	Set `cooldown=0`
Min_lr reached	Can't reduce further	Check current LR, may be at min_lr

Example Fix:

scheduler = ReduceLROnPlateau(
    optimizer,
    mode='min',          # For loss minimization
    factor=0.1,          # Reduce by 10x
    patience=10,         # Wait 10 epochs
    threshold=0,         # Accept any improvement (most sensitive)
    threshold_mode='rel',
    cooldown=0,          # No cooldown period
    min_lr=1e-6,         # Minimum LR allowed
    verbose=True         # Print when reducing
)

# Training loop
for epoch in range(epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    val_loss = validate(model, val_loader)

    print(f"Epoch {epoch}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}")

    scheduler.step(val_loss)  # Pass validation loss

    # Print current LR
    current_lr = optimizer.param_groups[0]['lr']
    print(f"  Current LR: {current_lr:.6e}")

Advanced Debugging:

If still not reducing, manually check scheduler logic:

# Get scheduler state
print(f"Best metric so far: {scheduler.best}")
print(f"Epochs without improvement: {scheduler.num_bad_epochs}")
print(f"Patience: {scheduler.patience}")

# If num_bad_epochs < patience, it's still waiting
# If num_bad_epochs >= patience, should reduce next step

10. Rationalization Table

When users rationalize away proper LR scheduling, counter with:

Rationalization	Reality	Counter-Argument
"Constant LR is simpler"	Leaves 2-5% performance on table	"One line of code for 2-5% better accuracy is excellent ROI"
"Warmup seems optional"	MANDATORY for transformers	"Without warmup, transformers diverge or train unstably"
"I don't know which scheduler to use"	CosineAnnealing is great default	"CosineAnnealingLR works well for most cases, zero tuning"
"Scheduling is too complicated"	Modern frameworks make it trivial	"scheduler = CosineAnnealingLR(optimizer, T_max=100) - that's it"
"Papers don't mention scheduling"	They do, in implementation details	"Check paper's code repo or appendix - scheduling always there"
"My model is too small to need scheduling"	Even small models benefit	"Scheduling helps all models converge to better minima"
"Just use Adam, it adapts automatically"	Adam still benefits from scheduling	"SOTA transformers use AdamW + scheduling (BERT, GPT, ViT)"
"I'll tune it later"	Scheduling should be from start	"Scheduling is core hyperparameter, not optional add-on"
"OneCycle always best"	Only for specific scenarios	"OneCycle great for fast training (<30 epochs), not long training"
"I don't have time to run LR finder"	Takes 5 minutes, saves hours	"LR finder runs in minutes, prevents wasted training runs"
"Warmup adds complexity"	One extra line of code	"SequentialLR([warmup, cosine], [5]) - that's the complexity"
"My training is already good enough"	Could be 2-5% better	"SOTA papers all use scheduling - it's standard practice"
"Reducing LR will slow training"	Reduces LR when high LR hurts	"High LR early (fast), low LR late (fine-tune) = best of both"
"I don't know what T_max to use"	T_max = total_epochs	"Just set T_max to your total training epochs"

11. Red Flags Checklist

Watch for these warning signs that indicate scheduling problems:

Critical Red Flags (Fix Immediately):

🚨 Training transformer without warmup

Impact: High risk of divergence, NaN loss
Fix: Add 5-10 epoch warmup immediately

🚨 Loss NaN or exploding in first few epochs

Impact: Training failed
Fix: Add warmup, lower initial LR, gradient clipping

🚨 scheduler.step() called every batch for Cosine/Step schedulers

Impact: LR decays 100x too fast
Fix: Move scheduler.step() outside batch loop

🚨 Not passing metric to ReduceLROnPlateau

Impact: Scheduler doesn't work at all
Fix: scheduler.step(val_loss)

Important Red Flags (Should Fix):

⚠️ Training >30 epochs without scheduler

Impact: Leaving 2-5% performance on table
Fix: Add CosineAnnealingLR or MultiStepLR

⚠️ OneCycle with random max_lr (not tuned)

Impact: Unstable training or suboptimal performance
Fix: Run LR finder, tune max_lr

⚠️ Large batch (>512) without warmup

Impact: Training instability
Fix: Add 5-10 epoch warmup

⚠️ Vision transformer with constant LR

Impact: Poor convergence, unstable training
Fix: Add warmup + cosine schedule

⚠️ Training plateaus but no scheduler to reduce LR

Impact: Stuck at local minimum
Fix: Add ReduceLROnPlateau or manually reduce LR

Minor Red Flags (Consider Fixing):

⚡ CNN training without any scheduling

Impact: Missing 1-3% accuracy
Fix: Add MultiStepLR or CosineAnnealingLR

⚡ Not monitoring LR during training

Impact: Hard to debug schedule issues
Fix: Log LR every epoch

⚡ T_max doesn't match training duration

Impact: Schedule ends too early/late
Fix: Set T_max = total_epochs - warmup_epochs

⚡ Using same LR for pretrained and new layers (fine-tuning)

Impact: Suboptimal fine-tuning
Fix: Use different LRs for param groups

⚡ Not validating schedule before full training

Impact: Risk wasting compute on wrong schedule
Fix: Plot schedule dry-run before training

12. Quick Reference

Scheduler Selection Cheatsheet

Q: What should I use for...

Vision CNN (100 epochs)?
→ CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-5)

Vision Transformer?
→ LinearLR(warmup 5) + CosineAnnealingLR(T_max=95) [WARMUP MANDATORY]

NLP Transformer?
→ LinearLR(warmup 10%) + LinearLR(decay) [WARMUP MANDATORY]

Fast training (<30 epochs)?
→ OneCycleLR(max_lr=tune_with_LR_finder)

Don't know optimal schedule?
→ ReduceLROnPlateau(mode='min', patience=10)

Training plateaued?
→ Add ReduceLROnPlateau or manually reduce LR by 10x now

Following paper recipe?
→ Use paper's exact schedule (usually MultiStepLR)

Fine-tuning pretrained model?
→ Constant low LR (1e-5) or gentle CosineAnnealing

Large batch (>512)?
→ LinearLR(warmup 5-10) + CosineAnnealingLR [WARMUP MANDATORY]

Step Placement Quick Reference

# Most schedulers (Step, Cosine, Exponential)
for epoch in range(epochs):
    for batch in train_loader:
        train_step(...)
    scheduler.step()  # AFTER epoch

# OneCycleLR (EXCEPTION)
for epoch in range(epochs):
    for batch in train_loader:
        train_step(...)
        scheduler.step()  # AFTER each batch

# ReduceLROnPlateau (pass metric)
for epoch in range(epochs):
    for batch in train_loader:
        train_step(...)
    val_loss = validate(...)
    scheduler.step(val_loss)  # Pass metric

Warmup Quick Reference

# Pattern: Warmup + Cosine (most common)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])

# When warmup is MANDATORY:
# ✅ Transformers (ViT, BERT, GPT)
# ✅ Large batch (>512)
# ✅ High initial LR
# ✅ Training from scratch

# When warmup is optional:
# ❌ Fine-tuning
# ❌ Small LR (<1e-4)
# ❌ Small models

LR Finder Quick Reference

# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)

# Find optimal (steepest descent)
optimal_lr = suggest_lr_from_finder(lrs, losses)

# Use cases:
# - Direct use: optimizer = SGD(params, lr=optimal_lr)
# - OneCycle: max_lr = optimal_lr * 10
# - Conservative: base_lr = optimal_lr * 0.1

Summary

Learning rate scheduling is CRITICAL for competitive model performance:

Key Takeaways:

Scheduling improves final accuracy by 2-5% - not optional for SOTA
Warmup is MANDATORY for transformers - prevents divergence
CosineAnnealingLR is best default - works well, zero tuning
Use LR finder for new problems - finds optimal initial LR in minutes
OneCycleLR needs max_lr tuning - run LR finder first
Watch scheduler.step() placement - most per epoch, OneCycle per batch
Always monitor LR during training - log to console or TensorBoard
Plot schedule before training - catch mistakes early

Modern Defaults (2025):

Vision CNNs: SGD + CosineAnnealingLR (optional warmup)
Vision Transformers: AdamW + Warmup + CosineAnnealingLR (warmup mandatory)
NLP Transformers: AdamW + Warmup + Linear decay (warmup mandatory)
Fast Training: SGD + OneCycleLR (tune max_lr with LR finder)

When In Doubt:

Use CosineAnnealingLR with T_max = total_epochs
Add 5-epoch warmup for large models
Run LR finder if unsure about initial LR
Log LR every epoch to monitor schedule

Learning rate scheduling is one of the highest-ROI hyperparameters - master it for significantly better model performance.

Install Skill

SKILL.md

Learning Rate Scheduling Skill

When to Use This Skill

Core Principles

1. Why Learning Rate Scheduling Matters

2. Decision Framework: When to Schedule vs Constant LR

Use Scheduler When:

Maybe Don't Need Scheduler When:

Default Recommendation:

3. Major Scheduler Types - Complete Comparison

StepLR / MultiStepLR (Classic Vision)

CosineAnnealingLR (Modern Default)

ReduceLROnPlateau (Adaptive)

OneCycleLR (Fast Training)

Advanced OneCycleLR Tuning

LinearLR (Warmup)

ExponentialLR (Continuous Decay)

LambdaLR (Custom Schedules)

4. Warmup Strategies - CRITICAL FOR TRANSFORMERS

Why Warmup is Essential

When Warmup is MANDATORY

Warmup Implementation Patterns

Pattern 1: Linear Warmup + Cosine Decay (Most Common)

Pattern 2: Linear Warmup + MultiStepLR

Pattern 3: Manual Warmup (More Control)

Pattern 4: Transformer-Style Warmup (Inverse Square Root)

Warmup Duration Guidelines

"But My Transformer Trained Fine Without Warmup"

5. LR Finder - Finding Optimal Initial LR

What is LR Finder?

LR Finder Implementation

Using LR Finder

Basic Usage:

Automated LR Selection:

Using with OneCycleLR:

Interpreting LR Finder Results

When to Use LR Finder

6. Scheduler Selection Guide

Selection Flowchart

Paper Recipe vs Modern Best Practices

Domain-Specific Recommendations

Image Classification (CNNs)

Vision Transformers (ViT, Swin, DeiT)

NLP Transformers (BERT, GPT, T5)

Object Detection (Faster R-CNN, YOLO)

Fast Training (Limited Compute)

7. Common Scheduling Pitfalls

Pitfall 1: No Warmup for Transformers

Pitfall 2: Wrong scheduler.step() Placement

Pitfall 3: scheduler.step() Before optimizer.step()

Pitfall 4: Not Passing Metric to ReduceLROnPlateau

Pitfall 5: Using OneCycle for Long Training

Pitfall 6: Not Tuning max_lr for OneCycle

Pitfall 7: Forgetting to Adjust T_max After Adding Warmup

Pitfall 8: Using Same LR for All Param Groups

Pitfall 9: Not Monitoring LR During Training

Pitfall 10: Not Validating Schedule Before Training

8. Modern Best Practices (2024-2025)

Vision Models (CNNs, ResNets, ConvNeXt)

Vision Transformers (ViT, Swin, DeiT)

NLP Transformers (BERT, GPT, T5)

Object Detection (Faster R-CNN, YOLO, DETR)

Semantic Segmentation (U-Net, DeepLab, SegFormer)

Fast Training / Limited Compute (FastAI Style)

Fine-Tuning Pretrained Models

Large Batch Training (Batch Size > 1024)

Modern Defaults by Domain (2025)

9. Debugging Scheduler Issues

Issue: Training Unstable / Loss Spikes

Issue: Training Plateaus Too Early

Issue: Poor Final Performance (Train > Val Gap)

Issue: LR Decays Too Fast

Issue: OneCycleLR Not Working

Issue: Warmup Not Working

Issue: ReduceLROnPlateau Never Reduces LR

10. Rationalization Table

11. Red Flags Checklist

12. Quick Reference

Scheduler Selection Cheatsheet