Claude Code Plugins

Community-maintained marketplace

Feedback

learning-rate-scheduling

@tachyon-beep/skillpacks
1
0

Learning rate scheduling - warmup, schedulers, decay strategies, modern best practices

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name learning-rate-scheduling
description Learning rate scheduling - warmup, schedulers, decay strategies, modern best practices

Learning Rate Scheduling Skill

When to Use This Skill

Use this skill when:

  • User asks "should I use a learning rate scheduler?"
  • Training plateaus or loss stops improving
  • Training transformers or large models (warmup critical)
  • User wants to implement OneCycleLR or specific scheduler
  • Training is unstable in early epochs
  • User asks "what learning rate should I use?"
  • Deciding between constant LR and scheduled LR
  • User is copying a paper's training recipe
  • Implementing modern training pipelines (vision, NLP, RL)
  • User suggests "just use constant LR" (rationalization)

Do NOT use when:

  • User has specific bugs unrelated to scheduling
  • Only discussing optimizer choice (no schedule questions)
  • Training already working well and no LR questions asked

Core Principles

1. Why Learning Rate Scheduling Matters

Learning rate scheduling is one of the MOST IMPACTFUL hyperparameters:

High LR Early (Exploration):

  • Fast initial progress through parameter space
  • Escape poor local minima
  • Rapid loss reduction in early epochs

Low LR Late (Exploitation):

  • Fine-tune to sharper, better minima
  • Improve generalization (test accuracy)
  • Stable convergence without oscillation

Quantitative Impact:

  • Proper scheduling improves final test accuracy by 2-5% (SIGNIFICANT)
  • Standard practice in all SOTA papers (ResNet, EfficientNet, ViT, BERT, GPT)
  • Not optional for competitive performance

When Constant LR Fails:

  • Can't explore quickly AND converge precisely
  • Either too high (never converges) or too low (too slow)
  • Leaves 2-5% performance on the table

2. Decision Framework: When to Schedule vs Constant LR

Use Scheduler When:

Long training (>30 epochs)

  • Scheduling essential for multi-stage training
  • Different LR regimes needed across training
  • Example: 90-epoch ImageNet training

Large model on large dataset

  • Training from scratch on ImageNet, COCO, etc.
  • Benefits from exploration → exploitation strategy
  • Example: ResNet-50 on ImageNet

Training plateaus or loss stops improving

  • Current LR too high for current parameter regime
  • Reducing LR breaks plateau
  • Example: Validation loss stuck for 10+ epochs

Following established training recipes

  • Papers publish schedules for reproducibility
  • Vision models typically use MultiStepLR or Cosine
  • Example: ResNet paper specifies drop at epochs 30, 60, 90

Want competitive SOTA performance

  • Squeezing out last 2-5% accuracy
  • Required for benchmarks and competitions
  • Example: Targeting SOTA on CIFAR-10

Maybe Don't Need Scheduler When:

Very short training (<10 epochs)

  • Not enough time for multi-stage scheduling
  • Constant LR or simple linear decay sufficient
  • Example: Quick fine-tuning for 5 epochs

OneCycle is the strategy itself

  • OneCycleLR IS the training strategy (not separate)
  • Don't combine OneCycle with another scheduler
  • Example: FastAI-style 20-epoch training

Hyperparameter search phase

  • Constant LR simpler to compare across runs
  • Add scheduling after finding good architecture/optimizer
  • Example: Running 50 architecture trials

Transfer learning fine-tuning

  • Small number of epochs on pretrained model
  • Constant small LR often sufficient
  • Example: Fine-tuning BERT for 3 epochs

Reinforcement learning

  • RL typically uses constant LR (exploration/exploitation balance different)
  • Some exceptions (PPO sometimes uses linear decay)
  • Example: DQN, A3C usually constant LR

Default Recommendation:

For >30 epoch training: USE A SCHEDULER (typically CosineAnnealingLR) For <10 epoch training: Constant LR usually fine For 10-30 epochs: Try both, scheduler usually wins


3. Major Scheduler Types - Complete Comparison

StepLR / MultiStepLR (Classic Vision)

Use When:

  • Training CNNs (ResNet, VGG, etc.)
  • Following established recipe from paper
  • Want simple, interpretable schedule

How It Works:

  • Drop LR by constant factor at specific epochs
  • StepLR: every N epochs
  • MultiStepLR: at specified milestone epochs

Implementation:

# StepLR: Drop every 30 epochs
scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer,
    step_size=30,    # Drop every 30 epochs
    gamma=0.1        # Multiply LR by 0.1 (10x reduction)
)

# MultiStepLR: Drop at specific milestones (more control)
scheduler = torch.optim.lr_scheduler.MultiStepLR(
    optimizer,
    milestones=[30, 60, 90],  # Drop at these epochs
    gamma=0.1                  # Multiply by 0.1 each time
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()  # Call AFTER each epoch

Example Schedule (initial_lr=0.1):

  • Epochs 0-29: LR = 0.1
  • Epochs 30-59: LR = 0.01 (dropped by 10x)
  • Epochs 60-89: LR = 0.001 (dropped by 10x again)
  • Epochs 90-99: LR = 0.0001

Pros:

  • Simple and interpretable
  • Well-established in papers (easy to reproduce)
  • Works well for vision models

Cons:

  • Requires manual milestone selection
  • Sharp LR drops can cause temporary instability
  • Need to know total training epochs in advance

Best For: Classical CNN training (ResNet, VGG) following paper recipes


CosineAnnealingLR (Modern Default)

Use When:

  • Training modern vision models (ViT, EfficientNet)
  • Want smooth decay without manual milestones
  • Don't want to tune milestone positions

How It Works:

  • Smooth cosine curve from initial_lr to eta_min
  • Gradual decay, no sharp drops
  • LR = eta_min + (initial_lr - eta_min) * (1 + cos(π * epoch / T_max)) / 2

Implementation:

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=100,      # Total epochs (LR reaches eta_min at epoch 100)
    eta_min=1e-5    # Minimum LR (default: 0)
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()  # Call AFTER each epoch

Example Schedule (initial_lr=0.1, eta_min=1e-5):

  • Epoch 0: LR = 0.1
  • Epoch 25: LR ≈ 0.075
  • Epoch 50: LR ≈ 0.05
  • Epoch 75: LR ≈ 0.025
  • Epoch 100: LR = 0.00001

Pros:

  • No milestone tuning required
  • Smooth decay (no instability from sharp drops)
  • Widely used in modern papers
  • Works well across many domains

Cons:

  • Must know total epochs in advance
  • Can't adjust schedule during training

Best Practice: ALWAYS COMBINE WITH WARMUP for large models:

# Warmup for 5 epochs, then cosine for 95 epochs
warmup = torch.optim.lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,  # Start at 1% of base LR
    end_factor=1.0,     # Ramp to 100%
    total_iters=5       # Over 5 epochs
)

cosine = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=95,          # 95 epochs after warmup
    eta_min=1e-5
)

scheduler = torch.optim.lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup, cosine],
    milestones=[5]  # Switch to cosine after 5 epochs
)

Best For: Modern vision models, transformers, default choice for most problems


ReduceLROnPlateau (Adaptive)

Use When:

  • Don't know optimal schedule in advance
  • Want adaptive approach based on validation performance
  • Training plateaus and you want automatic LR reduction

How It Works:

  • Monitors validation metric (loss or accuracy)
  • Reduces LR when metric stops improving
  • Requires passing metric to scheduler.step()

Implementation:

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode='min',          # 'min' for loss, 'max' for accuracy
    factor=0.1,          # Reduce LR by 10x when plateau detected
    patience=10,         # Wait 10 epochs before reducing
    threshold=1e-4,      # Minimum change to count as improvement
    threshold_mode='rel', # 'rel' or 'abs'
    cooldown=0,          # Epochs to wait after LR reduction
    min_lr=1e-6,         # Don't reduce below this
    verbose=True         # Print when LR reduced
)

# Training loop
for epoch in range(100):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    val_loss = validate(model, val_loader)

    # IMPORTANT: Pass validation metric to step()
    scheduler.step(val_loss)  # NOT scheduler.step() alone!

Example Behavior (patience=10, factor=0.1):

  • Epochs 0-30: Val loss improving, LR = 0.001
  • Epochs 31-40: Val loss plateaus at 0.15, patience counting
  • Epoch 41: Plateau detected, LR reduced to 0.0001
  • Epochs 42-60: Val loss improving again with lower LR
  • Epoch 61: Plateau again, LR reduced to 0.00001

Pros:

  • Adaptive - no manual tuning required
  • Based on actual training progress
  • Good for unknown optimal schedule

Cons:

  • Can be too conservative (waits long before reducing)
  • Requires validation metric (can't use train loss alone)
  • May reduce LR too late or not enough

Tuning Tips:

  • Smaller patience (5-10) for faster adaptation
  • Larger patience (10-20) for more conservative
  • Factor of 0.1 (10x) is standard, but 0.5 (2x) more gradual

Best For: Exploratory training, unknown optimal schedule, adaptive pipelines


OneCycleLR (Fast Training)

Use When:

  • Limited compute budget (want fast convergence)
  • Training for relatively few epochs (10-30)
  • Following FastAI-style training
  • Want aggressive schedule for quick results

How It Works:

  • Ramps UP from low LR to max_lr (first 30% by default)
  • Ramps DOWN from max_lr to very low LR (remaining 70%)
  • Steps EVERY BATCH (not every epoch) - CRITICAL DIFFERENCE

Implementation:

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.1,                    # Peak learning rate (TUNE THIS!)
    steps_per_epoch=len(train_loader),  # Batches per epoch
    epochs=20,                     # Total epochs
    pct_start=0.3,                 # Ramp up for first 30%
    anneal_strategy='cos',         # 'cos' or 'linear'
    div_factor=25,                 # initial_lr = max_lr / 25
    final_div_factor=10000         # final_lr = max_lr / 10000
)

# Training loop - NOTE: step() EVERY BATCH
for epoch in range(20):
    for batch in train_loader:
        loss = train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # CALL EVERY BATCH, NOT EVERY EPOCH!

Example Schedule (max_lr=0.1, 20 epochs, 400 batches/epoch):

  • Batches 0-2400 (epochs 0-6): LR ramps from 0.004 → 0.1
  • Batches 2400-8000 (epochs 6-20): LR ramps from 0.1 → 0.00001

CRITICAL: Tuning max_lr:

OneCycleLR is VERY sensitive to max_lr choice. Too high = instability.

Method 1 - LR Finder (RECOMMENDED):

# Run LR finder first (see LR Finder section)
optimal_lr = find_lr(model, train_loader, optimizer)  # e.g., 0.01
max_lr = optimal_lr * 10  # Use 10x optimal as max_lr

Method 2 - Manual tuning:

  • Start with max_lr = 0.1
  • If training unstable, try 0.03, 0.01
  • If training too slow, try 0.3, 1.0

Pros:

  • Very fast convergence (fewer epochs needed)
  • Strong final performance
  • Popular in FastAI community

Cons:

  • Sensitive to max_lr (requires tuning)
  • Steps every batch (easy to mess up)
  • Not ideal for very long training (>50 epochs)

Common Mistakes:

  1. Calling scheduler.step() per epoch instead of per batch
  2. Not tuning max_lr (using default blindly)
  3. Using for very long training (OneCycle designed for shorter cycles)

Best For: FastAI-style training, limited compute budget, 10-30 epoch training


Advanced OneCycleLR Tuning

If lowering max_lr doesn't resolve instability, try these advanced tuning options:

1. Adjust pct_start (warmup fraction):

# Default: 0.3 (30% warmup, 70% cooldown)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       pct_start=0.3)  # Default

# If unstable at peak: Increase to 0.4 or 0.5 (longer warmup)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       pct_start=0.5)  # Gentler ramp to peak

# If unstable in cooldown: Decrease to 0.2 (shorter warmup, gentler descent)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       pct_start=0.2)

2. Adjust div_factor (initial LR):

# Default: 25 (initial_lr = max_lr / 25)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       div_factor=25)  # Start at 0.004

# If unstable at start: Increase to 50 or 100 (start even lower)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       div_factor=100)  # Start at 0.001

3. Adjust final_div_factor (final LR):

# Default: 10000 (final_lr = max_lr / 10000)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       final_div_factor=10000)  # End at 0.00001

# If unstable at end: Decrease to 1000 (end at higher LR)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       final_div_factor=1000)  # End at 0.0001

4. Add gradient clipping:

# In training loop
for batch in train_loader:
    loss = train_step(model, batch, optimizer)
    loss.backward()

    # Clip gradients to prevent instability
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()
    scheduler.step()

5. Consider OneCycle may not be right for your problem:

  • Very deep networks (>100 layers): May be too unstable for OneCycle's aggressive schedule
  • Large models (>100M params): May need gentler schedule (Cosine + warmup)
  • Sensitive architectures (some transformers): OneCycle's rapid LR changes can destabilize

Alternative: Use CosineAnnealing + warmup for more stable training:

# More stable alternative to OneCycle
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=15, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])

LinearLR (Warmup)

Use When:

  • Need warmup at training start
  • Ramping up LR gradually over first few epochs
  • Combining with another scheduler (SequentialLR)

How It Works:

  • Linearly interpolates LR from start_factor to end_factor
  • Typically used for warmup: start_factor=0.01, end_factor=1.0

Implementation:

# Standalone linear warmup
scheduler = torch.optim.lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,  # Start at 1% of base LR
    end_factor=1.0,     # End at 100% of base LR
    total_iters=5       # Over 5 epochs
)

# More common: Combine with main scheduler
warmup = torch.optim.lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,
    total_iters=5
)

main = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=95
)

scheduler = torch.optim.lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup, main],
    milestones=[5]  # Switch after 5 epochs
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Example Schedule (base_lr=0.1):

  • Epoch 0: LR = 0.001 (1%)
  • Epoch 1: LR = 0.0208 (20.8%)
  • Epoch 2: LR = 0.0406 (40.6%)
  • Epoch 3: LR = 0.0604 (60.4%)
  • Epoch 4: LR = 0.0802 (80.2%)
  • Epoch 5: LR = 0.1 (100%, then switch to main scheduler)

Best For: Warmup phase for transformers and large models


ExponentialLR (Continuous Decay)

Use When:

  • Want smooth, continuous decay
  • Simpler alternative to Cosine
  • Prefer exponential over linear decay

How It Works:

  • Multiply LR by gamma every epoch
  • LR(epoch) = initial_lr * gamma^epoch

Implementation:

scheduler = torch.optim.lr_scheduler.ExponentialLR(
    optimizer,
    gamma=0.95  # Multiply by 0.95 each epoch
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Example Schedule (initial_lr=0.1, gamma=0.95):

  • Epoch 0: LR = 0.1
  • Epoch 10: LR = 0.0599
  • Epoch 50: LR = 0.0077
  • Epoch 100: LR = 0.0059

Tuning gamma:

  • Want 10x decay over 100 epochs: gamma = 0.977 (0.1^(1/100))
  • Want 100x decay over 100 epochs: gamma = 0.955 (0.01^(1/100))
  • General formula: gamma = (target_lr / initial_lr)^(1/epochs)

Pros:

  • Very smooth decay
  • Simple to implement

Cons:

  • Hard to intuit gamma value for desired final LR
  • Less popular than Cosine (Cosine is better default)

Best For: Cases where you want exponential decay specifically


LambdaLR (Custom Schedules)

Use When:

  • Need custom schedule not provided by standard schedulers
  • Implementing paper-specific schedule
  • Advanced use cases (e.g., transformer inverse sqrt schedule)

How It Works:

  • Provide function that computes LR multiplier for each epoch
  • LR(epoch) = initial_lr * lambda(epoch)

Implementation:

# Example: Warmup then constant
def warmup_lambda(epoch):
    if epoch < 5:
        return (epoch + 1) / 5  # Linear warmup
    else:
        return 1.0  # Constant after warmup

scheduler = torch.optim.lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=warmup_lambda
)

# Example: Transformer inverse square root schedule
def transformer_schedule(epoch):
    warmup_steps = 4000
    step = epoch + 1
    return min(step ** (-0.5), step * warmup_steps ** (-1.5))

scheduler = torch.optim.lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=transformer_schedule
)

# Example: Polynomial decay
def polynomial_decay(epoch):
    return (1 - epoch / 100) ** 0.9  # Decay to 0 at epoch 100

scheduler = torch.optim.lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=polynomial_decay
)

Best For: Custom schedules, implementing specific papers, advanced users


4. Warmup Strategies - CRITICAL FOR TRANSFORMERS

Why Warmup is Essential

Problem at Training Start:

  • Weights are randomly initialized
  • Gradients can be very large and unstable
  • BatchNorm statistics are uninitialized
  • High LR can cause immediate divergence (NaN loss)

Solution: Gradual LR Increase

  • Start with very low LR (1% of target)
  • Linearly increase to target LR over first few epochs
  • Allows model to stabilize before aggressive learning

Quantitative Impact:

  • Transformers WITHOUT warmup: Often diverge or train very unstably
  • Transformers WITH warmup: Stable training, better final performance
  • Vision models: Warmup improves stability, sometimes +0.5-1% accuracy

When Warmup is MANDATORY

ALWAYS use warmup when:

Training transformers (ViT, BERT, GPT, T5, etc.)

  • Transformers REQUIRE warmup - not optional
  • Without warmup, training often diverges
  • Standard practice in all transformer papers

Large batch sizes (>512)

  • Large batches → larger effective learning rate
  • Warmup prevents early instability
  • Standard for distributed training

High initial learning rates

  • If starting with LR > 0.001, use warmup
  • Warmup allows higher peak LR safely

Training from scratch (not fine-tuning)

  • Random initialization needs gentle start
  • Fine-tuning can often skip warmup (weights already good)

Usually use warmup when:

✅ Large models (>100M parameters) ✅ Using AdamW optimizer (common with transformers) ✅ Following modern training recipes

May skip warmup when:

❌ Fine-tuning pretrained models (weights already trained) ❌ Small learning rates (< 0.0001) ❌ Small models (<10M parameters) ❌ Established recipe without warmup (e.g., some CNN papers)


Warmup Implementation Patterns

Pattern 1: Linear Warmup + Cosine Decay (Most Common)

import torch.optim.lr_scheduler as lr_scheduler

# Warmup for 5 epochs
warmup = lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,  # Start at 1% of base LR
    end_factor=1.0,     # End at 100% of base LR
    total_iters=5       # Over 5 epochs
)

# Cosine decay for remaining 95 epochs
cosine = lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=95,          # 95 epochs after warmup
    eta_min=1e-5       # Final LR
)

# Combine sequentially
scheduler = lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup, cosine],
    milestones=[5]  # Switch to cosine after epoch 5
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Schedule Visualization (base_lr=0.001):

  • Epochs 0-4: Linear ramp from 0.00001 → 0.001 (warmup)
  • Epochs 5-99: Cosine decay from 0.001 → 0.00001

Use For: Vision transformers, modern CNNs, most large-scale training


Pattern 2: Linear Warmup + MultiStepLR

# Warmup for 5 epochs
warmup = lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,
    total_iters=5
)

# Step decay at 30, 60, 90
steps = lr_scheduler.MultiStepLR(
    optimizer,
    milestones=[30, 60, 90],
    gamma=0.1
)

# Combine
scheduler = lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup, steps],
    milestones=[5]
)

Use For: Classical CNN training with warmup


Pattern 3: Manual Warmup (More Control)

def get_lr_schedule(epoch, total_epochs, base_lr, warmup_epochs=5):
    """
    Custom schedule with warmup and cosine decay.
    """
    if epoch < warmup_epochs:
        # Linear warmup
        return base_lr * (epoch + 1) / warmup_epochs
    else:
        # Cosine decay
        progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
        return base_lr * 0.5 * (1 + math.cos(math.pi * progress))

# Training loop
for epoch in range(100):
    lr = get_lr_schedule(epoch, total_epochs=100, base_lr=0.001)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    train_one_epoch(model, train_loader, optimizer)

Use For: Custom schedules, research, maximum control


Pattern 4: Transformer-Style Warmup (Inverse Square Root)

def transformer_lr_schedule(step, d_model, warmup_steps):
    """
    Transformer schedule from "Attention is All You Need".
    LR increases during warmup, then decreases proportionally to inverse sqrt of step.
    """
    step = step + 1  # 1-indexed
    return (d_model ** -0.5) * min(step ** -0.5, step * warmup_steps ** -1.5)

scheduler = lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=lambda step: transformer_lr_schedule(step, d_model=512, warmup_steps=4000)
)

# Training loop - NOTE: step every BATCH for this schedule
for epoch in range(epochs):
    for batch in train_loader:
        train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # Step every batch

Use For: Transformer models (BERT, GPT), following original papers


Warmup Duration Guidelines

How many warmup epochs?

  • Transformers: 5-20 epochs (or 5-10% of total training)
  • Vision models: 5-10 epochs
  • Very large models (>1B params): 10-20 epochs
  • Small models: 3-5 epochs

Rule of thumb: 5-10% of total training epochs

Examples:

  • 100-epoch training: 5-10 epoch warmup
  • 20-epoch training: 2-3 epoch warmup
  • 300-epoch training: 15-30 epoch warmup

"But My Transformer Trained Fine Without Warmup"

Some users report training transformers without warmup successfully. Here's the reality:

What "fine" actually means:

  • Training didn't diverge (NaN loss) - that's a low bar
  • Got reasonable accuracy - but NOT optimal accuracy
  • One successful run doesn't mean it's optimal or reliable

What you're missing without warmup:

1. Performance gap (1-3% accuracy):

Without warmup: Training works, achieves 85% accuracy
With warmup: Same model achieves 87-88% accuracy

That 2-3% is SIGNIFICANT:

  • Difference between competitive and SOTA
  • Difference between accepted and rejected paper
  • Difference between passing and failing business metrics

2. Training stability:

Without warmup:
- Some runs diverge → need to restart with lower LR
- Sensitive to initialization seed
- Requires careful LR tuning
- Success rate: 60-80% of runs

With warmup:
- Stable training → consistent results
- Robust to initialization
- Wider stable LR range
- Success rate: 95-100% of runs

3. Hyperparameter sensitivity:

Without warmup:

  • Very sensitive to initial LR choice (0.001 works, 0.0015 diverges)
  • Sensitive to batch size
  • Sensitive to optimizer settings

With warmup:

  • More forgiving LR range (0.0005-0.002 all work)
  • Less sensitive to batch size
  • Robust optimizer configuration

Empirical Evidence - Published Papers:

Check transformer papers - ALL use warmup:

Model Paper Warmup
ViT Dosovitskiy et al., 2020 ✅ Linear, 10k steps
DeiT Touvron et al., 2021 ✅ Linear, 5 epochs
Swin Liu et al., 2021 ✅ Linear, 20 epochs
BERT Devlin et al., 2018 ✅ Linear, 10k steps
GPT-2 Radford et al., 2019 ✅ Linear warmup
GPT-3 Brown et al., 2020 ✅ Linear warmup
T5 Raffel et al., 2020 ✅ Inverse sqrt warmup

Every competitive transformer model uses warmup - there's a reason.

"But I got 85% accuracy without warmup!"

Great! Now try with warmup and see if you get 87-88%. You probably will.

The cost-benefit analysis:

# Cost: One line of code
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
scheduler = SequentialLR(optimizer, [warmup, main], [5])

# Benefit:
# - 1-3% better accuracy
# - More stable training
# - Higher success rate
# - Wider stable hyperparameter range

Recommendation:

  1. Run ablation study: Train your model with and without warmup
  2. Compare: Final test accuracy, training stability, number of failed runs
  3. You'll find warmup gives better results with minimal complexity

Bottom line: Just because something "works" doesn't mean it's optimal. Warmup is standard practice for transformers because it consistently improves results.


5. LR Finder - Finding Optimal Initial LR

What is LR Finder?

Method from Leslie Smith (2015): Cyclical Learning Rates paper

Core Idea:

  1. Start with very small LR (1e-8)
  2. Gradually increase LR (multiply by ~1.1 each batch)
  3. Train for a few hundred steps, record loss at each LR
  4. Plot loss vs LR
  5. Choose LR where loss decreases fastest (steepest descent)

Why It Works:

  • Too low LR: Loss decreases very slowly
  • Optimal LR: Loss decreases rapidly (steepest slope)
  • Too high LR: Loss plateaus or increases (instability)

Typical Findings:

  • Loss decreases fastest at some LR (e.g., 0.01)
  • Loss starts increasing at higher LR (e.g., 0.1)
  • Choose LR slightly below fastest descent point (e.g., 0.003-0.01)

LR Finder Implementation

import torch
import matplotlib.pyplot as plt
import numpy as np

def find_lr(model, train_loader, optimizer, loss_fn, device,
            start_lr=1e-8, end_lr=10, num_iter=100, smooth_f=0.05):
    """
    LR Finder: Sweep learning rates and plot loss curve.

    Args:
        model: PyTorch model
        train_loader: Training data loader
        optimizer: Optimizer (will be modified)
        loss_fn: Loss function
        device: Device to train on
        start_lr: Starting learning rate (default: 1e-8)
        end_lr: Ending learning rate (default: 10)
        num_iter: Number of iterations (default: 100)
        smooth_f: Smoothing factor for loss (default: 0.05)

    Returns:
        lrs: List of learning rates tested
        losses: List of losses at each LR
    """
    # Save initial model state to restore later
    model.train()
    initial_state = model.state_dict()

    # Calculate LR multiplier for exponential increase
    lr_mult = (end_lr / start_lr) ** (1 / num_iter)

    lrs = []
    losses = []
    best_loss = float('inf')
    avg_loss = 0

    lr = start_lr

    # Iterate through training data
    iterator = iter(train_loader)
    for iteration in range(num_iter):
        try:
            data, target = next(iterator)
        except StopIteration:
            # Restart iterator if we run out of data
            iterator = iter(train_loader)
            data, target = next(iterator)

        # Set learning rate
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr

        # Forward pass
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)

        # Compute smoothed loss (exponential moving average)
        if iteration == 0:
            avg_loss = loss.item()
        else:
            avg_loss = smooth_f * loss.item() + (1 - smooth_f) * avg_loss

        # Record
        lrs.append(lr)
        losses.append(avg_loss)

        # Track best loss
        if avg_loss < best_loss:
            best_loss = avg_loss

        # Stop if loss explodes (>4x best loss)
        if avg_loss > 4 * best_loss:
            print(f"Stopping early at iteration {iteration}: loss exploded")
            break

        # Backward pass
        loss.backward()
        optimizer.step()

        # Increase learning rate
        lr *= lr_mult
        if lr > end_lr:
            break

    # Restore model to initial state
    model.load_state_dict(initial_state)

    # Plot results
    plt.figure(figsize=(10, 6))
    plt.plot(lrs, losses)
    plt.xscale('log')
    plt.xlabel('Learning Rate (log scale)')
    plt.ylabel('Loss')
    plt.title('LR Finder')
    plt.grid(True, alpha=0.3)

    # Mark suggested LR (10x below minimum loss)
    min_loss_idx = np.argmin(losses)
    suggested_lr = lrs[max(0, min_loss_idx - 5)]  # A bit before minimum
    plt.axvline(suggested_lr, color='red', linestyle='--',
                label=f'Suggested LR: {suggested_lr:.2e}')
    plt.legend()
    plt.show()

    print(f"\nLR Finder Results:")
    print(f"  Minimum loss at LR: {lrs[min_loss_idx]:.2e}")
    print(f"  Suggested starting LR: {suggested_lr:.2e}")
    print(f"  (Choose LR where loss decreases fastest, before minimum)")

    return lrs, losses


def suggest_lr_from_finder(lrs, losses):
    """
    Suggest optimal learning rate from LR finder results.

    Strategy: Find LR where loss gradient is steepest (fastest decrease).
    """
    # Compute gradient of loss w.r.t. log(LR)
    log_lrs = np.log10(lrs)
    gradients = np.gradient(losses, log_lrs)

    # Find steepest descent (most negative gradient)
    steepest_idx = np.argmin(gradients)

    # Suggested LR is at steepest point or slightly before
    suggested_lr = lrs[steepest_idx]

    return suggested_lr

Using LR Finder

Basic Usage:

# Setup model, optimizer, loss
model = YourModel().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)  # LR will be overridden
loss_fn = torch.nn.CrossEntropyLoss()

# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)

# Manually inspect plot and choose LR
# Look for: steepest descent point (fastest loss decrease)
# Typically: 10x lower than loss minimum

# Example: If minimum is at 0.1, choose 0.01 as starting LR
base_lr = 0.01  # Based on plot inspection

Automated LR Selection:

# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)

# Get suggested LR
suggested_lr = suggest_lr_from_finder(lrs, losses)

# Use suggested LR
optimizer = torch.optim.SGD(model.parameters(), lr=suggested_lr)

Using with OneCycleLR:

# Find optimal LR
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses)  # e.g., 0.01

# OneCycleLR: Use 5-10x optimal as max_lr
max_lr = optimal_lr * 10  # e.g., 0.1

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=max_lr,
    steps_per_epoch=len(train_loader),
    epochs=20
)

Interpreting LR Finder Results

Typical Plot Patterns:

Loss
|
|         X  <-- Loss explodes (LR too high)
|        X
|       X
|      X     <-- Loss minimum (still too high)
|     X
|    X       <-- CHOOSE HERE (steepest descent)
|   X
|  X
| X
|X___________
  1e-8  1e-4  1e-2  0.1  1.0  10
              Learning Rate

How to Choose:

  1. Steepest Descent (BEST):

    • Find where loss decreases fastest (steepest downward slope)
    • This is optimal LR for rapid convergence
    • Example: If steepest at 0.01, choose 0.01
  2. Before Minimum (SAFE):

    • Find minimum loss LR (e.g., 0.1)
    • Choose 10x lower (e.g., 0.01)
    • More conservative, safer choice
  3. Avoid:

    • Don't choose minimum itself (often too high)
    • Don't choose where loss is flat (too low, slow progress)
    • Don't choose where loss increases (way too high)

Guidelines:

  • For SGD: Choose at steepest descent
  • For Adam: Choose 10x below steepest (Adam more sensitive)
  • For OneCycle: Use steepest as optimal, 5-10x as max_lr

When to Use LR Finder

Use LR Finder When:

✅ Starting new project (unknown optimal LR) ✅ New architecture or dataset ✅ Tuning OneCycleLR (finding max_lr) ✅ Transitioning between optimizers ✅ Having training instability issues

Can Skip When:

❌ Following established paper recipe (LR already known) ❌ Fine-tuning (small LR like 1e-5 typically works) ❌ Very constrained time/resources ❌ Using adaptive methods (ReduceLROnPlateau)

Best Practice:

  • Run LR finder once at project start
  • Use found LR for all subsequent runs
  • Re-run if changing optimizer, architecture, or batch size significantly

6. Scheduler Selection Guide

Selection Flowchart

1. What's your training duration?

  • <10 epochs: Constant LR or simple linear decay
  • 10-30 epochs: OneCycleLR (fast) or CosineAnnealingLR
  • >30 epochs: CosineAnnealingLR or MultiStepLR

2. What's your model type?

  • Transformer (ViT, BERT, GPT): CosineAnnealing + WARMUP (mandatory)
  • CNN (ResNet, EfficientNet): MultiStepLR or CosineAnnealing + optional warmup
  • Small model: Simpler schedulers (StepLR) or constant LR

3. Do you know optimal schedule?

  • Yes (from paper): Use paper's schedule (MultiStepLR usually)
  • No (exploring): ReduceLROnPlateau or CosineAnnealing
  • Want fast results: OneCycleLR + LR finder

4. What's your compute budget?

  • High budget (100+ epochs): CosineAnnealing or MultiStepLR
  • Low budget (10-20 epochs): OneCycleLR
  • Adaptive budget: ReduceLROnPlateau (stops when plateau)

Paper Recipe vs Modern Best Practices

If goal is EXACT REPRODUCTION:

Use paper's exact schedule (down to every detail):

# Example: Reproducing ResNet paper (He et al., 2015)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# No warmup (paper didn't use it)
# Train for 100 epochs

Rationale:

  • Reproduce results exactly
  • Enable apples-to-apples comparison
  • Validate paper's claims
  • Establish baseline before improvements

If goal is BEST PERFORMANCE:

Use modern recipe (benefit from years of community learning):

# Modern equivalent: ResNet with modern practices
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Train for 100 epochs

Rationale:

  • Typically +0.5-2% better accuracy than original paper
  • More stable training
  • Reflects 5-10 years of community improvements
  • SOTA competitive performance

Evolution of LR Scheduling Practices:

Early Deep Learning (2012-2016):

  • Scheduler: StepLR with manual milestones
  • Warmup: Not used (not yet discovered)
  • Optimizer: SGD with momentum
  • Examples: AlexNet, VGG, ResNet, Inception

Mid Period (2017-2019):

  • Scheduler: CosineAnnealing introduced, OneCycleLR popular
  • Warmup: Starting to be used for large batch training
  • Optimizer: SGD still dominant, Adam increasingly common
  • Examples: ResNeXt, DenseNet, MobileNet

Modern Era (2020-2025):

  • Scheduler: CosineAnnealing default, OneCycle for fast training
  • Warmup: Standard practice (mandatory for transformers)
  • Optimizer: AdamW increasingly preferred for transformers
  • Examples: ViT, EfficientNet, ConvNeXt, Swin, DeiT

Practical Workflow:

Step 1: Reproduce paper recipe

# Use exact paper settings
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# Should match paper's reported accuracy (e.g., 76.5%)

Step 2: Validate reproduction

  • If you get 76.5% (matches paper): ✅ Reproduction successful
  • If you get 74% (2% worse): ❌ Implementation bug, fix first
  • If you get 78% (2% better): ✅ Great! Proceed to modern recipe

Step 3: Try modern recipe

# Add warmup + cosine
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Expect +0.5-2% improvement (e.g., 77-78.5%)

Step 4: Compare results

Version Accuracy Notes
Paper recipe 76.5% Baseline (reproduces paper)
Modern recipe 78.0% +1.5% from warmup + cosine

When to Use Which:

Use Paper Recipe:

  • Publishing reproduction study
  • Comparing to paper's baseline
  • Validating implementation correctness
  • Research requiring exact reproducibility

Use Modern Recipe:

  • Building production system (want best performance)
  • Competing in benchmark (need SOTA results)
  • Publishing new method (should use modern baseline)
  • Limited compute (modern practices more efficient)

Trade-off Table:

Aspect Paper Recipe Modern Recipe
Reproducibility ✅ Exact ⚠️ Better but different
Performance ⚠️ Good (for its time) ✅ Better (+0.5-2%)
Comparability ✅ To paper ✅ To SOTA
Compute efficiency ⚠️ May be suboptimal ✅ Modern optimizations
Training stability ⚠️ Variable ✅ More stable (warmup)

Bottom Line:

Both are valid depending on your goal:

  • Research/reproduction: Start with paper recipe
  • Production/competition: Use modern recipe
  • Best practice: Validate with paper recipe, deploy with modern recipe

Domain-Specific Recommendations

Image Classification (CNNs)

Standard Recipe (ResNet, VGG):

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# Train for 100 epochs

Modern Recipe (EfficientNet, RegNet):

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-5)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Train for 100 epochs

Vision Transformers (ViT, Swin, DeiT)

Standard Recipe:

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.05)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=10)
cosine = CosineAnnealingLR(optimizer, T_max=290, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [10])
# Train for 300 epochs
# WARMUP IS MANDATORY

NLP Transformers (BERT, GPT, T5)

Standard Recipe:

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.01)

# Linear warmup + linear decay
def lr_lambda(step):
    warmup_steps = 10000
    total_steps = 100000
    if step < warmup_steps:
        return step / warmup_steps
    else:
        return max(0.0, (total_steps - step) / (total_steps - warmup_steps))

scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
# Step every batch, not epoch
# WARMUP IS MANDATORY

Object Detection (Faster R-CNN, YOLO)

Standard Recipe:

optimizer = torch.optim.SGD(model.parameters(), lr=0.02, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[16, 22], gamma=0.1)
# Train for 26 epochs

Fast Training (Limited Compute)

FastAI Recipe:

# Run LR finder first
optimal_lr = find_lr(model, train_loader, optimizer, loss_fn, device)
max_lr = optimal_lr * 10

optimizer = torch.optim.SGD(model.parameters(), lr=max_lr)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=max_lr,
    steps_per_epoch=len(train_loader),
    epochs=20,
    pct_start=0.3
)
# Train for 20 epochs
# Step every batch

7. Common Scheduling Pitfalls

Pitfall 1: No Warmup for Transformers

WRONG:

# Training Vision Transformer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
# ❌ No warmup - training will be very unstable or diverge

RIGHT:

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# ✅ Warmup prevents early instability

Why It Matters:

  • Transformers with high LR at start → NaN loss, divergence
  • Random initialization needs gradual LR ramp
  • 5-10 epoch warmup is STANDARD practice

How to Detect:

  • Loss is NaN or explodes in first few epochs
  • Training very unstable early, stabilizes later
  • Gradients extremely large at start

Pitfall 2: Wrong scheduler.step() Placement

WRONG (Most Schedulers):

for epoch in range(epochs):
    for batch in train_loader:
        loss = train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # ❌ Stepping every batch, not every epoch

RIGHT:

for epoch in range(epochs):
    for batch in train_loader:
        loss = train_step(model, batch, optimizer)
        optimizer.step()

    scheduler.step()  # ✅ Step AFTER each epoch

EXCEPTION (OneCycleLR):

for epoch in range(epochs):
    for batch in train_loader:
        loss = train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # ✅ OneCycle steps EVERY BATCH

Why It Matters:

  • CosineAnnealing with T_max=100 expects 100 steps (epochs)
  • Stepping every batch: If 390 batches/epoch, LR decays in <1 epoch
  • LR reaches minimum way too fast

How to Detect:

  • LR decays to minimum in first epoch
  • Print LR each step: print(optimizer.param_groups[0]['lr'])
  • Check if LR changes every batch (wrong) vs every epoch (right)

Rule:

  • Most schedulers (Step, Cosine, Exponential): Step per epoch
  • OneCycleLR only: Step per batch
  • ReduceLROnPlateau: Step per epoch with validation metric

Pitfall 3: scheduler.step() Before optimizer.step()

WRONG:

loss.backward()
scheduler.step()      # ❌ Wrong order
optimizer.step()

RIGHT:

loss.backward()
optimizer.step()      # ✅ Update weights first
scheduler.step()      # Then update LR

Why It Matters:

  • Scheduler updates LR based on current epoch/step
  • Should update weights with current LR, THEN move to next LR
  • Wrong order = off-by-one error in schedule

How to Detect:

  • Usually subtle, hard to notice
  • Best practice: always optimizer.step() then scheduler.step()

Pitfall 4: Not Passing Metric to ReduceLROnPlateau

WRONG:

scheduler = ReduceLROnPlateau(optimizer)
for epoch in range(epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    scheduler.step()  # ❌ No metric passed

RIGHT:

scheduler = ReduceLROnPlateau(optimizer, mode='min')
for epoch in range(epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    val_loss = validate(model, val_loader)
    scheduler.step(val_loss)  # ✅ Pass validation metric

Why It Matters:

  • ReduceLROnPlateau NEEDS metric to detect plateau
  • Without metric, scheduler doesn't know when to reduce LR
  • Will get error or incorrect behavior

How to Detect:

  • Error message: "ReduceLROnPlateau needs a metric"
  • LR never reduces even when training plateaus

Pitfall 5: Using OneCycle for Long Training

SUBOPTIMAL:

# Training for 200 epochs
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=200, steps_per_epoch=len(train_loader))
# ❌ OneCycle designed for shorter training (10-30 epochs)

BETTER:

# For long training, use Cosine
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=10)
cosine = CosineAnnealingLR(optimizer, T_max=190, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [10])
# ✅ Cosine better suited for long training

Why It Matters:

  • OneCycle's aggressive up-then-down profile works for short training
  • For long training, gentler cosine decay more stable
  • OneCycle typically used for 10-30 epochs in FastAI style

When to Use Each:

  • OneCycle: 10-30 epochs, limited compute, want fast results
  • Cosine: 50+ epochs, full training, want best final performance

Pitfall 6: Not Tuning max_lr for OneCycle

WRONG:

# Just guessing max_lr
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20, steps_per_epoch=len(train_loader))
# ❌ Random max_lr without tuning
# Might be too high (unstable) or too low (slow)

RIGHT:

# Step 1: Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses)  # e.g., 0.01

# Step 2: Use 5-10x optimal as max_lr
max_lr = optimal_lr * 10  # e.g., 0.1

scheduler = OneCycleLR(optimizer, max_lr=max_lr, epochs=20, steps_per_epoch=len(train_loader))
# ✅ Tuned max_lr based on LR finder

Why It Matters:

  • OneCycle is VERY sensitive to max_lr
  • Too high: Training unstable, loss explodes
  • Too low: Slow training, underperforms
  • LR finder finds optimal, use 5-10x as max_lr

How to Tune:

  1. Run LR finder (see LR Finder section)
  2. Find optimal LR (steepest descent point)
  3. Use 5-10x optimal as max_lr for OneCycle
  4. If still unstable, reduce max_lr (try 3x, 2x)

Pitfall 7: Forgetting to Adjust T_max After Adding Warmup

WRONG:

# Want 100 epoch training
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=100)  # ❌ Should be 95
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])

RIGHT:

# Want 100 epoch training
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95)  # ✅ 100 - 5 = 95
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])

Why It Matters:

  • Total training is warmup + main schedule
  • If warmup is 5 epochs and cosine is 100, total is 105 epochs
  • T_max should be (total_epochs - warmup_epochs)

How to Calculate:

total_epochs = 100
warmup_epochs = 5
T_max = total_epochs - warmup_epochs  # 95

Pitfall 8: Using Same LR for All Param Groups

SUBOPTIMAL:

# Fine-tuning: applying same LR to all layers
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# ❌ Backbone and head both use 1e-3

BETTER:

# Fine-tuning: lower LR for pretrained backbone, higher for new head
optimizer = torch.optim.Adam([
    {'params': model.backbone.parameters(), 'lr': 1e-4},  # Lower LR for pretrained
    {'params': model.head.parameters(), 'lr': 1e-3}       # Higher LR for random init
])
scheduler = CosineAnnealingLR(optimizer, T_max=100)
# ✅ Scheduler applies to all param groups proportionally

Why It Matters:

  • Pretrained layers need smaller LR (already trained)
  • New layers need higher LR (random initialization)
  • Schedulers work with param groups automatically

Note: Schedulers multiply all param groups by same factor, preserving relative ratios


Pitfall 9: Not Monitoring LR During Training

PROBLEM:

  • Schedule not behaving as expected
  • Hard to debug without visibility into LR

SOLUTION:

# Log LR every epoch
for epoch in range(epochs):
    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: LR = {current_lr:.6f}")

    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

# Or use TensorBoard
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()

for epoch in range(epochs):
    current_lr = optimizer.param_groups[0]['lr']
    writer.add_scalar('Learning Rate', current_lr, epoch)

    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Best Practice:

  • Always log LR to console or TensorBoard
  • Plot LR schedule before training (see next section)
  • Verify schedule matches expectations

Pitfall 10: Not Validating Schedule Before Training

PROBLEM:

  • Run full training, discover schedule was wrong
  • Waste compute on incorrect schedule

SOLUTION: Dry-run the schedule:

def plot_schedule(scheduler_fn, num_epochs):
    """
    Plot LR schedule before training to verify it's correct.
    """
    # Create dummy model and optimizer
    model = torch.nn.Linear(1, 1)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
    scheduler = scheduler_fn(optimizer)

    lrs = []
    for epoch in range(num_epochs):
        lrs.append(optimizer.param_groups[0]['lr'])
        optimizer.step()  # Dummy step
        scheduler.step()

    # Plot
    plt.figure(figsize=(10, 6))
    plt.plot(lrs)
    plt.xlabel('Epoch')
    plt.ylabel('Learning Rate')
    plt.title('LR Schedule')
    plt.grid(True, alpha=0.3)
    plt.show()

# Usage
def my_scheduler(opt):
    warmup = LinearLR(opt, start_factor=0.01, total_iters=5)
    cosine = CosineAnnealingLR(opt, T_max=95)
    return SequentialLR(opt, [warmup, cosine], [5])

plot_schedule(my_scheduler, num_epochs=100)
# Verify plot looks correct BEFORE training

Best Practice:

  • Plot schedule before every major training run
  • Verify warmup duration, decay shape, final LR
  • Catch mistakes early (T_max wrong, step placement, etc.)

8. Modern Best Practices (2024-2025)

Vision Models (CNNs, ResNets, ConvNeXt)

Standard Recipe:

# Optimizer
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4
)

# Scheduler: MultiStepLR or CosineAnnealing
# Option 1: MultiStepLR (classical)
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)

# Option 2: CosineAnnealing (modern)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])

# Training
epochs = 100
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Key Points:

  • SGD with momentum (0.9) is standard for CNNs
  • LR = 0.1 for batch size 256 (scale linearly for other batch sizes)
  • Warmup optional but beneficial (5 epochs)
  • CosineAnnealing increasingly preferred over MultiStepLR

Vision Transformers (ViT, Swin, DeiT)

Standard Recipe:

# Optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.05,
    betas=(0.9, 0.999)
)

# Scheduler: MUST include warmup
warmup_epochs = 10
cosine_epochs = 290
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_epochs)
cosine = CosineAnnealingLR(optimizer, T_max=cosine_epochs, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [warmup_epochs])

# Training
epochs = 300
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Key Points:

  • AdamW optimizer (not SGD)
  • Warmup is MANDATORY (10-20 epochs)
  • Long training (300 epochs typical)
  • LR = 1e-3 for batch size 512 (scale for other sizes)
  • Cosine decay to very small LR (1e-5)

Why Warmup is Critical for ViT:

  • Self-attention layers highly sensitive to initialization
  • High LR at start causes gradient explosion
  • Warmup allows attention patterns to stabilize

NLP Transformers (BERT, GPT, T5)

Standard Recipe:

# Optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=5e-4,
    weight_decay=0.01,
    betas=(0.9, 0.999)
)

# Scheduler: Linear warmup + linear decay (or inverse sqrt)
total_steps = len(train_loader) * epochs
warmup_steps = int(0.1 * total_steps)  # 10% warmup

def lr_lambda(step):
    if step < warmup_steps:
        return step / warmup_steps
    else:
        return max(0.0, (total_steps - step) / (total_steps - warmup_steps))

scheduler = LambdaLR(optimizer, lr_lambda)

# Training: step EVERY BATCH
for epoch in range(epochs):
    for batch in train_loader:
        train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # Step every batch, not epoch

Key Points:

  • AdamW optimizer
  • Warmup is MANDATORY (typically 10% of training)
  • Linear warmup + linear decay (BERT, GPT-2 style)
  • Step scheduler EVERY BATCH (not every epoch)
  • LR typically 1e-4 to 5e-4

Alternative: Inverse Square Root (Original Transformer):

def transformer_schedule(step):
    warmup_steps = 4000
    step = step + 1
    return (d_model ** -0.5) * min(step ** -0.5, step * warmup_steps ** -1.5)

scheduler = LambdaLR(optimizer, transformer_schedule)

Object Detection (Faster R-CNN, YOLO, DETR)

Standard Recipe (Two-stage detectors):

# Optimizer
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.02,
    momentum=0.9,
    weight_decay=1e-4
)

# Scheduler: MultiStepLR with short schedule
scheduler = MultiStepLR(optimizer, milestones=[16, 22], gamma=0.1)

# Training
epochs = 26  # Shorter than classification
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Standard Recipe (Transformer detectors like DETR):

# Optimizer
optimizer = torch.optim.AdamW(
    [
        {'params': model.backbone.parameters(), 'lr': 1e-5},  # Lower for backbone
        {'params': model.transformer.parameters(), 'lr': 1e-4}  # Higher for transformer
    ],
    weight_decay=1e-4
)

# Scheduler: Step decay
scheduler = MultiStepLR(optimizer, milestones=[200], gamma=0.1)

# Training: Long schedule for DETR
epochs = 300

Key Points:

  • Detection typically shorter training than classification
  • Lower LR (0.02 vs 0.1) due to task difficulty
  • DETR needs very long training (300 epochs)

Semantic Segmentation (U-Net, DeepLab, SegFormer)

Standard Recipe (CNN-based):

# Optimizer
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    weight_decay=1e-4
)

# Scheduler: Polynomial decay (common in segmentation)
def poly_lr_lambda(epoch):
    return (1 - epoch / total_epochs) ** 0.9

scheduler = LambdaLR(optimizer, poly_lr_lambda)

# Training
epochs = 100
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Key Points:

  • Polynomial decay common in segmentation (DeepLab papers)
  • Lower initial LR (0.01) than classification
  • Power of 0.9 standard

Fast Training / Limited Compute (FastAI Style)

OneCycle Recipe:

# Step 1: Find optimal LR
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses)  # e.g., 0.01
max_lr = optimal_lr * 10  # e.g., 0.1

# Step 2: OneCycleLR
optimizer = torch.optim.SGD(model.parameters(), lr=max_lr, momentum=0.9)
scheduler = OneCycleLR(
    optimizer,
    max_lr=max_lr,
    steps_per_epoch=len(train_loader),
    epochs=20,
    pct_start=0.3,        # 30% warmup, 70% cooldown
    anneal_strategy='cos'
)

# Step 3: Train (step every batch)
for epoch in range(20):
    for batch in train_loader:
        train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # Every batch

Key Points:

  • Use LR finder to tune max_lr (CRITICAL)
  • Train for fewer epochs (10-30)
  • Step scheduler every batch
  • Often achieves 90-95% of full training performance in 20-30% of time

Fine-Tuning Pretrained Models

Standard Recipe:

# Optimizer: Different LRs for backbone vs head
optimizer = torch.optim.AdamW([
    {'params': model.backbone.parameters(), 'lr': 1e-5},  # Very low for pretrained
    {'params': model.head.parameters(), 'lr': 1e-3}       # Higher for new head
])

# Scheduler: Simple cosine or even constant
# Option 1: Constant LR (fine-tuning often doesn't need scheduling)
scheduler = None

# Option 2: Gentle cosine decay
scheduler = CosineAnnealingLR(optimizer, T_max=10, eta_min=1e-6)

# Training: Short duration
epochs = 10  # Fine-tuning is quick
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    if scheduler:
        scheduler.step()

Key Points:

  • Much lower LR for pretrained parts (1e-5)
  • Higher LR for new/random parts (1e-3)
  • Short training (3-10 epochs)
  • Scheduling often optional (constant LR works)
  • No warmup needed (weights already good)

Large Batch Training (Batch Size > 1024)

Standard Recipe:

# Linear LR scaling rule: LR scales with batch size
base_lr = 0.1  # For batch size 256
batch_size = 2048
scaled_lr = base_lr * (batch_size / 256)  # 0.8 for batch 2048

# Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=scaled_lr, momentum=0.9)

# Scheduler: MUST include warmup (critical for large batch)
warmup_epochs = 5
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_epochs)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [warmup_epochs])

# Training
epochs = 100
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Key Points:

  • Scale LR linearly with batch size (LR = base_lr * batch_size / base_batch_size)
  • Warmup is MANDATORY for large batch (5-10 epochs minimum)
  • Longer warmup for very large batches (>4096: use 10-20 epochs)

Why Warmup Critical for Large Batch:

  • Large batch = larger effective LR
  • High effective LR at start causes instability
  • Warmup prevents divergence

Modern Defaults by Domain (2025)

Domain Optimizer Scheduler Warmup Epochs
Vision (CNN) SGD (0.9) Cosine or MultiStep Optional (5) 100-200
Vision (ViT) AdamW Cosine MANDATORY (10-20) 300
NLP (BERT/GPT) AdamW Linear MANDATORY (10%) Varies
Detection SGD MultiStep Optional 26-300
Segmentation SGD Polynomial Optional 100
Fast/OneCycle SGD OneCycle Built-in 10-30
Fine-tuning AdamW Constant/Cosine No 3-10
Large Batch SGD Cosine MANDATORY (5-20) 100-200

9. Debugging Scheduler Issues

Issue: Training Unstable / Loss Spikes

Symptoms:

  • Loss increases suddenly during training
  • NaN or Inf loss
  • Training was stable, then becomes unstable

Likely Causes:

  1. No warmup (transformers, large models)

    • Solution: Add 5-10 epoch warmup
  2. LR too high at start

    • Solution: Lower initial LR or extend warmup
  3. LR drop too sharp (MultiStepLR)

    • Solution: Use gentler scheduler (Cosine) or smaller gamma

Debugging Steps:

# 1. Print LR every epoch
for epoch in range(epochs):
    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: LR = {current_lr:.6e}")

    # 2. Check if loss spike correlates with LR change
    loss = train_one_epoch(model, train_loader, optimizer)
    print(f"  Loss = {loss:.4f}")

    scheduler.step()

# 3. Plot LR and loss together
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(lr_history)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.subplot(1, 2, 2)
plt.plot(loss_history)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

Solutions:

  • Add/extend warmup: LinearLR(optimizer, start_factor=0.01, total_iters=10)
  • Lower initial LR: lr = 0.01 instead of lr = 0.1
  • Gentler scheduler: CosineAnnealingLR instead of MultiStepLR
  • Gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Issue: Training Plateaus Too Early

Symptoms:

  • Loss stops decreasing after 20-30 epochs
  • Validation accuracy flat
  • Training seems stuck

Likely Causes:

  1. Not using scheduler (constant LR too high for current regime)

    • Solution: Add scheduler (CosineAnnealing or ReduceLROnPlateau)
  2. Scheduler reducing LR too early

    • Solution: Push back milestones or increase patience
  3. LR already too low

    • Solution: Check current LR, may need to restart with higher initial LR

Debugging Steps:

# Check current LR
current_lr = optimizer.param_groups[0]['lr']
print(f"Current LR: {current_lr:.6e}")

# If LR very low (<1e-6), plateau might be due to other issues (architecture, data, etc.)
# If LR still high (>1e-3), should reduce LR to break plateau

Solutions:

  • Add ReduceLROnPlateau: Automatically reduces when plateau detected

    scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)
    
  • Manual LR reduction: If at epoch 30 and plateaued, reduce LR by 10x now

    for param_group in optimizer.param_groups:
        param_group['lr'] *= 0.1
    
  • Use scheduler from start next time:

    scheduler = CosineAnnealingLR(optimizer, T_max=100)
    

Issue: Poor Final Performance (Train > Val Gap)

Symptoms:

  • Training accuracy high (95%), validation lower (88%)
  • Model overfitting
  • Test performance disappointing

Likely Causes (Scheduling Related):

  1. LR not low enough at end

    • Solution: Lower eta_min or extend training
  2. Not using scheduler (constant LR doesn't fine-tune)

    • Solution: Add scheduler to reduce LR in late training
  3. Scheduler ending too early

    • Solution: Extend training or adjust T_max

Debugging Steps:

# Check final LR
final_lr = optimizer.param_groups[0]['lr']
print(f"Final LR: {final_lr:.6e}")

# Final LR should be very low (1e-5 to 1e-6)
# If final LR still high (>1e-3), model didn't fine-tune properly

Solutions:

  • Lower eta_min: CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
  • Extend training: Train for more epochs to allow LR to decay further
  • Add late-stage fine-tuning:
    # After main training, do 10 more epochs with very low LR
    for param_group in optimizer.param_groups:
        param_group['lr'] = 1e-5
    for epoch in range(10):
        train_one_epoch(model, train_loader, optimizer)
    

Note: If train-val gap large, may also need regularization (not scheduling issue)


Issue: LR Decays Too Fast

Symptoms:

  • LR reaches minimum in first few epochs
  • Training very slow after initial epochs
  • Looks like constant very low LR

Likely Causes:

  1. scheduler.step() called every batch instead of epoch

    • Solution: Move scheduler.step() outside batch loop
  2. T_max too small (e.g., T_max=10 but training for 100 epochs)

    • Solution: Set T_max = total_epochs
  3. Using OneCycle unintentionally

    • Solution: Verify scheduler type

Debugging Steps:

# Print LR first few epochs
for epoch in range(10):
    print(f"Epoch {epoch}: LR = {optimizer.param_groups[0]['lr']:.6e}")
    for batch in train_loader:
        train_step(model, batch, optimizer)
        # scheduler.step()  # ❌ If this is here, that's the bug
    scheduler.step()  # ✅ Should be here

Solutions:

  • Move scheduler.step() to correct location (after epoch, not after batch)
  • Fix T_max: T_max = total_epochs or T_max = total_epochs - warmup_epochs
  • Verify scheduler type: print(type(scheduler))

Issue: OneCycleLR Not Working

Symptoms:

  • Training with OneCycle becomes unstable around peak LR
  • Loss increases during ramp-up phase
  • Worse performance than expected

Likely Causes:

  1. max_lr too high

    • Solution: Run LR finder, use lower max_lr
  2. scheduler.step() placement wrong (should be per batch)

    • Solution: Call scheduler.step() every batch
  3. Not tuning max_lr

    • Solution: Use LR finder to find optimal, use 5-10x as max_lr

Debugging Steps:

# Plot LR schedule
lrs = []
for epoch in range(epochs):
    for batch in train_loader:
        lrs.append(optimizer.param_groups[0]['lr'])
        scheduler.step()

plt.plot(lrs)
plt.xlabel('Batch')
plt.ylabel('Learning Rate')
plt.title('OneCycle LR Schedule')
plt.show()

# Should see: ramp up to max_lr, then ramp down
# If doesn't look like that, scheduler.step() placement wrong

Solutions:

  • Run LR finder first:

    optimal_lr = find_lr(model, train_loader, optimizer, loss_fn, device)
    max_lr = optimal_lr * 10  # Or try 5x, 3x if 10x unstable
    
  • Lower max_lr manually:

    # If max_lr=0.1 unstable, try 0.03 or 0.01
    scheduler = OneCycleLR(optimizer, max_lr=0.03, ...)
    
  • Verify step() every batch:

    for epoch in range(epochs):
        for batch in train_loader:
            train_step(model, batch, optimizer)
            optimizer.step()
            scheduler.step()  # ✅ Every batch
    

Issue: Warmup Not Working

Symptoms:

  • Training still unstable in first few epochs despite warmup
  • Loss spikes even with warmup
  • NaN loss at start

Likely Causes:

  1. Warmup too short (need longer ramp-up)

    • Solution: Extend warmup from 5 to 10-20 epochs
  2. start_factor too high (not starting low enough)

    • Solution: Use start_factor=0.001 instead of 0.01
  3. Warmup not actually being used (SequentialLR bug)

    • Solution: Verify warmup scheduler is active early

Debugging Steps:

# Print LR first 10 epochs
for epoch in range(10):
    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: LR = {current_lr:.6e}")
    # Should see gradual increase from low to high
    # If jumps immediately to high, warmup not working

    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Solutions:

  • Extend warmup:

    warmup = LinearLR(optimizer, start_factor=0.01, total_iters=20)  # 20 epochs
    
  • Lower start_factor:

    warmup = LinearLR(optimizer, start_factor=0.001, total_iters=5)  # Start at 0.1%
    
  • Verify SequentialLR milestone:

    # Milestone should match warmup duration
    scheduler = SequentialLR(optimizer, [warmup, cosine], milestones=[20])
    
  • Add gradient clipping as additional safeguard:

    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    

Issue: ReduceLROnPlateau Never Reduces LR

Symptoms:

  • Using ReduceLROnPlateau for 50+ epochs
  • Validation loss clearly plateaued
  • Learning rate never reduces

Debugging Steps:

1. Verify metric is being passed:

val_loss = validate(model, val_loader)
print(f"Epoch {epoch}: val_loss = {val_loss:.6f}")  # Print metric
scheduler.step(val_loss)  # Ensure passing metric

2. Check mode is correct:

# For loss (want to minimize):
scheduler = ReduceLROnPlateau(optimizer, mode='min')

# For accuracy (want to maximize):
scheduler = ReduceLROnPlateau(optimizer, mode='max')

Wrong mode means scheduler waits for opposite direction (loss increasing instead of decreasing).

3. Check threshold isn't too strict:

# Default threshold=1e-4 (0.01% improvement threshold)
# If val_loss 0.5000 → 0.4999 (0.02% improvement), counts as improvement
# If threshold too high, tiny improvements prevent reduction

# Solution: Lower threshold to be more sensitive
scheduler = ReduceLROnPlateau(optimizer, threshold=1e-5)

# Or remove threshold entirely
scheduler = ReduceLROnPlateau(optimizer, threshold=0)

4. Enable verbose logging:

scheduler = ReduceLROnPlateau(optimizer, verbose=True)
# Prints: "Epoch 00042: reducing learning rate of group 0 to 1.0000e-04"
# when it reduces

5. Verify plateau is real:

# Plot validation loss over time
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(val_losses)
plt.xlabel('Epoch')
plt.ylabel('Validation Loss')
plt.title('Validation Loss Over Time')
plt.grid(True, alpha=0.3)
plt.show()

# Check: Is loss truly flat, or still slowly improving?
# Tiny improvements (0.4500 → 0.4499) count as progress

6. Check cooldown isn't preventing reduction:

# Default cooldown=0, but if set higher, prevents reduction after recent reduction
scheduler = ReduceLROnPlateau(optimizer, cooldown=0)  # No cooldown

Common Causes Table:

Problem Symptom Solution
Not passing metric Error or no reduction scheduler.step(val_loss)
Wrong mode Never reduces mode='min' for loss, mode='max' for accuracy
Threshold too strict Ignores small improvements Lower to threshold=1e-5 or 0
Metric still improving Not actually plateaued Increase patience or accept slow progress
Cooldown active Reducing but waiting Set cooldown=0
Min_lr reached Can't reduce further Check current LR, may be at min_lr

Example Fix:

scheduler = ReduceLROnPlateau(
    optimizer,
    mode='min',          # For loss minimization
    factor=0.1,          # Reduce by 10x
    patience=10,         # Wait 10 epochs
    threshold=0,         # Accept any improvement (most sensitive)
    threshold_mode='rel',
    cooldown=0,          # No cooldown period
    min_lr=1e-6,         # Minimum LR allowed
    verbose=True         # Print when reducing
)

# Training loop
for epoch in range(epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    val_loss = validate(model, val_loader)

    print(f"Epoch {epoch}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}")

    scheduler.step(val_loss)  # Pass validation loss

    # Print current LR
    current_lr = optimizer.param_groups[0]['lr']
    print(f"  Current LR: {current_lr:.6e}")

Advanced Debugging:

If still not reducing, manually check scheduler logic:

# Get scheduler state
print(f"Best metric so far: {scheduler.best}")
print(f"Epochs without improvement: {scheduler.num_bad_epochs}")
print(f"Patience: {scheduler.patience}")

# If num_bad_epochs < patience, it's still waiting
# If num_bad_epochs >= patience, should reduce next step

10. Rationalization Table

When users rationalize away proper LR scheduling, counter with:

Rationalization Reality Counter-Argument
"Constant LR is simpler" Leaves 2-5% performance on table "One line of code for 2-5% better accuracy is excellent ROI"
"Warmup seems optional" MANDATORY for transformers "Without warmup, transformers diverge or train unstably"
"I don't know which scheduler to use" CosineAnnealing is great default "CosineAnnealingLR works well for most cases, zero tuning"
"Scheduling is too complicated" Modern frameworks make it trivial "scheduler = CosineAnnealingLR(optimizer, T_max=100) - that's it"
"Papers don't mention scheduling" They do, in implementation details "Check paper's code repo or appendix - scheduling always there"
"My model is too small to need scheduling" Even small models benefit "Scheduling helps all models converge to better minima"
"Just use Adam, it adapts automatically" Adam still benefits from scheduling "SOTA transformers use AdamW + scheduling (BERT, GPT, ViT)"
"I'll tune it later" Scheduling should be from start "Scheduling is core hyperparameter, not optional add-on"
"OneCycle always best" Only for specific scenarios "OneCycle great for fast training (<30 epochs), not long training"
"I don't have time to run LR finder" Takes 5 minutes, saves hours "LR finder runs in minutes, prevents wasted training runs"
"Warmup adds complexity" One extra line of code "SequentialLR([warmup, cosine], [5]) - that's the complexity"
"My training is already good enough" Could be 2-5% better "SOTA papers all use scheduling - it's standard practice"
"Reducing LR will slow training" Reduces LR when high LR hurts "High LR early (fast), low LR late (fine-tune) = best of both"
"I don't know what T_max to use" T_max = total_epochs "Just set T_max to your total training epochs"

11. Red Flags Checklist

Watch for these warning signs that indicate scheduling problems:

Critical Red Flags (Fix Immediately):

🚨 Training transformer without warmup

  • Impact: High risk of divergence, NaN loss
  • Fix: Add 5-10 epoch warmup immediately

🚨 Loss NaN or exploding in first few epochs

  • Impact: Training failed
  • Fix: Add warmup, lower initial LR, gradient clipping

🚨 scheduler.step() called every batch for Cosine/Step schedulers

  • Impact: LR decays 100x too fast
  • Fix: Move scheduler.step() outside batch loop

🚨 Not passing metric to ReduceLROnPlateau

  • Impact: Scheduler doesn't work at all
  • Fix: scheduler.step(val_loss)

Important Red Flags (Should Fix):

⚠️ Training >30 epochs without scheduler

  • Impact: Leaving 2-5% performance on table
  • Fix: Add CosineAnnealingLR or MultiStepLR

⚠️ OneCycle with random max_lr (not tuned)

  • Impact: Unstable training or suboptimal performance
  • Fix: Run LR finder, tune max_lr

⚠️ Large batch (>512) without warmup

  • Impact: Training instability
  • Fix: Add 5-10 epoch warmup

⚠️ Vision transformer with constant LR

  • Impact: Poor convergence, unstable training
  • Fix: Add warmup + cosine schedule

⚠️ Training plateaus but no scheduler to reduce LR

  • Impact: Stuck at local minimum
  • Fix: Add ReduceLROnPlateau or manually reduce LR

Minor Red Flags (Consider Fixing):

⚡ CNN training without any scheduling

  • Impact: Missing 1-3% accuracy
  • Fix: Add MultiStepLR or CosineAnnealingLR

⚡ Not monitoring LR during training

  • Impact: Hard to debug schedule issues
  • Fix: Log LR every epoch

⚡ T_max doesn't match training duration

  • Impact: Schedule ends too early/late
  • Fix: Set T_max = total_epochs - warmup_epochs

⚡ Using same LR for pretrained and new layers (fine-tuning)

  • Impact: Suboptimal fine-tuning
  • Fix: Use different LRs for param groups

⚡ Not validating schedule before full training

  • Impact: Risk wasting compute on wrong schedule
  • Fix: Plot schedule dry-run before training

12. Quick Reference

Scheduler Selection Cheatsheet

Q: What should I use for...

Vision CNN (100 epochs)?
→ CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-5)

Vision Transformer?
→ LinearLR(warmup 5) + CosineAnnealingLR(T_max=95) [WARMUP MANDATORY]

NLP Transformer?
→ LinearLR(warmup 10%) + LinearLR(decay) [WARMUP MANDATORY]

Fast training (<30 epochs)?
→ OneCycleLR(max_lr=tune_with_LR_finder)

Don't know optimal schedule?
→ ReduceLROnPlateau(mode='min', patience=10)

Training plateaued?
→ Add ReduceLROnPlateau or manually reduce LR by 10x now

Following paper recipe?
→ Use paper's exact schedule (usually MultiStepLR)

Fine-tuning pretrained model?
→ Constant low LR (1e-5) or gentle CosineAnnealing

Large batch (>512)?
→ LinearLR(warmup 5-10) + CosineAnnealingLR [WARMUP MANDATORY]

Step Placement Quick Reference

# Most schedulers (Step, Cosine, Exponential)
for epoch in range(epochs):
    for batch in train_loader:
        train_step(...)
    scheduler.step()  # AFTER epoch

# OneCycleLR (EXCEPTION)
for epoch in range(epochs):
    for batch in train_loader:
        train_step(...)
        scheduler.step()  # AFTER each batch

# ReduceLROnPlateau (pass metric)
for epoch in range(epochs):
    for batch in train_loader:
        train_step(...)
    val_loss = validate(...)
    scheduler.step(val_loss)  # Pass metric

Warmup Quick Reference

# Pattern: Warmup + Cosine (most common)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])

# When warmup is MANDATORY:
# ✅ Transformers (ViT, BERT, GPT)
# ✅ Large batch (>512)
# ✅ High initial LR
# ✅ Training from scratch

# When warmup is optional:
# ❌ Fine-tuning
# ❌ Small LR (<1e-4)
# ❌ Small models

LR Finder Quick Reference

# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)

# Find optimal (steepest descent)
optimal_lr = suggest_lr_from_finder(lrs, losses)

# Use cases:
# - Direct use: optimizer = SGD(params, lr=optimal_lr)
# - OneCycle: max_lr = optimal_lr * 10
# - Conservative: base_lr = optimal_lr * 0.1

Summary

Learning rate scheduling is CRITICAL for competitive model performance:

Key Takeaways:

  1. Scheduling improves final accuracy by 2-5% - not optional for SOTA
  2. Warmup is MANDATORY for transformers - prevents divergence
  3. CosineAnnealingLR is best default - works well, zero tuning
  4. Use LR finder for new problems - finds optimal initial LR in minutes
  5. OneCycleLR needs max_lr tuning - run LR finder first
  6. Watch scheduler.step() placement - most per epoch, OneCycle per batch
  7. Always monitor LR during training - log to console or TensorBoard
  8. Plot schedule before training - catch mistakes early

Modern Defaults (2025):

  • Vision CNNs: SGD + CosineAnnealingLR (optional warmup)
  • Vision Transformers: AdamW + Warmup + CosineAnnealingLR (warmup mandatory)
  • NLP Transformers: AdamW + Warmup + Linear decay (warmup mandatory)
  • Fast Training: SGD + OneCycleLR (tune max_lr with LR finder)

When In Doubt:

  • Use CosineAnnealingLR with T_max = total_epochs
  • Add 5-epoch warmup for large models
  • Run LR finder if unsure about initial LR
  • Log LR every epoch to monitor schedule

Learning rate scheduling is one of the highest-ROI hyperparameters - master it for significantly better model performance.