| name | rl-debugging |
| description | Systematic RL debugging - diagnose learning failures, common bugs, diagnosis trees |
RL Debugging Methodology
When to Use This Skill
Invoke this skill when you encounter:
- Agent Won't Learn: Reward stuck at baseline, not improving
- Training Unstable: Loss bouncing, reward highly variable
- Suboptimal Policy: Agent learned something but worse than expected
- Reward Hacking: Agent gaming the reward function
- Exploration Issues: Agent stuck in local optimum or exploring poorly
- Hyperparameter Sensitivity: Small changes break training
- Learning Rate Tuning: Not sure what value is right
- Convergence Problems: Loss doesn't decrease or decreases then stops
- Environment vs Algorithm: Unsure if problem is environment or RL algorithm
- Logging Confusion: Not sure what metrics to monitor
- Gradual Performance Degradation: Early training good, later poor
- Sparse Reward Challenge: Agent never finds reward signal
Core Problem: RL debugging often becomes random hyperparameter tweaking. Agents are complex systems with many failure modes. Systematic diagnosis finds root causes; random tweaking wastes time and leads to contradictory findings.
Do NOT Use This Skill For
- Learning RL theory (route to rl-foundations for MDPs, Bellman equations, policy gradients)
- Implementing new algorithms (route to algorithm-specific skills like value-based-methods, policy-gradient-methods, actor-critic-methods)
- Environment API questions (route to rl-environments for Gym/Gymnasium API, custom environments, wrappers)
- Evaluation methodology (route to rl-evaluation for rigorous statistical testing, generalization assessment)
- Initial algorithm selection (route to using-deep-rl router or rl-foundations for choosing the right algorithm family)
Core Principle: The 80/20 Rule
The most important insight in RL debugging:
80% of RL failures are in:
1. Environment design (agent can't see true state)
2. Reward function (misaligned or wrong scale)
3. Observation/action representation (missing information)
15% are in:
4. Hyperparameters (learning rate, batch size, etc.)
5. Exploration strategy (too much or too little)
5% are in:
6. Algorithm selection (wrong algorithm for problem)
Consequence: If training fails, check environment and reward FIRST. Changing the algorithm last.
Why This Order?
Scenario 1: Broken Environment
# BROKEN ENVIRONMENT: Agent can't win no matter what algorithm
class BrokenEnv:
def reset(self):
self.state = random_state() # Agent can't control this
return self.state
def step(self, action):
# Reward independent of action!
reward = random.random()
return self.state, reward
# No amount of PPO, DQN, SAC can learn from random reward
# CORRECT ENVIRONMENT: Agent can win with right policy
class CorrectEnv:
def reset(self):
self.state = initial_state
return self.state
def step(self, action):
# Reward depends on action
reward = compute_reward(self.state, action)
self.state = compute_next_state(self.state, action)
return self.state, reward
If environment is broken, no algorithm will learn.
Scenario 2: Reward Scale Issue
# WRONG SCALE: Reward in [0, 1000000]
# Algorithm gradient updates: param = param - lr * grad
# If gradient huge (due to reward scale), single step breaks everything
# CORRECT SCALE: Reward in [-1, 1]
# Gradients are reasonable, learning stable
# Fix is simple: divide reward by scale factor
# But if you don't know to check reward scale, you'll try 10 learning rates instead
Consequence: Always check reward scale before tuning learning rate.
Part 1: Systematic Debugging Framework
The Debugging Process (Not Random Tweaking)
START: Agent not learning (or training unstable, or suboptimal)
Step 1: ENVIRONMENT CHECK (Does agent have what it needs?)
├─ Can agent see the state? (Is observation sufficient?)
├─ Is environment deterministic or stochastic? (Affects algorithm choice)
├─ Can agent actually win? (Does optimal policy exist?)
└─ Is environment reset working? (Fresh episode each reset?)
Step 2: REWARD SCALE CHECK (Is reward in reasonable range?)
├─ What's the range of rewards? (Min, max, typical)
├─ Are rewards normalized? (Should be ≈ [-1, 1])
├─ Is reward aligned with desired behavior? (No reward hacking)
└─ Are rewards sparse or dense? (Affects exploration strategy)
Step 3: OBSERVATION REPRESENTATION (Is information preserved?)
├─ Are observations normalized? (Images: [0, 255] or [0, 1]?)
├─ Is temporal information included? (Frame stacking for Atari?)
├─ Are observations consistent? (Same format each episode?)
└─ Is observation sufficient to solve problem? (Can human win from this info?)
Step 4: BASIC ALGORITHM CHECK (Is the RL algorithm working at all?)
├─ Run on simple environment (CartPole, simple task)
├─ Can algorithm learn on simple env? (If not: algorithm issue)
├─ Can algorithm beat random baseline? (If not: something is broken)
└─ Does loss decrease? (If not: learning not happening)
Step 5: HYPERPARAMETER TUNING (Only after above passed)
├─ Is learning rate in reasonable range? (1e-5 to 1e-3 typical)
├─ Is batch size appropriate? (Power of 2: 32, 64, 128, 256)
├─ Is exploration sufficient? (Epsilon decaying? Entropy positive?)
└─ Are network layers reasonable? (3 hidden layers typical)
Step 6: LOGGING ANALYSIS (What do the metrics say?)
├─ Policy loss: decreasing? exploding? zero?
├─ Value loss: decreasing? stable?
├─ Reward curve: trending up? flat? oscillating?
├─ Entropy: decreasing over time? (Exploration → exploitation)
└─ Gradient norms: reasonable? exploding? vanishing?
Step 7: IDENTIFY ROOT CAUSE (Synthesize findings)
└─ Where is the actual problem? (Environment, reward, algorithm, hyperparameters)
Why This Order Matters
Common mistake: Jump to Step 5 (hyperparameter tuning)
# Agent not learning. Frustration sets in.
# "I'll try learning rate 1e-4" (Step 5, skipped 1-4)
# Doesn't work.
# "I'll try batch size 64" (more Step 5 tweaking)
# Doesn't work.
# "I'll try a bigger network" (still Step 5)
# Doesn't work.
# Hours wasted.
# Correct approach: Follow Steps 1-4 first.
# Step 1: Oh! Environment reset is broken, always same initial state
# Fix environment.
# Now agent learns immediately with default hyperparameters.
The order reflects probability: It's more likely the environment is broken than the algorithm; more likely the reward scale is wrong than learning rate is wrong.
Part 2: Diagnosis Trees by Symptom
Diagnosis Tree 1: "Agent Won't Learn"
Symptom: Reward stuck near random baseline. Loss doesn't decrease meaningfully.
START: Agent Won't Learn
├─ STEP 1: Can agent beat random baseline?
│ ├─ YES → Skip to STEP 4
│ └─ NO → Environment issue likely
│ ├─ Check 1A: Is environment output sane?
│ │ ├─ Print first 5 episodes: state, action, reward, next_state
│ │ ├─ Verify types match (shapes, ranges, dtypes)
│ │ └─ Is reward always same? Always zero? (Red flag: no signal)
│ ├─ Check 1B: Can you beat it manually?
│ │ ├─ Play environment by hand (hardcode a policy)
│ │ ├─ Can you get >0 reward? (If not: environment is broken)
│ │ └─ If yes: Agent is missing something
│ └─ Check 1C: Is reset working?
│ ├─ Call reset() twice, check states differ
│ └─ If states same: reset is broken, fix it
├─ STEP 2: Is reward scale reasonable?
│ ├─ Compute: min, max, mean, std of rewards from random policy
│ ├─ If range >> 1 (e.g., [0, 10000]):
│ │ ├─ Action: Normalize rewards to [-1, 1]
│ │ ├─ Code: reward = reward / max_possible_reward
│ │ └─ Retest: Usually fixes "won't learn"
│ ├─ If range << 1 (e.g., [0, 0.001]):
│ │ ├─ Action: Scale up rewards
│ │ ├─ Code: reward = reward * 1000
│ │ └─ Or increase network capacity (more signal needed)
│ └─ If reward is [0, 1] (looks fine):
│ └─ Continue to STEP 3
├─ STEP 3: Is observation sufficient?
│ ├─ Check 3A: Are observations normalized?
│ │ ├─ If images [0, 255]: normalize to [0, 1] or [-1, 1]
│ │ ├─ Code: observation = observation / 255.0
│ │ └─ Retest
│ ├─ Check 3B: Is temporal info included? (For vision: frame stacking)
│ │ ├─ If using images: last 4 frames stacked?
│ │ ├─ If using states: includes velocity/derivatives?
│ │ └─ Missing temporal info → agent can't infer velocity
│ └─ Check 3C: Is observation Markovian?
│ ├─ Can optimal policy be derived from this observation?
│ ├─ If not: observation insufficient (red flag)
│ └─ Example: Only position, not velocity → agent can't control
├─ STEP 4: Run sanity check on simple environment
│ ├─ Switch to CartPole or equivalent simple env
│ ├─ Train with default hyperparameters
│ ├─ Does simple env learn? (Should learn in 1000-5000 steps)
│ ├─ YES → Your algorithm works, issue is your env/hyperparameters
│ └─ NO → Algorithm itself broken (rare, check algorithm implementation)
├─ STEP 5: Check exploration
│ ├─ Is agent exploring or stuck?
│ ├─ Log entropy (for stochastic policies)
│ ├─ If entropy → 0 early: agent exploiting before exploring
│ │ └─ Solution: Increase entropy regularization or ε
│ ├─ If entropy always high: too much exploration
│ │ └─ Solution: Decay entropy or ε more aggressively
│ └─ Visualize: Plot policy actions over time, should see diversity early
├─ STEP 6: Check learning rate
│ ├─ Is learning rate in [1e-5, 1e-3]? (typical range)
│ ├─ If > 1e-3: Try reducing (might be too aggressive)
│ ├─ If < 1e-5: Try increasing (might be too conservative)
│ ├─ Watch loss first step: If loss increases → LR too high
│ └─ Safe default: 3e-4
└─ STEP 7: Check network architecture
├─ For continuous control: small networks ok (1-2 hidden layers, 64-256 units)
├─ For vision: use CNN (don't use FC on pixels)
├─ Check if network has enough capacity
└─ Tip: Start with simple, add complexity if needed
ROOT CAUSES in order of likelihood:
- Reward scale wrong (40% of cases)
- Environment broken (25% of cases)
- Observation insufficient (15% of cases)
- Learning rate too high/low (12% of cases)
- Algorithm issue (8% of cases)
Diagnosis Tree 2: "Training Unstable"
Symptom: Loss bounces wildly, reward spikes then crashes, training oscillates.
START: Training Unstable
├─ STEP 1: Characterize the instability
│ ├─ Plot loss curve: Does it bounce at same magnitude or grow?
│ ├─ Plot reward curve: Does it oscillate around mean or trend down?
│ ├─ Compute: reward variance over 100 episodes
│ └─ This tells you: Is it normal variance or pathological instability?
├─ STEP 2: Check if environment is deterministic
│ ├─ Deterministic environment + stochastic policy = normal variance
│ ├─ Stochastic environment + any policy = high variance (expected)
│ ├─ If stochastic: Can you reduce randomness? Or accept higher variance?
│ └─ Some instability is normal; distinguish from pathological
├─ STEP 3: Check reward scale
│ ├─ If rewards >> 1: Gradient updates too large
│ │ ├─ Single step might overshoot optimum
│ │ ├─ Solution: Normalize rewards to [-1, 1]
│ │ └─ This often fixes instability immediately
│ ├─ If reward has outliers: Single large reward breaks training
│ │ ├─ Solution: Reward clipping or scaling
│ │ └─ Example: r = np.clip(reward, -1, 1)
│ └─ Check: Is reward scale consistent?
├─ STEP 4: Check learning rate (LR often causes instability)
│ ├─ If loss oscillates: LR likely too high
│ │ ├─ Try reducing by 2-5× (e.g., 1e-3 → 3e-4)
│ │ ├─ Watch first 100 steps: Loss should decrease monotonically
│ │ └─ If still oscillates: try 10× reduction
│ ├─ If you have LR scheduler: Check if it's too aggressive
│ │ ├─ Scheduler reducing LR too fast can cause steps
│ │ └─ Solution: Slower schedule (more steps to final LR)
│ └─ Test: Set LR very low (1e-5), see if training is smooth
│ ├─ YES → Increase LR gradually until instability starts
│ └─ This bracketing finds safe LR range
├─ STEP 5: Check batch size
│ ├─ Small batch (< 32): High gradient variance, bouncy updates
│ │ ├─ Solution: Increase batch size (32, 64, 128)
│ │ └─ But not too large: training becomes slow
│ ├─ Large batch (> 512): Might overfit, large gradient steps
│ │ ├─ Solution: Use gradient accumulation
│ │ └─ Or reduce learning rate slightly
│ └─ Start with batch_size=64, adjust if needed
├─ STEP 6: Check gradient clipping
│ ├─ Are gradients exploding? (Check max gradient norm)
│ │ ├─ If max grad norm > 100: Likely exploding gradients
│ │ ├─ Solution: Enable gradient clipping (max_norm=1.0)
│ │ └─ Code: torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
│ ├─ If max grad norm reasonable (< 10): Skip this step
│ └─ Watch grad norm over training: Should stay roughly constant
├─ STEP 7: Check algorithm-specific parameters
│ ├─ For PPO: Is clipping epsilon reasonable? (0.2 default)
│ │ ├─ Too high: Over-clips, doesn't update
│ │ └─ Too low: Allows large updates, instability
│ ├─ For DQN: Is target network update frequency appropriate?
│ │ ├─ Update too often: Target constantly changing
│ │ └─ Update too rarely: Stale targets
│ └─ For A3C/A2C: Check entropy coefficient
│ ├─ Too high: Too much exploration, policy noisy
│ └─ Too low: Premature convergence
└─ STEP 8: Check exploration decay
├─ Is exploration decaying too fast? (Policy becomes deterministic)
│ └─ If entropy→0 early: Agent exploits before exploring
├─ Is exploration decaying too slow? (Policy stays noisy)
│ └─ If entropy stays high: Too much randomness in later training
└─ Entropy should decay: high early, low late
└─ Plot entropy over training: should show clear decay curve
ROOT CAUSES in order of likelihood:
- Learning rate too high (35% of cases)
- Reward scale too large (25% of cases)
- Batch size too small (15% of cases)
- Gradient explosion (10% of cases)
- Algorithm parameters (10% of cases)
- Environment stochasticity (5% of cases)
Diagnosis Tree 3: "Suboptimal Policy"
Symptom: Agent learned something but performs worse than expected. Better than random baseline, but not good enough.
START: Suboptimal Policy
├─ STEP 1: How suboptimal? (Quantify the gap)
│ ├─ Compute: Agent reward vs theoretical optimal
│ ├─ If 80% of optimal: Normal (RL usually gets 80-90% optimal)
│ ├─ If 50% of optimal: Significantly suboptimal
│ ├─ If 20% of optimal: Very bad
│ └─ This tells you: Is it "good enough" or truly broken?
├─ STEP 2: Is it stuck in local optimum?
│ ├─ Run multiple seeds: Do you get similar reward each seed?
│ ├─ If rewards similar across seeds: Consistent local optimum
│ ├─ If rewards vary wildly: High variance, need more training
│ └─ Solution if local optimum: More exploration or better reward shaping
├─ STEP 3: Check reward hacking
│ ├─ Visualize agent behavior: Does it match intent?
│ ├─ Example: Cart-pole reward is [0, 1] per timestep
│ │ ├─ Agent might learn: "Stay in center, don't move"
│ │ ├─ Policy is suboptimal but still gets reward
│ │ └─ Solution: Reward engineering (bonus for progress)
│ └─ Hacking signs:
│ ├─ Agent does something weird but gets reward
│ ├─ Behavior makes no intuitive sense
│ └─ Reward increases but performance bad
├─ STEP 4: Is exploration sufficient?
│ ├─ Check entropy: Does policy explore initially?
│ ├─ Check epsilon decay (if using ε-greedy): Does it decay appropriately?
│ ├─ Is agent exploring broadly or stuck in small region?
│ ├─ Solution: Slower exploration decay or intrinsic motivation
│ └─ Use RND/curiosity if environment has sparse rewards
├─ STEP 5: Check network capacity
│ ├─ Is network too small to represent optimal policy?
│ ├─ For vision: Use standard CNN (not tiny network)
│ ├─ For continuous control: 2-3 hidden layers, 128-256 units
│ ├─ Test: Double network size, does performance improve?
│ └─ If yes: Original network was too small
├─ STEP 6: Check data efficiency
│ ├─ Is agent training long enough?
│ ├─ RL usually needs: simple tasks 100k steps, complex tasks 1M+ steps
│ ├─ If training only 10k steps: Too short, agent didn't converge
│ ├─ Solution: Train longer (but check reward curve first)
│ └─ If reward plateaus early: Extend training won't help
├─ STEP 7: Check observation and action spaces
│ ├─ Is action space continuous or discrete?
│ ├─ Is action discretization appropriate?
│ │ ├─ Too coarse: Can't express fine control
│ │ ├─ Too fine: Huge action space, hard to learn
│ │ └─ Example: 100 actions for simple control = too many
│ ├─ Is observation sufficient? (See Diagnosis Tree 1, Step 3)
│ └─ Missing information in observation = impossible to be optimal
├─ STEP 8: Check reward structure
│ ├─ Is reward dense or sparse?
│ ├─ Sparse reward + suboptimal policy: Agent might not be exploring to good region
│ │ ├─ Solution: Reward shaping (bonus for progress)
│ │ └─ Or: Intrinsic motivation (RND/curiosity)
│ ├─ Dense reward + suboptimal: Possible misalignment with intent
│ └─ Can you improve by reshaping reward?
└─ STEP 9: Compare with baseline algorithm
├─ Run reference implementation on same env
├─ Does reference get better reward?
├─ YES → Your implementation has a bug
├─ NO → Problem is inherent to algorithm or environment
└─ This isolates: Implementation issue vs fundamental difficulty
ROOT CAUSES in order of likelihood:
- Exploration insufficient (30% of cases)
- Training not long enough (25% of cases)
- Reward hacking (20% of cases)
- Network too small (12% of cases)
- Observation insufficient (8% of cases)
- Algorithm mismatch (5% of cases)
Part 3: What to Check First
Critical Checks (Do These First)
Check 1: Reward Scale Analysis
Why: Reward scale is the MOST COMMON source of RL failures.
# DIAGNOSTIC SCRIPT
import numpy as np
# Collect rewards from random policy
rewards = []
for episode in range(100):
state = env.reset()
for step in range(1000):
action = env.action_space.sample() # Random action
state, reward, done, _ = env.step(action)
rewards.append(reward)
if done:
break
rewards = np.array(rewards)
print(f"Reward statistics from random policy:")
print(f" Min: {rewards.min()}")
print(f" Max: {rewards.max()}")
print(f" Mean: {rewards.mean()}")
print(f" Std: {rewards.std()}")
print(f" Range: [{rewards.min()}, {rewards.max()}]")
# RED FLAGS
if abs(rewards.max()) > 100 or abs(rewards.min()) > 100:
print("⚠️ RED FLAG: Rewards >> 1, normalize them!")
if rewards.std() > 10:
print("⚠️ RED FLAG: High reward variance, normalize or clip")
if rewards.mean() == rewards.max():
print("⚠️ RED FLAG: Constant rewards, no signal to learn from!")
if (rewards > 1).any() and (rewards < -1).any():
print("✓ Reward scale looks reasonable ([-1, 1] range)")
Action if scale is wrong:
# Normalize to [-1, 1]
reward = reward / max(abs(rewards.max()), abs(rewards.min()))
# Or clip
reward = np.clip(reward, -1, 1)
# Or shift and scale
reward = 2 * (reward - rewards.min()) / (rewards.max() - rewards.min()) - 1
Check 2: Environment Sanity Check
Why: Broken environment → no algorithm will work.
# DIAGNOSTIC SCRIPT
def sanity_check_env(env, num_episodes=5):
"""Quick check if environment is sane."""
for episode in range(num_episodes):
state = env.reset()
print(f"\nEpisode {episode}:")
print(f" Initial state shape: {state.shape}, dtype: {state.dtype}")
print(f" Initial state range: [{state.min()}, {state.max()}]")
for step in range(10):
action = env.action_space.sample()
next_state, reward, done, info = env.step(action)
print(f" Step {step}: action={action}, reward={reward}, done={done}")
print(f" State shape: {next_state.shape}, range: [{next_state.min()}, {next_state.max()}]")
# Check for NaN
if np.isnan(next_state).any() or np.isnan(reward):
print(f" ⚠️ NaN detected!")
# Check for reasonable values
if np.abs(next_state).max() > 1e6:
print(f" ⚠️ State explosion (values > 1e6)")
if done:
break
print("\n✓ Environment check complete")
sanity_check_env(env)
RED FLAGS:
- NaN or inf in observations/rewards
- State values exploding (> 1e6)
- Reward always same (no signal)
- Done flag never true (infinite episodes)
- State never changes despite actions
Check 3: Can You Beat It Manually?
Why: If human can't solve it, agent won't either (unless reward hacking).
# Manual policy: Hardcoded behavior
def manual_policy(state):
# Example for CartPole: if pole tilting right, push right
if state[2] > 0: # angle > 0
return 1 # Push right
else:
return 0 # Push left
# Test manual policy
total_reward = 0
for episode in range(10):
state = env.reset()
for step in range(500):
action = manual_policy(state)
state, reward, done, _ = env.step(action)
total_reward += reward
if done:
break
avg_reward = total_reward / 10
print(f"Manual policy average reward: {avg_reward}")
# If avg_reward > 0: Environment is learnable
# If avg_reward ≤ 0: Environment is broken or impossible
Check 4: Observation Normalization
Why: Non-normalized observations cause learning problems.
# Check if observations are normalized
for episode in range(10):
state = env.reset()
print(f"Episode {episode}: state range [{state.min()}, {state.max()}]")
# For images: should be [0, 1] or [-1, 1]
# For physical states: should be roughly [-1, 1]
if state.min() < -10 or state.max() > 10:
print("⚠️ Observations not normalized!")
# Solution:
state = state / np.abs(state).max() # Normalize
Part 4: Common RL Bugs Catalog
Bug 1: Reward Scale > 1
Symptom: Training unstable, loss spikes, agent doesn't learn
Root Cause: Gradients too large due to reward scale
Code Example:
# WRONG: Reward in [0, 1000]
reward = success_count * 1000
# CORRECT: Normalize to [-1, 1]
reward = success_count * 1000
reward = reward / max_possible_reward # Result: [-1, 1]
Fix: Divide rewards by max possible value
Detection:
rewards = [collect 100 episodes]
if max(abs(r) for r in rewards) > 1:
print("⚠️ Reward scale issue detected")
Bug 2: Environment Reset Broken
Symptom: Agent learns initial state but can't adapt
Root Cause: Reset doesn't randomize initial state or returns same state
Code Example:
# WRONG: Reset always same state
def reset(self):
self.state = np.array([0, 0, 0, 0]) # Always [0,0,0,0]
return self.state
# CORRECT: Reset randomizes initial state
def reset(self):
self.state = np.random.uniform(-0.1, 0.1, size=4) # Random
return self.state
Fix: Make reset() randomize initial state
Detection:
states = [env.reset() for _ in range(10)]
if len(set(map(tuple, states))) == 1:
print("⚠️ Reset broken, always same state")
Bug 3: Observation Insufficient (Partial Observability)
Symptom: Agent can't learn because it doesn't see enough
Root Cause: Observation missing velocity, derivatives, or temporal info
Code Example:
# WRONG: Only position, no velocity
state = np.array([position]) # Can't infer velocity from position alone
# CORRECT: Position + velocity
state = np.array([position, velocity])
# WRONG for images: Single frame
observation = env.render() # Single frame, no temporal info
# CORRECT for images: Stacked frames
frames = [frame_t-3, frame_t-2, frame_t-1, frame_t] # 4 frames
observation = np.stack(frames, axis=-1) # Shape: (84, 84, 4)
Fix: Add missing information to observation
Detection:
# If agent converges to bad performance despite long training
# Check: Can you compute optimal action from observation?
# If no: Observation is insufficient
Bug 4: Reward Always Same (No Signal)
Symptom: Loss decreases but doesn't improve over time, reward flat
Root Cause: Reward is constant or nearly constant
Code Example:
# WRONG: Constant reward
reward = 1.0 # Every step gets +1, no differentiation
# CORRECT: Differentiate good and bad outcomes
if reached_goal:
reward = 1.0
else:
reward = 0.0 # Or -0.1 for living cost
Fix: Ensure reward differentiates outcomes
Detection:
rewards = [collect random policy rewards]
if rewards.std() < 0.01:
print("⚠️ Reward has no variance, no signal to learn")
Bug 5: Learning Rate Too High
Symptom: Loss oscillates or explodes, training unstable
Root Cause: Gradient updates too large, overshooting optimum
Code Example:
# WRONG: Learning rate 1e-2 (too high)
optimizer = Adam(model.parameters(), lr=1e-2)
# CORRECT: Learning rate 3e-4 (safe default)
optimizer = Adam(model.parameters(), lr=3e-4)
Fix: Reduce learning rate by 2-5×
Detection:
# Watch loss first 100 steps
# If loss increases first step: LR too high
# If loss decreases but oscillates: LR probably high
Bug 6: Learning Rate Too Low
Symptom: Agent learns very slowly, training takes forever
Root Cause: Gradient updates too small, learning crawls
Code Example:
# WRONG: Learning rate 1e-6 (too low)
optimizer = Adam(model.parameters(), lr=1e-6)
# CORRECT: Learning rate 3e-4
optimizer = Adam(model.parameters(), lr=3e-4)
Fix: Increase learning rate by 2-5×
Detection:
# Training curve increases very slowly
# If training 1M steps and reward barely improved: LR too low
Bug 7: No Exploration Decay
Symptom: Agent learns but remains noisy, doesn't fully exploit
Root Cause: Exploration (epsilon or entropy) not decaying
Code Example:
# WRONG: Constant epsilon
epsilon = 0.3 # Forever
# CORRECT: Decay epsilon
epsilon = epsilon_linear(step, total_steps=1_000_000,
epsilon_start=1.0, epsilon_end=0.01)
Fix: Add exploration decay schedule
Detection:
# Plot entropy or epsilon over training
# Should show clear decay from high to low
# If flat: not decaying
Bug 8: Exploration Decay Too Fast
Symptom: Agent plateaus early, stuck in local optimum
Root Cause: Exploration stops before finding good policy
Code Example:
# WRONG: Decays to zero in 10k steps (for 1M step training)
epsilon = 0.99 ** (step / 100) # Reaches 0 too fast
# CORRECT: Decays over full training
epsilon = epsilon_linear(step, total_steps=1_000_000,
epsilon_start=1.0, epsilon_end=0.01)
Fix: Use longer decay schedule
Detection:
# Plot epsilon over training
# Should reach final value at 50-80% through training
# Not at 5%
Bug 9: Reward Hacking
Symptom: Agent achieves high reward but behavior is useless
Root Cause: Agent found way to game reward not aligned with intent
Code Example:
# WRONG: Reward for just staying alive
reward = 1.0 # Every timestep
# Agent learns: Stay in corner, don't move, get infinite reward
# CORRECT: Reward for progress + living cost
position_before = self.state[0]
self.state = compute_next_state(...)
position_after = self.state[0]
progress = position_after - position_before
reward = progress - 0.01 # Progress bonus, living cost
Fix: Reshape reward to align with intent
Detection:
# Visualize agent behavior
# If behavior weird but reward high: hacking
# If reward increases but task performance bad: hacking
Bug 10: Testing with Exploration
Symptom: Test performance much worse than training, high variance
Root Cause: Using stochastic policy at test time
Code Example:
# WRONG: Test with epsilon > 0
for test_episode in range(100):
action = epsilon_greedy(q_values, epsilon=0.05) # Wrong!
# Agent still explores at test
# CORRECT: Test greedy
for test_episode in range(100):
action = np.argmax(q_values) # Deterministic
Fix: Use greedy/deterministic policy at test time
Detection:
# Test reward variance high?
# Test reward < train reward?
# Check: Are you using exploration at test time?
Part 5: Logging and Monitoring
What Metrics to Track
# Minimal set of metrics for RL debugging
class RLLogger:
def __init__(self):
self.episode_rewards = []
self.policy_losses = []
self.value_losses = []
self.entropies = []
self.gradient_norms = []
def log_episode(self, episode_reward):
self.episode_rewards.append(episode_reward)
def log_losses(self, policy_loss, value_loss, entropy):
self.policy_losses.append(policy_loss)
self.value_losses.append(value_loss)
self.entropies.append(entropy)
def log_gradient_norm(self, norm):
self.gradient_norms.append(norm)
def plot_training(self):
"""Visualize training progress."""
# Plot 1: Episode rewards over time (smoothed)
# Plot 2: Policy and value losses
# Plot 3: Entropy (should decay)
# Plot 4: Gradient norms
pass
What Each Metric Means
Metric 1: Episode Reward
What to look for:
- Should trend upward over time
- Should have decreasing variance (less oscillation)
- Slight noise is normal
Red flags:
- Flat line: Not learning
- Downward trend: Getting worse
- Wild oscillations: Instability or unlucky randomness
Code:
rewards = agent.get_episode_rewards()
reward_smoothed = np.convolve(rewards, np.ones(100)/100, mode='valid')
plt.plot(reward_smoothed) # Smooth to see trend
Metric 2: Policy Loss
What to look for:
- Should decrease over training
- Decrease should smooth out (not oscillating)
Red flags:
- Loss increasing: Learning rate too high
- Loss oscillating: Learning rate too high or reward scale wrong
- Loss = 0: Policy not updating
Code:
if policy_loss > policy_loss_prev:
print("⚠️ Policy loss increased, LR might be too high")
Metric 3: Value Loss (for critic-based methods)
What to look for:
- Should decrease initially, then plateau
- Should not oscillate heavily
Red flags:
- Loss exploding: LR too high
- Loss not changing: Not updating
Code:
value_loss_smoothed = np.convolve(value_losses, np.ones(100)/100)
if value_loss_smoothed[-1] > value_loss_smoothed[-100]:
print("⚠️ Value loss increasing recently")
Metric 4: Entropy (Policy Randomness)
What to look for:
- Should start high (exploring)
- Should decay to low (exploiting)
- Clear downward trend
Red flags:
- Entropy always high: Too much exploration
- Entropy drops to zero: Over-exploiting
- No decay: Entropy not decreasing
Code:
if entropy[-1] > entropy[-100]:
print("⚠️ Entropy increasing, exploration not decaying")
Metric 5: Gradient Norms
What to look for:
- Should stay roughly constant over training
- Typical range: 0.1 to 10
Red flags:
- Gradient norms > 100: Exploding gradients
- Gradient norms < 0.001: Vanishing gradients
- Sudden spikes: Outlier data or numerical issue
Code:
total_norm = 0
for p in model.parameters():
param_norm = p.grad.norm(2)
total_norm += param_norm ** 2
total_norm = total_norm ** 0.5
if total_norm > 100:
print("⚠️ Gradient explosion detected")
Visualization Script
import matplotlib.pyplot as plt
import numpy as np
def plot_rl_training(rewards, policy_losses, value_losses, entropies):
"""Plot training metrics for RL debugging."""
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Plot 1: Episode rewards
ax = axes[0, 0]
ax.plot(rewards, alpha=0.3, label='Episode reward')
reward_smooth = np.convolve(rewards, np.ones(100)/100, mode='valid')
ax.plot(range(100, len(rewards)), reward_smooth, label='Smoothed (100 episodes)')
ax.set_xlabel('Episode')
ax.set_ylabel('Reward')
ax.set_title('Episode Rewards Over Time')
ax.legend()
ax.grid()
# Plot 2: Policy loss
ax = axes[0, 1]
ax.plot(policy_losses, alpha=0.3)
loss_smooth = np.convolve(policy_losses, np.ones(100)/100, mode='valid')
ax.plot(range(100, len(policy_losses)), loss_smooth, label='Smoothed')
ax.set_xlabel('Step')
ax.set_ylabel('Policy Loss')
ax.set_title('Policy Loss Over Time')
ax.legend()
ax.grid()
# Plot 3: Entropy
ax = axes[1, 0]
ax.plot(entropies, label='Policy entropy')
ax.set_xlabel('Step')
ax.set_ylabel('Entropy')
ax.set_title('Policy Entropy (Should Decrease)')
ax.legend()
ax.grid()
# Plot 4: Value loss
ax = axes[1, 1]
ax.plot(value_losses, alpha=0.3)
loss_smooth = np.convolve(value_losses, np.ones(100)/100, mode='valid')
ax.plot(range(100, len(value_losses)), loss_smooth, label='Smoothed')
ax.set_xlabel('Step')
ax.set_ylabel('Value Loss')
ax.set_title('Value Loss Over Time')
ax.legend()
ax.grid()
plt.tight_layout()
plt.show()
Part 6: Common Pitfalls and Red Flags
Pitfall 1: "Bigger Network = Better Learning"
Wrong: Oversized networks overfit and learn slowly
Right: Start with small network (2-3 hidden layers, 64-256 units)
Red Flag: Network has > 10M parameters for simple task
Fix:
# Too big
model = nn.Sequential(
nn.Linear(4, 1024),
nn.ReLU(),
nn.Linear(1024, 1024),
nn.Linear(1024, 2)
)
# Right size
model = nn.Sequential(
nn.Linear(4, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.Linear(128, 2)
)
Pitfall 2: "Random Seed Doesn't Matter"
Wrong: Different seeds give very different results (indicates instability)
Right: Results should be consistent across seeds (within reasonable variance)
Red Flag: Reward varies by 50%+ across 5 seeds
Fix:
# Test across multiple seeds
rewards_by_seed = []
for seed in range(5):
np.random.seed(seed)
torch.manual_seed(seed)
reward = train_agent(seed)
rewards_by_seed.append(reward)
print(f"Mean: {np.mean(rewards_by_seed)}, Std: {np.std(rewards_by_seed)}")
if np.std(rewards_by_seed) > 0.5 * np.mean(rewards_by_seed):
print("⚠️ High variance across seeds, training unstable")
Pitfall 3: "Skip Observation Normalization"
Wrong: Non-normalized observations (scale [-1e6, 1e6])
Right: Normalized observations (scale [-1, 1])
Red Flag: State values > 100 or < -100
Fix:
# Normalize images
observation = observation.astype(np.float32) / 255.0
# Normalize states
observation = (observation - observation_mean) / observation_std
# Or standardize on-the-fly
normalized_obs = (obs - running_mean) / (running_std + 1e-8)
Pitfall 4: "Ignore the Reward Curve Shape"
Wrong: Only look at final reward, ignore curve shape
Right: Curve shape tells you what's wrong
Red Flag: Curve shapes indicate:
- Flat then sudden jump: Long exploration, then found policy
- Oscillating: Unstable learning
- Decreasing after peak: Catastrophic forgetting
Fix:
# Look at curve shape
if reward_curve is flat:
print("Not learning, check environment/reward")
elif reward_curve oscillates:
print("Unstable, check LR or reward scale")
elif reward_curve peaks then drops:
print("Overfitting or exploration decay wrong")
Pitfall 5: "Skip the Random Baseline Check"
Wrong: Train agent without knowing what random baseline is
Right: Always compute random baseline first
Red Flag: Agent barely beats random (within 5% of baseline)
Fix:
# Compute random baseline
random_rewards = []
for _ in range(100):
state = env.reset()
episode_reward = 0
for step in range(1000):
action = env.action_space.sample()
state, reward, done, _ = env.step(action)
episode_reward += reward
if done:
break
random_rewards.append(episode_reward)
random_baseline = np.mean(random_rewards)
print(f"Random baseline: {random_baseline}")
# Compare agent
agent_reward = train_agent()
improvement = (agent_reward - random_baseline) / random_baseline
print(f"Agent improvement: {improvement*100}%")
Pitfall 6: "Changing Multiple Hyperparameters at Once"
Wrong: Change 5 things, training breaks, don't know which caused it
Right: Change one thing at a time, test, measure, iterate
Red Flag: Code has "TUNING" comments with 10 simultaneous changes
Fix:
# Scientific method for debugging
def debug_lr():
for lr in [1e-5, 1e-4, 1e-3, 1e-2]:
reward = train_with_lr(lr)
print(f"LR={lr}: Reward={reward}")
# Only change LR, keep everything else same
def debug_batch_size():
for batch in [32, 64, 128, 256]:
reward = train_with_batch(batch)
print(f"Batch={batch}: Reward={reward}")
# Only change batch, keep everything else same
Pitfall 7: "Using Training Metrics to Judge Performance"
Wrong: Trust training reward, test once at the end
Right: Monitor test reward during training (with exploration off)
Red Flag: Training reward high, test reward low (overfitting)
Fix:
# Evaluate with greedy policy (no exploration)
def evaluate(agent, num_episodes=10):
episode_rewards = []
for _ in range(num_episodes):
state = env.reset()
episode_reward = 0
for step in range(1000):
action = agent.act(state, explore=False) # Greedy
state, reward, done, _ = env.step(action)
episode_reward += reward
if done:
break
episode_rewards.append(episode_reward)
return np.mean(episode_rewards)
# Monitor during training
for step in range(total_steps):
train_agent_step()
if step % 10000 == 0:
test_reward = evaluate(agent) # Evaluate periodically
print(f"Step {step}: Test reward={test_reward}")
Part 7: Red Flags Checklist
CRITICAL RED FLAGS (Stop and debug immediately):
[ ] NaN in loss or rewards
→ Check: reward scale, gradients, network outputs
[ ] Gradient norms > 100 (exploding)
→ Check: Enable gradient clipping, reduce LR
[ ] Gradient norms < 1e-4 (vanishing)
→ Check: Increase LR, check network initialization
[ ] Reward always same
→ Check: Is reward function broken? No differentiation?
[ ] Agent never improves beyond random baseline
→ Check: Reward scale, environment, observation, exploration
[ ] Loss oscillates wildly
→ Check: Learning rate (likely too high), reward scale
[ ] Episode length decreases over training
→ Check: Agent learning bad behavior, poor reward shaping
[ ] Test reward >> training reward
→ Check: Training is lucky, test is representative
[ ] Training gets worse after improving
→ Check: Catastrophic forgetting, stability issue
IMPORTANT RED FLAGS (Debug within a few training runs):
[ ] Entropy not decaying (always high)
→ Check: Entropy regularization, exploration decay
[ ] Entropy goes to zero early
→ Check: Entropy coefficient too low, exploration too aggressive
[ ] Variance across seeds > 50% of mean
→ Check: Training is unstable or lucky, try more seeds
[ ] Network weights not changing
→ Check: Gradient zero, LR zero, network not connected
[ ] Loss = 0 (perfect fit)
→ Check: Network overfitting, reward too easy
MINOR RED FLAGS (Watch for patterns):
[ ] Training slower than expected
→ Check: LR too low, batch size too small, network too small
[ ] Occasional loss spikes
→ Check: Outlier data, reward outliers, clipping needed
[ ] Reward variance high
→ Check: Normal if environment stochastic, check if aligns with intent
[ ] Agent behavior seems random even late in training
→ Check: Entropy not decaying, exploration not stopping
Part 8: Rationalization Resistance
| Rationalization | Reality | Counter-Guidance |
|---|---|---|
| "Higher learning rate will speed up learning" | Can cause instability, often slows learning | Start with 3e-4, measure effect, don't assume |
| "Bigger network always learns better" | Oversized networks overfit, slow training | Start small (64-256 units), increase only if needed |
| "Random seed doesn't matter, RL is random anyway" | High variance indicates instability, not inherent randomness | Run 5+ seeds, variance should be low, not high |
| "I'll try all hyperparameters (grid search)" | Combinatorial explosion, wastes time, no diagnosis | Check environment/reward FIRST, then tune one param at a time |
| "Adding regularization helps unstable training" | Regularization is for overfitting, not instability | Instability usually LR or reward scale, not overfitting |
| "My algorithm is broken" | 80% chance environment, reward, or observation is broken | Check those FIRST before blaming algorithm |
| "More training always helps" | If reward plateaus, more training won't help | Check if training converged, if not why |
| "Skip observation normalization, network will learn to normalize" | Network should not spend capacity learning normalization | Normalize observations before network |
| "Test with epsilon > 0 to reduce variance" | Test should use learned policy, exploration harms test | Use greedy policy at test time |
| "If loss doesn't decrease, algorithm is broken" | More likely: reward scale wrong, gradient clipping needed | Check reward scale, enable gradient clipping before changing algorithm |
Key Takeaways
Follow the systematic process: Don't random tweak. Check environment → reward → observation → algorithm.
80/20 rule: Most failures are in environment, reward, or observation. Check those first.
Reward scale is critical: Most common bug. Normalize to [-1, 1].
Diagnosis trees: Use them. Different symptoms have different root causes.
Metrics tell you everything: Loss, entropy, gradient norms reveal what's wrong.
Rationalization is the enemy: Don't assume, measure. Plot curves, check outputs, verify.
Simple environment first: If agent can't learn CartPole, bigger environment won't help.
One seed is not enough: Run 5+ seeds, look at variance, not just mean.
This skill is about systematic debugging, not random tweaking. Apply the framework, follow diagnosis trees, and you'll find the bug.