| name | exploration-strategies |
| description | Master ε-greedy, UCB, curiosity-driven, RND, intrinsic motivation exploration |
Exploration Strategies in Deep RL
When to Use This Skill
Invoke this skill when you encounter:
- Exploration-Exploitation Problem: Agent stuck in local optimum, not finding sparse rewards
- ε-Greedy Tuning: Designing or debugging epsilon decay schedules
- Sparse Reward Environments: Montezuma's Revenge, goal-conditioned tasks, minimal feedback
- Large State Spaces: Too many states for random exploration to be effective
- Curiosity-Driven Learning: Implementing or understanding intrinsic motivation
- RND (Random Network Distillation): Novelty-based exploration for sparse rewards
- Count-Based Exploration: Encouraging discovery in discrete/tabular domains
- Exploration Stability: Agent explores too much/little, inconsistent performance
- Method Selection: Which exploration strategy for this problem?
- Computational Cost: Balancing exploration sophistication vs overhead
- Boltzmann Exploration: Softmax-based action selection and temperature tuning
Core Problem: Many RL agents get stuck exploiting a local optimum, never finding sparse rewards or exploring high-dimensional state spaces effectively. Choosing the right exploration strategy is fundamental to success.
Do NOT Use This Skill For
- Algorithm selection (route to rl-foundations or specific algorithm skills like value-based-methods, policy-gradient-methods)
- Reward design issues (route to reward-shaping-engineering)
- Environment bugs causing poor exploration (route to rl-debugging first to verify environment works correctly)
- Basic RL concepts (route to rl-foundations for MDPs, value functions, Bellman equations)
- Training instability unrelated to exploration (route to appropriate algorithm skill or rl-debugging)
Core Principle: The Exploration-Exploitation Tradeoff
The Fundamental Tension
In reinforcement learning, every action selection is a decision:
- Exploit: Take the action with highest estimated value (maximize immediate reward)
- Explore: Try a different action to learn about its value (find better actions)
Exploitation Extreme:
- Only take the best-known action
- High immediate reward (in training)
- BUT: Stuck in local optimum if initial action wasn't optimal
- Risk: Never find the actual best reward
Exploration Extreme:
- Take random actions uniformly
- Will eventually find any reward
- BUT: Wasting resources on clearly bad actions
- Risk: No learning because too much randomness
Optimal Balance:
- Explore enough to find good actions
- Exploit enough to benefit from learning
Why Exploration Matters
Scenario 1: Sparse Reward Environment
Imagine an agent in Montezuma's Revenge (classic exploration benchmark):
- Most states give reward = 0
- First coin gives +1 (at step 500+)
- Without exploring systematically, random actions won't find that coin in millions of steps
Without exploration strategy:
Steps 0-1,000: Random actions, no reward signal
Steps 1,000-10,000: Learned to get to the coin, finally seeing reward
Problem: Took 1,000 steps of pure random exploration!
With smart exploration (RND):
Steps 0-100: RND detects novel states, guides toward unexplored areas
Steps 100-500: Finds coin much faster because exploring strategically
Result: Reward found in 10% of steps
Scenario 2: Local Optimum Trap
Agent finds a small reward (+1) from a simple policy:
Without decay:
- Agent learns exploit_policy achieves +1
- ε-greedy with ε=0.3: Still 30% random (good, explores)
- BUT: 70% exploiting suboptimal policy indefinitely
With decay:
- Step 0: ε=1.0, 100% explore
- Step 100k: ε=0.05, 5% explore
- Step 500k: ε=0.01, 1% explore
- Result: Enough exploration to find +5 reward, then exploit it
Core Rule
Exploration is an investment with declining returns.
- Early training: Exploration critical (don't know anything yet)
- Mid training: Balanced (learning but not confident)
- Late training: Exploitation dominant (confident in good actions)
Part 1: ε-Greedy Exploration
The Baseline Method
ε-Greedy is the simplest exploration strategy: with probability ε, take a random action; otherwise, take the greedy (best-known) action.
import numpy as np
def epsilon_greedy_action(q_values, epsilon):
"""
Select action using ε-greedy.
Args:
q_values: Q(s, *) - values for all actions
epsilon: exploration probability [0, 1]
Returns:
action: int (0 to num_actions-1)
"""
if np.random.random() < epsilon:
# Explore: random action
return np.random.randint(len(q_values))
else:
# Exploit: best action
return np.argmax(q_values)
Why ε-Greedy Works
- Simple: Easy to implement and understand
- Guaranteed Convergence: Will eventually visit all states (if ε > 0)
- Effective Baseline: Works surprisingly well for many tasks
- Interpretable: ε has clear meaning (probability of random action)
When ε-Greedy Fails
Problem Space → Exploration Effectiveness:
Small discrete spaces (< 100 actions):
- ε-greedy: Excellent ✓
- Reason: Random exploration covers space quickly
Large discrete spaces (100-10,000 actions):
- ε-greedy: Poor ✗
- Reason: Random action is almost always bad
- Example: Game with 500 actions, random 1/500 chance is right action
Continuous action spaces:
- ε-greedy: Terrible ✗
- Reason: Random action in [-∞, ∞] is meaningless noise
- Alternative: Gaussian noise on action (not true ε-greedy)
Sparse rewards, large state spaces:
- ε-greedy: Hopeless ✗
- Reason: Random exploration won't find rare reward before heat death
- Alternative: Curiosity, RND, intrinsic motivation
ε-Decay Schedules
The key insight: ε should decay over time. Explore early, exploit late.
Linear Decay
def epsilon_linear(step, total_steps, epsilon_start=1.0, epsilon_end=0.1):
"""
Linear decay from epsilon_start to epsilon_end.
ε(t) = ε_start - (ε_start - ε_end) * t / T
"""
t = min(step, total_steps)
return epsilon_start - (epsilon_start - epsilon_end) * t / total_steps
Properties:
- Simple, predictable, easy to tune
- Equal exploration reduction per step
- Good for most tasks
Guidance:
- Use if no special knowledge about task
epsilon_start = 1.0(explore fully initially)epsilon_end = 0.01to0.1(small residual exploration)total_steps = 1,000,000(typical deep RL)
Exponential Decay
def epsilon_exponential(step, decay_rate=0.9995):
"""
Exponential decay with constant rate.
ε(t) = ε_0 * decay_rate^t
"""
return 1.0 * (decay_rate ** step)
Properties:
- Fast initial decay, slow tail
- Aggressive early exploration cutoff
- Exploration drops exponentially
Guidance:
- Use if task rewards are found quickly
decay_rate = 0.9995is gentle (1% per 100 steps)decay_rate = 0.999is aggressive (1% per step)- Watch for premature convergence to local optimum
Polynomial Decay
def epsilon_polynomial(step, total_steps, epsilon_start=1.0,
epsilon_end=0.01, power=2.0):
"""
Polynomial decay: ε(t) = ε_start * (1 - t/T)^p
power=1: Linear
power=2: Quadratic (faster early decay)
power=0.5: Slower decay
"""
t = min(step, total_steps)
fraction = t / total_steps
return epsilon_start * (1 - fraction) ** power
Properties:
- Smooth, tunable decay curve
- Power > 1: Fast early decay, slow tail
- Power < 1: Slow early decay, fast tail
Guidance:
power = 2.0: Quadratic (balanced, common)power = 3.0: Cubic (aggressive early decay)power = 0.5: Slower (gentle early decay)
Practical Guidance: Choosing Epsilon Parameters
Rule of Thumb:
- epsilon_start = 1.0 (explore uniformly initially)
- epsilon_end = 0.01 to 0.1 (maintain minimal exploration)
- 0.01: For large action spaces (need some exploration)
- 0.05: Default choice
- 0.1: For small action spaces (can afford random actions)
- total_steps: Based on training duration
- Usually 500k to 1M steps
- Longer if rewards are sparse or delayed
Task-Specific Adjustments:
- Sparse rewards: Longer decay (explore for more steps)
- Dense rewards: Shorter decay (can exploit earlier)
- Large action space: Higher epsilon_end (maintain exploration)
- Small action space: Lower epsilon_end (exploitation is cheap)
ε-Greedy Pitfall 1: Decay Too Fast
# WRONG: Decays to 0 in just 10k steps
epsilon_final = 0.01
decay_steps = 10_000
epsilon = epsilon_final ** (step / decay_steps) # ← BUG
# CORRECT: Decays gently over training
total_steps = 1_000_000
epsilon_linear(step, total_steps, epsilon_start=1.0, epsilon_end=0.01)
Symptom: Agent plateaus early, never improves past initial local optimum
Fix: Use longer decay schedule, ensure epsilon_end > 0
ε-Greedy Pitfall 2: Never Decays (Constant ε)
# WRONG: Fixed epsilon forever
epsilon = 0.3 # Constant
# CORRECT: Decay epsilon over time
epsilon = epsilon_linear(step, total_steps=1_000_000)
Symptom: Agent learns but performance noisy, can't fully exploit learned policy
Fix: Add epsilon decay schedule
ε-Greedy Pitfall 3: Epsilon on Continuous Actions
# WRONG: Discrete epsilon-greedy on continuous actions
action = np.random.uniform(-1, 1) if random() < epsilon else greedy_action
# CORRECT: Gaussian noise on continuous actions
def continuous_exploration(action, exploration_std=0.1):
return action + np.random.normal(0, exploration_std, action.shape)
Symptom: Continuous action spaces don't benefit from ε-greedy (random action is meaningless)
Fix: Use Gaussian noise or other continuous exploration methods
Part 2: Boltzmann Exploration
Temperature-Based Action Selection
Instead of deterministic greedy action, select actions proportional to their Q-values using softmax with temperature T.
def boltzmann_exploration(q_values, temperature=1.0):
"""
Select action using Boltzmann distribution.
P(a) = exp(Q(s,a) / T) / Σ exp(Q(s,a') / T)
Args:
q_values: Q(s, *) - values for all actions
temperature: Exploration parameter
T → 0: Becomes deterministic (greedy)
T → ∞: Becomes uniform random
Returns:
action: int (sampled from distribution)
"""
# Subtract max for numerical stability
q_shifted = q_values - np.max(q_values)
# Compute probabilities
probabilities = np.exp(q_shifted / temperature)
probabilities = probabilities / np.sum(probabilities)
# Sample action
return np.random.choice(len(q_values), p=probabilities)
Properties vs ε-Greedy
| Feature | ε-Greedy | Boltzmann |
|---|---|---|
| Good actions | Probability: 1-ε | Probability: higher (proportional to Q) |
| Bad actions | Probability: ε/(n-1) | Probability: lower (proportional to Q) |
| Action selection | Deterministic or random | Stochastic distribution |
| Exploration | Uniform random | Biased toward better actions |
| Tuning | ε (1 parameter) | T (1 parameter) |
Key Advantage: Boltzmann balances better—good actions are preferred but still get chances.
Example: Three actions with Q=[10, 0, -10]
ε-Greedy (ε=0.2):
- Action 0: P=0.8 (exploit best)
- Action 1: P=0.1 (random)
- Action 2: P=0.1 (random)
- Problem: Good actions (Q=0, -10) barely sampled
Boltzmann (T=2):
- Action 0: P=0.88 (exp(10/2)=e^5 ≈ 148)
- Action 1: P=0.11 (exp(0/2)=1)
- Action 2: P=0.01 (exp(-10/2)≈0.007)
- Better: Action 1 still gets 11% (not negligible)
Temperature Decay Schedule
Like epsilon, temperature should decay: start high (explore), end low (exploit).
def temperature_decay(step, total_steps, temp_start=1.0, temp_end=0.1):
"""
Linear temperature decay.
T(t) = T_start - (T_start - T_end) * t / T_total
"""
t = min(step, total_steps)
return temp_start - (temp_start - temp_end) * t / total_steps
# Usage in training loop
for step in range(total_steps):
T = temperature_decay(step, total_steps)
action = boltzmann_exploration(q_values, temperature=T)
# ...
When to Use Boltzmann vs ε-Greedy
Choose ε-Greedy if:
- Simple implementation preferred
- Discrete action space
- Task has clear good/bad actions (wide Q-value spread)
Choose Boltzmann if:
- Actions have similar Q-values (nuanced exploration)
- Want to bias exploration toward promising actions
- Fine-grained control over exploration desired
Part 3: UCB (Upper Confidence Bound)
Theoretical Optimality
UCB is provably optimal for the multi-armed bandit problem:
def ucb_action(q_values, action_counts, total_visits, c=1.0):
"""
Select action using Upper Confidence Bound.
UCB(a) = Q(a) + c * sqrt(ln(N) / N(a))
Args:
q_values: Current Q-value estimates
action_counts: N(a) - times each action visited
total_visits: N - total visits to state
c: Exploration constant (usually 1.0 or sqrt(2))
Returns:
action: int (maximizing UCB)
"""
# Avoid division by zero
action_counts = np.maximum(action_counts, 1)
# Compute exploration bonus
exploration_bonus = c * np.sqrt(np.log(total_visits) / action_counts)
# Upper confidence bound
ucb = q_values + exploration_bonus
return np.argmax(ucb)
Why UCB Works
UCB balances exploitation and exploration via optimism under uncertainty:
- If Q(a) is high → exploit it
- If Q(a) is uncertain (rarely visited) → exploration bonus makes UCB high
Example: Bandit with 2 arms
- Arm A: Visited 100 times, estimated Q=2.0
- Arm B: Visited 10 times, estimated Q=1.5
UCB(A) = 2.0 + 1.0 * sqrt(ln(110) / 100) ≈ 2.0 + 0.26 = 2.26
UCB(B) = 1.5 + 1.0 * sqrt(ln(110) / 10) ≈ 1.5 + 0.82 = 2.32
Result: Try Arm B despite lower Q estimate (less certain)
Critical Limitation: Doesn't Scale to Deep RL
UCB assumes tabular setting (small, discrete state space where you can count visits):
# WORKS: Tabular Q-learning
state_action_counts = defaultdict(int) # N(s, a)
state_counts = defaultdict(int) # N(s)
# BREAKS in deep RL:
# With function approximation, states don't repeat exactly
# Can't count "how many times visited state X" in continuous/image observations
Practical Issue:
In image-based RL (Atari, vision), never see the same pixel image twice. State counting is impossible.
When UCB Applies
Use UCB if:
✓ Discrete action space (< 100 actions)
✓ Discrete state space (< 10,000 states)
✓ Tabular Q-learning (no function approximation)
✓ Rewards come quickly (don't need long-term planning)
Examples: Simple bandits, small Gridworlds, discrete card games
DO NOT use UCB if:
✗ Using neural networks (state approximation)
✗ Continuous actions or large state space
✗ Image observations (pixel space too large)
✗ Sparse rewards (need different methods)
Connection to Deep RL
For deep RL, need to estimate uncertainty without explicit counts:
def deep_ucb_approximation(mean_q, uncertainty, c=1.0):
"""
Approximate UCB using learned uncertainty (not action counts).
Used in methods like:
- Deep Ensembles: Use ensemble variance as uncertainty
- Dropout: Use MC-dropout variance
- Bootstrap DQN: Ensemble of Q-networks
UCB ≈ Q(s,a) + c * uncertainty(s,a)
"""
return mean_q + c * uncertainty
Modern Approach: Instead of counting visits, learn uncertainty through:
- Ensemble Methods: Train multiple Q-networks, use disagreement
- Bayesian Methods: Learn posterior over Q-values
- Bootstrap DQN: Separate Q-networks give uncertainty estimates
These adapt UCB principles to deep RL.
Part 4: Curiosity-Driven Exploration (ICM)
The Core Insight
Prediction Error as Exploration Signal
Agent is "curious" about states where it can't predict the next state well:
Intuition: If I can't predict what will happen, I probably
haven't learned about this state yet. Let me explore here!
Intrinsic Reward = ||next_state - predicted_next_state||^2
Intrinsic Curiosity Module (ICM)
import torch
import torch.nn as nn
class IntrinsicCuriosityModule(nn.Module):
"""
ICM = Forward Model + Inverse Model
Forward Model: Predicts next state from (state, action)
- Input: current state + action taken
- Output: predicted next state
- Error: prediction error = surprise
Inverse Model: Predicts action from (state, next_state)
- Input: current state and next state
- Output: predicted action taken
- Purpose: Learn representation that distinguishes states
"""
def __init__(self, state_dim, action_dim, hidden_dim=128):
super().__init__()
# Inverse model: (s, s') → a
self.inverse = nn.Sequential(
nn.Linear(2 * state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim)
)
# Forward model: (s, a) → s'
self.forward = nn.Sequential(
nn.Linear(state_dim + action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, state_dim)
)
def compute_intrinsic_reward(self, state, action, next_state):
"""
Curiosity reward = prediction error of forward model.
high_error → Unseen state → Reward exploration
low_error → Seen state → Ignore (already learned)
"""
# Predict next state
predicted_next = self.forward(torch.cat([state, action], dim=-1))
# Compute prediction error
prediction_error = torch.norm(next_state - predicted_next, dim=-1)
# Intrinsic reward is prediction error (exploration bonus)
return prediction_error
def loss(self, state, action, next_state, action_pred_logits):
"""
Combine forward and inverse losses.
Forward loss: Forward model prediction error
Inverse loss: Inverse model action prediction error
"""
# Forward loss
predicted_next = self.forward(torch.cat([state, action], dim=-1))
forward_loss = torch.mean((next_state - predicted_next) ** 2)
# Inverse loss
predicted_action = action_pred_logits
inverse_loss = torch.mean((action - predicted_action) ** 2)
return forward_loss + inverse_loss
Why Both Forward and Inverse Models?
Forward model alone:
- Can predict next state without learning features
- Might just memorize (Q: Do pixels change when I do action X?)
- Doesn't necessarily learn task-relevant state representation
Inverse model:
- Forces feature learning that distinguishes states
- Can only predict action if states are well-represented
- Improves forward model's learned representation
Together: Forward + Inverse
- Better feature learning (inverse helps)
- Better prediction (forward is primary)
Critical Pitfall: Random Environment Trap
# WRONG: Using curiosity in stochastic environment
# Environment: Atari with pixel randomness/motion artifacts
# Agent gets reward for predicting pixel noise
# Prediction error = pixels changed randomly
# Intrinsic reward goes to the noisiest state!
# Result: Agent learns nothing about task, just explores random pixels
# CORRECT: Use RND instead (next section)
# RND uses FROZEN random network, doesn't get reward for actual noise
Key Distinction:
- ICM: Learns to predict environment (breaks if environment has noise/randomness)
- RND: Uses frozen random network (robust to environment randomness)
Computational Cost
# ICM adds significant overhead:
# - Forward model network (encoder + layers + output)
# - Inverse model network (encoder + layers + output)
# - Training both networks every step
# Overhead estimate:
# Base agent: 1 network (policy/value)
# With ICM: 3+ networks (policy + forward + inverse)
# Training time: ~2-3× longer
# Memory: ~3× larger
# When justified:
# - Sparse rewards (ICM critical)
# - Large state spaces (ICM helps)
#
# When NOT justified:
# - Dense rewards (environment signal sufficient)
# - Continuous control with simple rewards (ε-greedy enough)
Part 5: RND (Random Network Distillation)
The Elegant Solution
RND is simpler and more robust than ICM:
class RandomNetworkDistillation(nn.Module):
"""
RND: Intrinsic reward = prediction error of target network
Key innovation: Target network is RANDOM and FROZEN
(never updated)
Two networks:
1. Target (random, frozen): f_target(s) - fixed throughout training
2. Predictor (trained): f_predict(s) - learns to predict target
Intrinsic reward = ||f_target(s) - f_predict(s)||^2
New state (s not seen) → high prediction error → reward exploration
Seen state (s familiar) → low prediction error → ignore
"""
def __init__(self, state_dim, embedding_dim=128):
super().__init__()
# Target network: random, never updates
self.target = nn.Sequential(
nn.Linear(state_dim, embedding_dim),
nn.ReLU(),
nn.Linear(embedding_dim, embedding_dim)
)
# Predictor network: learns to mimic target
self.predictor = nn.Sequential(
nn.Linear(state_dim, embedding_dim),
nn.ReLU(),
nn.Linear(embedding_dim, embedding_dim)
)
# Freeze target network
for param in self.target.parameters():
param.requires_grad = False
def compute_intrinsic_reward(self, state, scale=1.0):
"""
Intrinsic reward = prediction error of target network.
Args:
state: Current observation
scale: Scale factor for reward (usually 0.1-1.0)
Returns:
Intrinsic reward (novelty signal)
"""
with torch.no_grad():
target_features = self.target(state)
predicted_features = self.predictor(state)
# L2 prediction error
prediction_error = torch.norm(
target_features - predicted_features,
dim=-1,
p=2
)
return scale * prediction_error
def predictor_loss(self, state):
"""
Loss for predictor: minimize prediction error.
Only update predictor (target stays frozen).
"""
with torch.no_grad():
target_features = self.target(state)
predicted_features = self.predictor(state)
# MSE loss
return torch.mean((target_features - predicted_features) ** 2)
Why RND is Elegant
- No Environment Model: Doesn't need to model dynamics (unlike ICM)
- Robust to Randomness: Random network isn't trying to predict anything real, so environment noise doesn't fool it
- Simple: Just predict random features
- Fast: Train only predictor (target frozen)
RND vs ICM Comparison
| Aspect | ICM | RND |
|---|---|---|
| Networks | Forward + Inverse | Target (frozen) + Predictor |
| Learns | Environment dynamics | Random feature prediction |
| Robust to noise | No (breaks with stochastic envs) | Yes (random target immune) |
| Complexity | High (3+ networks, 2 losses) | Medium (2 networks, 1 loss) |
| Computation | 2-3× base agent | 1.5-2× base agent |
| When to use | Dense features, clean env | Sparse rewards, noisy env |
RND Pitfall: Training Instability
# WRONG: High learning rate, large reward scale
rnd_loss = rnd.predictor_loss(state)
optimizer.zero_grad()
rnd_loss.backward()
optimizer.step() # ← high learning rate causes divergence
# CORRECT: Careful hyperparameter tuning
rnd_lr = 1e-4 # Much smaller than main agent
rnd_optimizer = Adam(rnd.predictor.parameters(), lr=rnd_lr)
# Scale intrinsic reward appropriately
intrinsic_reward = rnd.compute_intrinsic_reward(state, scale=0.01)
Symptom: RND rewards explode, agent overfits to novelty
Fix: Lower learning rate for RND, scale intrinsic rewards carefully
Part 6: Count-Based Exploration
State Visitation Counts
For discrete/tabular environments, track how many times each state visited:
from collections import defaultdict
class CountBasedExploration:
"""
Count-based exploration: encourage visiting rarely-seen states.
Works for:
✓ Tabular (small discrete state space)
✓ Gridworlds, simple games
Doesn't work for:
✗ Continuous spaces
✗ Image observations (never see same image twice)
✗ Large state spaces
"""
def __init__(self):
self.state_counts = defaultdict(int)
def compute_intrinsic_reward(self, state, reward_scale=1.0):
"""
Intrinsic reward inversely proportional to state visitation.
intrinsic_reward = reward_scale / sqrt(N(s))
Rarely visited states (small N) → high intrinsic reward
Frequently visited states (large N) → low intrinsic reward
"""
count = max(self.state_counts[state], 1) # Avoid division by zero
return reward_scale / np.sqrt(count)
def update_counts(self, state):
"""Increment visitation count for state."""
self.state_counts[state] += 1
Example: Gridworld with Sparse Reward
# Gridworld: 10×10 grid, reward at (9, 9), start at (0, 0)
# Without exploration: Random walking takes exponential time
# With count-based: Directed toward unexplored cells
# Pseudocode:
for episode in range(episodes):
state = env.reset()
for step in range(max_steps):
# Compute exploration bonus
intrinsic_reward = count_explorer.compute_intrinsic_reward(state)
# Combine with task reward
combined_reward = env_reward + lambda * intrinsic_reward
# Q-learning with combined reward
action = epsilon_greedy(q_values[state], epsilon)
next_state, env_reward = env.step(action)
q_values[state][action] += alpha * (
combined_reward + gamma * max(q_values[next_state]) - q_values[state][action]
)
# Update counts
count_explorer.update_counts(next_state)
state = next_state
Critical Limitation: Doesn't Scale
# Works: Small state space
state_space_size = 100 # 10×10 grid
# Can track counts for all states
# Fails: Large/continuous state space
state_space_size = 10^18 # Image observations
# Can't track visitation counts for 10^18 unique states!
Part 7: When Exploration is Critical
Decision Framework
Exploration matters when:
Sparse Rewards (rewards rare, hard to find)
- Examples: Montezuma's Revenge, goal-conditioned tasks, real robotics
- No dense reward signal to guide learning
- Agent must explore to find any reward
- Solution: Intrinsic motivation (curiosity, RND)
Large State Spaces (too many possible states)
- Examples: Image-based RL, continuous control
- Random exploration covers infinitesimal fraction
- Systematic exploration essential
- Solution: Curiosity-driven or RND
Long Horizons (many steps before reward)
- Examples: Multi-goal tasks, planning problems
- Temporal credit assignment hard
- Need to explore systematically to connect actions to delayed rewards
- Solution: Sophisticated exploration strategy
Deceptive Reward Landscape (local optima common)
- Examples: Multiple solutions, trade-offs
- Easy to get stuck in suboptimal policy
- Exploration helps escape local optima
- Solution: Slow decay schedule, maintain exploration
Decision Framework (Quick Check)
Do you have SPARSE rewards?
YES → Use intrinsic motivation (curiosity, RND)
NO → Continue
Is state space large (images, continuous)?
YES → Use curiosity-driven or RND
NO → Continue
Is exploration reasonably efficient with ε-greedy?
YES → Use ε-greedy + appropriate decay schedule
NO → Use curiosity-driven or RND
Example: Reward Structure Analysis
def analyze_reward_structure(rewards):
"""Determine if exploration strategy needed."""
# Check sparsity
nonzero_rewards = np.count_nonzero(rewards)
sparsity = 1 - (nonzero_rewards / len(rewards))
if sparsity > 0.95:
print("SPARSE REWARDS detected")
print(" → Use: Intrinsic motivation (RND or curiosity)")
print(" → Why: Reward signal too rare to guide learning")
# Check reward magnitude
reward_std = np.std(rewards)
reward_mean = np.mean(rewards)
if reward_std < 0.1:
print("WEAK/NOISY REWARDS detected")
print(" → Use: Intrinsic motivation")
print(" → Why: Reward signal insufficient to learn from")
# Check reward coverage
episode_length = len(rewards)
if episode_length > 1000:
print("LONG HORIZONS detected")
print(" → Use: Strong exploration decay or intrinsic motivation")
print(" → Why: Temporal credit assignment difficult")
Part 8: Combining Exploration with Task Rewards
Combining Intrinsic and Extrinsic Rewards
When using intrinsic motivation, balance with task reward:
def combine_rewards(extrinsic_reward, intrinsic_reward,
intrinsic_scale=0.01):
"""
Combine extrinsic (task) and intrinsic (curiosity) rewards.
r_total = r_extrinsic + λ * r_intrinsic
λ controls tradeoff:
- λ = 0: Ignore intrinsic reward (no exploration)
- λ = 0.01: Curiosity helps, task reward primary (typical)
- λ = 0.1: Curiosity significant
- λ = 1.0: Curiosity dominates (might ignore task)
"""
return extrinsic_reward + intrinsic_scale * intrinsic_reward
Challenges: Reward Hacking
# PROBLEM: Intrinsic reward encourages anything novel
# Even if novel thing is useless for task
# Example: Atari with RND
# If game has pixel randomness, RND rewards exploring random pixels
# Instead of exploring to find coins/power-ups
# SOLUTION: Scale intrinsic reward carefully
# Make it significant but not dominant
# SOLUTION 2: Curriculum learning
# Start with high intrinsic reward (discover environment)
# Gradually reduce as agent finds reward signals
Intrinsic Reward Scale Tuning
# Quick tuning procedure:
for intrinsic_scale in [0.001, 0.01, 0.1, 1.0]:
agent = RL_Agent(intrinsic_reward_scale=intrinsic_scale)
for episode in episodes:
performance = train_episode(agent)
print(f"Scale={intrinsic_scale}: Performance={performance}")
# Find scale where agent learns task well AND explores
# Usually 0.01-0.1 is sweet spot
Part 9: Common Pitfalls and Debugging
Pitfall 1: Epsilon Decay Too Fast
Symptom: Agent plateaus at poor performance early in training
Root Cause: Epsilon decays to near-zero before agent finds good actions
# WRONG: Decays in 10k steps
epsilon_final = 0.0
epsilon_decay = 0.9999 # Per-step decay
# After 10k steps: ε ≈ 0, almost no exploration left
# CORRECT: Decay over full training
total_training_steps = 1_000_000
epsilon_linear(step, total_training_steps,
epsilon_start=1.0, epsilon_end=0.01)
Diagnosis:
- Plot epsilon over training: does it reach 0 too early?
- Check if performance improves after epsilon reaches low values
Fix:
- Use longer decay (more steps)
- Use higher epsilon_end (never go to pure exploitation)
Pitfall 2: Intrinsic Reward Too Strong
Symptom: Agent explores forever, ignores task reward
Root Cause: Intrinsic reward scale too high
# WRONG: Intrinsic reward dominates
r_total = r_task + 1.0 * r_intrinsic
# Agent optimizes novelty, ignores task
# CORRECT: Intrinsic reward is small bonus
r_total = r_task + 0.01 * r_intrinsic
# Task reward primary, intrinsic helps exploration
Diagnosis:
- Agent explores everywhere but doesn't collect task rewards
- Intrinsic reward signal going to seemingly useless states
Fix:
- Reduce intrinsic_reward_scale (try 0.01, 0.001)
- Verify agent eventually starts collecting task rewards
Pitfall 3: ε-Greedy on Continuous Actions
Symptom: Exploration ineffective, agent doesn't learn
Root Cause: Random action in continuous space is meaningless
# WRONG: ε-greedy on continuous actions
if random() < epsilon:
action = np.random.uniform(-1, 1) # Random in action space
else:
action = network(state) # Neural network action
# Random action is far from learned policy, completely unhelpful
# CORRECT: Gaussian noise on action
action = network(state)
noisy_action = action + np.random.normal(0, exploration_std)
noisy_action = np.clip(noisy_action, -1, 1)
Diagnosis:
- Continuous action space and using ε-greedy
- Agent not learning effectively
Fix:
- Use Gaussian noise: action + N(0, σ)
- Decay exploration_std over time (like epsilon decay)
Pitfall 4: Forgetting to Decay Exploration
Symptom: Training loss decreases but policy doesn't improve, noisy behavior
Root Cause: Agent keeps exploring randomly instead of exploiting learned policy
# WRONG: Constant exploration forever
epsilon = 0.3
# CORRECT: Decaying exploration
epsilon = epsilon_linear(step, total_steps)
Diagnosis:
- No epsilon decay schedule mentioned in code
- Agent behaves randomly even after many training steps
Fix:
- Add decay schedule (linear, exponential, polynomial)
Pitfall 5: Using Exploration at Test Time
Symptom: Test performance worse than training, highly variable
Root Cause: Applying exploration strategy (ε > 0) at test time
# WRONG: Test with exploration
for test_episode in test_episodes:
action = epsilon_greedy(q_values, epsilon=0.05) # Wrong!
# Agent still explores at test time
# CORRECT: Test with greedy policy
for test_episode in test_episodes:
action = np.argmax(q_values) # Deterministic, no exploration
Diagnosis:
- Test performance has high variance
- Test performance < training performance (exploration hurts)
Fix:
- At test time, use greedy/deterministic policy
- No ε-greedy, no Boltzmann, no exploration noise
Pitfall 6: RND Predictor Overfitting
Symptom: RND loss decreases but intrinsic rewards still large everywhere
Root Cause: Predictor overfits to training data, doesn't generalize to new states
# WRONG: High learning rate, no regularization
rnd_optimizer = Adam(rnd.predictor.parameters(), lr=0.001)
rnd_loss.backward()
rnd_optimizer.step()
# Predictor fits perfectly to seen states but doesn't generalize
# CORRECT: Lower learning rate, regularization
rnd_optimizer = Adam(rnd.predictor.parameters(), lr=0.0001)
# Add weight decay for regularization
Diagnosis:
- RND training loss is low (close to 0)
- But intrinsic rewards still high for most states
- Suggests predictor fitted to training states but not generalizing
Fix:
- Reduce RND learning rate
- Add weight decay (L2 regularization)
- Use batch normalization in predictor
Pitfall 7: Count-Based on Non-Tabular Problems
Symptom: Exploration ineffective, agent keeps revisiting similar states
Root Cause: State counting doesn't work for continuous/image spaces
# WRONG: Counting state IDs in image-based RL
state = env.render(mode='rgb_array') # 84x84 image
state_id = hash(state.tobytes()) # Different hash every time!
count_based_explorer.update_counts(state_id)
# Every frame is "new" because of slight pixel differences
# State counting broken
# CORRECT: Use RND or curiosity instead
rnd = RandomNetworkDistillation(state_dim)
# RND handles high-dimensional states
Diagnosis:
- Using count-based exploration with images/continuous observations
- Exploration not working effectively
Fix:
- Switch to RND or curiosity-driven methods
- Count-based only for small discrete state spaces
Part 10: Red Flags and Pressure Tests
Red Flags Checklist
- Constant epsilon (no decay): Agent explores forever, can't exploit
- Epsilon decays to zero (pure exploitation): Agent stuck in local optimum
- ε-greedy on continuous actions: Exploration unguided, learning slow
- Intrinsic reward >> task reward: Agent ignores task
- No exploration specified: Might be using default (often insufficient)
- Using RND in random/stochastic environment: Breaks because predicting noise
- Count-based on images: Can't count unique images, worthless
- Exploration at test time: Test performance degrades from exploration noise
- ICM on pixel-randomness environment: Forward model tries to predict noise
- Ignoring sparse rewards: Using ε-greedy only, no intrinsic motivation
Pressure Test Scenario 1: "My Agent Stuck in Local Optimum"
User Query: "I trained my agent and it found a solution giving +10 reward. But I know the optimal solution is +100. Agent is stuck at +10. How do I fix it?"
What Skill Must Provide:
- Diagnosis: Likely exploration problem (local optimum)
- Solution: Increase exploration (slower decay, or intrinsic motivation)
- Prevention: Design exploration strategy before training
Skill Response:
"Your agent found a local optimum (+10) and exploitation is too strong. This is a classic exploration-exploitation problem.
Quick Fixes:
- Check epsilon decay: Is epsilon decaying too fast? Use slower decay schedule.
- Check epsilon_end: Is epsilon_end = 0? Change to 0.01-0.05.
- Add intrinsic motivation: Your environment might need curiosity-driven exploration.
Systematic Approach:
- If rewards are dense (frequent signals): ε-greedy with slow decay
- If rewards are sparse (rare signals): Add RND or curiosity
- If state space huge: Definitely use RND
Example Fix:
# Instead of epsilon decaying to 0.01 in 100k steps
# Decay over 1M steps with epsilon_end=0.05
epsilon = epsilon_linear(step, total_steps=1_000_000,
epsilon_start=1.0, epsilon_end=0.05)
```"
### Pressure Test Scenario 2: "Which Exploration for Montezuma's Revenge?"
**User Query:** "I'm training on Montezuma's Revenge (classic sparse reward benchmark). ε-Greedy isn't working (agent never finds first coin). What exploration strategy should I use?"
**What Skill Must Provide:**
1. Recognition: This is THE benchmark for intrinsic motivation
2. Solution: RND is standard, ICM is alternative
3. Implementation: Code/guidance for using RND
**Skill Response:**
"Montezuma's Revenge is the canonical example where ε-greedy fails. The first coin is at step 500+ with no reward signal. Random exploration won't find it.
**Recommended:** RND (Random Network Distillation)
Why RND:
- Robust to stochasticity (doesn't try to predict environment)
- Simpler than ICM (no environment model needed)
- Proven to work on Montezuma's Revenge
**Implementation:**
```python
rnd = RandomNetworkDistillation(state_dim=...)
lambda_intrinsic = 0.01 # Scale intrinsic reward
for step in training_steps:
state = env.observation
# Get intrinsic reward
r_intrinsic = rnd.compute_intrinsic_reward(state)
# Combine with task reward
r_total = r_task + lambda_intrinsic * r_intrinsic
# Train agent with combined reward
action = agent.select_action(state)
next_state, r_task = env.step(action)
# Update RND predictor
rnd_loss = rnd.predictor_loss(state)
rnd_optimizer.zero_grad()
rnd_loss.backward()
rnd_optimizer.step()
Alternative: Curiosity-driven (ICM) also works if environment deterministic.
What to tune:
- lambda_intrinsic: Start 0.01, increase if agent doesn't explore enough
- RND learning rate: Usually 1e-4 (much smaller than main agent)"
Pressure Test Scenario 3: "ε-Greedy Decay Schedule"
User Query: "I'm using ε-greedy but not sure about the decay schedule. I have 100k training steps. Should I decay epsilon linearly? Over how many steps? To what final value?"
What Skill Must Provide:
- Formula: Linear, exponential, polynomial decay
- Guidance: Typical values for different problem types
- Tuning: How to adjust based on task
Skill Response:
"Quick Answer:
total_steps = 100_000
epsilon = epsilon_linear(step, total_steps,
epsilon_start=1.0, epsilon_end=0.05)
Reasoning:
epsilon_start=1.0: Explore fully initially (know nothing)epsilon_end=0.05: Keep 5% exploration always (helps with stochasticity)- Linear decay: Simple, works well for most tasks
Decision Tree:
How dense are rewards?
- Sparse: Use slower decay (maybe 0.1 epsilon_end)
- Dense: Use faster decay (0.01 epsilon_end)
How many steps?
- 100k steps: Decay over 50k-100k (half to full training)
- 1M steps: Decay over 500k-1M
- Rule: Decay over 50-100% of training time
Which decay curve?
- Linear (ε_linear): Safe default
- Exponential (ε_exponential): Faster early decay
- Polynomial with power=2: Balanced
Example Tuning:
# Sparse rewards: More exploration needed
epsilon = epsilon_linear(step, total_steps=100_000,
epsilon_start=1.0, epsilon_end=0.1)
# Very sparse rewards: Even more
epsilon = epsilon_linear(step, total_steps=100_000,
epsilon_start=1.0, epsilon_end=0.2)
# Dense rewards: Faster decay to exploitation
epsilon = epsilon_linear(step, total_steps=50_000,
epsilon_start=1.0, epsilon_end=0.01)
```"
### Pressure Test Scenario 4: "ICM vs RND vs ε-Greedy"
**User Query:** "I have sparse rewards and a deterministic environment. Should I use ε-greedy, ICM curiosity, or RND? How do I decide?"
**What Skill Must Provide:**
1. Comparison table: Pros/cons of each
2. Decision framework: Which for which situation
3. Implementation difficulty: Is it worth the complexity?
**Skill Response:**
"**Comparison:**
| Method | Complexity | Robustness | When to Use |
|--------|-----------|-----------|------------|
| ε-Greedy | Very Simple | Good on dense rewards | Dense rewards, simple env |
| ICM | Complex (2 models) | Poor on stochastic | Sparse, deterministic, learning important |
| RND | Medium (1 model) | Excellent | Sparse, any environment |
**Decision Framework:**
1. **Sparse rewards + Deterministic + Lots of learning:**
→ Try ICM first (forward model learns useful features)
2. **Sparse rewards + Any environment:**
→ Use RND (robust, simpler)
3. **Sparse rewards + Large pixel space:**
→ RND (better for high-dimensional observations)
4. **Dense rewards:**
→ Just use ε-greedy (intrinsic motivation not needed)
5. **Not sure, want safe choice:**
→ Use RND (works almost everywhere)
**Implementation Difficulty:**
- ε-Greedy: 5 lines of code
- ICM: 100+ lines (forward + inverse models, 2 losses)
- RND: 50 lines (target + predictor)
**My Recommendation:**
Start ε-greedy. If not working after tuning decay schedules, switch to RND. Only use ICM if environment is fully deterministic and you care about learning dynamics."
### Pressure Test Scenario 5: "Computational Cost of Curiosity"
**User Query:** "I want to add curiosity-driven exploration (ICM) to my agent. What's the computational overhead? Will it slow down training significantly?"
**What Skill Must Provide:**
1. Overhead estimate: Exact computation cost
2. Tradeoff: When overhead is justified
3. Alternatives: Lighter-weight options
**Skill Response:**
"**Computational Overhead Estimate:**
Base Agent:
- 1 policy/value network
- Training time: 1.0×
With ICM:
- Policy/value network + Forward model + Inverse model
- Forward model: Encoder + hidden layers + state-dim output
- Inverse model: Concatenate states + hidden + action-dim output
- Training time: ~2.5-3.0×
**Memory:**
- Base: X MB
- With ICM: ~3-4× X MB
- GPU memory often bottleneck
**When Overhead is Justified:**
- Sparse rewards (ICM critical to success)
- Large state space (intrinsic motivation helps)
- Willing to wait longer for better exploration
**When Overhead is NOT Justified:**
- Dense rewards (ε-greedy sufficient)
- Real-time training constraints
- Limited GPU memory
**Lighter Alternative:**
Use RND instead of ICM:
- ~1.5-2.0× overhead (vs 2.5-3.0× for ICM)
- Same exploration benefits
- Simpler to implement
**Scaling to Large Models:**
```python
# ICM with huge state encoders can be prohibitive
# Example: Vision transformer encoder → ICM
# That's very expensive
# RND scales better: predictor can be small
# Don't need sophisticated encoder
Bottom Line: ICM costs 2-3× training time. If you can afford it and rewards are very sparse, worth it. Otherwise try RND or even ε-greedy with slower decay first."
Part 11: Rationalization Resistance Table
| Rationalization | Reality | Counter-Guidance | Red Flag |
|---|---|---|---|
| "ε-Greedy works everywhere" | Fails on sparse rewards, large spaces | Use ε-greedy for dense/small, intrinsic motivation for sparse/large | Applying ε-greedy to Montezuma's Revenge |
| "Higher epsilon is better" | High ε → too random, doesn't exploit | Use decay schedule (ε high early, low late) | Using constant ε=0.5 throughout training |
| "Decay epsilon to zero" | Agent needs residual exploration | Keep ε_end=0.01-0.1 always | Setting ε_final=0 (pure exploitation) |
| "Curiosity always helps" | Can break with stochasticity (model tries to predict noise) | Use RND for stochastic, ICM for deterministic | Agent learns to explore random noise instead of task |
| "RND is just ICM simplified" | RND is fundamentally different (frozen random vs learned model) | Understand frozen network prevents overfitting/noise | Not grasping why RND frozen network matters |
| "More intrinsic reward = faster exploration" | Too much intrinsic reward drowns out task signal | Balance with λ=0.01-0.1, tune on task performance | Agent explores forever, ignores task |
| "Count-based works anywhere" | Only works tabular (can't count unique images) | Use RND for continuous/high-dimensional spaces | Trying count-based on Atari images |
| "Boltzmann is always better than ε-greedy" | Boltzmann smoother but harder to tune | Use ε-greedy for simplicity (it works well) | Switching to Boltzmann without clear benefit |
| "Test with ε>0 for exploration" | Test should use learned policy, not explore | ε=0 or greedy policy at test time | Variable test performance from exploration |
| "Longer decay is always better" | Very slow decay wastes time in early training | Match decay to task difficulty (faster for easy, slower for hard) | Decaying over 10M steps when training only 1M |
| "Skip exploration, increase learning rate" | Learning rate is for optimization, exploration for coverage | Use both: exploration strategy + learning rate | Agent oscillates without exploration |
| "ICM is the SOTA exploration" | RND simpler and more robust | Use RND unless you need environment model | Implementing ICM when RND would suffice |
Part 12: Summary and Decision Framework
Quick Decision Tree
START: Need exploration strategy?
├─ Are rewards sparse? (rare reward signal)
│ ├─ YES → Need intrinsic motivation
│ │ ├─ Environment stochastic?
│ │ │ ├─ YES → RND
│ │ │ └─ NO → ICM (or RND for simplicity)
│ │ └─ Choose RND for safety
│ │
│ └─ NO → Dense rewards
│ └─ Use ε-greedy + decay schedule
├─ Is state space large? (images, continuous)
│ ├─ YES → Intrinsic motivation (RND/curiosity)
│ └─ NO → ε-greedy usually sufficient
└─ Choosing decay schedule:
├─ Sparse rewards → slower decay (ε_end=0.05-0.1)
├─ Dense rewards → faster decay (ε_end=0.01)
└─ Default: Linear decay over 50% of training
Implementation Checklist
- Define reward structure (dense vs sparse)
- Estimate state space size (discrete vs continuous)
- Choose exploration method (ε-greedy, curiosity, RND, UCB, count-based)
- Set epsilon/temperature parameters (start, end)
- Choose decay schedule (linear, exponential, polynomial)
- If using intrinsic motivation: set λ (usually 0.01)
- Use greedy policy at test time (ε=0)
- Monitor exploration vs exploitation (plot epsilon decay)
- Tune hyperparameters (decay schedule, λ) based on task performance
Typical Configurations
Dense Rewards, Small Action Space (e.g., simple game)
epsilon = epsilon_linear(step, total_steps=100_000,
epsilon_start=1.0, epsilon_end=0.01)
# Fast exploitation, low exploration needed
Sparse Rewards, Discrete Actions (e.g., Atari)
rnd = RandomNetworkDistillation(...)
epsilon = epsilon_linear(step, total_steps=1_000_000,
epsilon_start=1.0, epsilon_end=0.05)
r_total = r_task + 0.01 * r_intrinsic
# Intrinsic motivation + slow decay
Continuous Control, Sparse (e.g., Robotics)
rnd = RandomNetworkDistillation(...)
action = policy(state) + gaussian_noise(std=exploration_std)
exploration_std = exploration_std_linear(..., std_end=0.01)
r_total = r_task + 0.01 * r_intrinsic
# Gaussian noise + RND
Key Takeaways
Exploration is fundamental: Don't ignore it. Design exploration strategy before training.
Match method to problem:
- Dense rewards → ε-greedy
- Sparse rewards → Intrinsic motivation (RND preferred)
- Large state space → Intrinsic motivation
Decay exploration over time: Explore early, exploit late.
Avoid common pitfalls:
- Don't decay to zero (ε_end > 0)
- Don't use ε-greedy on continuous actions
- Don't forget decay schedule
- Don't use exploration at test time
Balance intrinsic and extrinsic: If using intrinsic rewards, don't let them dominate.
RND is the safe choice: Works for most exploration problems, simpler than ICM.
Test exploration hypothesis: Plot epsilon or intrinsic rewards, verify exploration strategy is active.
This skill is about systematic exploration design, not just tuning one hyperparameter.