name	value-based-methods
description	Master DQN, Double DQN, Dueling DQN, Rainbow - value-based methods for discrete actions

Value-Based Methods

When to Use This Skill

Invoke this skill when you encounter:

Algorithm Selection: "Should I use DQN or policy gradient for my problem?"
DQN Implementation: User implementing DQN and needs guidance on architecture
Training Issues: "DQN is diverging", "Q-values too high", "slow to learn"
Variant Questions: "What's Double DQN?", "Should I use Dueling?", "Is Rainbow worth it?"
Discrete Action RL: User has discrete action space and implementing value method
Hyperparameter Tuning: Debugging learning rates, replay buffer size, network architecture
Implementation Bugs: Target network missing, frame stacking wrong, reward scaling issues
Custom Environments: Designing states, rewards, action spaces for DQN

This skill provides practical implementation guidance for discrete action RL.

Do NOT use this skill for:

Continuous action spaces (route to actor-critic-methods)
Policy gradients (route to policy-gradient-methods)
Model-based RL (route to model-based-rl)
Offline RL (route to offline-rl-methods)
Theory foundations (route to rl-foundations)

Core Principle

Value-based methods solve discrete action RL by learning Q(s,a) = expected return from taking action a in state s, then acting greedily. They're powerful for discrete spaces but require careful implementation to avoid instability.

Key insight: Value methods assume you can enumerate and compare all action values. This breaks down with continuous actions (infinite actions to compare). Use them for:

Games (Atari, Chess)
Discrete control (robot navigation, discrete movement)
Dialog systems (discrete utterances)
Combinatorial optimization

Do not use for:

Continuous control (robot arm angles, vehicle acceleration)
Stochastic policies required (multi-agent, exploration in deterministic policy)
Exploration of large action space (too slow to learn all actions)

Part 1: Q-Learning Foundation

From TD Learning to Q-Learning

You understand TD learning from rl-foundations. Q-learning extends it to action-values.

TD(0) for V(s):

V[s] ← V[s] + α(r + γV[s'] - V[s])

Q-Learning for Q(s,a):

Q[s,a] ← Q[s,a] + α(r + γ max_a' Q[s',a'] - Q[s,a])

Key difference: Q-learning has max over next actions (off-policy).

Off-Policy Learning

Q-learning learns the optimal policy π(a|s) = argmax_a Q(s,a)* regardless of exploration policy.

Example: Cliff Walking

Agent follows epsilon-greedy (explores 10% random)
But Q-learning learns: "Take safe path away from cliff" (optimal)
NOT: "Walk along cliff edge" (what exploring policy does sometimes)

Q-learning separates:
- Behavior policy: ε-greedy (for exploration)
- Target policy: greedy (what we're learning toward)

Why This Matters: Off-policy learning is sample-efficient (can learn from any exploration strategy). On-policy methods like SARSA would learn the exploration noise into policy.

Convergence Guarantee

Theorem: Q-learning converges to Q*(s,a) if:

All state-action pairs visited infinitely often
Learning rate α(t) → 0 (e.g., α = 1/N(s,a))
Sufficiently small ε (exploration not zero)

Practical: Use ε-decay schedule that ensures eventual convergence.

epsilon = max(epsilon_min, epsilon * decay_rate)
# Start: ε=1.0, decay to ε=0.01
# Ensures: all actions eventually tried, then exploitation takes over

Q-Learning Pitfall #1: Small State Spaces Only

Scenario: User implements tabular Q-learning for Atari.

Problem:

Atari image: 210×160 RGB = 20,160 pixels
Possible states: 256^20160 (astronomical)
Tabular Q-learning: impossible

Solution: Use function approximation (neural networks) → Deep Q-Networks

Red Flag: Tabular Q-learning works only for small state spaces (<10,000 unique states).

Part 2: Deep Q-Networks (DQN)

What DQN Adds to Q-Learning

DQN = Q-learning + neural network + two critical stability mechanisms:

Experience Replay: Break temporal correlation
Target Network: Prevent moving target problem

Mechanism 1: Experience Replay

Problem without replay:

# Naive approach (WRONG)
state = env.reset()
for t in range(1000):
    action = epsilon_greedy(state)
    next_state, reward = env.step(action)

    # Update Q from this single transition
    Q(state, action) += α(reward + γ max Q(next_state) - Q(state, action))
    state = next_state

Why this fails:

Consecutive transitions are highly correlated (state_t and state_{t+1} very similar)
Neural network gradient updates are unstable with correlated data
Network overfits to recent trajectory

Experience Replay Solution:

# Collect experiences in buffer
replay_buffer = []

for episode in range(num_episodes):
    state = env.reset()
    for t in range(max_steps):
        action = epsilon_greedy(state)
        next_state, reward = env.step(action)

        # Store experience (not learn yet)
        replay_buffer.append((state, action, reward, next_state, done))

        # Sample random batch and learn
        if len(replay_buffer) > batch_size:
            batch = random.sample(replay_buffer, batch_size)
            for (s, a, r, s_next, done) in batch:
                if done:
                    target = r
                else:
                    target = r + gamma * max(Q(s_next))
                loss = (Q(s,a) - target)^2

            # Update network weights
            optimizer.step(loss)

        state = next_state

Why this works:

Breaks correlation: Random sampling decorrelates gradient updates
Sample efficiency: Reuse old experiences (learn more from same env interactions)
Stability: Averaged gradients are smoother

Mechanism 2: Target Network

Problem without target network:

# Moving target problem (WRONG)
loss = (Q(s,a) - [r + γ max Q(s_next, a_next)])^2
       #     ^^^^             ^^^^
       # Same network computing both target and prediction

Issue: Network updates move both the prediction AND the target, creating instability.

Analogy: Trying to hit a moving target that moves whenever you aim.

Target Network Solution:

# Separate networks
main_network = create_network()      # Learning network
target_network = create_network()    # Stable target (frozen)

# Training loop
loss = (main_network(s,a) - [r + γ max target_network(s_next)])^2
                                  ^^^^^^^^
                    Target network doesn't update every step

# Periodically synchronize
if t % update_frequency == 0:
    target_network = copy(main_network)  # Freeze for N steps

Why this works:

Stability: Target doesn't move as much (frozen for many steps)
Bellman consistency: Gives network time to learn, then adjusts target
Convergence: Bootstrapping no longer destabilized by moving target

DQN Architecture Pattern

import torch
import torch.nn as nn

class DQN(nn.Module):
    def __init__(self, input_size, num_actions):
        super().__init__()
        # For Atari: CNN backbone
        self.conv1 = nn.Conv2d(4, 32, kernel_size=8, stride=4)  # Frame stack: 4 frames
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)

        # Flatten and FC layers
        self.fc1 = nn.Linear(64*7*7, 512)  # After convolutions
        self.fc_value = nn.Linear(512, 1)  # For dueling: value stream
        self.fc_actions = nn.Linear(512, num_actions)  # For dueling: advantage stream

    def forward(self, x):
        # x shape: (batch, 4, 84, 84) for Atari
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        x = x.flatten(start_dim=1)
        x = torch.relu(self.fc1(x))

        # For basic DQN: just action values
        q_values = self.fc_actions(x)
        return q_values

Hyperparameter Guidance

Parameter	Value Range	Effect	Guidance
Replay buffer size	10k-1M	Memory, sample diversity	Start 100k, increase for slow learning
Batch size	32-256	Stability vs memory	32-64 common; larger = more stable
Learning rate α	0.0001-0.001	Convergence speed	Start 0.0001, increase if too slow
Target update freq	1k-10k steps	Stability	Update every 1000-5000 steps
ε initial	0.5-1.0	Exploration	Start 1.0 (random)
ε final	0.01-0.05	Late exploitation	0.01-0.05 typical
ε decay	10k-1M steps	Exploration → Exploitation	Tune to problem (larger env → longer decay)

DQN Pitfall #1: Missing Target Network

Symptom: "DQN loss explodes immediately, Q-values diverge to ±infinity"

Root cause: No target network (or target updates too frequently)

# WRONG - target network updates every step
loss = (Q(s,a) - [r + γ max Q(s_next)])^2  # Both from same network

# CORRECT - target network frozen for steps
loss = (Q_main(s,a) - [r + γ max Q_target(s_next)])^2
# Update target: if step % 1000 == 0: Q_target = copy(Q_main)

Fix: Verify target network update frequency (1000-5000 steps typical).

DQN Pitfall #2: Replay Buffer Too Small

Symptom: "Sample efficiency very poor, agent takes millions of steps to learn"

Root cause: Small replay buffer = replay many recent correlated experiences

# WRONG
replay_buffer_size = 10_000
# After 10k steps, only seeing recent experience (no diversity)

# CORRECT
replay_buffer_size = 100_000 or 1_000_000
# See diverse experiences from long history

Rule of Thumb: Replay buffer ≥ 10 × episode length (more is usually better)

Memory vs Sample Efficiency Tradeoff:

10k buffer: Low memory, high correlation (bad)
100k buffer: Moderate memory, good diversity (usually sufficient)
1M buffer: High memory, excellent diversity (overkill unless long episodes)

DQN Pitfall #3: No Frame Stacking

Symptom: "Learning very slow or doesn't converge"

Root cause: Single frame doesn't show velocity (violates Markov property)

# WRONG - single frame
state = current_frame  # No velocity information
# Network cannot infer: is ball moving left or right?

# CORRECT - stack frames
state = np.stack([frame_t, frame_{t-1}, frame_{t-2}, frame_{t-3}])
# Velocity: difference between consecutive frames

Implementation:

from collections import deque

class FrameBuffer:
    def __init__(self, num_frames=4):
        self.buffer = deque(maxlen=num_frames)

    def add_frame(self, frame):
        self.buffer.append(frame)

    def get_state(self):
        return np.stack(list(self.buffer))  # (4, 84, 84)

DQN Pitfall #4: Reward Clipping Wrong

Symptom: "Training unstable" or "Learned policy much worse than Q-values suggest"

Context: Atari papers clip rewards to {-1, 0, +1} for stability.

Misunderstanding: Clipping destroys reward information.

# WRONG - unthinking clip
reward = np.clip(reward, -1, 1)  # All rewards become -1,0,+1
# In custom env with rewards in [-100, 1000], loses critical information

# CORRECT - Normalize instead
reward = (reward - reward_mean) / reward_std
# Preserves differences, stabilizes scale

When to clip: Only if rewards are naturally in {-1, 0, +1} (like Atari).

When to normalize: Custom environments with arbitrary scales.

Part 3: Double DQN

The Overestimation Bias Problem

Max operator bias: In stochastic environments, max over noisy estimates is biased upward.

Example:

True Q*(s,a) values: [10.0, 5.0, 8.0]

Due to noise, estimates: [11.0, 4.0, 9.0]
                            ↑
                        True Q = 10, estimate = 11

Standard DQN takes max: max(Q_estimates) = 11
But true Q*(s,best_action) = 10

Systematic overestimation! Agent thinks actions better than they are.

Consequence:

Inflated Q-values during training
Learned policy (greedy) performs worse than Q-values suggest
Especially bad early in training when estimates very noisy

Double DQN Solution

Insight: Use one network to select best action, another to evaluate it.

# Standard DQN (overestimates)
target = r + γ max_a Q_target(s_next, a)
         #        ^^^^
         # Both selecting and evaluating with same network

# Double DQN (unbiased)
best_action = argmax_a Q_main(s_next, a)      # Select with main network
target = r + γ Q_target(s_next, best_action)  # Evaluate with target network

Why it works:

Decouples selection and evaluation
Removes systematic bias
Unbiased estimator of true Q*

Implementation

class DoubleDQN(DQNAgent):
    def compute_loss(self, batch):
        states, actions, rewards, next_states, dones = batch

        # Main network Q-values for current state
        q_values = self.main_network(states)
        q_values_current = q_values.gather(1, actions)

        # Double DQN: select action with main network
        next_q_main = self.main_network(next_states)
        best_actions = next_q_main.argmax(1, keepdim=True)

        # Evaluate with target network
        next_q_target = self.target_network(next_states)
        max_next_q = next_q_target.gather(1, best_actions).detach()

        # TD target (handles done flag)
        targets = rewards + (1 - dones) * self.gamma * max_next_q

        loss = F.smooth_l1_loss(q_values_current, targets)
        return loss

When to Use Double DQN

Use Double DQN if:

Training a medium-complexity task (Atari)
Suspicious that Q-values are too optimistic
Want slightly better sample efficiency

Standard DQN is OK if:

Small action space (less overestimation)
Training is otherwise stable
Sample efficiency not critical

Takeaway: Double DQN is strictly better, minimal cost, use it.

Part 4: Dueling DQN

Dueling Architecture: Separating Value and Advantage

Insight: Q(s,a) = V(s) + A(s,a) where:

V(s): How good is this state? (independent of action)
A(s,a): How much better is action a than average? (action-specific advantage)

Why separate:

Better feature learning: Network learns state features independently from action value
Stabilization: Value stream sees many states (more gradient signal)
Generalization: Advantage stream learns which actions matter

Example:

Atari Breakout:
V(s) = "Ball in good position, paddle ready" (state value)
A(s,LEFT) = -2 (moving left here hurts)
A(s,RIGHT) = +3 (moving right here helps)
A(s,NOOP) = 0 (staying still is neutral)

Q(s,LEFT) = V + A = 5 + (-2) = 3
Q(s,RIGHT) = V + A = 5 + 3 = 8  ← Best action
Q(s,NOOP) = V + A = 5 + 0 = 5

Architecture

class DuelingDQN(nn.Module):
    def __init__(self, input_size, num_actions):
        super().__init__()

        # Shared feature backbone
        self.conv1 = nn.Conv2d(4, 32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
        self.fc = nn.Linear(64*7*7, 512)

        # Value stream (single output)
        self.value_fc = nn.Linear(512, 256)
        self.value = nn.Linear(256, 1)

        # Advantage stream (num_actions outputs)
        self.advantage_fc = nn.Linear(512, 256)
        self.advantage = nn.Linear(256, num_actions)

    def forward(self, x):
        # Shared backbone
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        x = x.flatten(start_dim=1)
        x = torch.relu(self.fc(x))

        # Value stream
        v = torch.relu(self.value_fc(x))
        v = self.value(v)

        # Advantage stream
        a = torch.relu(self.advantage_fc(x))
        a = self.advantage(a)

        # Combine: Q = V + (A - mean(A))
        # Subtract mean(A) for normalization (prevents instability)
        q = v + (a - a.mean(dim=1, keepdim=True))
        return q

Why Subtract Mean of Advantages?

# Without mean subtraction
q = v + a
# Problem: V and A not separately identifiable
# V could be 100 + A = -90 or V = 50 + A = -40 (same Q)

# With mean subtraction
q = v + (a - mean(a))
# Mean advantage = 0 on average
# Forces: V learns state value, A learns relative advantage
# More stable training

When to Use Dueling DQN

Use Dueling if:

Training complex environments (Atari)
Want better feature learning
Training is unstable (helps stabilization)

Standard DQN is OK if:

Simple environments
Computational budget tight

Takeaway: Dueling is strictly better for neural network learning, minimal cost, use it.

Part 5: Prioritized Experience Replay

Problem with Uniform Sampling

Issue: All transitions equally likely to be sampled.

# Uniform sampling
batch = random.sample(replay_buffer, batch_size)
# Includes: boring transitions, important transitions, rare transitions
# All mixed together with equal weight

Problem:

Wasted learning on transitions already understood
Rare important transitions sampled rarely
Sample inefficiency

Example:

Atari agent learns mostly: "Move paddle left-right in routine positions"
Rarely: "What happens when ball is in corner?" (rare, important)

Uniform replay: 95% learning about paddle, 5% about corners
Should be: More focus on corners (rarer, more surprising)

Prioritized Experience Replay Solution

Insight: Sample transitions proportional to TD error (surprise).

# Compute TD error (surprise)
td_error = |r + γ max Q(s_next) - Q(s,a)|
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#           How wrong was our prediction?

# Probability ∝ TD error^α
# High error transitions sampled more
batch = sample_proportional_to_priority(replay_buffer, priorities)

Implementation

import numpy as np

class PrioritizedReplayBuffer:
    def __init__(self, size, alpha=0.6):
        self.buffer = []
        self.priorities = []
        self.size = size
        self.alpha = alpha  # How much to prioritize (0=uniform, 1=full priority)
        self.epsilon = 1e-6  # Small value to avoid zero priority

    def add(self, experience):
        # New experiences get max priority (important!)
        max_priority = np.max(self.priorities) if self.priorities else 1.0

        if len(self.buffer) < self.size:
            self.buffer.append(experience)
            self.priorities.append(max_priority)
        else:
            # Replace oldest if full
            self.buffer[len(self.buffer) % self.size] = experience
            self.priorities[len(self.priorities) % self.size] = max_priority

    def sample(self, batch_size):
        # Compute sampling probabilities
        priorities = np.array(self.priorities) ** self.alpha
        priorities = priorities / np.sum(priorities)

        # Sample indices
        indices = np.random.choice(len(self.buffer), batch_size, p=priorities)
        batch = [self.buffer[i] for i in indices]

        # Importance sampling weights (correct for bias from prioritized sampling)
        weights = (1 / (len(self.buffer) * priorities[indices])) ** (1/3)  # β=1/3
        weights = weights / np.max(weights)  # Normalize

        return batch, indices, weights

    def update_priorities(self, indices, td_errors):
        # Update priorities based on new TD errors
        for idx, td_error in zip(indices, td_errors):
            self.priorities[idx] = (np.abs(td_error) + self.epsilon) ** self.alpha

Importance Sampling Weights

Problem: Prioritized sampling introduces bias (samples important transitions more).

Solution: Reweight gradients by inverse probability.

# Uniform sampling: each transition contributes equally
loss = mean((r + γ max Q(s_next) - Q(s,a))^2)

# Prioritized sampling: bias toward high TD error
# Correct with importance weight (large TD error → small weight)
loss = mean(weights * (r + γ max Q(s_next) - Q(s,a))^2)
#            ^^^^^^^
#      Importance sampling correction

# weights ∝ 1/priority (inverse)

When to Use Prioritized Replay

Use if:

Training large environments (Atari)
Sample efficiency critical
Have computational budget for priority updates

Use standard uniform if:

Small environments
Computational budget tight
Standard training is working fine

Note: Adds complexity (priority updates), minimal empirical gain in many cases.

Part 6: Rainbow DQN

Combining All Improvements

Rainbow = Double DQN + Dueling DQN + Prioritized Replay + 3 more innovations:

Double DQN: Reduce overestimation bias
Dueling DQN: Separate value and advantage
Prioritized Replay: Sample important transitions
Noisy Networks: Exploration through network parameters
Distributional RL: Learn Q distribution not just mean
Multi-step Returns: n-step TD learning instead of 1-step

When to Use Rainbow

Use Rainbow if:

Need state-of-the-art Atari performance
Have weeks of compute for tuning
Paper requires it

Use Double + Dueling DQN if:

Standard DQN training unstable
Want good performance with less tuning
Typical development

Use Basic DQN if:

Learning the method
Sample efficiency not critical
Simple environments

Lesson: Understand components separately before combining.

Learning progression:
1. Q-learning (understand basics)
2. Basic DQN (add neural networks)
3. Double DQN (fix overestimation)
4. Dueling DQN (improve architecture)
5. Add prioritized replay (sample efficiency)
6. Rainbow (combine all)

Part 7: Common Bugs and Debugging

Bug #1: Training Divergence (Q-values explode)

Diagnosis Tree:

Check target network:

# WRONG - updating every step
loss = (Q_main(s,a) - [r + γ max Q_main(s_next)])^2
# FIX - use separate target network
loss = (Q_main(s,a) - [r + γ max Q_target(s_next)])^2

Check learning rate:

# WRONG - too high
optimizer = torch.optim.Adam(network.parameters(), lr=0.1)
# FIX - reduce learning rate
optimizer = torch.optim.Adam(network.parameters(), lr=0.0001)

Check reward scale:

# WRONG - rewards too large
reward = 1000 * indicator  # Values explode
# FIX - normalize
reward = 10 * indicator
# Or: reward = (reward - reward_mean) / reward_std

Check replay buffer:

# WRONG - too small
replay_buffer_size = 1000
# FIX - increase size
replay_buffer_size = 100_000

Bug #2: Poor Sample Efficiency (Slow Learning)

Diagnosis Tree:

Check replay buffer size:

# Too small → high correlation
if len(replay_buffer) < 100_000:
    print("WARNING: Replay buffer too small for Atari")

Check target network update frequency:

# Too frequent → moving target
# Too infrequent → slow target adjustment
# Good: every 1000-5000 steps
if update_frequency > 10_000:
    print("Target updates too infrequent")

Check batch size:

# Too small → noisy gradients
# Too large → slow training
# Good: 32-64
if batch_size < 16 or batch_size > 256:
    print("Consider adjusting batch size")

Check epsilon decay:

# Decaying too fast → premature exploitation
# Decaying too slow → wastes steps exploring
# Typical: decay over 10% of total steps
if decay_steps < total_steps * 0.05:
    print("Epsilon decays too quickly")

Bug #3: Q-Values Too Optimistic (Learned Policy << Training Q)

Diagnosis:

Red Flag: Policy performance much worse than max Q-value during training.

# Symptom
max_q_value = 100.0
actual_episode_return = 5.0
# 20x gap suggests overestimation

# Solutions (try in order)
1. Use Double DQN (reduces overestimation)
2. Reduce learning rate (slower updates → less optimistic)
3. Increase target network update frequency (more stable target)
4. Check reward function (might be wrong)

Bug #4: Frame Stacking Wrong

Symptoms:

Very slow learning despite "correct" implementation
Network can't learn velocity-dependent behaviors

Diagnosis:

# WRONG - single frame
state_shape = (84, 84, 3)
# Network sees only position, not velocity

# CORRECT - stack 4 frames
state_shape = (84, 84, 4)
# Last 4 frames show motion

# Check frame stacking implementation
frame_stack = deque(maxlen=4)
for frame in frames:
    frame_stack.append(frame)
    state = np.stack(list(frame_stack))  # (4, 84, 84)

Bug #5: Network Architecture Mismatch

Symptoms:

CNN on non-image input (or vice versa)
Output layer wrong number of actions
Input preprocessing wrong

Diagnosis:

# Image input → use CNN
if input_type == 'image':
    network = CNN(num_actions)

# Vector input → use FC
elif input_type == 'vector':
    network = FullyConnected(input_size, num_actions)

# Output layer MUST have num_actions outputs
assert network.output_size == num_actions

Part 8: Hyperparameter Tuning

Learning Rate

Too high (α > 0.001):

Divergence, unstable training
Q-values explode

Too low (α < 0.00001):

Very slow learning
May not converge in reasonable time

Start: α = 0.0001, adjust if needed

# Adaptive strategy
if max_q_value > 1000:
    print("Reduce learning rate")
    alpha = alpha / 2
if learning_curve_flat:
    print("Increase learning rate")
    alpha = alpha * 1.1

Replay Buffer Size

Too small (< 10k for Atari):

High correlation in gradients
Slow learning, poor sample efficiency

Too large (> 10M):

Excessive memory
Stale experiences dominate
Diminishing returns

Rule of thumb: 10 × episode length

episode_length = 1000  # typical
ideal_buffer = 100_000  # 10 × typical Atari episode

# Can increase if GPU memory available and learning slow
if learning_slow:
    buffer_size = 500_000  # More diversity

Epsilon Decay

Too fast (decay in 10k steps):

Agent exploits before learning
Suboptimal policy

Too slow (decay in 1M steps):

Wasted exploration time
Slow performance improvement

Rule: Decay over ~10% of total training steps

total_steps = 1_000_000
epsilon_decay_steps = total_steps * 0.1  # 100k steps
epsilon = max(epsilon_min, epsilon * (epsilon_decay_steps / current_step))

Target Network Update Frequency

Too frequent (every 100 steps):

Target still moves rapidly
Less stabilization benefit

Too infrequent (every 100k steps):

Network drifts far from target
Large jumps in learning

Sweet spot: Every 1k-5k steps (1000 typical)

update_frequency = 1000  # steps between target updates
if update_frequency < 500:
    print("Target updates might be too frequent")
if update_frequency > 10_000:
    print("Target updates might be too infrequent")

Reward Scaling

No scaling (raw rewards vary wildly):

Learning rate effects vary by task
Convergence issues

Clipping (clip to {-1, 0, +1}):

Good for Atari, loses information in custom envs

Normalization (zero-mean, unit variance):

General solution
Preserves reward differences

# Track running statistics
running_mean = 0.0
running_var = 1.0

def normalize_reward(reward):
    global running_mean, running_var
    running_mean = 0.99 * running_mean + 0.01 * reward
    running_var = 0.99 * running_var + 0.01 * (reward - running_mean)**2
    return (reward - running_mean) / np.sqrt(running_var + 1e-8)

Part 9: When to Use Each Method

DQN Selection Matrix

Situation	Method	Why
Learning method	Basic DQN	Understand target network, replay buffer
Medium task	Double DQN	Fix overestimation, minimal overhead
Complex task	Double + Dueling	Better architecture + bias reduction
Sample critical	Add Prioritized	Focus on important transitions
State-of-art	Rainbow	Best Atari performance
Simple Atari	DQN	Sufficient, faster to debug
Non-Atari discrete	DQN/Double	Adapt architecture to input type

Action Space Check

Before implementing DQN, ask:

if action_space == 'continuous':
    print("ERROR: Use actor-critic or policy gradient")
    print("Value methods only for discrete actions")
    redirect_to_actor_critic_methods()

elif action_space == 'discrete' and len(actions) <= 100:
    print("✓ DQN appropriate")

elif action_space == 'discrete' and len(actions) > 1000:
    print("⚠ Large action space, consider policy gradient")
    print("Or: hierarchical RL, action abstraction")

Part 10: Red Flags Checklist

When you see these, suspect bugs:

Single frame input: No velocity info, add frame stacking
No target network: Divergence expected, add it
Small replay buffer (< 10k): Poor efficiency, increase
High learning rate (> 0.001): Instability likely, decrease
No frame preprocessing: Raw image pixels, normalize to [0,1]
Updating target every step: Moving target problem, freeze it
No exploration decay: Explores forever, add epsilon decay
Continuous actions: Wrong method, use actor-critic
Very large rewards (> 100): Scaling issues, normalize
Only one environment: Bias high, use frame skipping or multiple envs
Immediate best performance: Overfitting to initial conditions, likely divergence later
Q-values >> rewards: Overestimation, try Double DQN
All Q-values zero: Network not learning, check learning rate
Training loss increasing: Learning rate too high, divergence

Part 11: Pitfall Rationalization

Rationalization	Reality	Counter-Guidance	Red Flag
"I'll skip target network, save memory"	Causes instability/divergence	Target network critical, minimal memory cost	"Target network optional"
"DQN works for continuous actions"	Breaks fundamental assumption (enumerate all actions)	Value methods discrete-only, use SAC/TD3 for continuous	Continuous action DQN attempt
"Uniform replay is fine"	Wastes learning on boring transitions	Prioritized replay better, but uniform adequate for many tasks	Always recommending prioritized
"I'll use tiny replay buffer, it's faster"	High correlation, poor learning	100k+ buffer typical, speed tradeoff acceptable	Buffer < 10k for Atari
"Frame stacking unnecessary, CNN sees motion"	Single frame Markov-violating	Frame stacking required for velocity from pixels	Single frame policy
"Rainbow is just DQN + tricks"	Missing that components solve specific problems	Each component fixes identified issue (overestimation, architecture, sampling)	Jumping to Rainbow without understanding
"Clip rewards, I saw it in a paper"	Clips away important reward information	Only clip for {-1,0,+1} Atari-style, normalize otherwise	Blind reward clipping
"Larger network will learn faster"	Overfitting, slower gradients, memory issues	Standard architecture (32-64-64 CNN) works, don't over-engineer	Unreasonably large networks
"Policy gradient would be simpler here"	Value methods discrete-only right choice	Know when each applies (discrete → value, continuous → policy)	Wrong method choice for action space
"Epsilon decay is a hyperparameter like any other"	decay schedule should match task complexity	Tune decay to problem (game length), not arbitrary	Epsilon decay without reasoning

Part 12: Pressure Test Scenarios

Scenario 1: Continuous Action Space

User: "I have a robot with continuous action space (joint angles in ℝ^7). Can I use DQN?"

Wrong Response: "Sure, discretize the actions" (Combinatorial explosion, inefficient)

Correct Response: "No, value methods are discrete-only. Use actor-critic (SAC) or policy gradient (PPO). They handle continuous actions naturally. Discretization would create 7-dimensional action space explosion (e.g., 10 values per joint = 10^7 actions)."

Scenario 2: Training Unstable

User: "My DQN is diverging immediately, loss explodes. Implementation looks right. What's wrong?"

Systematic Debug:

1. Check target network
   - Print: "Is target_network separate from main_network?"
   - Likely cause: updating together

2. Check learning rate
   - Print: "Learning rate = ?"
   - If > 0.001, reduce

3. Check reward scale
   - Print: "max(rewards) = ?"
   - If > 100, normalize

4. Check initial Q-values
   - Print: "mean(Q-values) = ?"
   - Should start near zero

Answer: Target network most likely culprit. Verify separate networks with proper update frequency.

Scenario 3: Rainbow vs Double DQN

User: "Should I implement Rainbow or just Double DQN? Is Rainbow worth the complexity?"

Guidance:

Double DQN:
+ Fixes overestimation bias
+ Simple to implement
+ 90% of Rainbow benefits in many cases
- Missing other optimizations

Rainbow:
+ Best Atari performance
+ State-of-the-art
- Complex (6 components)
- Harder to debug
- More hyperparameters

Recommendation:
Start: Double DQN
If unstable: Add Dueling
If slow: Add Prioritized
Only go to Rainbow: If need SotA and have time

Scenario 4: Frame Stacking Issue

User: "My agent trains on Atari but learning is slow. How many frames should I stack?"

Diagnosis:

# Check if frame stacking implemented
if state.shape != (4, 84, 84):
    print("ERROR: Not using frame stacking")
    print("Single frame (1, 84, 84) violates Markov property")
    print("Add frame stacking: stack last 4 frames")

# Frame count
4 frames: Standard (shows ~80ms at 50fps = ~4 frames)
3 frames: OK, slightly less velocity info
2 frames: Minimum, just barely Markovian
1 frame: WRONG, not Markovian
8+ frames: Too many, outdated states in stack

Scenario 5: Hyperparameter Tuning

User: "I've tuned learning rate, buffer size, epsilon. What else affects performance?"

Guidance:

Priority 1 (Critical):
- Target network update frequency (1000-5000 steps)
- Replay buffer size (100k+ typical)
- Frame stacking (4 frames)

Priority 2 (Important):
- Learning rate (0.0001-0.0005)
- Epsilon decay schedule (over ~10% of steps)
- Batch size (32-64)

Priority 3 (Nice to have):
- Network architecture (32-64-64 CNN standard)
- Reward normalization (helps but not required)
- Double/Dueling DQN (improvements, not essentials)

Start with Priority 1, only adjust Priority 2-3 if unstable.

Part 13: When to Route Elsewhere

Route to rl-foundations if

User confused about Bellman equations
Unclear on value function definition
Needs theory behind Q-learning convergence

Route to actor-critic-methods if

Continuous action space
Need deterministic policy gradients
Stochastic policy required

Route to policy-gradient-methods if

Large discrete action space (> 1000 actions)
Need policy regularization
Exploration by stochasticity useful

Route to offline-rl-methods if

No environment access (batch learning)
Learning from logged data only

Route to rl-debugging if

General training issues
Need systematic debugging methodology
Credit assignment problems

Route to reward-shaping if

Sparse rewards
Reward design affecting learning
Potential-based shaping questions

Summary

You now understand:

Q-Learning: TD learning for action values, off-policy convergence guarantee
DQN: Add neural networks + experience replay + target network for stability
Stability Mechanisms:
- Replay buffer: Break correlation
- Target network: Prevent moving target problem
Common Variants:
- Double DQN: Fix overestimation bias
- Dueling DQN: Separate value and advantage
- Prioritized Replay: Focus on important transitions
- Rainbow: Combine improvements
When to Use: Discrete action spaces only, not continuous
Common Bugs: Divergence, poor efficiency, overoptimism, frame issues
Hyperparameter Tuning: Buffer size, learning rate, epsilon decay, target frequency
Debugging Strategy: Systematic diagnosis (target network → learning rate → reward scale)

Key Takeaways:

Value methods are for discrete actions ONLY
DQN requires target network and experience replay
Frame stacking needed for video inputs (Markov property)
Double DQN fixes overestimation, use it
Start simple, add Dueling/Prioritized only if needed
Systematic debugging beats random tuning

Next: Implement on simple environment first (CartPole or small custom task), then scale to Atari.

Install Skill

SKILL.md

Value-Based Methods

When to Use This Skill

Core Principle

Part 1: Q-Learning Foundation

From TD Learning to Q-Learning

Off-Policy Learning

Convergence Guarantee

Q-Learning Pitfall #1: Small State Spaces Only

Part 2: Deep Q-Networks (DQN)

What DQN Adds to Q-Learning

Mechanism 1: Experience Replay

Mechanism 2: Target Network

DQN Architecture Pattern

Hyperparameter Guidance

DQN Pitfall #1: Missing Target Network

DQN Pitfall #2: Replay Buffer Too Small

DQN Pitfall #3: No Frame Stacking

DQN Pitfall #4: Reward Clipping Wrong

Part 3: Double DQN

The Overestimation Bias Problem

Double DQN Solution

Implementation

When to Use Double DQN

Part 4: Dueling DQN

Dueling Architecture: Separating Value and Advantage

Architecture

Why Subtract Mean of Advantages?

When to Use Dueling DQN

Part 5: Prioritized Experience Replay

Problem with Uniform Sampling

Prioritized Experience Replay Solution

Implementation

Importance Sampling Weights

When to Use Prioritized Replay

Part 6: Rainbow DQN

Combining All Improvements

When to Use Rainbow

Part 7: Common Bugs and Debugging

Bug #1: Training Divergence (Q-values explode)

Bug #2: Poor Sample Efficiency (Slow Learning)

Bug #3: Q-Values Too Optimistic (Learned Policy << Training Q)

Bug #4: Frame Stacking Wrong

Bug #5: Network Architecture Mismatch

Part 8: Hyperparameter Tuning

Learning Rate

Replay Buffer Size

Epsilon Decay

Target Network Update Frequency

Reward Scaling

Part 9: When to Use Each Method

DQN Selection Matrix

Action Space Check

Part 10: Red Flags Checklist

Part 11: Pitfall Rationalization

Part 12: Pressure Test Scenarios

Scenario 1: Continuous Action Space

Scenario 2: Training Unstable

Scenario 3: Rainbow vs Double DQN

Scenario 4: Frame Stacking Issue

Scenario 5: Hyperparameter Tuning

Part 13: When to Route Elsewhere

Route to rl-foundations if

Route to actor-critic-methods if

Route to policy-gradient-methods if

Route to offline-rl-methods if

Route to rl-debugging if

Route to reward-shaping if

Summary