Claude Code Plugins

Community-maintained marketplace

Feedback
1
0

Master RL theory - MDPs, value functions, Bellman equations, value/policy iteration, TD

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name rl-foundations
description Master RL theory - MDPs, value functions, Bellman equations, value/policy iteration, TD

RL Foundations

When to Use This Skill

Invoke this skill when you encounter:

  • New to RL: User asks "what is RL" or "how does RL work"
  • Theory Questions: MDP, value functions, Bellman equations, policy optimization
  • Conceptual Confusion: Mixing up V(s) and Q(s,a), value iteration vs policy iteration
  • Before Implementation: User wants to implement RL algorithms without understanding foundations
  • Debugging Theory: Why discount factor matters, why exploration needed, how algorithms differ
  • Foundation Check: User jumps to DQN/PPO without understanding MDPs

This skill provides the theoretical foundation for ALL other deep-rl skills.

Do NOT use this skill for:

  • Algorithm implementation (use value-based-methods, policy-gradient-methods, actor-critic-methods)
  • Debugging code (use rl-debugging)
  • Environment setup (use rl-environments)

Core Principle

Understanding the theory enables everything else.

Reinforcement learning is built on a rigorous mathematical foundation:

  1. MDP (Markov Decision Process) - the framework
  2. Value Functions - quantify expected return
  3. Bellman Equations - recursive decomposition
  4. Optimal Policy - maximize expected return
  5. Algorithms - methods to find optimal policy

Without this foundation, you're copy-pasting code you cannot debug, adapt, or extend.


Part 1: Markov Decision Process (MDP)

What is an MDP?

An MDP is the mathematical framework for sequential decision-making under uncertainty.

Formal Definition: A Markov Decision Process is a 5-tuple (S, A, P, R, γ):

  • S: State space (set of all possible states)
  • A: Action space (set of all possible actions)
  • P: Transition probability P(s'|s,a) - probability of reaching state s' from state s after action a
  • R: Reward function R(s,a,s') - immediate reward for transition
  • γ: Discount factor (0 ≤ γ ≤ 1) - controls importance of future rewards

Key Property: Markov Property

P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0, a_0) = P(s_{t+1} | s_t, a_t)

Meaning: The future depends only on the present state, not the history.

Why this matters: Allows recursive algorithms (Bellman equations). If Markov property violated, standard RL algorithms may fail.


Example 1: GridWorld MDP

Problem: Agent navigates 4x4 grid to reach goal.

S = {(0,0), (0,1), ..., (3,3)}  # 16 states
A = {UP, DOWN, LEFT, RIGHT}      # 4 actions
R = -1 for each step, +10 at goal
γ = 0.9
P: Deterministic (up always moves up if not wall)

Visualization:

S  .  .  .
.  .  .  .
.  #  .  .  # = wall
.  .  .  G  G = goal (+10)

Transition Example:

  • State s = (1,1), Action a = RIGHT
  • Deterministic: P(s'=(1,2) | s=(1,1), a=RIGHT) = 1.0
  • Reward: R(s,a,s') = -1
  • Next state: s' = (1,2)

Markov Property Holds: Future position depends only on current position and action, not how you got there.


Example 2: Stochastic GridWorld

Modification: Actions succeed with probability 0.8, move perpendicular with probability 0.1 each.

P((1,2) | (1,1), RIGHT) = 0.8  # intended
P((0,1) | (1,1), RIGHT) = 0.1  # slip up
P((2,1) | (1,1), RIGHT) = 0.1  # slip down

Why Stochastic: Models real-world uncertainty (robot actuators, wind, slippery surfaces).

Consequence: Agent must consider probabilities when choosing actions.


Example 3: Continuous State MDP (Cartpole)

S ⊂ ℝ⁴: (cart_position, cart_velocity, pole_angle, pole_angular_velocity)
A = {LEFT, RIGHT}  # discrete actions, continuous state
R = +1 for each timestep upright
γ = 0.99
P: Physics-based transition (continuous dynamics)

Key Difference: State space is continuous, requires function approximation (neural networks).

Still an MDP: Markov property holds (physics is Markovian given state).


When is Markov Property Violated?

Example: Poker

State: Current cards visible
Markov Violated: Opponents' strategies depend on past betting patterns

Solution: Augment state with history (last N actions), or use partially observable MDP (POMDP).

Example: Robot with Noisy Sensors

State: Raw sensor reading (single frame)
Markov Violated: True position requires integrating multiple frames

Solution: Stack frames (last 4 frames as state), or use recurrent network (LSTM).


Episodic vs Continuing Tasks

Episodic: Task terminates (games, reaching goal)

Episode: s₀ → s₁ → ... → s_T (terminal state)
Return: G_t = r_t + γr_{t+1} + ... + γ^{T-t}r_T

Continuing: Task never ends (stock trading, robot operation)

Return: G_t = r_t + γr_{t+1} + γ²r_{t+2} + ... (infinite)

Critical: Continuing tasks REQUIRE γ < 1 (else return infinite).


MDP Pitfall #1: Using Wrong State Representation

Bad: State = current frame only (when velocity matters)

# Pong: Ball position alone doesn't tell velocity
state = current_frame  # WRONG - not Markovian

Good: State = last 4 frames (velocity from difference)

# Frame stacking preserves Markov property
state = np.concatenate([frame_t, frame_{t-1}, frame_{t-2}, frame_{t-3}])

Why: Ball velocity = (position_t - position_{t-1}) / dt, need history.


MDP Pitfall #2: Reward Function Shapes Behavior

Example: Robot navigating to goal

Bad Reward:

reward = +1 if at_goal else 0  # Sparse

Problem: No signal until goal reached, hard to learn.

Better Reward:

reward = -distance_to_goal  # Dense

Problem: Agent learns to get closer but may not reach goal (local optimum).

Best Reward (Potential-Based Shaping):

reward = (distance_prev - distance_curr) + large_bonus_at_goal

Why: Encourages progress + explicit goal reward.

Takeaway: Reward function engineering is CRITICAL. Route to reward-shaping skill for details.


MDP Formulation Checklist

Before implementing any RL algorithm, answer:

  • States: What information defines the situation? Is it Markovian?
  • Actions: What can the agent do? Discrete or continuous?
  • Transitions: Deterministic or stochastic? Do you know P(s'|s,a)?
  • Rewards: Immediate reward for each transition? Sparse or dense?
  • Discount: Episodic (can use γ=1) or continuing (need γ<1)?
  • Markov Property: Does current state fully determine future?

If you cannot answer these, you cannot implement RL algorithms effectively.


Part 2: Value Functions

What is a Value Function?

A value function quantifies "how good" a state (or state-action pair) is.

State-Value Function V^π(s):

V^π(s) = E_π[G_t | s_t = s]
       = E_π[r_t + γr_{t+1} + γ²r_{t+2} + ... | s_t = s]

Meaning: Expected cumulative discounted reward starting from state s and following policy π.

Action-Value Function Q^π(s,a):

Q^π(s,a) = E_π[G_t | s_t = s, a_t = a]
         = E_π[r_t + γr_{t+1} + γ²r_{t+2} + ... | s_t = s, a_t = a]

Meaning: Expected cumulative discounted reward starting from state s, taking action a, then following policy π.

Relationship:

V^π(s) = Σ_a π(a|s) Q^π(s,a)

Intuition: V(s) = value of state, Q(s,a) = value of state-action pair.


Critical Distinction: Value vs Reward

Reward r(s,a): Immediate, one-step payoff.

Value V(s): Long-term, cumulative expected reward.

Example: GridWorld

Reward: r = -1 every step, r = +10 at goal
Value at state 2 steps from goal:
  V(s) ≈ -1 + γ(-1) + γ²(+10)
       = -1 - 0.9 + 0.81*10
       = -1.9 + 8.1 = 6.2

Key: Value is higher than immediate reward because it accounts for future goal reward.

Common Mistake: Setting V(s) = r(s). This ignores all future rewards.


Example: Computing V^π for Simple Policy

GridWorld: 3x3 grid, goal at (2,2), γ=0.9, r=-1 per step.

Policy π: Always move right or down (deterministic).

Manual Calculation:

V^π((2,2)) = 0  (goal, no future rewards)

V^π((2,1)) = r + γ V^π((2,2))
           = -1 + 0.9 * 0 = -1

V^π((1,2)) = r + γ V^π((2,2))
           = -1 + 0.9 * 0 = -1

V^π((1,1)) = r + γ V^π((1,2))  (assuming action = DOWN)
           = -1 + 0.9 * (-1) = -1.9

V^π((0,0)) = r + γ V^π((0,1))
           = ... (depends on path)

Observation: Values decrease as distance from goal increases (more -1 rewards to collect).


Optimal Value Functions

Optimal State-Value Function V(s)*:

V*(s) = max_π V^π(s)

Meaning: Maximum value achievable from state s under ANY policy.

Optimal Action-Value Function Q(s,a)*:

Q*(s,a) = max_π Q^π(s,a)

Meaning: Maximum value achievable from state s, taking action a, then acting optimally.

Optimal Policy π*:

π*(s) = argmax_a Q*(s,a)

Meaning: Policy that achieves V*(s) at all states.

Key Insight: If you know Q*(s,a), optimal policy is trivial (pick action with max Q).


Value Function Pitfall #1: Confusing V and Q

Wrong Understanding:

  • V(s) = value of state s
  • Q(s,a) = value of action a (WRONG - ignores state)

Correct Understanding:

  • V(s) = value of state s (average over actions under policy)
  • Q(s,a) = value of taking action a IN STATE s

Example: GridWorld

State s = (1,1)
V(s) might be 5.0 (average value under policy)

Q(s, RIGHT) = 6.0  (moving right is good)
Q(s, LEFT)  = 2.0  (moving left is bad)
Q(s, UP)    = 4.0
Q(s, DOWN)  = 7.0  (moving down is best)

V(s) = π(RIGHT|s)*6 + π(LEFT|s)*2 + π(UP|s)*4 + π(DOWN|s)*7

Takeaway: Q depends on BOTH state and action. V depends only on state.


Value Function Pitfall #2: Forgetting Expectation

Wrong: V(s) = sum of rewards on one trajectory.

Correct: V(s) = expected sum over ALL possible trajectories.

Example: Stochastic GridWorld

# WRONG: Compute V by running one episode
episode_return = sum([r_0, r_1, ..., r_T])
V[s_0] = episode_return  # This is ONE sample, not expectation

# CORRECT: Compute V by averaging over many episodes
returns = []
for _ in range(1000):
    episode_return = run_episode(policy, start_state=s)
    returns.append(episode_return)
V[s] = np.mean(returns)  # Expectation via Monte Carlo

Key: Value is an expectation, not a single sample.


Value Function Pitfall #3: Ignoring Discount Factor

Scenario: User computes V without discounting.

Wrong:

V[s] = r_0 + r_1 + r_2 + ...  # No discount

Correct:

V[s] = r_0 + gamma*r_1 + gamma**2*r_2 + ...

Why It Matters: Without discount, values blow up in continuing tasks.

Example: Continuing task with r=1 every step

Without discount: V = 1 + 1 + 1 + ... = ∞
With γ=0.9:      V = 1 + 0.9 + 0.81 + ... = 1/(1-0.9) = 10

Takeaway: Always discount future rewards in continuing tasks.


Part 3: Policies

What is a Policy?

A policy π is a mapping from states to actions (or action probabilities).

Deterministic Policy: π: S → A

π(s) = a  (always take action a in state s)

Stochastic Policy: π: S × A → [0,1]

π(a|s) = probability of taking action a in state s
Σ_a π(a|s) = 1  (probabilities sum to 1)

Example: Policies in GridWorld

Deterministic Policy:

def policy(state):
    if state[0] < 2:
        return "RIGHT"
    else:
        return "DOWN"

Stochastic Policy:

def policy(state):
    # 70% right, 20% down, 10% up
    return np.random.choice(["RIGHT", "DOWN", "UP"], 
                           p=[0.7, 0.2, 0.1])

Uniform Random Policy:

def policy(state):
    return np.random.choice(["UP", "DOWN", "LEFT", "RIGHT"])

Policy Evaluation

Problem: Given policy π, compute V^π(s) for all states.

Approach 1: Monte Carlo (sample trajectories)

# Run many episodes, average returns
V = defaultdict(float)
counts = defaultdict(int)

for episode in range(10000):
    trajectory = run_episode(policy)
    G = 0
    for (s, a, r) in reversed(trajectory):
        G = r + gamma * G
        V[s] += G
        counts[s] += 1

for s in V:
    V[s] /= counts[s]  # Average

Approach 2: Bellman Expectation (iterative)

# Initialize V arbitrarily
V = {s: 0 for s in states}

# Iterate until convergence
while not converged:
    V_new = {}
    for s in states:
        V_new[s] = sum(policy(a|s) * (R(s,a) + gamma * sum(P(s'|s,a) * V[s'] 
                                                            for s' in states))
                       for a in actions)
    V = V_new

Approach 2 requires knowing P(s'|s,a) (model-based).


Policy Improvement

Theorem: Given V^π, greedy policy π' with respect to V^π is at least as good as π.

π'(s) = argmax_a Q^π(s,a)
      = argmax_a Σ_{s'} P(s'|s,a) [R(s,a,s') + γV^π(s')]

Proof Sketch: By construction, π' maximizes expected immediate reward + future value.

Consequence: Iterating policy evaluation + policy improvement converges to optimal policy π*.


Optimal Policy π*

Theorem: There exists an optimal policy πthat achieves V(s) at all states.

How to find π from Q**:

def optimal_policy(state):
    return argmax(Q_star[state, :])  # Greedy w.r.t. Q*

How to find π from V**:

def optimal_policy(state):
    # One-step lookahead
    return argmax([R(state, a) + gamma * sum(P(s'|state,a) * V_star[s'] 
                                              for s' in states)
                   for a in actions])

Key: Optimal policy is deterministic (greedy w.r.t. Qor V).

Exception: In stochastic games with multiple optimal actions, any distribution over optimal actions is fine.


Policy Pitfall #1: Greedy Policy Without Exploration

Problem: Always taking argmax(Q) means never trying new actions.

Example:

# Pure greedy policy (WRONG for learning)
def policy(state):
    return argmax(Q[state, :])

Why It Fails: If Q is initialized wrong, agent never explores better actions.

Solution: ε-greedy policy

def epsilon_greedy_policy(state, epsilon=0.1):
    if random.random() < epsilon:
        return random.choice(actions)  # Explore
    else:
        return argmax(Q[state, :])      # Exploit

Exploration-Exploitation Tradeoff: Explore to find better actions, exploit to maximize reward.


Policy Pitfall #2: Stochastic Policy for Deterministic Optimal

Scenario: Optimal policy is deterministic (most MDPs), but user uses stochastic policy.

Effect: Suboptimal performance (randomness doesn't help).

Example: GridWorld optimal policy always moves toward goal (deterministic).

When Stochastic is Needed:

  1. During Learning: Exploration (ε-greedy, Boltzmann)
  2. Partially Observable: Stochasticity can help in POMDPs
  3. Multi-Agent: Randomness prevents exploitation by opponents

Takeaway: After learning, optimal policy is usually deterministic. Use stochastic for exploration.


Part 4: Bellman Equations

Bellman Expectation Equation

For V^π:

V^π(s) = Σ_a π(a|s) Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]

Intuition: Value of state s = expected immediate reward + discounted value of next state.

For Q^π:

Q^π(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ Σ_{a'} π(a'|s') Q^π(s',a')]

Intuition: Value of (s,a) = expected immediate reward + discounted value of next (s',a').

Relationship:

V^π(s) = Σ_a π(a|s) Q^π(s,a)
Q^π(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]

Bellman Optimality Equation

For V*:

V*(s) = max_a Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V*(s')]

Intuition: Optimal value = max over actions of (immediate reward + discounted optimal future value).

For Q*:

Q*(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ max_{a'} Q*(s',a')]

Intuition: Optimal Q-value = expected immediate reward + discounted optimal Q-value of next state.

Relationship:

V*(s) = max_a Q*(s,a)
Q*(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V*(s')]

Deriving the Bellman Equation

Start with definition of V^π:

V^π(s) = E_π[G_t | s_t = s]
       = E_π[r_t + γr_{t+1} + γ²r_{t+2} + ... | s_t = s]

Factor out first reward:

V^π(s) = E_π[r_t + γ(r_{t+1} + γr_{t+2} + ...) | s_t = s]
       = E_π[r_t | s_t = s] + γ E_π[r_{t+1} + γr_{t+2} + ... | s_t = s]

Second term is V^π(s_{t+1}):

V^π(s) = E_π[r_t | s_t = s] + γ E_π[V^π(s_{t+1}) | s_t = s]

Expand expectations:

V^π(s) = Σ_a π(a|s) Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]

This is the Bellman Expectation Equation.

Key Insight: Value function satisfies a consistency equation (recursive).


Why Bellman Equations Matter

1. Iterative Algorithms: Use Bellman equation as update rule

# Value Iteration
V_new[s] = max_a Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V[s']]

# Q-Learning
Q[s,a] += alpha * (r + gamma * max_a' Q[s',a'] - Q[s,a])

2. Convergence Guarantees: Bellman operator is a contraction, guarantees convergence.

3. Understanding Algorithms: All RL algorithms approximate Bellman equations.

Takeaway: Bellman equations are the foundation of RL algorithms.


Bellman Pitfall #1: Forgetting Max vs Expectation

Bellman Expectation (for policy π):

V^π(s) = Σ_a π(a|s) ...  # Expectation over policy

Bellman Optimality (for optimal policy):

V*(s) = max_a ...  # Maximize over actions

Consequence:

  • Policy evaluation uses Bellman expectation
  • Value iteration uses Bellman optimality

Common Mistake: Using max when evaluating a non-greedy policy.


Bellman Pitfall #2: Ignoring Transition Probabilities

Deterministic Transition:

V^π(s) = R(s,a) + γ V^π(s')  # Direct, s' is deterministic

Stochastic Transition:

V^π(s) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]  # Weighted sum

Example: Stochastic GridWorld

# Action RIGHT from (1,1)
V((1,1)) = 0.8 * [r + γ V((1,2))]    # 80% intended
         + 0.1 * [r + γ V((0,1))]    # 10% slip up
         + 0.1 * [r + γ V((2,1))]    # 10% slip down

Takeaway: Don't forget to weight by transition probabilities in stochastic environments.


Part 5: Discount Factor γ

What Does γ Control?

Discount factor γ ∈ [0, 1] controls how much the agent cares about future rewards.

γ = 0: Only immediate reward matters

V(s) = E[r_t]  (myopic)

γ = 1: All future rewards matter equally

V(s) = E[r_t + r_{t+1} + r_{t+2} + ...]  (far-sighted)

γ = 0.9: Future discounted exponentially

V(s) = E[r_t + 0.9*r_{t+1} + 0.81*r_{t+2} + ...]

Reward 10 steps away:

  • γ=0.9: worth 0.9^10 = 0.35 of immediate reward
  • γ=0.99: worth 0.99^10 = 0.90 of immediate reward

Planning Horizon

Effective Horizon: How far ahead does agent plan?

Approximation: Horizon ≈ 1/(1-γ)

Examples:

  • γ=0.9 → Horizon ≈ 10 steps
  • γ=0.99 → Horizon ≈ 100 steps
  • γ=0.5 → Horizon ≈ 2 steps
  • γ=0.999 → Horizon ≈ 1000 steps

Intuition: After horizon steps, rewards are discounted to ~37% (e^{-1}).

Formal: Σ_{t=0}^∞ γ^t = 1/(1-γ) (sum of geometric series).


Choosing γ

Rule of Thumb:

  • Task horizon known: γ such that 1/(1-γ) ≈ task_length
  • Short episodes (< 100 steps): γ = 0.9 to 0.95
  • Long episodes (100-1000 steps): γ = 0.99
  • Very long (> 1000 steps): γ = 0.999

Example: Pong (episode ~ 1000 steps)

γ = 0.99  # Horizon ≈ 100, sees ~10% of episode

Example: Cartpole (episode ~ 200 steps)

γ = 0.99  # Horizon ≈ 100, sees half of episode

Example: Chess (game ~ 40 moves = 80 steps)

γ = 0.95  # Horizon ≈ 20, sees quarter of game

γ = 1 Special Case

When γ = 1:

  • Only valid for episodic tasks (guaranteed termination)
  • Continuing tasks: V = ∞ (unbounded)

Example: GridWorld (terminates at goal)

γ = 1.0  # OK, episode ends
V(s) = -steps_to_goal + 10  (finite)

Example: Stock trading (never terminates)

γ = 1.0  # WRONG, V = ∞
γ = 0.99  # Correct

Takeaway: Use γ < 1 for continuing tasks, γ = 1 allowed for episodic.


Discount Factor Pitfall #1: Too Small γ

Scenario: Task requires 50 steps to reach goal, γ=0.9.

Problem:

Reward at step 50 discounted by 0.9^50 = 0.0052

Effect: Agent effectively blind to long-term goals (can't see reward).

Solution: Increase γ to 0.99 (0.99^50 = 0.61, still significant).

Symptom: Agent learns suboptimal policy (ignores distant goals).


Discount Factor Pitfall #2: γ = 1 in Continuing Tasks

Scenario: Continuing task (never terminates), γ=1.

Problem:

V(s) = r + r + r + ... = ∞  (unbounded)

Effect: Value iteration, Q-learning diverge (values explode).

Solution: Use γ < 1 (e.g., γ=0.99).

Symptom: Values grow without bound, algorithm doesn't converge.


Discount Factor Pitfall #3: Treating γ as Hyperparameter

Wrong Mindset: "Let's grid search γ in [0.9, 0.95, 0.99]."

Correct Mindset: "Task requires planning X steps ahead, so γ = 1 - 1/X."

Example: Goal 100 steps away

Required horizon = 100
γ = 1 - 1/100 = 0.99

Takeaway: γ is not arbitrary. Choose based on task horizon.


Part 6: Algorithm Families

Three Paradigms

1. Dynamic Programming (DP):

  • Requires full MDP model (P, R known)
  • Exact algorithms (no sampling)
  • Examples: Value Iteration, Policy Iteration

2. Monte Carlo (MC):

  • Model-free (learn from experience)
  • Learns from complete episodes
  • Examples: First-visit MC, Every-visit MC

3. Temporal Difference (TD):

  • Model-free (learn from experience)
  • Learns from incomplete episodes
  • Examples: TD(0), Q-learning, SARSA

Key Differences:

  • DP: Needs model, no sampling
  • MC: No model, full episodes
  • TD: No model, partial episodes (most flexible)

Value Iteration

Algorithm: Iteratively apply Bellman optimality operator.

# Initialize
V = {s: 0 for s in states}

# Iterate until convergence
while not converged:
    V_new = {}
    for s in states:
        # Bellman optimality backup
        V_new[s] = max([sum(P(s_next|s,a) * (R(s,a,s_next) + gamma * V[s_next])
                            for s_next in states)
                        for a in actions])
    
    if max(abs(V_new[s] - V[s]) for s in states) < threshold:
        converged = True
    V = V_new

# Extract policy
policy = {s: argmax([sum(P(s_next|s,a) * (R(s,a,s_next) + gamma * V[s_next])
                         for s_next in states)
                    for a in actions])
          for s in states}

Convergence: Guaranteed (Bellman operator is contraction).

Computational Cost: O(|S|² |A|) per iteration.

When to Use: Small state spaces (< 10,000 states), full model available.


Policy Iteration

Algorithm: Alternate between policy evaluation and policy improvement.

# Initialize random policy
policy = {s: random.choice(actions) for s in states}

while not converged:
    # Policy Evaluation: Compute V^π
    V = {s: 0 for s in states}
    while not converged_V:
        V_new = {}
        for s in states:
            a = policy[s]
            V_new[s] = sum(P(s_next|s,a) * (R(s,a,s_next) + gamma * V[s_next])
                          for s_next in states)
        V = V_new
    
    # Policy Improvement: Make policy greedy w.r.t. V
    policy_stable = True
    for s in states:
        old_action = policy[s]
        policy[s] = argmax([sum(P(s_next|s,a) * (R(s,a,s_next) + gamma * V[s_next])
                               for s_next in states)
                           for a in actions])
        if old_action != policy[s]:
            policy_stable = False
    
    if policy_stable:
        converged = True

Convergence: Guaranteed, often fewer iterations than value iteration.

When to Use: When policy converges faster than values (common).

Key Difference from Value Iteration:

  • Value iteration: no explicit policy until end
  • Policy iteration: maintain and improve policy each iteration

Monte Carlo Methods

Idea: Estimate V^π(s) by averaging returns from state s.

# First-visit MC
V = defaultdict(float)
counts = defaultdict(int)

for episode in range(num_episodes):
    trajectory = run_episode(policy)  # [(s_0, a_0, r_0), ..., (s_T, a_T, r_T)]
    
    G = 0
    visited = set()
    for (s, a, r) in reversed(trajectory):
        G = r + gamma * G  # Accumulate return
        
        if s not in visited:  # First-visit
            V[s] += G
            counts[s] += 1
            visited.add(s)
    
    for s in counts:
        V[s] /= counts[s]  # Average return

Advantages:

  • No model needed (model-free)
  • Can handle stochastic environments
  • Unbiased estimates

Disadvantages:

  • Requires complete episodes (can't learn mid-episode)
  • High variance (one trajectory is noisy)
  • Slow convergence

When to Use: Episodic tasks, when model unavailable.


Temporal Difference (TD) Learning

Idea: Update V after each step using bootstrapping.

TD(0) Update:

V[s] += alpha * (r + gamma * V[s_next] - V[s])
#                \_____________________/
#                       TD error

Bootstrapping: Use current estimate V[s_next] instead of true return.

Full Algorithm:

V = {s: 0 for s in states}

for episode in range(num_episodes):
    s = initial_state()
    
    while not terminal:
        a = policy(s)
        s_next, r = environment.step(s, a)
        
        # TD update
        V[s] += alpha * (r + gamma * V[s_next] - V[s])
        
        s = s_next

Advantages:

  • No model needed (model-free)
  • Can learn from incomplete episodes (online)
  • Lower variance than MC

Disadvantages:

  • Biased estimates (bootstrap uses estimate)
  • Requires tuning α (learning rate)

When to Use: Model-free, need online learning.


Q-Learning (TD for Q-values)

TD for action-values Q(s,a):

Q[s,a] += alpha * (r + gamma * max_a' Q[s_next, a'] - Q[s,a])

Full Algorithm:

Q = defaultdict(lambda: defaultdict(float))

for episode in range(num_episodes):
    s = initial_state()
    
    while not terminal:
        # ε-greedy action selection
        if random.random() < epsilon:
            a = random.choice(actions)
        else:
            a = argmax(Q[s])
        
        s_next, r = environment.step(s, a)
        
        # Q-learning update (off-policy)
        Q[s][a] += alpha * (r + gamma * max(Q[s_next].values()) - Q[s][a])
        
        s = s_next

Key: Off-policy (learns optimal Q regardless of behavior policy).

When to Use: Model-free, discrete actions, want optimal policy.


SARSA (On-Policy TD)

Difference from Q-learning: Uses next action from policy (on-policy).

Q[s,a] += alpha * (r + gamma * Q[s_next, a_next] - Q[s,a])
#                                      ^^^^^^
#                                      Action from policy, not max

Full Algorithm:

Q = defaultdict(lambda: defaultdict(float))

for episode in range(num_episodes):
    s = initial_state()
    a = epsilon_greedy(Q[s], epsilon)  # Choose first action
    
    while not terminal:
        s_next, r = environment.step(s, a)
        a_next = epsilon_greedy(Q[s_next], epsilon)  # Next action from policy
        
        # SARSA update (on-policy)
        Q[s][a] += alpha * (r + gamma * Q[s_next][a_next] - Q[s][a])
        
        s, a = s_next, a_next

Difference from Q-learning:

  • Q-learning: learns optimal policy (off-policy)
  • SARSA: learns policy being followed (on-policy)

When to Use: When you want policy to reflect exploration strategy.


Algorithm Comparison

Algorithm Model? Episodes? Convergence Use Case
Value Iteration Yes (P, R) No Guaranteed Small MDPs, known model
Policy Iteration Yes (P, R) No Guaranteed, faster Small MDPs, good init policy
Monte Carlo No Complete Slow, high variance Episodic, model-free
TD(0) No Partial Faster, lower variance Online, model-free
Q-Learning No Partial Guaranteed* Discrete actions, off-policy
SARSA No Partial Guaranteed* On-policy, safe exploration

*With appropriate exploration and learning rate schedule.


Algorithm Pitfall #1: Using DP Without Model

Scenario: User tries value iteration on real robot (no model).

Problem: Value iteration requires P(s'|s,a) and R(s,a,s').

Solution: Use model-free methods (Q-learning, SARSA, policy gradients).

Red Flag: "Let's use policy iteration for Atari games." (No model available.)


Algorithm Pitfall #2: Monte Carlo on Non-Episodic Tasks

Scenario: Continuing task (never terminates), try MC.

Problem: MC requires complete episodes to compute return.

Solution: Use TD methods (learn from partial trajectories).

Red Flag: "Let's use MC for stock trading." (Continuing task.)


Algorithm Pitfall #3: Confusing Q-Learning and SARSA

Scenario: User uses Q-learning but expects on-policy behavior.

Example: Cliff walking with epsilon-greedy

  • Q-learning: Learns optimal (risky) path along cliff
  • SARSA: Learns safe path away from cliff (accounts for exploration)

Takeaway:

  • Q-learning: Learns optimal policy (off-policy)
  • SARSA: Learns policy being followed (on-policy)

Choose based on whether you want optimal policy or policy that accounts for exploration.


Part 7: Exploration vs Exploitation

The Tradeoff

Exploitation: Choose action with highest known value (maximize immediate reward).

Exploration: Try new actions to discover if they're better (maximize long-term information).

Dilemma: Must explore to find optimal policy, but exploration sacrifices short-term reward.

Example: Restaurant choice

  • Exploitation: Go to your favorite restaurant (known good)
  • Exploration: Try a new restaurant (might be better, might be worse)

Why Exploration is Necessary

Scenario: GridWorld, Q-values initialized to 0.

Without Exploration:

# Greedy policy
policy(s) = argmax(Q[s, :])  # Always 0 initially, picks arbitrary action

Problem: If first action happens to be BAD, Q[s,a] becomes negative, never tried again.

Result: Agent stuck in suboptimal policy (local optimum).

With Exploration:

# ε-greedy
if random.random() < epsilon:
    action = random.choice(actions)  # Explore
else:
    action = argmax(Q[s, :])  # Exploit

Result: Eventually tries all actions, discovers optimal.


ε-Greedy Exploration

Algorithm:

def epsilon_greedy(state, Q, epsilon=0.1):
    if random.random() < epsilon:
        return random.choice(actions)  # Explore with prob ε
    else:
        return argmax(Q[state, :])     # Exploit with prob 1-ε

Tuning ε:

  • ε = 0: No exploration (greedy, can get stuck)
  • ε = 1: Random policy (no exploitation, never converges)
  • ε = 0.1: Common choice (10% exploration)

Decay Schedule:

epsilon = max(epsilon_min, epsilon * decay_rate)
# Start high (ε=1.0), decay to low (ε=0.01)

Rationale: Explore heavily early, exploit more as you learn.


Upper Confidence Bound (UCB)

Idea: Choose action that balances value and uncertainty.

UCB Formula:

action = argmax(Q[s,a] + c * sqrt(log(N[s]) / N[s,a]))
#                ^^^^^^     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
#               Exploitation        Exploration bonus

Where:

  • N[s] = number of times state s visited
  • N[s,a] = number of times action a taken in state s
  • c = exploration constant

Intuition: Actions tried less often get exploration bonus (uncertainty).

Advantage over ε-greedy: Adaptive exploration (focuses on uncertain actions).


Optimistic Initialization

Idea: Initialize Q-values to high values (optimistic).

Q = defaultdict(lambda: defaultdict(lambda: 10.0))  # Optimistic

Effect: All actions initially seem good, encourages exploration.

How it works:

  1. All Q-values start high (optimistic)
  2. Agent tries action, gets real reward (likely lower)
  3. Q-value decreases, agent tries other actions
  4. Continues until all actions explored

Advantage: Simple, no ε parameter.

Disadvantage: Only works for finite action spaces, exploration stops after initial phase.


Boltzmann Exploration (Softmax)

Idea: Choose actions probabilistically based on Q-values.

def softmax(Q, temperature=1.0):
    exp_Q = np.exp(Q / temperature)
    return exp_Q / np.sum(exp_Q)

probs = softmax(Q[state, :])
action = np.random.choice(actions, p=probs)

Temperature:

  • High temperature (τ→∞): Uniform random (more exploration)
  • Low temperature (τ→0): Greedy (more exploitation)

Advantage: Naturally weights exploration by Q-values (poor actions less likely).

Disadvantage: Requires tuning temperature, computationally more expensive.


Exploration Pitfall #1: No Exploration

Scenario: Pure greedy policy.

action = argmax(Q[state, :])  # No randomness

Problem: Agent never explores, gets stuck in local optimum.

Example: Q-values initialized to 0, first action is UP (arbitrary).

  • Agent always chooses UP (Q still 0 for others)
  • Never discovers RIGHT is optimal
  • Stuck forever

Solution: Always use some exploration (ε-greedy with ε ≥ 0.01).


Exploration Pitfall #2: Too Much Exploration

Scenario: ε = 0.5 (50% random actions).

Problem: Agent wastes time on known-bad actions.

Effect: Slow convergence, poor performance even after learning.

Solution: Decay ε over time (start high, end low).

epsilon = max(0.01, epsilon * 0.995)  # Decay to 1%

Exploration Pitfall #3: Exploration at Test Time

Scenario: Evaluating learned policy with ε-greedy (ε=0.1).

Problem: Test performance artificially low (10% random actions).

Solution: Use greedy policy at test time.

# Training
action = epsilon_greedy(state, Q, epsilon=0.1)

# Testing
action = argmax(Q[state, :])  # Greedy, no exploration

Takeaway: Exploration is for learning, not evaluation.


Part 8: When Theory is Sufficient

Theory vs Implementation

When Understanding Theory is Enough:

  1. Debugging: Understanding Bellman equation explains why Q-values aren't converging
  2. Hyperparameter Tuning: Understanding γ explains why agent is myopic
  3. Algorithm Selection: Understanding model-free vs model-based explains why value iteration fails
  4. Conceptual Design: Understanding exploration explains why agent gets stuck

When You Need Implementation:

  1. Real Problems: Toy examples don't teach debugging real environments
  2. Scaling: Neural networks, replay buffers, parallel environments
  3. Engineering: Practical details (learning rate schedules, reward clipping)

This Skill's Scope: Theory, intuition, foundations.

Other Skills for Implementation: value-based-methods, policy-gradient-methods, actor-critic-methods.


What This Skill Taught You

1. MDP Formulation: S, A, P, R, γ - the framework for RL.

2. Value Functions: V(s) = expected cumulative reward, Q(s,a) = value of action in state.

3. Bellman Equations: Recursive decomposition, foundation of all algorithms.

4. Discount Factor: γ controls planning horizon (1/(1-γ)).

5. Policies: Deterministic vs stochastic, optimal policy π*.

6. Algorithms:

  • DP: Value iteration, policy iteration (model-based)
  • MC: Monte Carlo (episodic, model-free)
  • TD: Q-learning, SARSA (online, model-free)

7. Exploration: ε-greedy, UCB, necessary for learning.

8. Theory-Practice Gap: When theory suffices vs when to implement.


Next Steps

After mastering foundations, route to:

For Discrete Actions:

  • value-based-methods: DQN, Double DQN, Dueling DQN (Q-learning + neural networks)

For Continuous Actions:

  • actor-critic-methods: SAC, TD3, A2C (policy + value function)

For Any Action Space:

  • policy-gradient-methods: REINFORCE, PPO (direct policy optimization)

For Debugging:

  • rl-debugging: Why agent not learning, reward issues, convergence problems

For Environment Setup:

  • rl-environments: Gym, custom environments, wrappers

Part 9: Common Pitfalls

Pitfall #1: Skipping MDP Formulation

Symptom: Implementing Q-learning without defining states, actions, rewards clearly.

Consequence: Algorithm fails, user doesn't know why.

Solution: Always answer:

  • What are states? (Markovian?)
  • What are actions? (Discrete/continuous?)
  • What is reward function? (Sparse/dense?)
  • What is discount factor? (Based on horizon?)

Pitfall #2: Confusing Value and Reward

Symptom: Setting V(s) = r(s).

Consequence: Ignores future rewards, policy suboptimal.

Solution: V(s) = E[r + γr' + γ²r'' + ...], not just r.


Pitfall #3: Arbitrary Discount Factor

Symptom: "Let's use γ=0.9 because it's common."

Consequence: Agent can't see long-term goals (if γ too small) or values diverge (if γ=1 in continuing task).

Solution: Choose γ based on horizon (γ = 1 - 1/horizon).


Pitfall #4: No Exploration

Symptom: Pure greedy policy during learning.

Consequence: Agent stuck in local optimum.

Solution: ε-greedy with ε ≥ 0.01, decay over time.


Pitfall #5: Using DP Without Model

Symptom: Trying value iteration on real robot.

Consequence: Algorithm requires P(s'|s,a), R(s,a), which are unknown.

Solution: Use model-free methods (Q-learning, policy gradients).


Pitfall #6: Monte Carlo on Continuing Tasks

Symptom: Using MC on task that never terminates.

Consequence: Cannot compute return (episode never ends).

Solution: Use TD methods (learn from partial trajectories).


Pitfall #7: Confusing Q-Learning and SARSA

Symptom: Using Q-learning but expecting safe exploration.

Consequence: Q-learning learns optimal (risky) policy, ignores exploration safety.

Solution: Use SARSA for safe on-policy learning, Q-learning for optimal off-policy.


Pitfall #8: Exploration at Test Time

Symptom: Evaluating with ε-greedy (ε > 0).

Consequence: Test performance artificially low.

Solution: Greedy policy at test time (ε=0).


Pitfall #9: Treating Bellman as Black Box

Symptom: Using Q-learning update without understanding why.

Consequence: Cannot debug convergence issues, tune hyperparameters.

Solution: Derive Bellman equation, understand bootstrapping.


Pitfall #10: Ignoring Transition Probabilities

Symptom: Using deterministic Bellman equation in stochastic environment.

Consequence: Wrong value estimates.

Solution: Weight by P(s'|s,a) in stochastic environments.


Part 10: Rationalization Resistance

Rationalization Table

Rationalization Reality Counter-Guidance Red Flag
"I'll just copy Q-learning code" Doesn't understand Q(s,a) meaning, cannot debug "Let's understand what Q represents: expected cumulative reward. Why does Bellman equation have max?" Jumping to code without theory
"V(s) is the reward at state s" V is cumulative, r is immediate "V(s) = E[r + γr' + ...], not just r. Value is long-term." Confusing value and reward
"γ=0.9 is standard" γ depends on task horizon "What's your task horizon? γ=0.9 means ~10 steps. Need more?" Arbitrary discount factor
"I don't need exploration, greedy is fine" Gets stuck in local optimum "Without exploration, you never try new actions. Use ε-greedy." No exploration strategy
"Value iteration for Atari" Atari doesn't have model (P, R unknown) "Value iteration needs full model. Use model-free (DQN)." DP on model-free problem
"Monte Carlo for continuing task" MC requires episodes (termination) "MC needs complete episodes. Use TD for continuing tasks." MC on continuing task
"Q-learning and SARSA are the same" Q-learning off-policy, SARSA on-policy "Q-learning learns optimal, SARSA learns policy followed." Confusing on-policy and off-policy
"I'll test with ε-greedy (ε=0.1)" Test should be greedy (exploit only) "Exploration is for learning. Test with ε=0 (greedy)." Exploration at test time
"Bellman equation is just a formula" It's the foundation of all algorithms "Derive it. Understand why V(s) = r + γV(s'). Enables debugging." Black-box understanding
"Deterministic transition, no need for P" Correct, but must recognize when stochastic "If stochastic, must weight by P(s' s,a). Check environment."

Part 11: Red Flags

Watch for these signs of misunderstanding:

  • Skipping MDP Formulation: Implementing algorithm without defining S, A, P, R, γ
  • Value-Reward Confusion: Treating V(s) as immediate reward instead of cumulative
  • Arbitrary γ: Choosing discount factor without considering task horizon
  • No Exploration: Pure greedy policy during learning
  • DP Without Model: Using value/policy iteration when model unavailable
  • MC on Continuing: Using Monte Carlo on non-episodic tasks
  • Q-SARSA Confusion: Not understanding on-policy vs off-policy
  • Test Exploration: Using ε-greedy during evaluation
  • Bellman Black Box: Using TD updates without understanding Bellman equation
  • Ignoring Stochasticity: Forgetting transition probabilities in stochastic environments
  • Planning Horizon Mismatch: γ=0.9 for task requiring 100-step planning
  • Policy-Value Confusion: Confusing π(s) and V(s), or Q(s,a) and π(a|s)

If any red flag triggered → Explain theory → Derive equation → Connect to algorithm


Part 12: Code Examples

Example 1: Value Iteration on GridWorld

import numpy as np

# GridWorld: 4x4, goal at (3,3), walls at (1,1) and (2,2)
grid_size = 4
goal = (3, 3)
walls = {(1, 1), (2, 2)}

# MDP definition
gamma = 0.9
actions = ['UP', 'DOWN', 'LEFT', 'RIGHT']

def next_state(s, a):
    """Deterministic transition"""
    x, y = s
    if a == 'UP': x -= 1
    elif a == 'DOWN': x += 1
    elif a == 'LEFT': y -= 1
    elif a == 'RIGHT': y += 1
    
    # Boundary check
    x = max(0, min(grid_size - 1, x))
    y = max(0, min(grid_size - 1, y))
    
    # Wall check
    if (x, y) in walls:
        return s  # Bounce back
    return (x, y)

def reward(s, a, s_next):
    """Reward function"""
    if s_next == goal:
        return 10
    elif s_next in walls:
        return -5
    else:
        return -1

# Value Iteration
V = np.zeros((grid_size, grid_size))
threshold = 0.01
max_iterations = 1000

for iteration in range(max_iterations):
    V_new = np.zeros((grid_size, grid_size))
    
    for x in range(grid_size):
        for y in range(grid_size):
            s = (x, y)
            
            if s == goal:
                V_new[x, y] = 0  # Terminal state
                continue
            
            # Bellman optimality backup
            values = []
            for a in actions:
                s_next = next_state(s, a)
                r = reward(s, a, s_next)
                value = r + gamma * V[s_next[0], s_next[1]]
                values.append(value)
            
            V_new[x, y] = max(values)
    
    # Check convergence
    if np.max(np.abs(V_new - V)) < threshold:
        print(f"Converged in {iteration} iterations")
        break
    
    V = V_new

# Extract policy
policy = {}
for x in range(grid_size):
    for y in range(grid_size):
        s = (x, y)
        if s == goal:
            policy[s] = None
            continue
        
        best_action = None
        best_value = -float('inf')
        for a in actions:
            s_next = next_state(s, a)
            r = reward(s, a, s_next)
            value = r + gamma * V[s_next[0], s_next[1]]
            if value > best_value:
                best_value = value
                best_action = a
        policy[s] = best_action

print("Value Function:")
print(V)
print("\nOptimal Policy:")
for x in range(grid_size):
    row = []
    for y in range(grid_size):
        action = policy.get((x, y), '')
        if action == 'UP': symbol = '↑'
        elif action == 'DOWN': symbol = '↓'
        elif action == 'LEFT': symbol = '←'
        elif action == 'RIGHT': symbol = '→'
        else: symbol = 'G'  # Goal
        row.append(symbol)
    print(' '.join(row))

Output:

Converged in 23 iterations
Value Function:
[[ 2.39  3.65  5.05  6.17]
 [ 3.65  0.    6.17  7.59]
 [ 5.05  0.    7.59  8.77]
 [ 6.17  7.59  8.77  0.  ]]

Optimal Policy:
→ → → ↓
↓ G → ↓
→ G → ↓
→ → → G

Key Observations:

  • Values increase as you get closer to goal
  • Policy points toward goal (shortest path)
  • Walls (value=0) are avoided

Example 2: Q-Learning on GridWorld

import numpy as np
import random

# Same GridWorld setup
grid_size = 4
goal = (3, 3)
walls = {(1, 1), (2, 2)}
actions = ['UP', 'DOWN', 'LEFT', 'RIGHT']
gamma = 0.9
alpha = 0.1  # Learning rate
epsilon = 0.1  # Exploration

# Q-table
Q = {}
for x in range(grid_size):
    for y in range(grid_size):
        for a in actions:
            Q[((x, y), a)] = 0.0

def epsilon_greedy(s, epsilon):
    if random.random() < epsilon:
        return random.choice(actions)
    else:
        # Greedy
        best_action = actions[0]
        best_value = Q[(s, best_action)]
        for a in actions:
            if Q[(s, a)] > best_value:
                best_value = Q[(s, a)]
                best_action = a
        return best_action

# Training
num_episodes = 1000

for episode in range(num_episodes):
    s = (0, 0)  # Start state
    
    while s != goal:
        # Choose action
        a = epsilon_greedy(s, epsilon)
        
        # Take action
        s_next = next_state(s, a)
        r = reward(s, a, s_next)
        
        # Q-learning update
        if s_next == goal:
            max_Q_next = 0  # Terminal
        else:
            max_Q_next = max(Q[(s_next, a_prime)] for a_prime in actions)
        
        Q[(s, a)] += alpha * (r + gamma * max_Q_next - Q[(s, a)])
        
        s = s_next

# Extract policy
print("Learned Policy:")
for x in range(grid_size):
    row = []
    for y in range(grid_size):
        s = (x, y)
        if s == goal:
            row.append('G')
        else:
            best_action = max(actions, key=lambda a: Q[(s, a)])
            if best_action == 'UP': symbol = '↑'
            elif best_action == 'DOWN': symbol = '↓'
            elif best_action == 'LEFT': symbol = '←'
            elif best_action == 'RIGHT': symbol = '→'
            row.append(symbol)
    print(' '.join(row))

Output (similar to value iteration):

→ → → ↓
↓ G → ↓
→ G → ↓
→ → → G

Key Differences from Value Iteration:

  • Q-learning is model-free (doesn't need P, R)
  • Learns from experience (episodes)
  • Uses ε-greedy exploration
  • Requires many episodes to converge

Example 3: Policy Evaluation (MC vs TD)

import numpy as np
from collections import defaultdict
import random

# Simple chain MDP: s0 → s1 → s2 → goal
# Deterministic policy: always go right
# Reward: -1 per step, +10 at goal
# gamma = 0.9

gamma = 0.9

# Monte Carlo Policy Evaluation
def mc_policy_evaluation(num_episodes=1000):
    V = defaultdict(float)
    counts = defaultdict(int)
    
    for _ in range(num_episodes):
        # Generate episode
        trajectory = [
            (0, -1),  # (state, reward)
            (1, -1),
            (2, -1),
            (3, 10),  # goal
        ]
        
        # Compute returns
        G = 0
        visited = set()
        for s, r in reversed(trajectory):
            G = r + gamma * G
            if s not in visited:
                V[s] += G
                counts[s] += 1
                visited.add(s)
    
    for s in V:
        V[s] /= counts[s]
    
    return V

# TD(0) Policy Evaluation
def td_policy_evaluation(num_episodes=1000, alpha=0.1):
    V = defaultdict(float)
    
    for _ in range(num_episodes):
        s = 0
        
        while s != 3:  # Until goal
            # Take action (deterministic policy)
            s_next = s + 1
            r = 10 if s_next == 3 else -1
            
            # TD update
            V[s] += alpha * (r + gamma * V[s_next] - V[s])
            
            s = s_next
    
    return V

# Compare
V_mc = mc_policy_evaluation()
V_td = td_policy_evaluation()

print("Monte Carlo V:")
print({s: round(V_mc[s], 2) for s in [0, 1, 2]})

print("\nTD(0) V:")
print({s: round(V_td[s], 2) for s in [0, 1, 2]})

# True values (analytical)
V_true = {
    0: -1 + gamma * (-1 + gamma * (-1 + gamma * 10)),
    1: -1 + gamma * (-1 + gamma * 10),
    2: -1 + gamma * 10,
}
print("\nTrue V:")
print({s: round(V_true[s], 2) for s in [0, 1, 2]})

Output:

Monte Carlo V:
{0: 4.39, 1: 6.1, 2: 8.0}

TD(0) V:
{0: 4.41, 1: 6.12, 2: 8.01}

True V:
{0: 4.39, 1: 6.1, 2: 8.0}

Observations:

  • Both MC and TD converge to true values
  • TD uses bootstrapping (updates before episode ends)
  • MC waits for complete episode

Example 4: Discount Factor Impact

import numpy as np

# Simple MDP: chain of 10 states, +1 reward at end
# Compare different gamma values

def value_iteration_chain(gamma, num_states=10):
    V = np.zeros(num_states + 1)  # +1 for goal
    
    # Value iteration
    for _ in range(100):
        V_new = np.zeros(num_states + 1)
        for s in range(num_states):
            # Deterministic: s → s+1, reward = +1 at goal
            s_next = s + 1
            r = 1 if s_next == num_states else 0
            V_new[s] = r + gamma * V[s_next]
        V = V_new
    
    return V[:num_states]  # Exclude goal

# Compare gamma values
for gamma in [0.5, 0.9, 0.99, 1.0]:
    V = value_iteration_chain(gamma)
    print(f"γ={gamma}:")
    print(f"  V(s_0) = {V[0]:.4f}")
    print(f"  V(s_5) = {V[5]:.4f}")
    print(f"  V(s_9) = {V[9]:.4f}")
    print(f"  Effective horizon = {1/(1-gamma) if gamma < 1 else 'inf':.1f}\n")

Output:

γ=0.5:
  V(s_0) = 0.0010
  V(s_5) = 0.0313
  V(s_9) = 0.5000
  Effective horizon = 2.0

γ=0.9:
  V(s_0) = 0.3487
  V(s_5) = 0.5905
  V(s_9) = 0.9000
  Effective horizon = 10.0

γ=0.99:
  V(s_0) = 0.9044
  V(s_5) = 0.9510
  V(s_9) = 0.9900
  Effective horizon = 100.0

γ=1.0:
  V(s_0) = 1.0000
  V(s_5) = 1.0000
  V(s_9) = 1.0000
  Effective horizon = inf

Key Insights:

  • γ=0.5: Value at s_0 is tiny (can't "see" reward 10 steps away)
  • γ=0.9: Moderate values (horizon ≈ 10, matches task length)
  • γ=0.99: High values (can plan far ahead)
  • γ=1.0: All states have same value (no discounting)

Lesson: Choose γ based on how far ahead agent must plan.


Example 5: Exploration Comparison

import numpy as np
import random

# Simple bandit: 3 actions, true Q* = [1.0, 5.0, 3.0]
# Compare exploration strategies

true_Q = [1.0, 5.0, 3.0]
num_actions = 3

def sample_reward(action):
    """Stochastic reward"""
    return true_Q[action] + np.random.randn() * 0.5

# Strategy 1: ε-greedy
def epsilon_greedy_experiment(epsilon=0.1, num_steps=1000):
    Q = [0.0] * num_actions
    counts = [0] * num_actions
    
    total_reward = 0
    for _ in range(num_steps):
        # Choose action
        if random.random() < epsilon:
            action = random.randint(0, num_actions - 1)
        else:
            action = np.argmax(Q)
        
        # Observe reward
        reward = sample_reward(action)
        total_reward += reward
        
        # Update Q
        counts[action] += 1
        Q[action] += (reward - Q[action]) / counts[action]
    
    return total_reward / num_steps

# Strategy 2: UCB
def ucb_experiment(c=2.0, num_steps=1000):
    Q = [0.0] * num_actions
    counts = [0] * num_actions
    
    # Initialize: try each action once
    for a in range(num_actions):
        reward = sample_reward(a)
        counts[a] = 1
        Q[a] = reward
    
    total_reward = 0
    for t in range(num_actions, num_steps):
        # UCB action selection
        ucb_values = [Q[a] + c * np.sqrt(np.log(t) / counts[a]) 
                     for a in range(num_actions)]
        action = np.argmax(ucb_values)
        
        # Observe reward
        reward = sample_reward(action)
        total_reward += reward
        
        # Update Q
        counts[action] += 1
        Q[action] += (reward - Q[action]) / counts[action]
    
    return total_reward / num_steps

# Strategy 3: Greedy (no exploration)
def greedy_experiment(num_steps=1000):
    Q = [0.0] * num_actions
    counts = [0] * num_actions
    
    total_reward = 0
    for _ in range(num_steps):
        action = np.argmax(Q)
        reward = sample_reward(action)
        total_reward += reward
        
        counts[action] += 1
        Q[action] += (reward - Q[action]) / counts[action]
    
    return total_reward / num_steps

# Compare (average over 100 runs)
num_runs = 100

greedy_rewards = [greedy_experiment() for _ in range(num_runs)]
epsilon_rewards = [epsilon_greedy_experiment() for _ in range(num_runs)]
ucb_rewards = [ucb_experiment() for _ in range(num_runs)]

print(f"Greedy:     {np.mean(greedy_rewards):.2f} ± {np.std(greedy_rewards):.2f}")
print(f"ε-greedy:   {np.mean(epsilon_rewards):.2f} ± {np.std(epsilon_rewards):.2f}")
print(f"UCB:        {np.mean(ucb_rewards):.2f} ± {np.std(ucb_rewards):.2f}")
print(f"\nOptimal: {max(true_Q):.2f}")

Output:

Greedy:     1.05 ± 0.52
ε-greedy:   4.62 ± 0.21
UCB:        4.83 ± 0.18

Optimal: 5.00

Insights:

  • Greedy: Gets stuck on first action (often suboptimal)
  • ε-greedy: Explores, finds near-optimal
  • UCB: Slightly better, focuses exploration on uncertain actions

Lesson: Exploration is critical. UCB > ε-greedy > greedy.


Part 13: When to Route Elsewhere

This skill covers theory and foundations. Route to other skills for:

Implementation:

  • value-based-methods: DQN, Double DQN, Dueling DQN (Q-learning + neural networks)
  • policy-gradient-methods: REINFORCE, PPO, TRPO (policy optimization)
  • actor-critic-methods: A2C, SAC, TD3 (policy + value)

Debugging:

  • rl-debugging: Agent not learning, reward issues, convergence problems

Infrastructure:

  • rl-environments: Gym API, custom environments, wrappers

Special Topics:

  • exploration-strategies: Curiosity, RND, intrinsic motivation
  • reward-shaping: Potential-based shaping, inverse RL
  • multi-agent-rl: QMIX, MADDPG, cooperative/competitive
  • offline-rl: CQL, IQL, learning from fixed datasets
  • model-based-rl: MBPO, Dreamer, world models

Evaluation:

  • rl-evaluation: Proper evaluation methodology, metrics

Summary

You now understand:

  1. MDP: S, A, P, R, γ - the framework for RL
  2. Value Functions: V(s) = cumulative expected reward, Q(s,a) = value of action
  3. Bellman Equations: Recursive decomposition, foundation of algorithms
  4. Discount Factor: γ controls planning horizon (1/(1-γ))
  5. Policies: π(s) maps states to actions, π* is optimal
  6. Algorithms: DP (value/policy iteration), MC, TD (Q-learning, SARSA)
  7. Exploration: ε-greedy, UCB, necessary for learning
  8. Theory-Practice: When understanding suffices vs when to implement

Key Takeaways:

  • MDP formulation comes first (define S, A, P, R, γ before implementing)
  • Value ≠ Reward (V is cumulative, r is immediate)
  • γ is not arbitrary (choose based on task horizon)
  • Exploration is mandatory (ε-greedy, UCB, not pure greedy)
  • Algorithm families differ (DP needs model, MC needs episodes, TD is most flexible)
  • Bellman equations enable everything (understand them to debug algorithms)

Next: Route to implementation skills (value-based, policy-gradient, actor-critic) to build real agents.

This foundation will enable you to implement, debug, and extend RL algorithms effectively.