Claude Code Plugins

Community-maintained marketplace

Feedback

reward-shaping-engineering

@tachyon-beep/skillpacks
1
0

Master reward function design - potential-based shaping, hacking patterns, validation

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name reward-shaping-engineering
description Master reward function design - potential-based shaping, hacking patterns, validation

Reward-Shaping Engineering

When to Use This Skill

Invoke this skill when you encounter:

  • Reward Design: "How do I design the reward function for my task?"
  • Slow Learning: "Training is extremely slow with sparse rewards" or "Dense rewards cause weird behavior"
  • Reward Hacking: "My agent learned a trick that works in training but fails on test", "Agent oscillating instead of balancing"
  • Potential-Based Shaping: "How to add shaping without breaking the optimal policy?"
  • Distance-Based Rewards: "How to reward progress toward goal without hacking?"
  • Inverse RL: "I have expert demonstrations, can I learn reward from them?"
  • Auxiliary Rewards: "Should I add helper rewards like action smoothness?"
  • Reward Scale Issues: "Training feels different when rewards change magnitude"
  • Sparse vs Dense: "When should I use sparse vs dense rewards?"
  • Reward Validation: "How do I verify my reward function is correct?"
  • Credit Assignment: "How to help agent understand which actions led to success?"
  • Normalization: "Should I clip or normalize rewards?"

This skill provides systematic frameworks and concrete patterns for reward engineering.

Do NOT use for:

  • Algorithm selection (route to rl-foundations or specific algorithm skill)
  • General RL debugging (route to rl-debugging-methodology)
  • Exploration strategies (route to exploration-strategies)
  • Environment design (route to environment-design-patterns)

Core Principle

Reward design is often the hardest part of RL. The reward function defines the entire objective the agent optimizes. A poorly designed reward either:

  1. Learns something unintended (reward hacking)
  2. Learns slowly due to sparse/noisy signal (credit assignment crisis)
  3. Learns correctly but unstably due to scale/normalization issues

The key insight: You're solving an inverse problem. You want an agent that achieves behavior X. You need to specify function R(s,a,s') such that optimal policy under R produces behavior X. This is much harder than it sounds because:

  • Agents optimize expected return, not intentions (find loopholes)
  • Credit assignment requires clear reward signal (sparse rewards fail)
  • Scale/normalization matters (reward magnitude affects gradients)
  • Shaping can help or hurt (need to preserve optimal policy)

Part 1: Reward Design Principles

Principle 1: Reward Must Align With Task

The Problem: You want agent to do X, but reward incentivizes Y.

Example (CartPole):

  • Task: Balance pole in center for as long as possible
  • Bad reward: +1 per step (true) → Agent learns to oscillate side-to-side (unintended but gets +1 every step)
  • Better reward: +1 per step centered + penalty for deviation

Example (Robotics):

  • Task: Grasp object efficiently
  • Bad reward: Just +1 when grasped → Agent grasps sloppily, jerky movements
  • Better reward: +1 for grasp + small penalty per action (reward efficiency)

Pattern: Specify WHAT success looks like, not HOW to achieve it. Let agent find the HOW.

# Anti-pattern: Specify HOW
bad_reward = -0.1 * np.sum(np.abs(action))  # Penalize movement

# Pattern: Specify WHAT
good_reward = (1.0 if grasp_success else 0.0) + (-0.01 * np.sum(action**2))
# Says: Success is good, movements have small cost
# Agent figures out efficient movements to minimize action cost

Principle 2: Reward Should Enable Credit Assignment

The Problem: Sparse rewards mean agent can't learn which actions led to success.

Example (Goal Navigation):

  • Sparse: Only +1 when reaching goal (1 in 1000 episodes maybe)
  • Agent can't tell: Did action 10 steps ago help or action 5 steps ago?
  • Solution: Add shaping reward based on progress

Credit Assignment Window:

Short window (< 10 steps):    Need dense rewards every 1-2 steps
Medium window (10-100 steps): Reward every 5-10 steps OK
Long window (> 100 steps):    Sparse rewards very hard, need shaping

When to Add Shaping:

  • Episode length > 50 steps AND sparse rewards
  • Agent can't achieve >10% success after exploring

Principle 3: Reward Should Prevent Hacking

The Problem: Agent finds unintended loopholes.

Classic Hacking Patterns:

  1. Shortcut Exploitation: Taking unintended path to goal

    • Example: Quadruped learns to flip instead of walk
    • Solution: Specify movement requirements in reward
  2. Side-Effect Exploitation: Achieving side-effect that gives reward

    • Example: Robotic arm oscillating (gets +1 per step for oscillation)
    • Solution: Add penalty for suspicious behavior
  3. Scale Exploitation: Abusing unbounded reward dimension

    • Example: Agent learns to get reward signal to spike → oscillates
    • Solution: Use clipped/normalized rewards

Prevention Framework:

def design_robust_reward(s, a, s_next):
    # Core task reward
    task_reward = compute_task_reward(s_next)

    # Anti-hacking penalties
    action_penalty = -0.01 * np.sum(a**2)  # Penalize unnecessary action
    suspension_penalty = check_suspension(s_next)  # Penalize weird postures

    return task_reward + action_penalty + suspension_penalty

Principle 4: Reward Scale and Normalization Matter

The Problem: Reward magnitude affects gradient flow.

Example:

Task A rewards:  0 to 1000
Task B rewards:  0 to 1
Same optimizer with fixed learning rate:
  Task A: Step sizes huge, diverges
  Task B: Step sizes tiny, barely learns

Solution: Normalize both to [-1, 1]

Standard Normalization Pipeline:

def normalize_reward(r):
    # 1. Clip to reasonable range (prevents scale explosions)
    r_clipped = np.clip(r, -1000, 1000)

    # 2. Normalize using running statistics
    reward_mean = running_mean(r_clipped)
    reward_std = running_std(r_clipped)
    r_normalized = (r_clipped - reward_mean) / (reward_std + 1e-8)

    # 3. Clip again to [-1, 1] for stability
    return np.clip(r_normalized, -1.0, 1.0)

Part 2: Potential-Based Shaping (The Theorem)

The Fundamental Problem

You want to:

  • Help agent learn faster (add shaping rewards)
  • Preserve the optimal policy (so shaping doesn't change what's best)

The Solution: Potential-Based Shaping

The theorem states: If you add shaping reward of form

F(s, a, s') = γ * Φ(s') - Φ(s)

where Φ(s) is ANY function of state, then:

  1. Optimal policy remains unchanged
  2. Optimal value function shifts by Φ
  3. Learning accelerates due to better signal

Why This Matters: You can safely add rewards like distance-to-goal without worrying you're changing what the agent should do.

Mathematical Foundation

Original MDP has Q-function: Q^π(s,a) = E[R(s,a,s') + γV^π(s')]

With potential-based shaping:

Q'^π(s,a) = Q^π(s,a) + [γΦ(s') - Φ(s)]
          = E[R(s,a,s') + γΦ(s') - Φ(s) + γV^π(s')]
          = E[R(s,a,s') + γ(Φ(s') + V^π(s')) - Φ(s)]

The key insight: When computing optimal policy, Φ(s) acts like state-value function offset. Different actions get different Φ values, but relative ordering (which action is best) unchanged.

Proof Sketch:

  • Policy compares Q(s,a₁) vs Q(s,a₂) to pick action
  • Both differ by same [γΦ(s') - Φ(s)] at state s
  • Relative ordering preserved → same optimal action

Practical Implementation

def potential_based_shaping(s, a, s_next, gamma=0.99):
    """
    Compute shaping reward that preserves optimal policy.

    Args:
        s: current state
        a: action taken
        s_next: next state (result of action)
        gamma: discount factor

    Returns:
        Shaping reward to ADD to environment reward
    """
    # Define potential function (e.g., negative distance to goal)
    phi = compute_potential(s)
    phi_next = compute_potential(s_next)

    # Potential-based shaping formula
    shaping_reward = gamma * phi_next - phi

    return shaping_reward

def compute_potential(s):
    """
    Potential function: Usually distance to goal.

    Negative of distance works well:
    - States farther from goal have lower potential
    - Moving closer increases potential (positive shaping reward)
    - Reaching goal gives highest potential
    """
    if goal_reached(s):
        return 0.0  # Peak potential
    else:
        distance = euclidean_distance(s['position'], s['goal'])
        return -distance  # Negative distance

Critical Error: NOT Using Potential-Based Shaping

Common Mistake:

# WRONG: This changes the optimal policy!
shaping_reward = -0.1 * distance_to_goal

# WHY WRONG: This isn't potential-based. Moving from d=1 to d=0.5 gives:
#   Reward = -0.1 * 0.5 - (-0.1 * 1.0) = +0.05
# But moving from d=3 to d=2.5 gives:
#   Reward = -0.1 * 2.5 - (-0.1 * 3.0) = +0.05
# Same reward for same distance change regardless of state!
# This distorts value function and can change which action is optimal.

Right Way:

# CORRECT: Potential-based shaping
def shaping(s, a, s_next):
    phi_s = -distance(s, goal)  # Potential = negative distance
    phi_s_next = -distance(s_next, goal)

    return gamma * phi_s_next - phi_s

# Moving from d=1 to d=0.5:
#   shaping = 0.99 * (-0.5) - (-1.0) = +0.495
# Moving from d=3 to d=2.5:
#   shaping = 0.99 * (-2.5) - (-3.0) = +0.475
# Slightly different, depends on state! Preserves policy.

Using Potential-Based Shaping

def compute_total_reward(s, a, s_next, env_reward, gamma=0.99):
    """
    Combine environment reward with potential-based shaping.

    Pattern: R_total = R_env + R_shaping
    """
    # 1. Get reward from environment
    task_reward = env_reward

    # 2. Compute potential-based shaping (safe to add)
    potential = -distance_to_goal(s_next)
    potential_prev = -distance_to_goal(s)
    shaping_reward = gamma * potential - potential_prev

    # 3. Combine
    total_reward = task_reward + shaping_reward

    return total_reward

Part 3: Sparse vs Dense Rewards

The Fundamental Tradeoff

Aspect Sparse Rewards Dense Rewards
Credit Assignment Hard (credit window huge) Easy (immediate feedback)
Learning Speed Slow (few positive examples) Fast (constant signal)
Reward Hacking Less likely (fewer targets) More likely (many targets to exploit)
Convergence Can converge to suboptimal May not converge if hacking
Real-World Matches reality (goals sparse) Artificial but helps learning

Decision Framework

Use SPARSE when:

  • Task naturally has sparse rewards (goal-reaching, game win/loss)
  • Episode short (< 20 steps)
  • You want solution robust to reward hacking
  • Final performance matters more than learning speed

Use DENSE when:

  • Episode long (> 50 steps) and no natural sub-goals
  • Learning speed critical (limited training budget)
  • You can design safe auxiliary rewards
  • You'll validate extensively against hacking

Use HYBRID when:

  • Combine sparse task reward with dense shaping
  • Example: +1 for reaching goal (sparse) + negative distance shaping (dense)
  • This is the most practical approach for long-horizon tasks

Design Pattern: Sparse Task + Dense Shaping

def reward_function(s, a, s_next, done):
    """
    Standard pattern: sparse task reward + potential-based shaping.

    This gets the best of both worlds:
    - Sparse task reward prevents hacking on main objective
    - Dense shaping prevents credit assignment crisis
    """
    # 1. Sparse task reward (what we truly care about)
    if goal_reached(s_next):
        task_reward = 1.0
    else:
        task_reward = 0.0

    # 2. Dense potential-based shaping (helps learning)
    gamma = 0.99
    phi_s = -np.linalg.norm(s['position'] - s['goal'])
    phi_s_next = -np.linalg.norm(s_next['position'] - s_next['goal'])
    shaping_reward = gamma * phi_s_next - phi_s

    # 3. Combine: Sparse main objective + dense guidance
    total = task_reward + 0.1 * shaping_reward
    # Scale shaping (0.1) relative to task (1.0) so task dominates

    return total

Validation: Confirming Sparse/Dense Choice

def validate_reward_choice(sparse_reward_fn, dense_reward_fn, env, n_trials=10):
    """
    Compare sparse vs dense by checking:
    1. Learning speed (how fast does agent improve?)
    2. Final performance (does dense cause hacking?)
    3. Stability (does one diverge?)
    """
    results = {
        'sparse': train_agent(sparse_reward_fn, env, n_trials),
        'dense': train_agent(dense_reward_fn, env, n_trials)
    }

    # Check learning curves
    print("Sparse learning speed:", results['sparse']['steps_to_50pct'])
    print("Dense learning speed:", results['dense']['steps_to_50pct'])

    # Check if dense causes hacking
    print("Sparse final score:", results['sparse']['final_score'])
    print("Dense final score:", results['dense']['final_score'])

    # If dense learned faster AND achieved same/higher score: use dense + validation
    # If sparse achieved higher: reward hacking detected in dense

Part 4: Reward Hacking - Patterns and Detection

Common Hacking Patterns

Pattern 1: Shortcut Exploitation

Agent finds unintended path to success.

Example (Quadruped):

  • Task: Walk forward 10 meters
  • Intended: Gait pattern that moves forward
  • Hack: Agent learns to flip upside down (center of mass moves forward during flip!)

Detection:

# Test on distribution shift
if test_on_different_terrain(agent) << train_performance:
    print("ALERT: Shortcut exploitation detected")
    print("Agent doesn't generalize → learned specific trick")

Prevention:

def robust_reward(s, a, s_next):
    # Forward progress
    progress = s_next['x'] - s['x']

    # Requirement: Stay upright (prevents flipping hack)
    upright_penalty = -1.0 if not is_upright(s_next) else 0.0

    # Requirement: Reasonable movement (prevents wiggling)
    movement_penalty = -0.1 * np.sum(a**2)

    return progress + upright_penalty + movement_penalty

Pattern 2: Reward Signal Exploitation

Agent exploits direct reward signal rather than task.

Example (Oscillation):

  • Task: Balance pole in center
  • Intended: Keep pole balanced
  • Hack: Agent oscillates rapidly (each oscillation = +1 reward per step)

Detection:

def detect_oscillation(trajectory):
    positions = [s['pole_angle'] for s in trajectory]
    # Count zero crossings
    crossings = sum(1 for i in range(len(positions)-1)
                    if positions[i] * positions[i+1] < 0)

    if crossings > len(trajectory) / 3:
        print("ALERT: Oscillation detected")

Prevention:

def non_hackable_reward(s, a, s_next):
    # Task: Balanced pole
    balance_penalty = -(s_next['pole_angle']**2)  # Reward being centered

    # Prevent oscillation
    angle_velocity = s_next['pole_angle'] - s['pole_angle']
    oscillation_penalty = -0.1 * abs(angle_velocity)

    return balance_penalty + oscillation_penalty

Pattern 3: Unbounded Reward Exploitation

Agent maximizes dimension without bound.

Example (Camera Hack):

  • Task: Detect object (reward for correct detection)
  • Hack: Agent learns to point camera lens at bright light source (always triggers detection)

Detection:

def detect_unbounded_exploitation(training_history):
    rewards = training_history['episode_returns']

    # Check if rewards growing without bound
    if rewards[-100:].mean() >> rewards[100:200].mean():
        print("ALERT: Rewards diverging")
        print("Possible unbounded exploitation")

Prevention:

# Use reward clipping
def clipped_reward(r):
    return np.clip(r, -1.0, 1.0)

# Or normalize
def normalized_reward(r, running_mean, running_std):
    r_norm = (r - running_mean) / (running_std + 1e-8)
    return np.clip(r_norm, -1.0, 1.0)

Systematic Hacking Detection Framework

def check_for_hacking(agent, train_env, test_envs, holdout_env):
    """
    Comprehensive hacking detection.
    """
    # 1. Distribution shift test
    train_perf = evaluate(agent, train_env)
    test_perf = evaluate(agent, test_envs)  # Variations of train

    if train_perf >> test_perf:
        print("HACKING: Agent doesn't generalize to distribution shift")
        return "shortcut_exploitation"

    # 2. Behavioral inspection
    trajectory = run_episode(agent, holdout_env)
    if has_suspicious_pattern(trajectory):
        print("HACKING: Suspicious behavior detected")
        return "pattern_exploitation"

    # 3. Reward curve analysis
    if rewards_diverging(agent.training_history):
        print("HACKING: Unbounded reward exploitation")
        return "reward_signal_exploitation"

    return "no_obvious_hacking"

Part 5: Auxiliary Rewards and Shaping Examples

Example 1: Distance-Based Shaping

Most common shaping pattern. Safe when done with potential-based formula.

def distance_shaping(s, a, s_next, gamma=0.99):
    """
    Reward agent for getting closer to goal.

    CRITICAL: Use potential-based formula to preserve optimal policy.
    """
    goal_position = s['goal']
    curr_pos = s['position']
    next_pos = s_next['position']

    # Potential function: negative distance
    phi = -np.linalg.norm(curr_pos - goal_position)
    phi_next = -np.linalg.norm(next_pos - goal_position)

    # Potential-based shaping (preserves optimal policy)
    shaping_reward = gamma * phi_next - phi

    return shaping_reward

Example 2: Auxiliary Smoothness Reward

Help agent learn smooth actions without changing optimal behavior.

def smoothness_shaping(a, a_prev):
    """
    Penalize jittery/jerky actions.
    Helps with efficiency and generalization.
    """
    # Difference between consecutive actions
    action_jerk = np.linalg.norm(a - a_prev)

    # Penalty (small, doesn't dominate task reward)
    smoothness_penalty = -0.01 * action_jerk

    return smoothness_penalty

Example 3: Energy/Control Efficiency

Encourage efficient control.

def efficiency_reward(a):
    """
    Penalize excessive control effort.
    Makes solutions more robust.
    """
    # L2 norm of action (total control magnitude)
    effort = np.sum(a**2)

    # Small penalty
    return -0.001 * effort

Example 4: Staying Safe Reward

Prevent dangerous states (without hard constraints).

def safety_reward(s):
    """
    Soft penalty for dangerous states.
    Better than hard constraints (more learnable).
    """
    danger_score = 0.0

    # Example: Prevent collision
    min_clearance = np.min(s['collision_distances'])
    if min_clearance < 0.1:
        danger_score += 10.0 * (0.1 - min_clearance)

    # Example: Prevent extreme states
    if np.abs(s['position']).max() > 5.0:
        danger_score += 1.0

    return -danger_score

When to Add Auxiliary Rewards

Add auxiliary reward if:

  • It's potential-based (safe)
  • Task reward already roughly works (agent > 10% success)
  • Auxiliary targets clear sub-goals
  • You validate with/without

Don't add if:

  • Task reward doesn't work at all (fix that first)
  • Creates new exploitation opportunities
  • Makes reward engineering too complex

Part 6: Inverse RL - Learning Rewards from Demonstrations

The Problem

You have expert demonstrations but no explicit reward function. How to learn?

Options:

  1. Behavioral cloning: Copy actions directly (doesn't learn why)
  2. Reward learning (inverse RL): Infer reward structure from demonstrations
  3. Imitation learning: Match expert behavior distribution (GAIL style)

Inverse RL Concept

Idea: Expert is optimal under some reward function. Infer what reward structure makes expert optimal.

Expert demonstrations → Infer reward function → Train agent on learned reward

Key insight: If expert is optimal under reward R, then R(expert_actions) >> R(other_actions)

Practical Inverse RL (Maximum Entropy IRL)

class InverseRLLearner:
    """
    Learn reward function from expert demonstrations.

    Assumes expert is performing near-optimal policy under true reward.
    """

    def __init__(self, state_dim, action_dim):
        # Reward function (small neural network)
        self.reward_net = nn.Sequential(
            nn.Linear(state_dim + action_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
        self.optimizer = torch.optim.Adam(self.reward_net.parameters())

    def compute_reward(self, s, a):
        """Learned reward function."""
        sa = torch.cat([torch.tensor(s), torch.tensor(a)])
        return self.reward_net(sa).item()

    def train_step(self, expert_trajectories, agent_trajectories):
        """
        Update reward to make expert better than agent.

        Principle: Maximize expert returns, minimize agent returns under current reward.
        """
        # Expert reward sum
        expert_returns = sum(
            sum(self.compute_reward(s, a) for s, a in traj)
            for traj in expert_trajectories
        )

        # Agent reward sum
        agent_returns = sum(
            sum(self.compute_reward(s, a) for s, a in traj)
            for traj in agent_trajectories
        )

        # Loss: Want expert >> agent
        loss = agent_returns - expert_returns

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        return loss.item()

When to Use Inverse RL

Use when:

  • Reward is hard to specify but easy to demonstrate
  • You have expert demonstrations (human, reference controller)
  • Task complex enough that behavior != objective
  • Training budget allows for two-stage process

Don't use when:

  • Reward is easy to specify (just specify it!)
  • No expert demonstrations available
  • Demonstration quality varies
  • Need fast learning (inverse RL is slow)

Part 7: Reward Normalization and Clipping

Why Normalize?

Reward scale directly affects gradient magnitude and training stability.

# Without normalization
reward_taskA = 1000 * task_metric  # Large magnitude
loss = -policy_gradient * reward_taskA  # Huge gradients

# With normalization
reward_normalized = reward_taskA / reward_std  # Unit magnitude
loss = -policy_gradient * reward_normalized  # Reasonable gradients

Standard Normalization Pipeline

class RewardNormalizer:
    def __init__(self, epsilon=1e-8):
        self.mean = 0.0
        self.var = 1.0
        self.epsilon = epsilon

    def update_statistics(self, rewards):
        """Update running mean and variance."""
        rewards = np.array(rewards)
        # Exponential moving average (online update)
        alpha = 0.01
        self.mean = (1 - alpha) * self.mean + alpha * rewards.mean()
        self.var = (1 - alpha) * self.var + alpha * rewards.var()

    def normalize(self, reward):
        """Apply standardization then clipping."""
        # 1. Standardize (zero mean, unit variance)
        normalized = (reward - self.mean) / np.sqrt(self.var + self.epsilon)

        # 2. Clip to [-1, 1] for stability
        clipped = np.clip(normalized, -1.0, 1.0)

        return clipped

Clipping Strategy

def clip_reward(r, clip_range=(-1.0, 1.0)):
    """
    Clip reward to fixed range.

    Prevents large reward spikes from destabilizing training.
    """
    return np.clip(r, clip_range[0], clip_range[1])

# Usage
def total_reward(task_r, shaping_r):
    # Combine rewards
    combined = task_r + shaping_r

    # Clip combined
    clipped = clip_reward(combined)

    return clipped

Part 8: Validating Reward Functions

Validation Checklist

def validate_reward_function(reward_fn, env, agent_class, n_trials=5):
    """
    Systematic validation of reward design.
    """
    results = {}

    # 1. Learning speed test
    agent = train_agent(agent_class, env, reward_fn, steps=100000)
    success_rate = evaluate(agent, env, n_episodes=100)
    results['learning_speed'] = success_rate

    if success_rate < 0.3:
        print("WARNING: Agent can't learn → reward signal too sparse")
        return False

    # 2. Generalization test
    test_variants = [modify_env(env) for _ in range(5)]
    test_rates = [evaluate(agent, test_env, 20) for test_env in test_variants]

    if np.mean(test_rates) < 0.7 * success_rate:
        print("WARNING: Hacking detected → Agent doesn't generalize")
        return False

    # 3. Stability test
    agents = [train_agent(...) for _ in range(n_trials)]
    variance = np.var([evaluate(a, env, 20) for a in agents])

    if variance > 0.3:
        print("WARNING: Training unstable → Reward scale issue?")
        return False

    # 4. Behavioral inspection
    trajectory = run_episode(agent, env)
    if suspicious_behavior(trajectory):
        print("WARNING: Agent exhibiting strange behavior")
        return False

    print("PASSED: Reward function validated")
    return True

Red Flags During Validation

Red Flag Likely Cause Fix
Success rate < 10% after 50k steps Reward too sparse Add shaping
High variance across seeds Reward scale/noise Normalize/clip
Passes train but fails test Reward hacking Add anti-hacking penalties
Rewards diverging to infinity Unbounded reward Use clipping
Agent oscillates/twitches Per-step reward exploitation Penalize action change
Learning suddenly stops Reward scale issue Check normalization

Part 9: Common Pitfalls and Rationalizations

Pitfall 1: "Let me just add distance reward"

Rationalization: "I'll add reward for getting closer to goal, it can't hurt" Problem: Without potential-based formula, changes optimal policy Reality Check: Measure policy difference with/without shaping

Pitfall 2: "Sparse rewards are always better"

Rationalization: "Sparse rewards prevent hacking" Problem: Agent can't learn in long-horizon tasks (credit assignment crisis) Reality Check: 10+ steps without reward → need shaping or fail training

Pitfall 3: "Normalize everything"

Rationalization: "I'll normalize all rewards to [-1, 1]" Problem: Over-normalization loses task structure (goal vs near-goal now equal) Reality Check: Validate that normalized reward still trains well

Pitfall 4: "Inverse RL is the answer"

Rationalization: "I don't know how to specify rewards, I'll learn from demos" Problem: Inverse RL is slow and requires good demonstrations Reality Check: If you can specify reward clearly, just do it

Pitfall 5: "More auxiliary rewards = faster learning"

Rationalization: "I'll add smoothness, energy, safety rewards" Problem: Each auxiliary reward is another hacking target Reality Check: Validate each auxiliary independently

Pitfall 6: "This should work, why doesn't it?"

Rationalization: "The reward looks right, must be algorithm issue" Problem: Reward design is usually the bottleneck, not algorithm Reality Check: Systematically validate reward using test framework

Pitfall 7: "Agent learned the task, my reward was right"

Rationalization: "Agent succeeded, so reward design was good" Problem: Agent might succeed on hacked solution, not true task Reality Check: Test on distribution shift / different environment variants

Pitfall 8: "Dense rewards cause overfitting"

Rationalization: "Sparse rewards generalize better" Problem: Sparse rewards just fail to learn in long episodes Reality Check: Compare learning curves and final policy generalization

Pitfall 9: "Clipping breaks the signal"

Rationalization: "If I clip rewards, I lose information" Problem: Unbounded rewards cause training instability Reality Check: Relative ordering preserved after clipping, information retained

Pitfall 10: "Potential-based shaping doesn't matter"

Rationalization: "A reward penalty is a reward penalty" Problem: Non-potential-based shaping CAN change optimal policy Reality Check: Prove mathematically that Φ(s') - Φ(s) structure used


Part 10: Reward Engineering Patterns for Common Tasks

Pattern 1: Goal-Reaching Tasks

def reaching_reward(s, a, s_next, gamma=0.99):
    """
    Task: Reach target location.
    """
    goal = s['goal']

    # Sparse task reward
    if np.linalg.norm(s_next['position'] - goal) < 0.1:
        task_reward = 1.0
    else:
        task_reward = 0.0

    # Dense potential-based shaping
    distance = np.linalg.norm(s_next['position'] - goal)
    distance_prev = np.linalg.norm(s['position'] - goal)

    phi = -distance
    phi_prev = -distance_prev
    shaping = gamma * phi - phi_prev

    # Efficiency penalty (optional)
    efficiency = -0.001 * np.sum(a**2)

    return task_reward + 0.1 * shaping + efficiency

Pattern 2: Locomotion Tasks

def locomotion_reward(s, a, s_next):
    """
    Task: Move forward efficiently.
    """
    # Forward progress (sparse)
    forward_reward = s_next['x_pos'] - s['x_pos']

    # Staying alive (don't fall)
    alive_bonus = 1.0 if is_alive(s_next) else 0.0

    # Energy efficiency
    action_penalty = -0.0001 * np.sum(a**2)

    return forward_reward + alive_bonus + action_penalty

Pattern 3: Multi-Objective Tasks

def multi_objective_reward(s, a, s_next):
    """
    Task: Multiple objectives (e.g., reach goal AND minimize energy).
    """
    goal_reward = 10.0 * (goal_progress(s, s_next))
    energy_reward = -0.01 * np.sum(a**2)
    safety_reward = -1.0 * collision_risk(s_next)

    # Weight objectives
    return 1.0 * goal_reward + 0.1 * energy_reward + 0.5 * safety_reward

Summary: Reward Engineering Workflow

  1. Specify what success looks like (task reward)
  2. Choose sparse or dense based on episode length
  3. If dense, use potential-based shaping (preserves policy)
  4. Add anti-hacking penalties if needed
  5. Normalize and clip for stability
  6. Validate systematically (generalization, hacking, stability)
  7. Iterate based on validation results

Key Equations Reference

Potential-Based Shaping:
F(s,a,s') = γΦ(s') - Φ(s)

Value Function Shift (with shaping):
V'(s) = V(s) + Φ(s)

Optimal Policy Preservation:
argmax_a Q'(s,a) = argmax_a Q(s,a)  (same action, different Q-values)

Reward Normalization:
r_norm = (r - μ) / (σ + ε)

Clipping:
r_clipped = clip(r_norm, -1, 1)

Testing Scenarios (13+)

The skill addresses these scenarios:

  1. Detecting reward hacking from test set failure
  2. Implementing potential-based shaping correctly
  3. Choosing sparse vs dense based on episode length
  4. Designing distance-based rewards without changing policy
  5. Adding auxiliary rewards without hacking
  6. Normalizing rewards across task variants
  7. Validating that shaping preserves optimal policy
  8. Applying inverse RL to expert demonstrations
  9. Debugging when reward signal causes oscillation
  10. Engineering rewards for specific task families
  11. Recognizing when reward is bottleneck vs algorithm
  12. Explaining reward hacking in principal-agent terms
  13. Implementing end-to-end reward validation pipeline

Practical Checklist

  • Task reward clearly specifies success
  • Reward function can't be exploited by shortcuts
  • Episode length < 20 steps → sparse OK
  • Episode length > 50 steps → need shaping
  • Using potential-based formula F = γΦ(s') - Φ(s)
  • Clipping/normalizing rewards to [-1, 1]
  • Tested on distribution shift (different env variant)
  • Behavioral inspection (is agent doing what you expect?)
  • Training stability across seeds (variance < 0.3)
  • Learning curves look reasonable (no sudden divergence)
  • Final policy generalizes to test distribution