| name | rl-evaluation |
| description | Rigorous RL evaluation - statistical protocols, train/test discipline, metrics, generalization |
RL Evaluation: Rigorous Methodology for Agent Assessment
- Draw conclusions from statistical noise
- Report results that don't generalize
- Deploy agents that fail in production
- Waste resources on false improvements
This skill provides systematic evaluation protocols that ensure statistical validity, generalization measurement, and deployment-ready assessment.
When to Use This Skill
Use this skill when:
- ✅ Evaluating RL agent performance
- ✅ Comparing multiple RL algorithms
- ✅ Reporting results for publication or deployment
- ✅ Making algorithm selection decisions
- ✅ Assessing readiness for production deployment
- ✅ Debugging training (need accurate performance estimates)
DO NOT use for:
- ❌ Quick sanity checks during development (use informal evaluation)
- ❌ Monitoring training progress (use running averages)
- ❌ Initial hyperparameter sweeps (use coarse evaluation)
When in doubt: If the evaluation result will inform a decision (publish, deploy, choose algorithm), use this skill.
Core Principles
Principle 1: Statistical Rigor is Non-Negotiable
Reality: RL has inherently high variance. Single runs are meaningless.
Enforcement:
- Minimum 5-10 random seeds for any performance claim
- Report mean ± std or 95% confidence intervals
- Statistical significance testing when comparing algorithms
- Never report single-seed results as representative
Principle 2: Train/Test Discipline Prevents Overfitting
Reality: Agents exploit environment quirks. Training performance ≠ generalization.
Enforcement:
- Separate train/test environment instances
- Different random seeds for train/eval
- Test on distribution shifts (new instances, physics, appearances)
- Report both training and generalization performance
Principle 3: Sample Efficiency Matters
Reality: Final performance ignores cost. Samples are often expensive.
Enforcement:
- Report sample efficiency curves (reward vs steps)
- Include "reward at X steps" for multiple budgets
- Consider deployment constraints
- Compare at SAME sample budget, not just asymptotic
Principle 4: Evaluation Mode Must Match Deployment
Reality: Stochastic vs deterministic evaluation changes results by 10-30%.
Enforcement:
- Specify evaluation mode (stochastic/deterministic)
- Match evaluation to deployment scenario
- Report both if ambiguous
- Explain choice in methodology
Principle 5: Offline RL Requires Special Care
Reality: Cannot accurately evaluate offline RL without online rollouts.
Enforcement:
- Acknowledge evaluation limitations
- Use conservative metrics (in-distribution performance)
- Quantify uncertainty
- Staged deployment (offline → small online trial → full)
Statistical Evaluation Protocol
Multi-Seed Evaluation (MANDATORY)
Minimum Requirements:
- Exploration/research: 5-10 seeds minimum
- Publication: 10-20 seeds
- Production deployment: 20-50 seeds (depending on variance)
Protocol:
import numpy as np
from scipy import stats
def evaluate_multi_seed(algorithm, env_name, seeds, total_steps):
"""
Evaluate algorithm across multiple random seeds.
Args:
algorithm: RL algorithm class
env_name: Environment name
seeds: List of random seeds
total_steps: Training steps per seed
Returns:
Dictionary with statistics
"""
final_rewards = []
sample_efficiency_curves = []
for seed in seeds:
# Train agent
env = gym.make(env_name, seed=seed)
agent = algorithm(env, seed=seed)
# Track performance during training
eval_points = np.linspace(0, total_steps, num=20, dtype=int)
curve = []
for step in eval_points:
agent.train(steps=step)
reward = evaluate_deterministic(agent, env, episodes=10)
curve.append((step, reward))
sample_efficiency_curves.append(curve)
final_rewards.append(curve[-1][1]) # Final performance
final_rewards = np.array(final_rewards)
return {
'mean': np.mean(final_rewards),
'std': np.std(final_rewards),
'median': np.median(final_rewards),
'min': np.min(final_rewards),
'max': np.max(final_rewards),
'iqr': (np.percentile(final_rewards, 75) -
np.percentile(final_rewards, 25)),
'confidence_interval_95': stats.t.interval(
0.95,
len(final_rewards) - 1,
loc=np.mean(final_rewards),
scale=stats.sem(final_rewards)
),
'all_seeds': final_rewards,
'curves': sample_efficiency_curves
}
# Usage
results = evaluate_multi_seed(
algorithm=PPO,
env_name="HalfCheetah-v3",
seeds=range(10), # 10 seeds
total_steps=1_000_000
)
print(f"Performance: {results['mean']:.1f} ± {results['std']:.1f}")
print(f"95% CI: [{results['confidence_interval_95'][0]:.1f}, "
f"{results['confidence_interval_95'][1]:.1f}]")
print(f"Median: {results['median']:.1f}")
print(f"Range: [{results['min']:.1f}, {results['max']:.1f}]")
Reporting Template:
Algorithm: PPO
Environment: HalfCheetah-v3
Seeds: 10
Total Steps: 1M
Final Performance:
- Mean: 4,523 ± 387
- Median: 4,612
- 95% CI: [4,246, 4,800]
- Range: [3,812, 5,201]
Sample Efficiency:
- Reward at 100k steps: 1,234 ± 156
- Reward at 500k steps: 3,456 ± 289
- Reward at 1M steps: 4,523 ± 387
Statistical Significance Testing
When comparing algorithms:
def compare_algorithms(results_A, results_B, alpha=0.05):
"""
Compare two algorithms with statistical rigor.
Args:
results_A: Array of final rewards for algorithm A (multiple seeds)
results_B: Array of final rewards for algorithm B (multiple seeds)
alpha: Significance level (default 0.05)
Returns:
Dictionary with comparison statistics
"""
# T-test for difference in means
t_statistic, p_value = stats.ttest_ind(results_A, results_B)
# Effect size (Cohen's d)
pooled_std = np.sqrt((np.std(results_A)**2 + np.std(results_B)**2) / 2)
cohens_d = (np.mean(results_A) - np.mean(results_B)) / pooled_std
# Bootstrap confidence interval for difference
def bootstrap_diff(n_bootstrap=10000):
diffs = []
for _ in range(n_bootstrap):
sample_A = np.random.choice(results_A, size=len(results_A))
sample_B = np.random.choice(results_B, size=len(results_B))
diffs.append(np.mean(sample_A) - np.mean(sample_B))
return np.percentile(diffs, [2.5, 97.5])
ci_diff = bootstrap_diff()
return {
'mean_A': np.mean(results_A),
'mean_B': np.mean(results_B),
'difference': np.mean(results_A) - np.mean(results_B),
'p_value': p_value,
'significant': p_value < alpha,
'cohens_d': cohens_d,
'ci_difference': ci_diff,
'conclusion': (
f"Algorithm A is {'significantly' if p_value < alpha else 'NOT significantly'} "
f"better than B (p={p_value:.4f})"
)
}
# Usage
ppo_results = np.array([4523, 4612, 4201, 4789, 4456, 4390, 4678, 4234, 4567, 4498])
sac_results = np.array([4678, 4890, 4567, 4923, 4712, 4645, 4801, 4556, 4734, 4689])
comparison = compare_algorithms(ppo_results, sac_results)
print(comparison['conclusion'])
print(f"Effect size (Cohen's d): {comparison['cohens_d']:.3f}")
print(f"95% CI for difference: [{comparison['ci_difference'][0]:.1f}, "
f"{comparison['ci_difference'][1]:.1f}]")
Interpreting Effect Size (Cohen's d):
- d < 0.2: Negligible difference
- 0.2 ≤ d < 0.5: Small effect
- 0.5 ≤ d < 0.8: Medium effect
- d ≥ 0.8: Large effect
Red Flag: If p-value < 0.05 but Cohen's d < 0.2, the difference is statistically significant but practically negligible. Don't claim "better" without practical significance.
Power Analysis: How Many Seeds Needed?
def required_seeds_for_precision(std_estimate, mean_estimate,
desired_precision=0.1, confidence=0.95):
"""
Calculate number of seeds needed for desired precision.
Args:
std_estimate: Estimated standard deviation (from pilot runs)
mean_estimate: Estimated mean performance
desired_precision: Desired precision as fraction of mean (0.1 = ±10%)
confidence: Confidence level (0.95 = 95% CI)
Returns:
Required number of seeds
"""
# Z-score for confidence level
z = stats.norm.ppf(1 - (1 - confidence) / 2)
# Desired margin of error
margin = desired_precision * mean_estimate
# Required sample size
n = (z * std_estimate / margin) ** 2
return int(np.ceil(n))
# Example: You ran 3 pilot seeds
pilot_results = [4500, 4200, 4700]
std_est = np.std(pilot_results) # 250
mean_est = np.mean(pilot_results) # 4467
# How many seeds for ±10% precision at 95% confidence?
n_required = required_seeds_for_precision(std_est, mean_est,
desired_precision=0.1)
print(f"Need {n_required} seeds for ±10% precision") # ~12 seeds
# How many for ±5% precision?
n_tight = required_seeds_for_precision(std_est, mean_est,
desired_precision=0.05)
print(f"Need {n_tight} seeds for ±5% precision") # ~47 seeds
Practical Guidelines:
- Quick comparison: 5 seeds (±20% precision)
- Standard evaluation: 10 seeds (±10% precision)
- Publication: 20 seeds (±7% precision)
- Production deployment: 50+ seeds (±5% precision)
Train/Test Discipline
Environment Instance Separation
CRITICAL: Never evaluate on the same environment instances used for training.
# WRONG: Single environment for both training and evaluation
env = gym.make("CartPole-v1", seed=42)
agent.train(env)
performance = evaluate(agent, env) # BIASED!
# CORRECT: Separate environments
train_env = gym.make("CartPole-v1", seed=42)
eval_env = gym.make("CartPole-v1", seed=999) # Different seed
agent.train(train_env)
performance = evaluate(agent, eval_env) # Unbiased
Train/Test Split for Custom Environments
For environments with multiple instances (levels, objects, configurations):
def create_train_test_split(all_instances, test_ratio=0.2, seed=42):
"""
Split environment instances into train and test sets.
Args:
all_instances: List of environment configurations
test_ratio: Fraction for test set (default 0.2)
seed: Random seed for reproducibility
Returns:
(train_instances, test_instances)
"""
np.random.seed(seed)
n_test = int(len(all_instances) * test_ratio)
indices = np.random.permutation(len(all_instances))
test_indices = indices[:n_test]
train_indices = indices[n_test:]
train_instances = [all_instances[i] for i in train_indices]
test_instances = [all_instances[i] for i in test_indices]
return train_instances, test_instances
# Example: Maze environments
all_mazes = [MazeLayout(seed=i) for i in range(100)]
train_mazes, test_mazes = create_train_test_split(all_mazes, test_ratio=0.2)
print(f"Training on {len(train_mazes)} mazes") # 80
print(f"Testing on {len(test_mazes)} mazes") # 20
# Train only on training set
agent.train(train_mazes)
# Evaluate on BOTH train and test (measure generalization gap)
train_performance = evaluate(agent, train_mazes)
test_performance = evaluate(agent, test_mazes)
generalization_gap = train_performance - test_performance
print(f"Train: {train_performance:.1f}")
print(f"Test: {test_performance:.1f}")
print(f"Generalization gap: {generalization_gap:.1f}")
# Red flag: If gap > 20% of train performance, agent is overfitting
if generalization_gap > 0.2 * train_performance:
print("WARNING: Significant overfitting detected!")
Randomization Protocol
Ensure independent randomization for train/eval:
class EvaluationProtocol:
def __init__(self, env_name, train_seed=42, eval_seed=999):
"""
Proper train/eval environment management.
Args:
env_name: Gym environment name
train_seed: Seed for training environment
eval_seed: Seed for evaluation environment (DIFFERENT)
"""
self.env_name = env_name
self.train_seed = train_seed
self.eval_seed = eval_seed
# Separate environments
self.train_env = gym.make(env_name)
self.train_env.seed(train_seed)
self.train_env.action_space.seed(train_seed)
self.train_env.observation_space.seed(train_seed)
self.eval_env = gym.make(env_name)
self.eval_env.seed(eval_seed)
self.eval_env.action_space.seed(eval_seed)
self.eval_env.observation_space.seed(eval_seed)
def train_step(self, agent):
"""Training step on training environment."""
return agent.step(self.train_env)
def evaluate(self, agent, episodes=100):
"""Evaluation on SEPARATE evaluation environment."""
rewards = []
for _ in range(episodes):
state = self.eval_env.reset()
episode_reward = 0
done = False
while not done:
action = agent.act_deterministic(state)
state, reward, done, _ = self.eval_env.step(action)
episode_reward += reward
rewards.append(episode_reward)
return np.mean(rewards), np.std(rewards)
# Usage
protocol = EvaluationProtocol("HalfCheetah-v3", train_seed=42, eval_seed=999)
# Training
agent = SAC()
for step in range(1_000_000):
protocol.train_step(agent)
if step % 10_000 == 0:
mean_reward, std_reward = protocol.evaluate(agent, episodes=10)
print(f"Step {step}: {mean_reward:.1f} ± {std_reward:.1f}")
Sample Efficiency Metrics
Sample Efficiency Curves
Report performance at multiple sample budgets, not just final:
def compute_sample_efficiency_curve(agent_class, env_name, seed,
max_steps, eval_points=20):
"""
Compute sample efficiency curve (reward vs steps).
Args:
agent_class: RL algorithm class
env_name: Environment name
seed: Random seed
max_steps: Maximum training steps
eval_points: Number of evaluation points
Returns:
List of (steps, reward) tuples
"""
env = gym.make(env_name, seed=seed)
agent = agent_class(env, seed=seed)
eval_steps = np.logspace(3, np.log10(max_steps), num=eval_points, dtype=int)
# [1000, 1500, 2200, ..., max_steps] (logarithmic spacing)
curve = []
current_step = 0
for target_step in eval_steps:
# Train until target_step
steps_to_train = target_step - current_step
agent.train(steps=steps_to_train)
current_step = target_step
# Evaluate
reward = evaluate_deterministic(agent, env, episodes=10)
curve.append((target_step, reward))
return curve
# Compare sample efficiency of multiple algorithms
algorithms = [PPO, SAC, TD3]
env_name = "HalfCheetah-v3"
max_steps = 1_000_000
for algo in algorithms:
# Average across 5 seeds
all_curves = []
for seed in range(5):
curve = compute_sample_efficiency_curve(algo, env_name, seed, max_steps)
all_curves.append(curve)
# Aggregate
steps = [point[0] for point in all_curves[0]]
rewards_at_step = [[curve[i][1] for curve in all_curves]
for i in range(len(steps))]
mean_rewards = [np.mean(rewards) for rewards in rewards_at_step]
std_rewards = [np.std(rewards) for rewards in rewards_at_step]
# Report at specific budgets
for i, step in enumerate([100_000, 500_000, 1_000_000]):
idx = steps.index(step)
print(f"{algo.__name__} at {step} steps: "
f"{mean_rewards[idx]:.1f} ± {std_rewards[idx]:.1f}")
Sample Output:
PPO at 100k steps: 1,234 ± 156
PPO at 500k steps: 3,456 ± 289
PPO at 1M steps: 4,523 ± 387
SAC at 100k steps: 891 ± 178
SAC at 500k steps: 3,789 ± 245
SAC at 1M steps: 4,912 ± 312
TD3 at 100k steps: 756 ± 134
TD3 at 500k steps: 3,234 ± 298
TD3 at 1M steps: 4,678 ± 276
Analysis:
- PPO is most sample-efficient early (1,234 at 100k)
- SAC has best final performance (4,912 at 1M)
- If sample budget is 100k → PPO is best choice
- If sample budget is 1M → SAC is best choice
Area Under Curve (AUC) Metric
Single metric for sample efficiency:
def compute_auc(curve):
"""
Compute area under sample efficiency curve.
Args:
curve: List of (steps, reward) tuples
Returns:
AUC value (higher = more sample efficient)
"""
steps = np.array([point[0] for point in curve])
rewards = np.array([point[1] for point in curve])
# Trapezoidal integration
auc = np.trapz(rewards, steps)
return auc
# Compare algorithms by AUC
for algo in algorithms:
all_aucs = []
for seed in range(5):
curve = compute_sample_efficiency_curve(algo, env_name, seed, max_steps)
auc = compute_auc(curve)
all_aucs.append(auc)
print(f"{algo.__name__} AUC: {np.mean(all_aucs):.2e} ± {np.std(all_aucs):.2e}")
Note: AUC is sensitive to evaluation point spacing. Use consistent evaluation points across algorithms.
Generalization Testing
Distribution Shift Evaluation
Test on environment variations to measure robustness:
def evaluate_generalization(agent, env_name, shifts):
"""
Evaluate agent on distribution shifts.
Args:
agent: Trained RL agent
env_name: Base environment name
shifts: Dictionary of shift types and parameters
Returns:
Dictionary of performance on each shift
"""
results = {}
# Baseline (no shift)
baseline_env = gym.make(env_name)
baseline_perf = evaluate(agent, baseline_env, episodes=50)
results['baseline'] = baseline_perf
# Test shifts
for shift_name, shift_params in shifts.items():
shifted_env = apply_shift(env_name, shift_params)
shift_perf = evaluate(agent, shifted_env, episodes=50)
results[shift_name] = shift_perf
# Compute degradation
degradation = (baseline_perf - shift_perf) / baseline_perf
results[f'{shift_name}_degradation'] = degradation
return results
# Example: Robotic grasping
shifts = {
'lighting_dim': {'lighting_scale': 0.5},
'lighting_bright': {'lighting_scale': 1.5},
'camera_angle_15deg': {'camera_rotation': 15},
'table_height_+5cm': {'table_height_offset': 0.05},
'object_mass_+50%': {'mass_scale': 1.5},
'object_friction_-30%': {'friction_scale': 0.7}
}
gen_results = evaluate_generalization(agent, "RobotGrasp-v1", shifts)
print(f"Baseline: {gen_results['baseline']:.2%} success")
for shift_name in shifts.keys():
perf = gen_results[shift_name]
deg = gen_results[f'{shift_name}_degradation']
print(f"{shift_name}: {perf:.2%} success ({deg:.1%} degradation)")
# Red flag: If any degradation > 50%, agent is brittle
Zero-Shot Transfer Evaluation
Test on completely new environments:
def zero_shot_transfer(agent, train_env_name, test_env_names):
"""
Evaluate zero-shot transfer to related environments.
Args:
agent: Agent trained on train_env_name
train_env_name: Training environment
test_env_names: List of related test environments
Returns:
Transfer performance dictionary
"""
results = {}
# Source performance
source_env = gym.make(train_env_name)
source_perf = evaluate(agent, source_env, episodes=50)
results['source'] = source_perf
# Target performances
for target_env_name in test_env_names:
target_env = gym.make(target_env_name)
target_perf = evaluate(agent, target_env, episodes=50)
results[target_env_name] = target_perf
# Transfer efficiency
transfer_ratio = target_perf / source_perf
results[f'{target_env_name}_transfer_ratio'] = transfer_ratio
return results
# Example: Locomotion transfer
agent_trained_on_cheetah = train(PPO, "HalfCheetah-v3")
transfer_results = zero_shot_transfer(
agent_trained_on_cheetah,
train_env_name="HalfCheetah-v3",
test_env_names=["Hopper-v3", "Walker2d-v3", "Ant-v3"]
)
print(f"Source (HalfCheetah): {transfer_results['source']:.1f}")
for env in ["Hopper-v3", "Walker2d-v3", "Ant-v3"]:
perf = transfer_results[env]
ratio = transfer_results[f'{env}_transfer_ratio']
print(f"{env}: {perf:.1f} ({ratio:.1%} of source)")
Robustness to Adversarial Perturbations
Test against worst-case scenarios:
def adversarial_evaluation(agent, env, perturbation_types,
perturbation_magnitudes):
"""
Evaluate robustness to adversarial perturbations.
Args:
agent: RL agent to evaluate
env: Environment
perturbation_types: List of perturbation types
perturbation_magnitudes: List of magnitudes to test
Returns:
Robustness curve for each perturbation type
"""
results = {}
for perturb_type in perturbation_types:
results[perturb_type] = []
for magnitude in perturbation_magnitudes:
# Apply perturbation
perturbed_env = add_perturbation(env, perturb_type, magnitude)
# Evaluate
perf = evaluate(agent, perturbed_env, episodes=20)
results[perturb_type].append((magnitude, perf))
return results
# Example: Vision-based control
perturbation_types = ['gaussian_noise', 'occlusion', 'brightness']
magnitudes = [0.0, 0.1, 0.2, 0.3, 0.5]
robustness = adversarial_evaluation(
agent, env, perturbation_types, magnitudes
)
for perturb_type, curve in robustness.items():
print(f"\n{perturb_type}:")
for magnitude, perf in curve:
print(f" Magnitude {magnitude}: {perf:.1f} reward")
Evaluation Protocols
Stochastic vs Deterministic Evaluation
Decision Tree:
Is the policy inherently deterministic?
├─ YES (DQN, DDPG without noise)
│ └─ Use deterministic evaluation
└─ NO (PPO, SAC, stochastic policies)
├─ Will deployment use stochastic policy?
│ ├─ YES (dialogue, exploration needed)
│ │ └─ Use stochastic evaluation
│ └─ NO (control, deterministic deployment)
│ └─ Use deterministic evaluation
└─ Unsure?
└─ Report BOTH stochastic and deterministic
Implementation:
class EvaluationMode:
@staticmethod
def deterministic(agent, env, episodes=100):
"""
Deterministic evaluation (use mean/argmax action).
"""
rewards = []
for _ in range(episodes):
state = env.reset()
episode_reward = 0
done = False
while not done:
# Use mean action (no sampling)
if hasattr(agent, 'act_deterministic'):
action = agent.act_deterministic(state)
else:
action = agent.policy.mean(state) # Or argmax for discrete
state, reward, done, _ = env.step(action)
episode_reward += reward
rewards.append(episode_reward)
return np.mean(rewards), np.std(rewards)
@staticmethod
def stochastic(agent, env, episodes=100):
"""
Stochastic evaluation (sample from policy).
"""
rewards = []
for _ in range(episodes):
state = env.reset()
episode_reward = 0
done = False
while not done:
# Sample from policy distribution
action = agent.policy.sample(state)
state, reward, done, _ = env.step(action)
episode_reward += reward
rewards.append(episode_reward)
return np.mean(rewards), np.std(rewards)
@staticmethod
def report_both(agent, env, episodes=100):
"""
Report both evaluation modes for transparency.
"""
det_mean, det_std = EvaluationMode.deterministic(agent, env, episodes)
sto_mean, sto_std = EvaluationMode.stochastic(agent, env, episodes)
return {
'deterministic': {'mean': det_mean, 'std': det_std},
'stochastic': {'mean': sto_mean, 'std': sto_std},
'difference': det_mean - sto_mean
}
# Usage
sac_agent = SAC(env)
sac_agent.train(steps=1_000_000)
eval_results = EvaluationMode.report_both(sac_agent, env, episodes=100)
print(f"Deterministic: {eval_results['deterministic']['mean']:.1f} "
f"± {eval_results['deterministic']['std']:.1f}")
print(f"Stochastic: {eval_results['stochastic']['mean']:.1f} "
f"± {eval_results['stochastic']['std']:.1f}")
print(f"Difference: {eval_results['difference']:.1f}")
Interpretation:
- If difference < 5% of mean: Evaluation mode doesn't matter much
- If difference > 15% of mean: Evaluation mode significantly affects results
- Must clearly specify which mode used
- Ensure fair comparison across algorithms (same mode)
Episode Count Selection
How many evaluation episodes needed?
def required_eval_episodes(env, agent, desired_sem, max_episodes=1000):
"""
Determine number of evaluation episodes for desired standard error.
Args:
env: Environment
agent: Agent to evaluate
desired_sem: Desired standard error of mean
max_episodes: Maximum episodes to test
Returns:
Required number of episodes
"""
# Run initial episodes to estimate variance
initial_episodes = min(20, max_episodes)
rewards = []
for _ in range(initial_episodes):
state = env.reset()
episode_reward = 0
done = False
while not done:
action = agent.act_deterministic(state)
state, reward, done, _ = env.step(action)
episode_reward += reward
rewards.append(episode_reward)
# Estimate standard deviation
std_estimate = np.std(rewards)
# Required episodes: n = (std / desired_sem)^2
required = int(np.ceil((std_estimate / desired_sem) ** 2))
return min(required, max_episodes)
# Usage
agent = PPO(env)
agent.train(steps=1_000_000)
# Want standard error < 10 reward units
n_episodes = required_eval_episodes(env, agent, desired_sem=10)
print(f"Need {n_episodes} episodes for SEM < 10")
# Evaluate with required episodes
final_eval = evaluate(agent, env, episodes=n_episodes)
Rule of Thumb:
- Quick check: 10 episodes
- Standard evaluation: 50-100 episodes
- Publication/deployment: 100-200 episodes
- High-variance environments: 500+ episodes
Evaluation Frequency During Training
How often to evaluate during training?
def adaptive_evaluation_schedule(total_steps, early_freq=1000,
late_freq=10000, transition_step=100000):
"""
Create adaptive evaluation schedule.
Early training: Frequent evaluations (detect divergence early)
Late training: Infrequent evaluations (policy more stable)
Args:
total_steps: Total training steps
early_freq: Evaluation frequency in early training
late_freq: Evaluation frequency in late training
transition_step: Step to transition from early to late
Returns:
List of evaluation timesteps
"""
eval_steps = []
# Early phase
current_step = 0
while current_step < transition_step:
eval_steps.append(current_step)
current_step += early_freq
# Late phase
while current_step < total_steps:
eval_steps.append(current_step)
current_step += late_freq
# Always evaluate at end
if eval_steps[-1] != total_steps:
eval_steps.append(total_steps)
return eval_steps
# Usage
schedule = adaptive_evaluation_schedule(
total_steps=1_000_000,
early_freq=1_000, # Every 1k steps for first 100k
late_freq=10_000, # Every 10k steps after 100k
transition_step=100_000
)
print(f"Total evaluations: {len(schedule)}")
print(f"First 10 eval steps: {schedule[:10]}")
print(f"Last 10 eval steps: {schedule[-10:]}")
# Training loop
agent = PPO(env)
for step in range(1_000_000):
agent.train_step()
if step in schedule:
eval_perf = evaluate(agent, eval_env, episodes=10)
log(step, eval_perf)
Guidelines:
- Evaluation is expensive (10-100 episodes × episode length)
- Early training: Evaluate frequently to detect divergence
- Late training: Evaluate less frequently (policy stabilizes)
- Don't evaluate every step (wastes compute)
- Save checkpoints at evaluation steps (for later analysis)
Offline RL Evaluation
The Offline RL Evaluation Problem
CRITICAL: You cannot accurately evaluate offline RL policies without online rollouts.
Why:
- Learned Q-values are only accurate for data distribution
- Policy wants to visit out-of-distribution states
- Q-values for OOD states are extrapolated (unreliable)
- Dataset doesn't contain policy's trajectories
What to do:
class OfflineRLEvaluation:
"""
Conservative offline RL evaluation protocol.
"""
@staticmethod
def in_distribution_performance(offline_dataset, policy):
"""
Evaluate policy on dataset trajectories (lower bound).
This measures: "How well does policy match best trajectories
in dataset?" NOT "How good is the policy?"
"""
returns = []
for trajectory in offline_dataset:
# Check if policy would generate this trajectory
policy_match = True
for (state, action) in trajectory:
policy_action = policy(state)
if not actions_match(policy_action, action):
policy_match = False
break
if policy_match:
returns.append(trajectory.return)
if len(returns) == 0:
return None # Policy doesn't match any dataset trajectory
return np.mean(returns)
@staticmethod
def behavioral_cloning_baseline(offline_dataset):
"""
Train behavior cloning on dataset (baseline).
Offline RL should outperform BC, otherwise it's not learning.
"""
bc_policy = BehaviorCloning(offline_dataset)
bc_policy.train()
return bc_policy
@staticmethod
def model_based_evaluation(offline_dataset, policy, model):
"""
Use learned dynamics model for evaluation (if available).
CAUTION: Model errors compound. Short rollouts only.
"""
# Train dynamics model on dataset
model.train(offline_dataset)
# Generate short rollouts (5-10 steps)
rollout_returns = []
for _ in range(100):
state = sample_initial_state(offline_dataset)
rollout_return = 0
for step in range(10): # Short rollouts only
action = policy(state)
next_state, reward = model.predict(state, action)
rollout_return += reward
state = next_state
rollout_returns.append(rollout_return)
# Heavy discount for model uncertainty
uncertainty = model.get_uncertainty(offline_dataset)
adjusted_return = np.mean(rollout_returns) * (1 - uncertainty)
return adjusted_return
@staticmethod
def state_coverage_metric(offline_dataset, policy, num_rollouts=100):
"""
Measure how much policy stays in-distribution.
Low coverage → policy goes OOD → evaluation unreliable
"""
# Get dataset state distribution
dataset_states = get_all_states(offline_dataset)
# Simulate policy rollouts
policy_states = []
for _ in range(num_rollouts):
trajectory = simulate_with_model(policy) # Needs model
policy_states.extend(trajectory.states)
# Compute coverage (fraction of policy states near dataset states)
coverage = compute_coverage(policy_states, dataset_states)
return coverage
@staticmethod
def full_offline_evaluation(offline_dataset, policy):
"""
Comprehensive offline evaluation (still conservative).
"""
results = {}
# 1. In-distribution performance
results['in_dist_perf'] = OfflineRLEvaluation.in_distribution_performance(
offline_dataset, policy
)
# 2. Compare to behavior cloning
bc_policy = OfflineRLEvaluation.behavioral_cloning_baseline(offline_dataset)
results['bc_baseline'] = evaluate(bc_policy, offline_dataset)
# 3. Model-based evaluation (if model available)
# model = train_dynamics_model(offline_dataset)
# results['model_eval'] = OfflineRLEvaluation.model_based_evaluation(
# offline_dataset, policy, model
# )
# 4. State coverage
# results['coverage'] = OfflineRLEvaluation.state_coverage_metric(
# offline_dataset, policy
# )
return results
# Usage
offline_dataset = load_offline_dataset("d4rl-halfcheetah-medium-v0")
offline_policy = CQL(offline_dataset)
offline_policy.train()
eval_results = OfflineRLEvaluation.full_offline_evaluation(
offline_dataset, offline_policy
)
print("Offline Evaluation (CONSERVATIVE):")
print(f"In-distribution performance: {eval_results['in_dist_perf']}")
print(f"BC baseline: {eval_results['bc_baseline']}")
print("\nWARNING: These are lower bounds. True performance unknown without online evaluation.")
Staged Deployment for Offline RL
Best practice: Gradually introduce online evaluation
def staged_offline_to_online_deployment(offline_policy, env):
"""
Staged deployment: Offline → Small online trial → Full deployment
Stage 1: Offline evaluation (conservative)
Stage 2: Small online trial (safety-constrained)
Stage 3: Full online evaluation
Stage 4: Deployment
"""
results = {}
# Stage 1: Offline evaluation
print("Stage 1: Offline evaluation")
offline_perf = offline_evaluation(offline_policy)
results['offline'] = offline_perf
if offline_perf < minimum_threshold:
print("Failed offline evaluation. Stop.")
return results
# Stage 2: Small online trial (100 episodes)
print("Stage 2: Small online trial (100 episodes)")
online_trial_perf = evaluate(offline_policy, env, episodes=100)
results['small_trial'] = online_trial_perf
# Check degradation
degradation = (offline_perf - online_trial_perf) / offline_perf
if degradation > 0.3: # >30% degradation
print(f"WARNING: {degradation:.1%} performance drop in online trial")
print("Policy may be overfitting to offline data. Investigate.")
return results
# Stage 3: Full online evaluation (1000 episodes)
print("Stage 3: Full online evaluation (1000 episodes)")
online_full_perf = evaluate(offline_policy, env, episodes=1000)
results['full_online'] = online_full_perf
# Stage 4: Deployment decision
if online_full_perf > deployment_threshold:
print("Passed all stages. Ready for deployment.")
results['deploy'] = True
else:
print("Failed online evaluation. Do not deploy.")
results['deploy'] = False
return results
Common Pitfalls
Pitfall 1: Single Seed Reporting
Symptom: Reporting one training run as "the result"
Why it's wrong: RL has high variance. Single seed is noise.
Detection:
- Paper shows single training curve
- No variance/error bars
- No mention of multiple seeds
Fix: Minimum 5-10 seeds, report mean ± std
Pitfall 2: Cherry-Picking Results
Symptom: Running many experiments, reporting best
Why it's wrong: Creates false positives (p-hacking)
Detection:
- Results seem too good
- No mention of failed runs
- "We tried many seeds and picked a representative one"
Fix: Report ALL runs. Pre-register experiments.
Pitfall 3: Evaluating on Training Set
Symptom: Agent evaluated on same environment instances used for training
Why it's wrong: Measures memorization, not generalization
Detection:
- No mention of train/test split
- Same random seed for training and evaluation
- Perfect performance on specific instances
Fix: Separate train/test environments with different seeds
Pitfall 4: Ignoring Sample Efficiency
Symptom: Comparing algorithms only on final performance
Why it's wrong: Final performance ignores cost to achieve it
Detection:
- No sample efficiency curves
- No "reward at X steps" metrics
- Only asymptotic performance reported
Fix: Report sample efficiency curves, compare at multiple budgets
Pitfall 5: Conflating Train and Eval Performance
Symptom: Using training episode returns as evaluation
Why it's wrong: Training uses exploration, evaluation should not
Detection:
- "Training reward" used for algorithm comparison
- No separate evaluation protocol
- Same environment instance for both
Fix: Separate training (with exploration) and evaluation (without)
Pitfall 6: Insufficient Evaluation Episodes
Symptom: Evaluating with 5-10 episodes
Why it's wrong: High variance → unreliable estimates
Detection:
- Large error bars
- Inconsistent results across runs
- SEM > 10% of mean
Fix: 50-100 episodes minimum, power analysis for exact number
Pitfall 7: Reporting Peak Instead of Final
Symptom: Selecting best checkpoint during training
Why it's wrong: Peak is overfitting to evaluation variance
Detection:
- "Best performance during training" reported
- Early stopping based on eval performance
- No mention of final performance
Fix: Report final performance, or use validation set for model selection
Pitfall 8: No Generalization Testing
Symptom: Only evaluating on single environment configuration
Why it's wrong: Doesn't measure robustness to distribution shift
Detection:
- No mention of distribution shifts
- Only one environment configuration tested
- No transfer/zero-shot evaluation
Fix: Test on held-out environments, distribution shifts, adversarial cases
Pitfall 9: Inconsistent Evaluation Mode
Symptom: Comparing stochastic and deterministic evaluations
Why it's wrong: Evaluation mode affects results by 10-30%
Detection:
- No mention of evaluation mode
- Comparing algorithms with different modes
- Unclear if sampling or mean action used
Fix: Specify evaluation mode, ensure consistency across comparisons
Pitfall 10: Offline RL Without Online Validation
Symptom: Deploying offline RL policy based on Q-values alone
Why it's wrong: Q-values extrapolate OOD, unreliable
Detection:
- No online rollouts before deployment
- Claiming performance based on learned values
- Ignoring distribution shift
Fix: Staged deployment (offline → small online trial → full deployment)
Red Flags
| Red Flag | Implication | Action |
|---|---|---|
| Only one training curve shown | Single seed, cherry-picked | Demand multi-seed results |
| No error bars or confidence intervals | No variance accounting | Require statistical rigor |
| "We picked a representative seed" | Cherry-picking | Reject, require all seeds |
| No train/test split mentioned | Likely overfitting | Check evaluation protocol |
| No sample efficiency curves | Ignoring cost | Request curves or AUC |
| Evaluation mode not specified | Unclear methodology | Ask: stochastic or deterministic? |
| < 20 evaluation episodes | High variance | Require more episodes |
| Only final performance reported | Missing sample efficiency | Request performance at multiple steps |
| No generalization testing | Narrow evaluation | Request distribution shift tests |
| Offline RL with no online validation | Unreliable estimates | Require online trial |
| Results too good to be true | Probably cherry-picked or overfitting | Deep investigation |
| p-value reported without effect size | Statistically significant but practically irrelevant | Check Cohen's d |
Rationalization Table
| Rationalization | Why It's Wrong | Counter |
|---|---|---|
| "RL papers commonly use single seed, so it's acceptable" | Common ≠ correct. Field is improving standards. | "Newer venues require multi-seed. Improve rigor." |
| "Our algorithm is deterministic, variance is low" | Algorithm determinism ≠ environment/initialization determinism | "Environment randomness still causes variance." |
| "We don't have compute for 10 seeds" | Then don't make strong performance claims | "Report 3-5 seeds with caveats, or wait for compute." |
| "Evaluation on training set is faster" | Speed < correctness | "Fast wrong answer is worse than slow right answer." |
| "We care about final performance, not sample efficiency" | Depends on application, often sample efficiency matters | "Clarify deployment constraints. Samples usually matter." |
| "Stochastic/deterministic doesn't matter" | 10-30% difference is common | "Specify mode, ensure fair comparison." |
| "10 eval episodes is enough" | Standard error likely > 10% of mean | "Compute SEM, use power analysis." |
| "Our environment is simple, doesn't need generalization testing" | Deployment is rarely identical to training | "Test at least 2-3 distribution shifts." |
| "Offline RL Q-values are accurate" | Only for in-distribution, not OOD | "Q-values extrapolate. Need online validation." |
| "We reported the best run, but all were similar" | Then report all and show they're similar | "Show mean ± std to prove similarity." |
Decision Trees
Decision Tree 1: How Many Seeds?
What is the use case?
├─ Quick internal comparison
│ └─ 3-5 seeds (caveat: preliminary results)
├─ Algorithm selection for production
│ └─ 10-20 seeds
├─ Publication
│ └─ 10-20 seeds (depends on venue)
└─ Safety-critical deployment
└─ 20-50 seeds (need tight confidence intervals)
Decision Tree 2: Evaluation Mode?
Is policy inherently deterministic?
├─ YES (DQN, deterministic policies)
│ └─ Deterministic evaluation
└─ NO (PPO, SAC, stochastic policies)
├─ Will deployment use stochastic policy?
│ ├─ YES
│ │ └─ Stochastic evaluation
│ └─ NO
│ └─ Deterministic evaluation
└─ Unsure?
└─ Report BOTH, explain trade-offs
Decision Tree 3: How Many Evaluation Episodes?
What is variance estimate?
├─ Unknown
│ └─ Start with 20 episodes, estimate variance, use power analysis
├─ Known (σ)
│ ├─ Low variance (σ < 0.1 * μ)
│ │ └─ 20-50 episodes sufficient
│ ├─ Medium variance (0.1 * μ ≤ σ < 0.3 * μ)
│ │ └─ 50-100 episodes
│ └─ High variance (σ ≥ 0.3 * μ)
│ └─ 100-500 episodes (or use variance reduction techniques)
Decision Tree 4: Generalization Testing?
Is environment parameterized or procedurally generated?
├─ YES (multiple instances possible)
│ ├─ Use train/test split (80/20)
│ └─ Report both train and test performance
└─ NO (single environment)
├─ Can you create distribution shifts?
│ ├─ YES (modify dynamics, observations, etc.)
│ │ └─ Test on 3-5 distribution shifts
│ └─ NO
│ └─ At minimum, use different random seed for eval
Decision Tree 5: Offline RL Evaluation?
Can you do online rollouts?
├─ YES
│ └─ Use staged deployment (offline → small trial → full online)
├─ NO (completely offline)
│ ├─ Use conservative offline metrics
│ ├─ Compare to behavior cloning baseline
│ ├─ Clearly state limitations
│ └─ Do NOT claim actual performance, only lower bounds
└─ PARTIAL (limited online budget)
└─ Use model-based evaluation + small online trial
Workflow
Standard Evaluation Workflow
1. Pre-Experiment Planning
☐ Define evaluation protocol BEFORE running experiments
☐ Select number of seeds (minimum 5-10)
☐ Define train/test split if applicable
☐ Specify evaluation mode (stochastic/deterministic)
☐ Define sample budgets for efficiency curves
☐ Pre-register experiments (commit to protocol)
2. Training Phase
☐ Train on training environments ONLY
☐ Use separate eval environments with different seeds
☐ Evaluate at regular intervals (adaptive schedule)
☐ Save checkpoints at evaluation points
☐ Log both training and evaluation performance
3. Evaluation Phase
☐ Final evaluation on test set (never seen during training)
☐ Use sufficient episodes (50-100 minimum)
☐ Evaluate across all seeds
☐ Compute statistics (mean, std, CI, median, IQR)
☐ Test generalization (distribution shifts, zero-shot transfer)
4. Analysis Phase
☐ Compute sample efficiency metrics (AUC, reward at budgets)
☐ Statistical significance testing if comparing algorithms
☐ Check effect size (Cohen's d), not just p-value
☐ Identify failure cases and edge cases
☐ Measure robustness to perturbations
5. Reporting Phase
☐ Report all seeds, not selected subset
☐ Include mean ± std or 95% CI
☐ Show sample efficiency curves
☐ Report both training and generalization performance
☐ Specify evaluation mode
☐ Include negative results and failure analysis
☐ Provide reproducibility details (seeds, hyperparameters)
Checklist for Publication/Deployment
Statistical Rigor:
☐ Minimum 10 seeds
☐ Mean ± std or 95% CI reported
☐ Statistical significance testing (if comparing algorithms)
☐ Effect size reported (Cohen's d)
Train/Test Discipline:
☐ Separate train/test environments
☐ Different random seeds for train/eval
☐ No evaluation on training data
☐ Generalization gap reported (train vs test performance)
Comprehensive Metrics:
☐ Final performance
☐ Sample efficiency curves
☐ Performance at multiple sample budgets
☐ Evaluation mode specified (stochastic/deterministic)
Generalization:
☐ Tested on distribution shifts
☐ Zero-shot transfer evaluation (if applicable)
☐ Robustness to perturbations
Methodology:
☐ Sufficient evaluation episodes (50-100+)
☐ Evaluation protocol clearly described
☐ Reproducibility details provided
☐ Negative results included
Offline RL (if applicable):
☐ Conservative offline metrics used
☐ Online validation included (or limitations clearly stated)
☐ Comparison to behavior cloning baseline
☐ Distribution shift acknowledged
Integration with rl-debugging
RL evaluation and debugging are closely related:
Use rl-debugging when:
- Evaluation reveals poor performance
- Need to diagnose WHY agent fails
- Debugging training issues
Use rl-evaluation when:
- Agent seems to work, need to measure HOW WELL
- Comparing multiple algorithms
- Preparing for deployment
Combined workflow:
- Train agent
- Evaluate (rl-evaluation skill)
- If performance poor → Debug (rl-debugging skill)
- Fix issues
- Re-evaluate
- Repeat until satisfactory
- Final rigorous evaluation for deployment
Summary
RL evaluation is NOT just "run the agent and see what happens." It requires:
- Statistical rigor: Multi-seed, confidence intervals, significance testing
- Train/test discipline: Separate environments, no overfitting
- Comprehensive metrics: Sample efficiency, generalization, robustness
- Appropriate protocols: Evaluation mode, episode count, frequency
- Offline RL awareness: Conservative estimates, staged deployment
Without rigorous evaluation:
- You will draw wrong conclusions from noise
- You will deploy agents that fail in production
- You will waste resources on false improvements
- You will make scientifically invalid claims
With rigorous evaluation:
- Reliable performance estimates
- Valid algorithm comparisons
- Deployment-ready agents
- Reproducible research
When in doubt: More seeds, more episodes, more generalization tests.
END OF SKILL