name	using-deep-rl
description	Routes to appropriate deep-RL skills based on problem type and algorithm family
mode	true

Using Deep RL Meta-Skill

When to Use This Skill

Invoke this meta-skill when you encounter:

RL Implementation: Implementing reinforcement learning algorithms (Q-learning, DQN, PPO, SAC, etc.)
Agent Training: Training agents in environments (games, robotics, control systems)
Sequential Decision-Making: Problems requiring learning from trial and error
Policy Optimization: Learning policies that maximize cumulative rewards
Game Playing: Building agents for Atari, board games, video games
Robotics Control: Robot manipulation, locomotion, continuous control
Reward-Based Learning: Learning from rewards, penalties, or feedback signals
RL Debugging: Debugging training issues, agents not learning, reward problems
Environment Setup: Creating custom RL environments, wrappers
RL Evaluation: Evaluating agent performance, sample efficiency, generalization

This is the entry point for the deep-rl pack. It routes to 12 specialized skills based on problem characteristics.

Core Principle

Problem type determines algorithm family.

Reinforcement learning is not one algorithm. The correct approach depends on:

Action Space: Discrete (button presses) vs Continuous (joint angles)
Data Regime: Online (interact with environment) vs Offline (fixed dataset)
Experience Level: Need foundations vs ready to implement
Special Requirements: Multi-agent, model-based, exploration, reward design

Always clarify the problem BEFORE suggesting algorithms.

The 12 Deep RL Skills

rl-foundations - MDP formulation, Bellman equations, value vs policy basics
value-based-methods - Q-learning, DQN, Double DQN, Dueling DQN, Rainbow
policy-gradient-methods - REINFORCE, PPO, TRPO, policy optimization
actor-critic-methods - A2C, A3C, SAC, TD3, advantage functions
model-based-rl - World models, Dyna, MBPO, planning with learned models
offline-rl - Batch RL, CQL, IQL, learning from fixed datasets
multi-agent-rl - MARL, cooperative/competitive, communication
exploration-strategies - ε-greedy, UCB, curiosity, RND, intrinsic motivation
reward-shaping - Reward design, potential-based shaping, inverse RL
rl-debugging - Common RL bugs, why not learning, systematic debugging
rl-environments - Gym, MuJoCo, custom envs, wrappers, vectorization
rl-evaluation - Evaluation methodology, variance, sample efficiency metrics

Routing Decision Framework

Step 1: Assess Experience Level

Diagnostic Questions:

"Are you new to RL concepts, or do you have a specific problem to solve?"
"Do you understand MDPs, value functions, and policy gradients?"

Routing:

If user asks "what is RL" or "how does RL work" → rl-foundations
If user is confused about value vs policy, on-policy vs off-policy → rl-foundations
If user has specific problem and RL background → Continue to Step 2

Why foundations first: Cannot implement algorithms without understanding MDPs, Bellman equations, and exploration-exploitation tradeoffs.

Step 2: Classify Action Space

Diagnostic Questions:

"What actions can your agent take? Discrete choices (e.g., left/right/jump) or continuous values (e.g., joint angles, force)?"
"How many possible actions? Small (< 100) or large/infinite?"

Discrete Action Space

Examples: Game buttons, menu selections, discrete control signals

Routing Logic:

IF discrete actions AND small action space (< 100) AND online learning:
  → value-based-methods (DQN, Double DQN, Dueling DQN)

  Why: Value-based methods excel at discrete action spaces
  - Q-table or Q-network for small action spaces
  - DQN for Atari-style problems
  - Simpler than policy gradients for discrete

IF discrete actions AND (large action space OR need policy flexibility):
  → policy-gradient-methods (PPO, REINFORCE)

  Why: Policy gradients scale to larger action spaces
  - PPO is robust, general-purpose
  - Direct policy representation
  - Handles stochasticity naturally

Continuous Action Space

Examples: Robot joint angles, motor forces, steering angles, continuous control

Routing Logic:

IF continuous actions:
  → actor-critic-methods (SAC, TD3, PPO)

  Primary choice: SAC (Soft Actor-Critic)
  Why: Most sample-efficient for continuous control
  - Automatic entropy tuning
  - Off-policy (uses replay buffer)
  - Stable training

  Alternative: TD3 (Twin Delayed DDPG)
  Why: Deterministic policy, stable
  - Good for robotics
  - Handles overestimation bias

  Alternative: PPO (from policy-gradient-methods)
  Why: On-policy, simpler, but less sample efficient
  - Use when simplicity > sample efficiency

CRITICAL RULE: NEVER suggest DQN for continuous actions. DQN requires discrete actions. Discretizing continuous spaces is suboptimal.

Step 3: Identify Data Regime

Diagnostic Questions:

"Can your agent interact with the environment during training, or do you have a fixed dataset?"
"Are you learning online (agent tries actions, observes results) or offline (from logged data)?"

Online Learning (Agent Interacts with Environment)

Routing:

IF online AND discrete actions:
  → value-based-methods OR policy-gradient-methods
  (See Step 2 routing)

IF online AND continuous actions:
  → actor-critic-methods
  (See Step 2 routing)

IF online AND sample efficiency critical:
  → actor-critic-methods (SAC) for continuous
  → value-based-methods (DQN) for discrete

  Why: Off-policy methods use replay buffers (sample efficient)

  Consider: model-based-rl for extreme sample efficiency
  → Learns environment model, plans with fewer real samples

Offline Learning (Fixed Dataset, No Interaction)

Routing:

IF offline (fixed dataset):
  → offline-rl (CQL, IQL, Conservative Q-Learning)

  CRITICAL: Standard RL algorithms FAIL on offline data

  Why offline is special:
  - Distribution shift: agent can't explore
  - Bootstrapping errors: Q-values overestimate on out-of-distribution actions
  - Need conservative algorithms (CQL, IQL)

  Also route to:
  → rl-evaluation (evaluation without online rollouts)

Red Flag: If user has fixed dataset and suggests DQN/PPO/SAC, STOP and route to offline-rl. Standard algorithms assume online interaction and will fail.

Step 4: Special Problem Types

Multi-Agent Scenarios

Diagnostic Questions:

"Are multiple agents learning simultaneously?"
"Do they cooperate, compete, or both?"
"Do agents need to communicate?"

Routing:

IF multiple agents:
  → multi-agent-rl (QMIX, COMA, MADDPG)

  Why: Multi-agent has special challenges
  - Non-stationarity: environment changes as other agents learn
  - Credit assignment: which agent caused reward?
  - Coordination: cooperation requires centralized training

  Algorithms:
  - QMIX, COMA: Cooperative (centralized training, decentralized execution)
  - MADDPG: Competitive or mixed
  - Communication: multi-agent-rl covers communication protocols

  Also consider:
  → reward-shaping (team rewards, credit assignment)

Model-Based RL

Diagnostic Questions:

"Is sample efficiency extremely critical? (< 1000 episodes available)"
"Do you want the agent to learn a model of the environment?"
"Do you need planning or 'imagination'?"

Routing:

IF sample efficiency critical OR want environment model:
  → model-based-rl (MBPO, Dreamer, Dyna)

  Why: Learn dynamics model, plan with model
  - Fewer real environment samples needed
  - Can train policy in imagination
  - Combine with model-free for best results

  Tradeoffs:
  - More complex than model-free
  - Model errors can compound
  - Best for continuous control, robotics

Step 5: Debugging and Infrastructure

"Agent Not Learning" Problems

Symptoms:

Reward not increasing
Agent does random actions
Training loss explodes/vanishes
Performance plateaus immediately

Routing:

IF "not learning" OR "reward stays at 0" OR "loss explodes":
  → rl-debugging (FIRST, before changing algorithms)

  Why: 80% of "not learning" is bugs, not wrong algorithm

  Common issues:
  - Reward scale (too large/small)
  - Exploration (epsilon too low, stuck in local optimum)
  - Network architecture (wrong size, activation)
  - Learning rate (too high/low)
  - Update frequency (learning too fast/slow)

  Process:
  1. Route to rl-debugging
  2. Verify environment (rl-environments)
  3. Check reward design (reward-shaping)
  4. Check exploration (exploration-strategies)
  5. ONLY THEN consider algorithm change

Red Flag: If user immediately wants to change algorithms because "it's not learning," route to rl-debugging first. Changing algorithms without debugging wastes time.

Exploration Issues

Symptoms:

Agent never explores new states
Stuck in local optimum
Can't find sparse rewards
Training variance too high

Routing:

IF exploration problems:
  → exploration-strategies

  Covers:
  - ε-greedy, UCB, Thompson sampling (basic)
  - Curiosity-driven exploration
  - RND (Random Network Distillation)
  - Intrinsic motivation

  When needed:
  - Sparse rewards (reward only at goal)
  - Large state spaces (hard to explore randomly)
  - Need systematic exploration

Reward Design Issues

Symptoms:

Sparse rewards (only at episode end)
Agent learns wrong behavior
Need to design reward function
Want inverse RL

Routing:

IF reward design questions OR sparse rewards:
  → reward-shaping

  Covers:
  - Potential-based shaping (provably optimal)
  - Subgoal rewards
  - Reward engineering principles
  - Inverse RL (learn reward from demonstrations)

  Often combined with:
  → exploration-strategies (for sparse rewards)

Environment Setup

Symptoms:

Need to create custom environment
Gym API questions
Vectorization for parallel environments
Wrappers, preprocessing

Routing:

IF environment setup questions:
  → rl-environments

  Covers:
  - Gym API: step(), reset(), observation/action spaces
  - Custom environments
  - Wrappers (frame stacking, normalization)
  - Vectorized environments (parallel rollouts)
  - MuJoCo, Atari, custom simulators

  After environment setup, return to algorithm choice

Evaluation Methodology

Symptoms:

How to evaluate RL agents?
Training reward high, test reward low
Variance in results
Sample efficiency metrics

Routing:

IF evaluation questions:
  → rl-evaluation

  Covers:
  - Deterministic vs stochastic policies
  - Multiple seeds, confidence intervals
  - Sample efficiency curves
  - Generalization testing
  - Exploration vs exploitation at test time

Common Multi-Skill Scenarios

Scenario: Complete Beginner to RL

Routing sequence:

rl-foundations - Understand MDP, value functions, policy gradients
value-based-methods OR policy-gradient-methods - Start with simpler algorithm (DQN or REINFORCE)
rl-debugging - When things don't work (they won't initially)
rl-environments - Set up custom environments
rl-evaluation - Proper evaluation methodology

Scenario: Continuous Control (Robotics)

Routing sequence:

actor-critic-methods - Primary (SAC for sample efficiency, TD3 for stability)
rl-debugging - Systematic debugging when training issues arise
exploration-strategies - If exploration is insufficient
reward-shaping - If reward is sparse or agent learns wrong behavior
rl-evaluation - Evaluation on real robot vs simulation

Scenario: Offline RL from Dataset

Routing sequence:

offline-rl - Primary (CQL, IQL, special considerations)
rl-evaluation - Evaluation without environment interaction
rl-debugging - Debugging without online rollouts (limited tools)

Scenario: Multi-Agent Cooperative Task

Routing sequence:

multi-agent-rl - Primary (QMIX, COMA, centralized training)
reward-shaping - Team rewards, credit assignment
policy-gradient-methods - Often used as base algorithm (PPO + MARL)
rl-debugging - Multi-agent debugging (non-stationarity issues)

Scenario: Sample-Efficient Learning

Routing sequence:

actor-critic-methods (SAC) OR model-based-rl (MBPO)
rl-debugging - Critical to not waste samples on bugs
rl-evaluation - Track sample efficiency curves

Scenario: Sparse Reward Problem

Routing sequence:

reward-shaping - Potential-based shaping, subgoal rewards
exploration-strategies - Curiosity, intrinsic motivation
rl-debugging - Verify exploration hyperparameters
Primary algorithm: actor-critic-methods or policy-gradient-methods

Rationalization Resistance Table

Rationalization	Reality	Counter-Guidance	Red Flag
"Just use PPO for everything"	PPO is general but not optimal for all cases	"Let's clarify: discrete or continuous actions? Sample efficiency constraints?"	Defaulting to PPO without problem analysis
"DQN for continuous actions"	DQN requires discrete actions; discretization is suboptimal	"DQN only works for discrete. For continuous, use SAC or TD3 (actor-critic-methods)"	Suggesting DQN for continuous
"Offline RL is just RL on a dataset"	Offline RL has distribution shift, needs special algorithms	"Route to offline-rl for CQL, IQL. Standard algorithms fail on offline data."	Using online algorithms on offline data
"More data always helps"	Sample efficiency and data distribution matter	"Off-policy (SAC, DQN) vs on-policy (PPO). Offline needs CQL."	Ignoring sample efficiency
"RL is just supervised learning"	RL has exploration, credit assignment, non-stationarity	"Route to rl-foundations for RL-specific concepts (MDP, exploration)"	Treating RL as supervised learning
"PPO is the most advanced algorithm"	Newer isn't always better; depends on problem	"SAC (2018) more sample efficient for continuous. DQN (2013) great for discrete."	Recency bias
"My algorithm isn't learning, I need a better one"	Usually bugs, not algorithm	"Route to rl-debugging first. Check reward scale, exploration, learning rate."	Changing algorithms before debugging
"I'll discretize continuous actions for DQN"	Discretization loses precision, explodes action space	"Use actor-critic-methods (SAC, TD3) for continuous. Don't discretize."	Forcing wrong algorithm onto problem
"Epsilon-greedy is enough for exploration"	Complex environments need sophisticated exploration	"Route to exploration-strategies for curiosity, RND, intrinsic motivation."	Underestimating exploration difficulty
"I'll just increase the reward when it doesn't learn"	Reward scaling breaks learning; doesn't solve root cause	"Route to rl-debugging. Check if reward scale is the issue, not magnitude."	Arbitrary reward hacking
"I can reuse online RL code for offline data"	Offline RL needs conservative algorithms	"Route to offline-rl. CQL/IQL prevent overestimation, online algorithms fail."	Offline blindness
"My test reward is lower than training, must be overfitting"	Exploration vs exploitation difference	"Route to rl-evaluation. Training uses exploration, test should be greedy."	Misunderstanding RL evaluation

Red Flags Checklist

Watch for these signs of incorrect routing:

Algorithm-First Thinking: Recommending algorithm before asking about action space, data regime
DQN for Continuous: Suggesting DQN/Q-learning for continuous action spaces
Offline Blindness: Not recognizing fixed dataset requires offline-rl (CQL, IQL)
PPO Cargo-Culting: Defaulting to PPO without considering alternatives
No Problem Characterization: Not asking: discrete vs continuous? online vs offline?
Skipping Foundations: Implementing algorithms when user doesn't understand RL basics
Debug-Last: Suggesting algorithm changes before systematic debugging
Sample Efficiency Ignorance: Not asking about sample constraints (simulator cost, real robot limits)
Exploration Assumptions: Assuming epsilon-greedy is sufficient for all problems
Infrastructure Confusion: Trying to explain Gym API instead of routing to rl-environments
Evaluation Naivety: Not routing to rl-evaluation for proper methodology

If any red flag triggered → STOP → Ask diagnostic questions → Route correctly

When NOT to Use This Pack

Clarify boundaries with other packs:

User Request	Correct Pack	Reason
"Train classifier on labeled data"	training-optimization	Supervised learning, not RL
"Design transformer architecture"	neural-architectures	Architecture design, not RL algorithm
"Implement PyTorch autograd"	pytorch-engineering	PyTorch internals, not RL
"Deploy model to production"	ml-production	Deployment, not RL training
"Fine-tune LLM with RLHF"	llm-specialist	LLM-specific (though uses RL concepts)
"Optimize hyperparameters"	training-optimization	Hyperparameter search, not RL
"Implement custom CUDA kernel"	pytorch-engineering	Low-level optimization, not RL

Edge case: RLHF (Reinforcement Learning from Human Feedback) for LLMs uses RL concepts (PPO) but has LLM-specific considerations. Route to llm-specialist first; they may reference this pack.

Diagnostic Question Templates

Use these questions to classify problems:

Action Space

"What actions can your agent take? Discrete choices or continuous values?"
"How many possible actions? Small (< 100), large (100-10000), or infinite (continuous)?"

Data Regime

"Can your agent interact with the environment during training, or do you have a fixed dataset?"
"Are you learning online (agent tries actions) or offline (from logged data)?"

Experience Level

"Are you new to RL, or do you have a specific problem?"
"Do you understand MDPs, value functions, and policy gradients?"

Special Requirements

"Are multiple agents involved? Do they cooperate or compete?"
"Is sample efficiency critical? How many episodes can you afford?"
"Is the reward sparse (only at goal) or dense (every step)?"
"Do you need the agent to learn a model of the environment?"

Infrastructure

"Do you have an environment set up, or do you need to create one?"
"Are you debugging a training issue, or designing from scratch?"
"How will you evaluate the agent?"

Implementation Process

When routing to a skill:

Ask Diagnostic Questions (don't assume)
Explain Routing Rationale (teach the user problem classification)
Route to Primary Skill(s) (1-3 skills for multi-faceted problems)
Mention Related Skills (user may need later)
Set Expectations (what the skill will cover)

Example:

"You mentioned continuous joint angles for a robot arm. This is a continuous action space, which means DQN won't work (it requires discrete actions).

I'm routing you to actor-critic-methods because:

Continuous actions need actor-critic (SAC, TD3) or policy gradients (PPO)

SAC is most sample-efficient for continuous control

TD3 is stable and deterministic for robotics

You'll also likely need:

rl-debugging when training issues arise (they will)

reward-shaping if your reward is sparse

rl-environments to set up your robot simulation

Let's start with actor-critic-methods to choose between SAC, TD3, and PPO."

Summary: Routing Decision Tree

START: RL problem

├─ Need foundations? (new to RL, confused about concepts)
│  └─ → rl-foundations
│
├─ DISCRETE actions?
│  ├─ Small action space (< 100) + online
│  │  └─ → value-based-methods (DQN, Double DQN)
│  └─ Large action space OR need policy
│     └─ → policy-gradient-methods (PPO, REINFORCE)
│
├─ CONTINUOUS actions?
│  ├─ Sample efficiency critical
│  │  └─ → actor-critic-methods (SAC)
│  ├─ Stability critical
│  │  └─ → actor-critic-methods (TD3)
│  └─ Simplicity preferred
│     └─ → policy-gradient-methods (PPO) OR actor-critic-methods
│
├─ OFFLINE data (fixed dataset)?
│  └─ → offline-rl (CQL, IQL) [CRITICAL: not standard algorithms]
│
├─ MULTI-AGENT?
│  └─ → multi-agent-rl (QMIX, MADDPG)
│
├─ Sample efficiency EXTREME?
│  └─ → model-based-rl (MBPO, Dreamer)
│
├─ DEBUGGING issues?
│  ├─ Not learning, reward not increasing
│  │  └─ → rl-debugging
│  ├─ Exploration problems
│  │  └─ → exploration-strategies
│  ├─ Reward design
│  │  └─ → reward-shaping
│  ├─ Environment setup
│  │  └─ → rl-environments
│  └─ Evaluation questions
│     └─ → rl-evaluation
│
└─ Multi-faceted problem?
   └─ Route to 2-3 skills (primary + supporting)

Final Reminders

Problem characterization BEFORE algorithm selection
DQN for discrete ONLY (never continuous)
Offline data needs offline-rl (CQL, IQL)
PPO is not universal (good general-purpose, not optimal everywhere)
Debug before changing algorithms (route to rl-debugging)
Ask questions, don't assume (action space? data regime?)

This meta-skill is your routing hub. Route decisively, explain clearly, teach problem classification.

Install Skill

SKILL.md

Using Deep RL Meta-Skill

When to Use This Skill

Core Principle

The 12 Deep RL Skills

Routing Decision Framework

Step 1: Assess Experience Level

Step 2: Classify Action Space

Discrete Action Space

Continuous Action Space

Step 3: Identify Data Regime

Online Learning (Agent Interacts with Environment)

Offline Learning (Fixed Dataset, No Interaction)

Step 4: Special Problem Types

Multi-Agent Scenarios

Model-Based RL

Step 5: Debugging and Infrastructure

"Agent Not Learning" Problems

Exploration Issues

Reward Design Issues

Environment Setup

Evaluation Methodology

Common Multi-Skill Scenarios

Scenario: Complete Beginner to RL

Scenario: Continuous Control (Robotics)

Scenario: Offline RL from Dataset

Scenario: Multi-Agent Cooperative Task

Scenario: Sample-Efficient Learning

Scenario: Sparse Reward Problem

Rationalization Resistance Table

Red Flags Checklist

When NOT to Use This Pack

Diagnostic Question Templates

Action Space

Data Regime

Experience Level

Special Requirements

Infrastructure

Implementation Process

Summary: Routing Decision Tree

Final Reminders