name	account-aware-training
description	Add account state (P&L, win rate, drawdown) to RL observations + drawdown penalty in rewards. Trigger when: (1) model needs account awareness, (2) training should penalize drawdowns, (3) upgrading obs_dim 5300→5600.
author	Claude Code
date	Thu Dec 26 2024 00:00:00 GMT+0000 (Coordinated Universal Time)

Account-Aware RL Training (v2.4)

Experiment Overview

Item	Details
Date	2024-12-26
Goal	Make RL model learn from account state (P&L, win rate, drawdown)
Environment	vectorized_env.py, inference_obs_builder.py, training notebook
Status	Success

Context

Prior to v2.4, the RL model was "blind" to account performance. It received:

53 features: price action, technicals, regime probabilities, calendar effects
No information about cumulative P&L, win rate, or drawdown

Problem: The model could generate signals that were individually good but led to excessive drawdowns at the account level. It had no incentive to trade conservatively after losses.

Solution: Add 3 account-level features + drawdown penalty in rewards.

Verified Workflow

1. Config Parameters (GPUEnvConfig)

# In vectorized_env.py GPUEnvConfig dataclass (~line 405)
# Account-aware training (v2.4)
drawdown_penalty_threshold: float = 0.15  # Penalize when drawdown > 15%
drawdown_penalty_weight: float = 0.10     # Weight in reward function

2. Equity Tracking Tensors

# In _init_state_tensors() after line 712
# Account-level equity tracking (v2.4)
self.initial_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device)
self.peak_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device)
self.current_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device)

3. Reset Equity Tensors

# In reset() after line 850
# Reset account-level equity tracking
self.initial_equity[env_ids] = 1.0
self.peak_equity[env_ids] = 1.0
self.current_equity[env_ids] = 1.0

4. Update Equity in step()

# In step() after line 926
# Update account-level equity tracking (v2.4)
self.current_equity = self.initial_equity + self.total_pnl / (current_prices + 1e-8)
self.peak_equity = torch.maximum(self.peak_equity, self.current_equity)

5. Feature Count Update

# In _calculate_obs_features() line 682
# Add account features
account = 3  # total_pnl_pct, rolling_win_rate, current_drawdown_pct
return base + technical + intraday + temporal + markov + extended + multi_window + account
# Result: 53 + 3 = 56 features

6. Account Features in Observations

# In _get_observations() after line 1258, before sanitization

# === ACCOUNT-LEVEL FEATURES (3) - v2.4 ===

# Feature 1: Total P&L % (normalized to [-1, 1])
total_pnl_pct = self.total_pnl / (self.initial_equity + 1e-8)
total_pnl_pct_norm = torch.tanh(total_pnl_pct * 10)
obs[:, :, feat_idx] = total_pnl_pct_norm[env_ids].unsqueeze(1).expand(-1, self.config.window)
feat_idx += 1

# Feature 2: Rolling win rate (0.5 if no trades)
win_rate = torch.where(
    self.n_trades[env_ids] > 0,
    self.n_wins[env_ids].float() / self.n_trades[env_ids].float(),
    torch.full((n_envs,), 0.5, dtype=self.dtype, device=self.device)
)
obs[:, :, feat_idx] = win_rate.unsqueeze(1).expand(-1, self.config.window)
feat_idx += 1

# Feature 3: Current drawdown % [0, 1]
drawdown = (self.peak_equity[env_ids] - self.current_equity[env_ids]) / (self.peak_equity[env_ids] + 1e-8)
drawdown = torch.clamp(drawdown, 0.0, 1.0)
obs[:, :, feat_idx] = drawdown.unsqueeze(1).expand(-1, self.config.window)
feat_idx += 1

7. Drawdown Penalty in Rewards

# In _calculate_rewards() after line 1618

# COMPONENT 7: Drawdown penalty (v2.4)
current_drawdown = (self.peak_equity - self.current_equity) / (self.peak_equity + 1e-8)
current_drawdown = torch.clamp(current_drawdown, 0.0, 1.0)

# Quadratic penalty when over threshold
drawdown_over_threshold = torch.clamp(current_drawdown - self.config.drawdown_penalty_threshold, min=0.0)
drawdown_penalty = -drawdown_over_threshold ** 2 * 10

# Add to reward combination:
reward = (
    self.config.direction_weight * direction_reward +
    self.config.magnitude_weight * magnitude_reward +
    self.config.pnl_weight * pnl_reward +
    self.config.stop_tp_weight * stop_tp_reward +
    self.config.exploration_weight * exploration_bonus +
    self.config.slippage_weight * slippage_penalty +
    self.config.drawdown_penalty_weight * drawdown_penalty  # NEW
) * risk_adjustment

8. Inference Observation Builder

# In inference_obs_builder.py get_target_features_from_obs_dim()
if features == 56:
    return 56  # v2.4 with account awareness
elif features == 53:
    return 53  # v2.3
# ... legacy support

# In build_inference_observation() after line 624
# === ACCOUNT-LEVEL FEATURES (3) - v2.4 ===
# Use neutral defaults during inference
if target_features >= 56:
    obs[:, feat_idx] = 0.0   # total_pnl_pct (no prior trades)
    feat_idx += 1
    obs[:, feat_idx] = 0.5   # win_rate (neutral prior)
    feat_idx += 1
    obs[:, feat_idx] = 0.0   # drawdown (no drawdown)
    feat_idx += 1

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
Account features with raw P&L values	P&L scale varies by price level	Use P&L percentage normalized with tanh
Win rate = 0 when no trades	Invalid input during initial episodes	Default to 0.5 (neutral prior)
Peak equity never decreasing	Logical error in update	Use torch.maximum() to track high-water mark
Drawdown penalty linear	Too harsh at moderate levels	Quadratic scaling is gentler below threshold
Live inference with account state	Would need real account connection	Use neutral defaults (0, 0.5, 0) for inference

Final Parameters

# GPUEnvConfig (v2.4)
n_features: 56  # Was 53 in v2.3
drawdown_penalty_threshold: 0.15  # 15% drawdown starts penalty
drawdown_penalty_weight: 0.10     # Moderate weight in reward

# Feature breakdown (56 total)
base_features: 7              # price action basics
technical_features: 4         # intraday technicals
temporal_features: 7          # calendar features
markov_features: 12           # 4-chain regime probabilities
extended_features: 14         # extended technicals
multi_window_features: 9      # 20/50/100 bar windows
account_features: 3           # P&L %, win rate, drawdown %

# obs_dim = n_features * window = 56 * 100 = 5600

Key Insights

Breaking Change: obs_dim 5300 → 5600 means v2.3 models CANNOT be used with v2.4 environments
Neutral Inference: Live trading uses neutral defaults (0, 0.5, 0) since account state isn't tracked per-prediction
Quadratic Penalty: The ** 2 makes penalty gentle at 16% drawdown but harsh at 25%+
Normalized P&L: tanh(pnl * 10) keeps values in [-1, 1] even for large P&L swings
0.5 Win Rate Prior: Prevents model confusion during initial trades with no history

Model Behavior Expected

With account awareness, the model should learn:

Reduce position sizing after losses (sees drawdown feature)
Be more selective after poor win rate (sees win rate feature)
Avoid compounding losses (drawdown penalty kicks in at 15%)
Trade more aggressively when profitable (sees positive P&L)

References

alpaca_trading/gpu/vectorized_env.py: Lines 405 (config), 712 (tensors), 850 (reset), 926 (step), 1258 (obs)
alpaca_trading/gpu/inference_obs_builder.py: Lines 61-108 (feature detection), 624+ (account features)
notebooks/VSCode_Colab_Training_NATIVE.ipynb: Training notebook with v2.4 settings

account-aware-training

Install Skill

SKILL.md