| name | account-aware-training |
| description | Add account state (P&L, win rate, drawdown) to RL observations + drawdown penalty in rewards. Trigger when: (1) model needs account awareness, (2) training should penalize drawdowns, (3) upgrading obs_dim 5300→5600. |
| author | Claude Code |
| date | Thu Dec 26 2024 00:00:00 GMT+0000 (Coordinated Universal Time) |
Account-Aware RL Training (v2.4)
Experiment Overview
| Item | Details |
|---|---|
| Date | 2024-12-26 |
| Goal | Make RL model learn from account state (P&L, win rate, drawdown) |
| Environment | vectorized_env.py, inference_obs_builder.py, training notebook |
| Status | Success |
Context
Prior to v2.4, the RL model was "blind" to account performance. It received:
- 53 features: price action, technicals, regime probabilities, calendar effects
- No information about cumulative P&L, win rate, or drawdown
Problem: The model could generate signals that were individually good but led to excessive drawdowns at the account level. It had no incentive to trade conservatively after losses.
Solution: Add 3 account-level features + drawdown penalty in rewards.
Verified Workflow
1. Config Parameters (GPUEnvConfig)
# In vectorized_env.py GPUEnvConfig dataclass (~line 405)
# Account-aware training (v2.4)
drawdown_penalty_threshold: float = 0.15 # Penalize when drawdown > 15%
drawdown_penalty_weight: float = 0.10 # Weight in reward function
2. Equity Tracking Tensors
# In _init_state_tensors() after line 712
# Account-level equity tracking (v2.4)
self.initial_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device)
self.peak_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device)
self.current_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device)
3. Reset Equity Tensors
# In reset() after line 850
# Reset account-level equity tracking
self.initial_equity[env_ids] = 1.0
self.peak_equity[env_ids] = 1.0
self.current_equity[env_ids] = 1.0
4. Update Equity in step()
# In step() after line 926
# Update account-level equity tracking (v2.4)
self.current_equity = self.initial_equity + self.total_pnl / (current_prices + 1e-8)
self.peak_equity = torch.maximum(self.peak_equity, self.current_equity)
5. Feature Count Update
# In _calculate_obs_features() line 682
# Add account features
account = 3 # total_pnl_pct, rolling_win_rate, current_drawdown_pct
return base + technical + intraday + temporal + markov + extended + multi_window + account
# Result: 53 + 3 = 56 features
6. Account Features in Observations
# In _get_observations() after line 1258, before sanitization
# === ACCOUNT-LEVEL FEATURES (3) - v2.4 ===
# Feature 1: Total P&L % (normalized to [-1, 1])
total_pnl_pct = self.total_pnl / (self.initial_equity + 1e-8)
total_pnl_pct_norm = torch.tanh(total_pnl_pct * 10)
obs[:, :, feat_idx] = total_pnl_pct_norm[env_ids].unsqueeze(1).expand(-1, self.config.window)
feat_idx += 1
# Feature 2: Rolling win rate (0.5 if no trades)
win_rate = torch.where(
self.n_trades[env_ids] > 0,
self.n_wins[env_ids].float() / self.n_trades[env_ids].float(),
torch.full((n_envs,), 0.5, dtype=self.dtype, device=self.device)
)
obs[:, :, feat_idx] = win_rate.unsqueeze(1).expand(-1, self.config.window)
feat_idx += 1
# Feature 3: Current drawdown % [0, 1]
drawdown = (self.peak_equity[env_ids] - self.current_equity[env_ids]) / (self.peak_equity[env_ids] + 1e-8)
drawdown = torch.clamp(drawdown, 0.0, 1.0)
obs[:, :, feat_idx] = drawdown.unsqueeze(1).expand(-1, self.config.window)
feat_idx += 1
7. Drawdown Penalty in Rewards
# In _calculate_rewards() after line 1618
# COMPONENT 7: Drawdown penalty (v2.4)
current_drawdown = (self.peak_equity - self.current_equity) / (self.peak_equity + 1e-8)
current_drawdown = torch.clamp(current_drawdown, 0.0, 1.0)
# Quadratic penalty when over threshold
drawdown_over_threshold = torch.clamp(current_drawdown - self.config.drawdown_penalty_threshold, min=0.0)
drawdown_penalty = -drawdown_over_threshold ** 2 * 10
# Add to reward combination:
reward = (
self.config.direction_weight * direction_reward +
self.config.magnitude_weight * magnitude_reward +
self.config.pnl_weight * pnl_reward +
self.config.stop_tp_weight * stop_tp_reward +
self.config.exploration_weight * exploration_bonus +
self.config.slippage_weight * slippage_penalty +
self.config.drawdown_penalty_weight * drawdown_penalty # NEW
) * risk_adjustment
8. Inference Observation Builder
# In inference_obs_builder.py get_target_features_from_obs_dim()
if features == 56:
return 56 # v2.4 with account awareness
elif features == 53:
return 53 # v2.3
# ... legacy support
# In build_inference_observation() after line 624
# === ACCOUNT-LEVEL FEATURES (3) - v2.4 ===
# Use neutral defaults during inference
if target_features >= 56:
obs[:, feat_idx] = 0.0 # total_pnl_pct (no prior trades)
feat_idx += 1
obs[:, feat_idx] = 0.5 # win_rate (neutral prior)
feat_idx += 1
obs[:, feat_idx] = 0.0 # drawdown (no drawdown)
feat_idx += 1
Failed Attempts (Critical)
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
| Account features with raw P&L values | P&L scale varies by price level | Use P&L percentage normalized with tanh |
| Win rate = 0 when no trades | Invalid input during initial episodes | Default to 0.5 (neutral prior) |
| Peak equity never decreasing | Logical error in update | Use torch.maximum() to track high-water mark |
| Drawdown penalty linear | Too harsh at moderate levels | Quadratic scaling is gentler below threshold |
| Live inference with account state | Would need real account connection | Use neutral defaults (0, 0.5, 0) for inference |
Final Parameters
# GPUEnvConfig (v2.4)
n_features: 56 # Was 53 in v2.3
drawdown_penalty_threshold: 0.15 # 15% drawdown starts penalty
drawdown_penalty_weight: 0.10 # Moderate weight in reward
# Feature breakdown (56 total)
base_features: 7 # price action basics
technical_features: 4 # intraday technicals
temporal_features: 7 # calendar features
markov_features: 12 # 4-chain regime probabilities
extended_features: 14 # extended technicals
multi_window_features: 9 # 20/50/100 bar windows
account_features: 3 # P&L %, win rate, drawdown %
# obs_dim = n_features * window = 56 * 100 = 5600
Key Insights
- Breaking Change: obs_dim 5300 → 5600 means v2.3 models CANNOT be used with v2.4 environments
- Neutral Inference: Live trading uses neutral defaults (0, 0.5, 0) since account state isn't tracked per-prediction
- Quadratic Penalty: The
** 2makes penalty gentle at 16% drawdown but harsh at 25%+ - Normalized P&L:
tanh(pnl * 10)keeps values in [-1, 1] even for large P&L swings - 0.5 Win Rate Prior: Prevents model confusion during initial trades with no history
Model Behavior Expected
With account awareness, the model should learn:
- Reduce position sizing after losses (sees drawdown feature)
- Be more selective after poor win rate (sees win rate feature)
- Avoid compounding losses (drawdown penalty kicks in at 15%)
- Trade more aggressively when profitable (sees positive P&L)
References
alpaca_trading/gpu/vectorized_env.py: Lines 405 (config), 712 (tensors), 850 (reset), 926 (step), 1258 (obs)alpaca_trading/gpu/inference_obs_builder.py: Lines 61-108 (feature detection), 624+ (account features)notebooks/VSCode_Colab_Training_NATIVE.ipynb: Training notebook with v2.4 settings