| name | training-improvements-v245 |
| description | Training improvements: LR warmup, validation intervals, reward weights. Trigger when: (1) training unstable in early epochs, (2) need more validation visibility, (3) model too conservative. |
| author | Claude Code |
| date | Fri Dec 27 2024 00:00:00 GMT+0000 (Coordinated Universal Time) |
Training Improvements v2.4.5
Experiment Overview
| Item | Details |
|---|---|
| Date | 2024-12-27 |
| Goal | Improve training stability and visibility based on 20251226 results analysis |
| Environment | GPU-native PPO, A100-40GB, 500M timesteps |
| Status | Success |
Context
Analysis of training results from 20251226 showed:
- Training was working (PF 1.55-2.61, consistency 100%)
- But low reward magnitude (0.05-0.10) suggested conservative behavior
- Negative skew (-0.2 to -0.5) indicated more negative outliers
- Only 5 validation points for 477 updates (limited visibility)
These improvements stabilize early training and provide better monitoring.
Verified Workflow
1. LR Warmup (Stabilizes Early Training)
First 5% of training uses linear warmup from lr/10 to full lr. This prevents large updates before the network settles.
# In ppo_trainer_native.py, NativePPOConfig
warmup_fraction: float = 0.05 # First 5% of training
min_lr_fraction: float = 0.01 # Don't decay to 0 - maintain 1% of initial LR
# Scheduler setup with warmup
warmup_steps = max(1, int(n_updates * self.config.warmup_fraction))
main_steps = max(1, n_updates - warmup_steps)
# Warmup scheduler: lr/10 -> lr
warmup_scheduler = torch.optim.lr_scheduler.LinearLR(
self.optimizer,
start_factor=0.1, # Start at lr/10
end_factor=1.0, # Warmup to full lr
total_iters=warmup_steps,
)
# Main scheduler: lr -> lr*min_lr_fraction
main_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
self.optimizer,
T_max=main_steps,
eta_min=self.config.learning_rate * self.config.min_lr_fraction,
)
# Chain them
self.lr_scheduler = torch.optim.lr_scheduler.SequentialLR(
self.optimizer,
schedulers=[warmup_scheduler, main_scheduler],
milestones=[warmup_steps],
)
2. Validation Interval (More Visibility)
Increased validation frequency to ~10 validations per training run:
# In ppo_trainer_native.py, mode_configs
mode_configs = {
'quick_test': {'validation_interval': 5}, # ~8 validations for ~40 updates
'standard': {'validation_interval': 20}, # ~10 validations for ~200 updates
'production': {'validation_interval': 40}, # ~10 validations for ~400 updates
'extended': {'validation_interval': 100}, # ~10 validations for ~1000 updates
'auto': {'validation_interval': 40}, # Match production
}
3. Reward Weights (Less Conservative)
With account-aware training (v2.4), drawdown is already in observations. Reduce penalty weights to avoid making model too conservative:
# In vectorized_env.py, GPUEnvConfig
# Reward weights (v2.4.5 - reduced magnitude and drawdown penalty)
# NOTE: With account-aware training, drawdown is in observations - less penalty needed in reward
direction_weight: float = 0.35 # Keep - primary signal
magnitude_weight: float = 0.10 # Reduced from 0.15 - noisy component
pnl_weight: float = 0.25 # Keep - P&L matters
stop_tp_weight: float = 0.15 # Keep - risk management
exploration_weight: float = 0.10 # Keep - exploration
# Account-aware training (v2.4)
drawdown_penalty_threshold: float = 0.15 # Penalize when drawdown exceeds 15%
drawdown_penalty_weight: float = 0.05 # Reduced from 0.10 - DD already in observations
Failed Attempts (Critical)
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
| Warmup 10% of training | Too long warmup, wasted training time | 5% is sufficient for stabilization |
| Decay to lr=0 | Training stalled at end | Maintain 1% of initial LR |
| Validation every 10 updates | Too frequent, slowed training | ~10 validations total is enough |
| Keep high drawdown penalty | Model became too conservative (only 0.05 rewards) | Reduce penalty when DD is in observations |
Final Parameters
# ppo_trainer_native.py - NativePPOConfig
warmup_fraction = 0.05 # First 5% of training
min_lr_fraction = 0.01 # Don't decay to 0
max_drawdown_threshold = 0.15 # Reduced from 0.30
drawdown_penalty_weight = 0.2 # Reduced from 0.3
# Validation intervals (per mode)
# quick_test: 5, standard: 20, production: 40, extended: 100
# vectorized_env.py - GPUEnvConfig reward weights
magnitude_weight = 0.10 # Reduced from 0.15
drawdown_penalty_weight = 0.05 # Reduced from 0.10
Key Insights
Warmup prevents early instability - Large LR in early training can cause divergence. Starting at lr/10 lets the network find a stable region first.
Don't decay to zero - Training at lr=0 is just noise. Maintaining 1% of initial LR allows continued learning at end of training.
More validations = better visibility - With only 5 validations, you can't see training dynamics. 10 validations show the learning curve clearly.
Avoid double-penalizing - Account-aware training already shows the model its drawdown. Heavy reward penalty on top makes it too conservative.
Magnitude is noisy - The magnitude component (how much price moved) is noisy and less predictive than direction. Reducing its weight helps.
References
alpaca_trading/gpu/ppo_trainer_native.py: Lines 100-102 (warmup config), 526-572 (scheduler setup), 1708-1747 (mode configs)alpaca_trading/gpu/vectorized_env.py: Lines 388-409 (reward weights)- Training analysis: 20251226 results showing low reward magnitude and negative skew
- Commit:
14d07c3(training improvements)