| name | network-architecture-sizing |
| description | PPO network architecture sizing for trading models. Trigger: (1) model files are unexpectedly small/large, (2) choosing hidden_dims for training, (3) balancing model capacity vs inference speed. |
| author | Claude Code |
| date | Thu Dec 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) |
Network Architecture Sizing - Research Notes
Experiment Overview
| Item | Details |
|---|---|
| Date | 2025-12-18 |
| Goal | Understand relationship between hidden_dims and model file size |
| Environment | Google Colab A100, PyTorch 2.x, NativePPOTrainer |
| Status | Documented |
Context
Training runs produced models at ~72 MB instead of expected ~148 MB. Investigation revealed the hidden_dims configuration determines model size, with the first layer dominating total parameter count due to multiplication with observation dimensions.
Architecture Comparison
Model Size vs Architecture
| Architecture | hidden_dims | Layers | Total Params | File Size | First Layer Size |
|---|---|---|---|---|---|
| Large (v2.2) | (2048, 1024, 512, 256) | 4 | 12.6M | ~148 MB | 2048 × obs_dim |
| Medium (v2.3) | (1024, 512, 256) | 3 | 6.1M | ~72 MB | 1024 × obs_dim |
| Small | (512, 256, 128) | 3 | ~1.5M | ~18 MB | 512 × obs_dim |
| Tiny | (256, 128, 64) | 3 | ~0.4M | ~5 MB | 256 × obs_dim |
Why First Layer Dominates
With 53 features × 100 lookback = 5,300 input dimensions:
- Large:
2048 × 5300 = 10.9M params(86% of network) - Medium:
1024 × 5300 = 5.4M params(89% of network) - Small:
512 × 5300 = 2.7M params(90% of network)
Key insight: The first hidden layer dimension has exponentially more impact on model size than deeper layers.
Configuration Locations
Current defaults in ppo_trainer_native.py:
| Function | GPU Tier | hidden_dims |
|---|---|---|
get_auto_config() |
H100 | (1024, 512, 256) |
get_auto_config() |
A100 | (1024, 512, 256) |
get_auto_config() |
high (40GB+) | (512, 256, 128) |
get_auto_config() |
medium (20-40GB) | (512, 256, 128) |
get_auto_config() |
low (<20GB) | (256, 128, 64) |
get_a100_config() |
A100-80GB | (1024, 512, 256) |
get_a100_config() |
A100-40GB | (512, 256, 128) |
Verified Workflow
To use larger architecture (148 MB models):
from alpaca_trading.gpu.ppo_trainer_native import get_auto_config
config = get_auto_config(total_timesteps=200_000_000, training_mode='production')
config.hidden_dims = (2048, 1024, 512, 256) # Override to 4-layer large
trainer = NativePPOTrainer(env, config)
To verify model architecture before training:
import torch
# Check expected size
obs_dim = 5300 # 53 features × 100 lookback
hidden_dims = (2048, 1024, 512, 256)
params = obs_dim * hidden_dims[0] # First layer
for i in range(len(hidden_dims) - 1):
params += hidden_dims[i] * hidden_dims[i+1]
params += hidden_dims[-1] * 64 * 2 # Actor + critic heads
print(f"Expected params: {params:,}")
print(f"Expected size: ~{params * 4 * 3 / 1024 / 1024:.0f} MB") # float32 × 3 (weights + optimizer state)
To inspect existing model:
import torch
ckpt = torch.load('model.pt', map_location='cpu', weights_only=False)
print(f"hidden_dims: {ckpt['config'].hidden_dims}")
print(f"Total params: {sum(v.numel() for v in ckpt['policy_state_dict'].values()):,}")
Failed Attempts (Critical)
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
| Assuming all configs use same architecture | Different GPU tiers have different defaults | Always check hidden_dims in config before training |
| Only checking layer count | 3-layer (1024,512,256) vs 4-layer (2048,1024,512,256) | First layer width matters more than depth |
| Not saving config with model | Couldn't reproduce training | Always save full config in checkpoint |
| Using large architecture on small GPU | OOM errors | Match architecture to available VRAM |
| Assuming bigger = better | Overfitting on small datasets | Larger models need more data/regularization |
Performance Considerations
Larger Architecture (2048, 1024, 512, 256)
Pros:
- Higher model capacity for complex patterns
- Better for symbols with rich feature interactions
- May capture longer-term dependencies
Cons:
- 2x file size (~148 MB vs ~72 MB)
- Slower inference (~1.5-2x)
- Higher VRAM usage during training
- More prone to overfitting with limited data
Smaller Architecture (1024, 512, 256)
Pros:
- Faster inference (important for live trading)
- Lower VRAM requirements
- Faster training iterations
- Better generalization on limited data
Cons:
- May underfit complex market dynamics
- Less capacity for feature interactions
Recommended Architecture by Use Case
| Use Case | Recommended hidden_dims | Rationale |
|---|---|---|
| Quick iteration/testing | (512, 256, 128) | Fast training, low memory |
| Standard production | (1024, 512, 256) | Good balance |
| Complex symbols (crypto) | (2048, 1024, 512, 256) | Higher volatility patterns |
| Limited training data (<1 year) | (512, 256, 128) | Reduce overfitting |
| Extended training (500M+ steps) | (2048, 1024, 512, 256) | Capacity for more learning |
Key Insights
- First layer width dominates model size - doubling first layer ~doubles total params
- File size ≈ params × 12 bytes (float32 weights + Adam optimizer moments)
- Current v2.3 defaults favor smaller models - optimized for speed over capacity
- Architecture mismatch = inference failure - models trained with different hidden_dims are incompatible
- Always log hidden_dims - critical for reproducibility and debugging
Diagnostic Commands
# Compare two model architectures
def compare_models(path1, path2):
m1 = torch.load(path1, map_location='cpu', weights_only=False)
m2 = torch.load(path2, map_location='cpu', weights_only=False)
print(f"Model 1: {m1['config'].hidden_dims}")
print(f"Model 2: {m2['config'].hidden_dims}")
print(f"Params 1: {sum(v.numel() for v in m1['policy_state_dict'].values()):,}")
print(f"Params 2: {sum(v.numel() for v in m2['policy_state_dict'].values()):,}")
References
alpaca_trading/gpu/ppo_trainer_native.py: Lines 1314, 1339, 1726, 1760alpaca_trading/gpu/ppo_trainer_native.py: NativeActorCritic class (line 305)- CLAUDE.md: GPU Optimized Settings table