name	network-architecture-sizing
description	PPO network architecture sizing for trading models. Trigger: (1) model files are unexpectedly small/large, (2) choosing hidden_dims for training, (3) balancing model capacity vs inference speed.
author	Claude Code
date	Thu Dec 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time)

Network Architecture Sizing - Research Notes

Experiment Overview

Item	Details
Date	2025-12-18
Goal	Understand relationship between hidden_dims and model file size
Environment	Google Colab A100, PyTorch 2.x, NativePPOTrainer
Status	Documented

Context

Training runs produced models at ~72 MB instead of expected ~148 MB. Investigation revealed the hidden_dims configuration determines model size, with the first layer dominating total parameter count due to multiplication with observation dimensions.

Architecture Comparison

Model Size vs Architecture

Architecture	hidden_dims	Layers	Total Params	File Size	First Layer Size
Large (v2.2)	(2048, 1024, 512, 256)	4	12.6M	~148 MB	2048 × obs_dim
Medium (v2.3)	(1024, 512, 256)	3	6.1M	~72 MB	1024 × obs_dim
Small	(512, 256, 128)	3	~1.5M	~18 MB	512 × obs_dim
Tiny	(256, 128, 64)	3	~0.4M	~5 MB	256 × obs_dim

Why First Layer Dominates

With 53 features × 100 lookback = 5,300 input dimensions:

Large: 2048 × 5300 = 10.9M params (86% of network)
Medium: 1024 × 5300 = 5.4M params (89% of network)
Small: 512 × 5300 = 2.7M params (90% of network)

Key insight: The first hidden layer dimension has exponentially more impact on model size than deeper layers.

Configuration Locations

Current defaults in ppo_trainer_native.py:

Function	GPU Tier	hidden_dims
`get_auto_config()`	H100	(1024, 512, 256)
`get_auto_config()`	A100	(1024, 512, 256)
`get_auto_config()`	high (40GB+)	(512, 256, 128)
`get_auto_config()`	medium (20-40GB)	(512, 256, 128)
`get_auto_config()`	low (<20GB)	(256, 128, 64)
`get_a100_config()`	A100-80GB	(1024, 512, 256)
`get_a100_config()`	A100-40GB	(512, 256, 128)

Verified Workflow

To use larger architecture (148 MB models):

from alpaca_trading.gpu.ppo_trainer_native import get_auto_config

config = get_auto_config(total_timesteps=200_000_000, training_mode='production')
config.hidden_dims = (2048, 1024, 512, 256)  # Override to 4-layer large

trainer = NativePPOTrainer(env, config)

To verify model architecture before training:

import torch

# Check expected size
obs_dim = 5300  # 53 features × 100 lookback
hidden_dims = (2048, 1024, 512, 256)

params = obs_dim * hidden_dims[0]  # First layer
for i in range(len(hidden_dims) - 1):
    params += hidden_dims[i] * hidden_dims[i+1]
params += hidden_dims[-1] * 64 * 2  # Actor + critic heads

print(f"Expected params: {params:,}")
print(f"Expected size: ~{params * 4 * 3 / 1024 / 1024:.0f} MB")  # float32 × 3 (weights + optimizer state)

To inspect existing model:

import torch

ckpt = torch.load('model.pt', map_location='cpu', weights_only=False)
print(f"hidden_dims: {ckpt['config'].hidden_dims}")
print(f"Total params: {sum(v.numel() for v in ckpt['policy_state_dict'].values()):,}")

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
Assuming all configs use same architecture	Different GPU tiers have different defaults	Always check `hidden_dims` in config before training
Only checking layer count	3-layer (1024,512,256) vs 4-layer (2048,1024,512,256)	First layer width matters more than depth
Not saving config with model	Couldn't reproduce training	Always save full config in checkpoint
Using large architecture on small GPU	OOM errors	Match architecture to available VRAM
Assuming bigger = better	Overfitting on small datasets	Larger models need more data/regularization

Performance Considerations

Larger Architecture (2048, 1024, 512, 256)

Pros:

Higher model capacity for complex patterns
Better for symbols with rich feature interactions
May capture longer-term dependencies

Cons:

2x file size (~148 MB vs ~72 MB)
Slower inference (~1.5-2x)
Higher VRAM usage during training
More prone to overfitting with limited data

Smaller Architecture (1024, 512, 256)