Testing & Benchmarking Infrastructure - Research Notes
Experiment Overview
| Item |
Details |
| Date |
2026-01-02 |
| Goal |
Implement comprehensive testing with benchmarking for version comparison |
| Environment |
Python 3.10/3.11, pytest 8.4.2, pytest-benchmark 4.0+ |
| Status |
Success |
Context
Trading systems evolve through versions (v2.3 -> v2.7) with different feature counts and action spaces. Need to:
- Track performance regressions between versions
- Ensure backwards compatibility for model loading
- Validate feature computation consistency
- Support both GPU and CPU testing (CI compatibility)
Without proper benchmarking, performance regressions go unnoticed until production.
Verified Workflow
1. Directory Structure
tests/
conftest.py # Shared fixtures, GPU mocking
test_benchmarks.py # pytest-benchmark tests
test_feature_regression.py # Version compatibility tests
integration/
__init__.py
test_training_pipeline.py
test_inference_pipeline.py
benchmark_results/ # Stores version-tagged results
v2.6.0/
v2.7.0/
v2.7.0_report.md
pytest.ini # Test configuration
.coveragerc # Coverage thresholds
2. Shared Fixtures with GPU Mocking (conftest.py)
@pytest.fixture
def mock_cuda():
"""Mock CUDA environment for GPU tests on CPU-only machines."""
mock_props = MagicMock()
mock_props.total_memory = 42949672960 # 40GB (A100)
mock_props.name = "NVIDIA A100-PCIE-40GB"
with patch('torch.cuda.is_available', return_value=True), \
patch('torch.cuda.get_device_properties', return_value=mock_props), \
patch('torch.cuda.get_device_name', return_value='NVIDIA A100-PCIE-40GB'), \
patch('torch.cuda.get_device_capability', return_value=(8, 0)), \
patch('torch.cuda.synchronize'):
yield
@pytest.fixture(scope="module")
def sample_prices():
"""Deterministic price data for reproducibility."""
np.random.seed(42)
returns = np.random.randn(10000) * 0.01
return 100 * np.exp(np.cumsum(returns))
3. Benchmark Categories
# Feature Engineering benchmarks
def test_markov_chain_step_benchmark(benchmark, sample_prices):
"""Measures Markov state transition speed."""
result = benchmark(lambda: markov.get_regime_state(prices))
# GPU Environment benchmarks (with mock fallback)
@pytest.mark.gpu
def test_env_step_throughput_benchmark(benchmark, mock_cuda):
"""Measures environment steps/sec."""
# Validation Metrics benchmarks
def test_profit_factor_benchmark(benchmark, benchmark_rewards):
"""Measures profit factor calculation on 100k samples."""
4. Version Comparison Script
# Save benchmarks for current version
python scripts/run_version_benchmarks.py --save
# Compare two versions
python scripts/run_version_benchmarks.py --compare v2.6.0 v2.7.0
# Generate detailed report
python scripts/run_version_benchmarks.py --report v2.7.0
5. Feature Regression Tests
@pytest.mark.parametrize("version,n_features,n_actions,obs_dim", [
("2.3.0", 53, 3, 5300),
("2.4.0", 56, 3, 5600),
("2.7.0", 59, 7, 5900),
])
def test_version_spec_consistency(self, version, n_features, n_actions, obs_dim):
"""Verify version specs match documented values."""
spec = MODEL_SPECS[version]
assert spec.n_features == n_features
assert spec.n_actions == n_actions
6. pytest.ini Configuration
[pytest]
testpaths = tests
markers =
benchmark: performance benchmark test
slow: marks tests as slow (> 10s)
gpu: requires CUDA GPU (mocked if unavailable)
integration: integration test
addopts = -v --tb=short --strict-markers
7. Coverage Configuration (.coveragerc)
[run]
source = alpaca_trading
branch = True
[report]
# Per DEVELOPMENT_PRACTICES.md
# 70% minimum, 85% for risk/signals modules
fail_under = 70
Failed Attempts (Critical)
| Attempt |
Why it Failed |
Lesson Learned |
| Benchmark config in pytest.ini |
pytest-benchmark options not recognized without plugin |
Pass via command line or use helper script |
| Running GPU tests without mock |
Tests skip entirely on CPU machines |
Use mock_cuda fixture for CI compatibility |
| Single benchmark file with all categories |
Hard to run specific categories |
Organize by class: TestFeatureEngineering, TestGPU, etc. |
| Fixtures without scope="module" |
Slow test execution, re-creating data |
Use module scope for expensive fixtures |
| Hardcoded version specs in tests |
Breaks when versions change |
Import from model_version.py |
| time.time() for performance tests |
Not statistically rigorous |
Use pytest-benchmark for proper stats |
Final Parameters
# conftest.py fixture scopes
scope="session" # benchmark_output_dir, model_version
scope="module" # sample_prices, sample_ohlcv, gpu_prices
scope="function" # mock_cuda, mock_broker
# pytest markers for test selection
markers = ["benchmark", "slow", "gpu", "live", "integration"]
# Coverage thresholds (per DEVELOPMENT_PRACTICES.md)
minimum_coverage = 70 # Standard modules
critical_coverage = 85 # risk/, signals/ modules
# Benchmark report columns
columns = ["min", "max", "mean", "stddev", "median", "iqr", "rounds"]
Key Insights
- GPU mocking is essential - Tests must run on CPU-only CI machines
- Fixtures reduce duplication - Shared conftest.py cuts test code by ~30%
- Version comparison needs automation - Helper script makes it easy
- Regression tests catch breaking changes - Parametrize across versions
- Integration tests go in separate directory - Per DEVELOPMENT_PRACTICES.md
- Markers enable selective execution - Run fast tests on commit, full on PR
- Deterministic seeds required - Use np.random.seed(42) for reproducibility
Usage Commands
# Run all tests with coverage
pytest tests/ --cov=alpaca_trading --cov-report=html
# Run fast tests only (pre-commit)
pytest tests/ -m "not (benchmark or slow or gpu or integration or live)"
# Run benchmarks and save
python scripts/run_version_benchmarks.py --save
# Compare versions
python scripts/run_version_benchmarks.py --compare v2.6.0 v2.7.0 --threshold 10
# Run integration tests
pytest tests/integration/ -v
Testing Checklist
1. [x] conftest.py with shared fixtures
2. [x] GPU mocking for CI compatibility
3. [x] pytest.ini with custom markers
4. [x] .coveragerc with 70%/85% thresholds
5. [x] Benchmark tests (21 tests, 4 categories)
6. [x] Feature regression tests (28 tests)
7. [x] Integration tests (23 tests)
8. [x] Version comparison script
9. [x] pre-commit hook excludes slow tests
Files Created
| File |
Purpose |
| tests/conftest.py |
Shared fixtures, GPU mocking |
| pytest.ini |
Test configuration |
| .coveragerc |
Coverage thresholds |
| tests/test_benchmarks.py |
21 benchmark tests |
| tests/test_feature_regression.py |
28 regression tests |
| tests/integration/ |
Pipeline integration tests |
| scripts/run_version_benchmarks.py |
Version comparison tool |
| benchmark_results/.gitkeep |
Results storage |
References