name	policyengine-data-testing
description	Testing patterns for PolicyEngine data generation pipelines (policyengine-us-data, policyengine-uk-data)

PolicyEngine Data Testing Patterns

Testing patterns and optimization strategies for PolicyEngine data generation repositories.

Quick Reference

Test Mode Pattern

import os

TESTING = os.environ.get("TESTING") == "1"

# Reduce expensive parameters in test mode
epochs = 32 if TESTING else 512
batch_size = 256 if TESTING else 1024

CI Configuration

# .github/workflows/test.yml
env:
  TESTING: 1  # Enable fast test mode

1. Test Mode Environment Variable

Pattern

Data generation pipelines often involve expensive operations:

Neural network training (calibration, imputation)
Large-scale data processing
Multiple iterations/epochs

For CI tests, use a TESTING environment variable to reduce runtime:

import os

TESTING = os.environ.get("TESTING") == "1"

def create_dataset():
    # Use reduced parameters in test mode
    if TESTING:
        epochs = 32
        sample_size = 1000
        iterations = 10
    else:
        epochs = 512
        sample_size = 100000
        iterations = 100

    # Rest of implementation...

Where to Apply

Use TESTING mode for:

Neural network training - Reduce epochs from 512 to 32-64
Calibration iterations - Reduce from 1000s to 100s
Sample sizes - Use smaller representative samples
Data validation - Check subset instead of full dataset

Don't Use For

Data correctness logic - Always validate fully
Critical calculations - Never skip important steps
File I/O operations - These are usually fast enough

2. CI/CD Configuration

GitHub Actions

Set TESTING=1 in workflow files:

name: Test

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    env:
      TESTING: 1  # Enable fast test mode

    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install -e .
      - name: Run tests
        run: make test

Local Testing

Users can enable test mode locally:

# Fast test mode
TESTING=1 pytest

# Production mode (full training)
pytest

3. Common Data Pipeline Operations

Neural Network Training

import os
from microcalibrate import Calibrator

TESTING = os.environ.get("TESTING") == "1"

def calibrate_weights(data, targets):
    calibrator = Calibrator(
        data=data,
        targets=targets,
        epochs=32 if TESTING else 512,  # Reduce training time
        batch_size=256 if TESTING else 1024,
        learning_rate=0.01,
        early_stopping=True if TESTING else False  # Stop early in tests
    )

    return calibrator.fit()

Data Imputation

import os
from microimpute import Imputer

TESTING = os.environ.get("TESTING") == "1"

def impute_variables(data):
    imputer = Imputer(
        method="random_forest",
        n_estimators=10 if TESTING else 100,  # Fewer trees
        max_depth=5 if TESTING else 20,       # Shallower trees
        n_jobs=-1
    )

    return imputer.fit_transform(data)

Sample Size Reduction

import os
import pandas as pd

TESTING = os.environ.get("TESTING") == "1"

def load_and_process_data():
    data = pd.read_csv("raw_data.csv")

    if TESTING:
        # Use 1% sample for testing
        data = data.sample(frac=0.01, random_state=42)

    # Process full or sample data
    return process(data)

4. Runtime Impact Examples

Before: 40+ minute CI tests

# create_datasets.py
def create_enhanced_cps():
    # Always use 512 epochs
    calibrate_weights(data, targets, epochs=512)
    # CI timeout issues, slow feedback

After: 5-10 minute CI tests

# create_datasets.py
import os

TESTING = os.environ.get("TESTING") == "1"

def create_enhanced_cps():
    epochs = 32 if TESTING else 512
    calibrate_weights(data, targets, epochs=epochs)
    # Fast CI, quick feedback, full training in production

Typical Time Savings

Operation	Production	Test Mode	Savings
Calibration (512 epochs)	30 min	2 min	93%
Imputation (100 trees)	10 min	1 min	90%
Full pipeline	45 min	5 min	89%

5. Best Practices

Do's ✅

✅ Use for expensive operations - Training, large-scale processing
✅ Document the difference - Comment what changes in test mode
✅ Keep logic identical - Only change hyperparameters, not algorithms
✅ Set in CI configuration - Always enable for automated tests
✅ Make it optional - Default to production mode if not set

Don'ts ❌

❌ Skip validation - Always validate correctness
❌ Change algorithms - Same method, different scale
❌ Hide errors - Test mode should catch real issues
❌ Make tests meaningless - Keep tests representative
❌ Forget documentation - Explain the pattern in README

6. Example: Complete Implementation

create_datasets.py

"""
Enhanced dataset creation pipeline.

Set TESTING=1 to use reduced parameters for faster CI tests.
"""
import os
from pathlib import Path
from microcalibrate import Calibrator
from microimpute import Imputer

# Detect test mode
TESTING = os.environ.get("TESTING") == "1"

# Configure parameters based on mode
CONFIG = {
    "epochs": 32 if TESTING else 512,
    "batch_size": 256 if TESTING else 1024,
    "n_trees": 10 if TESTING else 100,
    "sample_frac": 0.01 if TESTING else 1.0,
}

if TESTING:
    print("Running in TESTING mode with reduced parameters")
    print(f"Config: {CONFIG}")


def create_enhanced_dataset():
    """Create enhanced dataset with imputation and calibration."""

    # Load data
    data = load_raw_data()

    # Sample if in test mode
    if TESTING:
        data = data.sample(frac=CONFIG["sample_frac"], random_state=42)

    # Impute missing values
    imputer = Imputer(
        method="random_forest",
        n_estimators=CONFIG["n_trees"],
        n_jobs=-1
    )
    data = imputer.fit_transform(data)

    # Calibrate weights to targets
    calibrator = Calibrator(
        data=data,
        targets=load_targets(),
        epochs=CONFIG["epochs"],
        batch_size=CONFIG["batch_size"],
    )
    data = calibrator.fit_transform(data)

    # Save results
    save_dataset(data)

    return data


if __name__ == "__main__":
    create_enhanced_dataset()

.github/workflows/test.yml

name: Test Data Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest

    env:
      TESTING: 1  # Enable fast test mode for CI

    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install -e .
          pip install pytest pytest-cov

      - name: Run data pipeline tests
        run: |
          python create_datasets.py
          pytest tests/

README.md Addition

## Testing

The data pipeline supports a fast test mode for CI:

```bash
# Fast test mode (reduced epochs, smaller samples)
TESTING=1 python create_datasets.py

# Production mode (full training)
python create_datasets.py

In test mode:

Epochs reduced from 512 to 32
Sample size reduced to 1%
Tree count reduced from 100 to 10
Runtime: ~5 minutes vs ~45 minutes


---

## 7. Repository-Specific Notes

### policyengine-us-data

- Primary bottleneck: CPS calibration with neural networks
- Test mode reduces 512 epochs → 32 epochs
- Savings: ~40 minutes → ~5 minutes in CI

### policyengine-uk-data

- Primary bottleneck: FRS data processing and calibration
- Apply same pattern for neural network training
- Consider sample size reduction for large datasets

---

## 8. When to Use This Pattern

### Use When

- Repository has data generation scripts
- CI tests take >10 minutes
- Pipeline includes ML training (calibration, imputation)
- Tests timeout or are too slow for rapid iteration

### Don't Use When

- Tests already run quickly (<5 minutes)
- No expensive operations (just file I/O)
- Correctness depends on full-scale processing
- Repository is not a data pipeline

---

## For Agents

When working on `policyengine-*-data` repositories:

1. **Check for slow CI** - Look at workflow run times
2. **Identify bottlenecks** - Usually neural network training
3. **Add TESTING variable** - Check `os.environ.get("TESTING") == "1"`
4. **Reduce expensive parameters** - Epochs, trees, sample sizes
5. **Update CI config** - Set `TESTING: 1` in workflow env
6. **Document the change** - Explain in comments and README
7. **Test both modes** - Verify test mode catches real issues

policyengine-data-testing

Install Skill

SKILL.md