| name | microimpute |
| description | ML-based variable imputation for survey data - used in policyengine-us-data to fill missing values |
MicroImpute
MicroImpute enables ML-based variable imputation through different statistical methods, with comparison and benchmarking capabilities.
For Users 👥
What is MicroImpute?
When PolicyEngine calculates population impacts, the underlying survey data has missing information. MicroImpute uses machine learning to fill in those gaps intelligently.
What imputation does:
- Fills missing data in surveys
- Uses machine learning to predict missing values
- Maintains statistical relationships
- Improves PolicyEngine accuracy
Example:
- Survey asks about income but not capital gains breakdown
- MicroImpute predicts short-term vs long-term capital gains
- Based on patterns from IRS data
- Result: More accurate tax calculations
You benefit from imputation when:
- PolicyEngine calculates capital gains tax accurately
- Benefits eligibility uses complete household information
- State-specific calculations have all needed data
For Analysts 📊
Installation
pip install microimpute
# With image export (for plots)
pip install microimpute[images]
What MicroImpute Does
Imputation problem:
- Donor dataset has complete information (e.g., IRS tax records)
- Recipient dataset has missing variables (e.g., CPS survey)
- Imputation predicts missing values in recipient using donor patterns
Methods available:
- Linear regression
- Random forest
- Quantile forest (preserves full distribution)
- XGBoost
- Hot deck (traditional matching)
Quick Example
from microimpute import Imputer
import pandas as pd
# Donor data (complete)
donor = pd.DataFrame({
'income': [50000, 60000, 70000],
'age': [30, 40, 50],
'capital_gains': [5000, 8000, 12000] # Variable to impute
})
# Recipient data (missing capital_gains)
recipient = pd.DataFrame({
'income': [55000, 65000],
'age': [35, 45],
# capital_gains is missing
})
# Impute using quantile forest
imputer = Imputer(method='quantile_forest')
imputer.fit(
donor=donor,
donor_target='capital_gains',
common_vars=['income', 'age']
)
recipient_imputed = imputer.predict(recipient)
# Now recipient has predicted capital_gains
Method Comparison
from microimpute import compare_methods
# Compare different imputation methods
results = compare_methods(
donor=donor,
recipient=recipient,
target_var='capital_gains',
common_vars=['income', 'age'],
methods=['linear', 'random_forest', 'quantile_forest']
)
# Shows quantile loss for each method
print(results)
Quantile Loss (Quality Metric)
Why quantile loss:
- Measures how well imputation preserves the distribution
- Not just mean accuracy, but full distribution shape
- Lower is better
Interpretation:
# Quantile loss around 0.1 = good
# Quantile loss around 0.5 = poor
# Compare across methods to choose best
For Contributors 💻
Repository
Location: PolicyEngine/microimpute
Clone:
git clone https://github.com/PolicyEngine/microimpute
cd microimpute
Current Implementation
To see structure:
tree microimpute/
# Key modules:
ls microimpute/
# - imputer.py - Main Imputer class
# - methods/ - Different imputation methods
# - comparison.py - Method benchmarking
# - utils/ - Utilities
To see specific methods:
# Quantile forest implementation
cat microimpute/methods/quantile_forest.py
# Random forest
cat microimpute/methods/random_forest.py
# Linear regression
cat microimpute/methods/linear.py
Dependencies
Required:
- numpy, pandas (data handling)
- scikit-learn (ML models)
- quantile-forest (distributional imputation)
- optuna (hyperparameter tuning)
- statsmodels (statistical methods)
- scipy (statistical functions)
To see all dependencies:
cat pyproject.toml
Adding New Imputation Methods
Pattern:
# microimpute/methods/my_method.py
class MyMethodImputer:
def fit(self, X_train, y_train):
"""Train on donor data."""
# Fit your model
pass
def predict(self, X_test):
"""Impute on recipient data."""
# Return predictions
pass
def get_quantile_loss(self, X_val, y_val):
"""Compute validation loss."""
# Evaluate quality
pass
Usage in policyengine-us-data
To see how data pipeline uses microimpute:
cd ../policyengine-us-data
# Find usage
grep -r "microimpute" policyengine_us_data/
grep -r "Imputer" policyengine_us_data/
Typical workflow:
- Load CPS (has demographics, missing capital gains details)
- Load IRS PUF (has complete tax data)
- Use microimpute to predict missing CPS variables from PUF patterns
- Validate imputation quality
- Save enhanced dataset
Testing
Run tests:
make test
# Or
pytest tests/ -v --cov=microimpute
To see test patterns:
cat tests/test_imputer.py
cat tests/test_methods.py
Common Patterns
Pattern 1: Basic Imputation
from microimpute import Imputer
# Create imputer
imputer = Imputer(method='quantile_forest')
# Fit on donor (complete data)
imputer.fit(
donor=donor_df,
donor_target='target_variable',
common_vars=['age', 'income', 'state']
)
# Predict on recipient (missing target_variable)
recipient_imputed = imputer.predict(recipient_df)
Pattern 2: Choosing Best Method
from microimpute import compare_methods
# Test multiple methods
methods = ['linear', 'random_forest', 'quantile_forest', 'xgboost']
results = compare_methods(
donor=donor,
recipient=recipient,
target_var='target',
common_vars=common_vars,
methods=methods
)
# Use method with lowest quantile loss
best_method = results.sort_values('quantile_loss').iloc[0]['method']
Pattern 3: Multiple Variable Imputation
# Impute several variables
variables_to_impute = [
'short_term_capital_gains',
'long_term_capital_gains',
'qualified_dividends'
]
for var in variables_to_impute:
imputer = Imputer(method='quantile_forest')
imputer.fit(donor=irs_puf, donor_target=var, common_vars=common_vars)
cps[var] = imputer.predict(cps)
Advanced Features
Hyperparameter Tuning
Built-in Optuna integration:
from microimpute import tune_hyperparameters
# Automatically find best hyperparameters
best_params, study = tune_hyperparameters(
donor=donor,
target_var='target',
common_vars=common_vars,
method='quantile_forest',
n_trials=100
)
# Use tuned parameters
imputer = Imputer(method='quantile_forest', **best_params)
Cross-Validation
Validate imputation quality:
from sklearn.model_selection import cross_val_score
# Split donor for validation
# Impute on validation set
# Measure accuracy
Visualization
Plot imputation results:
import plotly.express as px
# Compare imputed vs actual (on donor validation set)
fig = px.scatter(
x=actual_values,
y=imputed_values,
labels={'x': 'Actual', 'y': 'Imputed'}
)
fig.add_trace(px.line(x=[min, max], y=[min, max])) # 45-degree line
Statistical Background
Imputation preserves:
- Marginal distributions (imputed variable distribution matches donor)
- Conditional relationships (imputation depends on common variables)
- Uncertainty (quantile methods preserve full distribution)
Trade-offs:
- Linear: Fast, but assumes linear relationships
- Random forest: Handles non-linearity, may overfit
- Quantile forest: Preserves full distribution, slower
- XGBoost: High accuracy, requires tuning
Integration with PolicyEngine
Full pipeline (policyengine-us-data):
1. Load CPS survey data
↓
2. microimpute: Fill missing variables from IRS PUF
↓
3. microcalibrate: Adjust weights to match benchmarks
↓
4. Validation: Check against administrative totals
↓
5. Package: Distribute enhanced dataset
↓
6. PolicyEngine: Use for population simulations
Comparison to Other Methods
MicroImpute vs traditional imputation:
Traditional (mean imputation):
- Fast but destroys distribution
- All missing values get same value
- Underestimates variance
MicroImpute (ML methods):
- Preserves relationships
- Different predictions per record
- Maintains distribution shape
Quantile forest advantage:
- Predicts full conditional distribution
- Not just point estimates
- Can sample from predicted distribution
Performance Tips
For large datasets:
# Use random forest (faster than quantile forest)
imputer = Imputer(method='random_forest')
# Or subsample donor
donor_sample = donor.sample(n=10000, random_state=42)
imputer.fit(donor=donor_sample, ...)
For high accuracy:
# Use quantile forest with tuning
best_params, _ = tune_hyperparameters(...)
imputer = Imputer(method='quantile_forest', **best_params)
Related Skills
- l0-skill - Regularization techniques
- microcalibrate-skill - Survey calibration (next step after imputation)
- policyengine-us-data-skill - Complete data pipeline
- microdf-skill - Working with imputed/calibrated data
Resources
Repository: https://github.com/PolicyEngine/microimpute PyPI: https://pypi.org/project/microimpute/ Documentation: See README and docstrings in source