name	mcmc-sampling-stan
description	Guide for performing Markov Chain Monte Carlo (MCMC) sampling using RStan or PyStan. This skill should be used when implementing Bayesian statistical models, fitting hierarchical models, working with Stan modeling language, or running MCMC diagnostics. Applies to tasks involving posterior sampling, Bayesian inference, and probabilistic programming with Stan.

MCMC Sampling with Stan

Overview

This skill provides guidance for implementing Bayesian models and running MCMC sampling using Stan (via RStan or PyStan). It covers model specification, prior implementation, sampling configuration, and critical diagnostic checks that must be performed to validate results.

Workflow

Phase 1: Environment Setup

Before writing any Stan code:

Verify Stan installation and version
- Check that the required version of RStan/PyStan is installed
- RStan requires a C++ toolchain; verify compilation works before proceeding
- For RStan: Check with packageVersion("rstan") and test compilation with a simple model
Check system dependencies first
- Stan requires compilation; missing system libraries cause cryptic errors
- On Linux: Ensure g++, make, and development libraries are installed
- On macOS: Xcode command line tools required
- Verify the toolchain before attempting package installation
Common installation pitfalls
- R's install.packages() does not accept a version parameter for CRAN packages
- To install a specific version, use remotes::install_version() or install from source
- RStan compilation can fail silently; always test with a minimal model first

Phase 2: Data Exploration

Before implementing the model:

Analyze the data structure
- Examine dimensions: number of observations, groups, variables
- Check for missing values, zeros, or extreme values
- Understand the range and distribution of variables
Identify potential issues
- Zero counts in binomial/Poisson models (valid but worth noting)
- Extreme values that might cause numerical issues
- Small sample sizes that may not support complex models
Document data characteristics
- Record any data-specific considerations that affect model specification
- Note whether the data provides enough information for the chosen priors

Phase 3: Model Implementation

When writing the Stan model:

Prior specification
- Translate mathematical priors correctly to Stan syntax
- Use target += for log-probability contributions (e.g., target += -2.5 * log(alpha + beta) for p(α,β) ∝ (α+β)^(-5/2))
- Document the reasoning behind prior choices
Parameter constraints
- Use appropriate bounds: real<lower=0>, real<lower=0, upper=1>, etc.
- Consider whether improper priors will yield proper posteriors given the data
- Add small lower bounds if numerical stability is a concern (e.g., real<lower=0.001>)
Model block structure
- Follow Stan's block order: data, transformed data, parameters, transformed parameters, model, generated quantities
- Keep transformations in appropriate blocks for efficiency
Numerical stability considerations
- Improper priors (e.g., (α+β)^(-5/2)) may not integrate to finite values
- The data must provide sufficient information for posterior propriety
- Consider adding soft constraints if parameters can drift to extreme values

Phase 4: Sampling Configuration

Configure MCMC sampling appropriately:

Iteration and warmup settings
- RStan default: half of iter for warmup, half for sampling
- Explicitly set warmup parameter for clarity
- Example: iter = 100000 with default settings = 50,000 warmup + 50,000 samples per chain
Control parameters
- adapt_delta: Increase (e.g., 0.95 or 0.99) if divergent transitions occur
- max_treedepth: Increase (e.g., 15) if hitting tree depth limits
- Document why non-default values are chosen
Chain configuration
- Multiple chains (typically 4) enable convergence diagnostics
- Set seeds for reproducibility
- Consider computational resources vs. chain count tradeoffs

Phase 5: Diagnostic Checks (CRITICAL)

Never skip diagnostics. Successful sampling completion does not guarantee valid results.

Convergence diagnostics (MANDATORY)
- R-hat (Gelman-Rubin statistic): Must be < 1.01 for all parameters
- Effective Sample Size (ESS): Should be > 400 for reliable estimates; > 100 minimum
- Check both bulk-ESS and tail-ESS
Sampling diagnostics (MANDATORY)
- Divergent transitions: Must be 0; any divergences indicate model problems
- Tree depth: Check for max treedepth warnings
- Energy diagnostics: E-BFMI should be > 0.3
Visual diagnostics (recommended)
- Trace plots: Chains should mix well ("fuzzy caterpillar" appearance)
- Density plots: Posteriors should be smooth and reasonable
- Pairs plots: Check for problematic correlations

How to access diagnostics in RStan

# Summary with R-hat and ESS
print(fit, pars = c("alpha", "beta"))

# Check for divergences
sampler_params <- get_sampler_params(fit, inc_warmup = FALSE)
sum(sapply(sampler_params, function(x) sum(x[, "divergent__"])))

# Check tree depth
sum(sapply(sampler_params, function(x) sum(x[, "treedepth__"] >= 15)))

Interpreting diagnostic failures
- Divergences: Reparameterize model or increase adapt_delta
- Low ESS: Run longer chains or reparameterize
- High R-hat: Chains haven't converged; run longer or check model specification

Phase 6: Results Extraction

After diagnostics pass:

Extract posterior summaries
- Posterior means, medians, and credible intervals
- Standard deviations and quantiles
Validate against expectations
- Compare to known results for standard models (e.g., Bayesian Data Analysis examples)
- Check that posteriors are in reasonable ranges
- Verify that constraints are respected
Output file handling
- Use appropriate functions for clean file output
- In R: writeLines() or cat() with explicit formatting
- Avoid artifacts like trailing newlines or formatting characters

Common Pitfalls

Installation Issues

Attempting to specify version in install.packages() (use remotes::install_version() instead)
Missing C++ toolchain or system libraries
Not testing compilation before running full models

Model Specification Errors

Incorrect translation of mathematical priors to Stan code
Missing or incorrect parameter bounds
Not considering posterior propriety with improper priors

Sampling Problems

Not explicitly setting warmup period
Using default control parameters when model requires tuning
Running insufficient iterations for convergence

Diagnostic Omissions

Assuming successful sampling means valid results
Not checking R-hat, ESS, or divergent transitions
Ignoring warnings about tree depth or energy diagnostics

Output Errors

Bash command parsing issues with redirection operators in R scripts
Not verifying output file format and content
Missing error handling for file operations

Verification Checklist

Before considering the task complete, verify:

Stan/RStan/PyStan version matches requirements
Model compiles without errors
Priors are correctly implemented (verify mathematical translation)
Parameter bounds are appropriate
Sampling completes without errors
R-hat < 1.01 for all parameters
ESS > 400 (or reasonable for the application)
Zero divergent transitions
No max treedepth warnings (or addressed if present)
Posterior summaries are reasonable
Output files are correctly formatted

References

For detailed information on Stan diagnostics and model reparameterization, consult:

references/stan_diagnostics.md - Detailed diagnostic interpretation guide

mcmc-sampling-stan

Install Skill

SKILL.md