| name | mcmc-sampling-stan |
| description | Guidance for Bayesian MCMC sampling tasks using RStan. This skill applies when implementing hierarchical Bayesian models, configuring Stan/RStan for MCMC inference, or working with posterior distributions. Use for tasks involving Stan model specification, RStan installation, MCMC diagnostics, and Bayesian hierarchical modeling. |
MCMC Sampling with Stan
Overview
This skill provides procedural guidance for implementing Markov Chain Monte Carlo (MCMC) sampling using Stan and RStan. It covers the complete workflow from environment setup through model specification to convergence validation, with emphasis on avoiding common pitfalls in hierarchical Bayesian modeling.
When to Use This Skill
- Installing and configuring RStan in R environments
- Writing Stan models for Bayesian inference
- Implementing hierarchical/multilevel models
- Configuring MCMC sampling parameters
- Diagnosing and validating MCMC convergence
Pre-Implementation Checklist
Before writing any code, verify the following in order:
1. System Dependencies First
Critical: Always check system dependencies before attempting R package installation.
# Check for C++ compiler and build tools
which g++ || which clang++
dpkg -l | grep -E "(build-essential|r-base-dev)" # Debian/Ubuntu
Required system packages for RStan:
build-essentialor equivalent C++ toolchainr-base-devfor R development headers- Sufficient memory (Stan compilation is memory-intensive)
2. RStan Installation Strategy
Install RStan with explicit version pinning when specified:
# If specific version required (e.g., 2.32.7)
install.packages("rstan", repos = "https://cloud.r-project.org/")
# Verify installation and version
library(rstan)
packageVersion("rstan")
Common installation failures:
- Missing C++ compiler → Install build-essential first
- Compilation errors → Check g++ version compatibility
- Memory exhaustion → Close other applications, increase swap
3. Environment Verification
Before proceeding with model development:
# Verify RStan loads without errors
library(rstan)
# Test Stan compilation with minimal model
stan_model(model_code = "parameters { real x; } model { x ~ normal(0,1); }")
Stan Model Specification
Hierarchical Model Structure
For hierarchical Bayesian models, follow this block structure:
data {
// Declare all input data with types and constraints
int<lower=0> N; // Number of observations
int<lower=0> y[N]; // Observed counts
int<lower=0> n[N]; // Trial sizes
}
parameters {
// Hyperparameters (population-level)
real<lower=0> alpha;
real<lower=0> beta;
// Group-level parameters
real<lower=0, upper=1> theta[N];
}
model {
// Hyperprior (if using custom/improper priors)
// Example: p(α,β) ∝ (α+β)^(-5/2)
target += -2.5 * log(alpha + beta);
// Prior for group parameters
theta ~ beta(alpha, beta);
// Likelihood
y ~ binomial(n, theta);
}
Custom Prior Implementation
When implementing custom or improper priors:
| Prior Form | Stan Implementation |
|---|---|
| p(x) ∝ x^a | target += a * log(x); |
| p(x) ∝ (a+b)^c | target += c * log(a + b); |
| Flat/improper | No explicit statement needed |
Document the mathematical transformation: Always comment the relationship between the mathematical prior and the target += statement.
MCMC Sampling Configuration
Recommended Control Parameters
For hierarchical models with potential sampling difficulties:
fit <- sampling(
model,
data = stan_data,
chains = 4, # Multiple chains for convergence assessment
iter = 100000, # Total iterations (including warmup)
seed = 1, # Reproducibility
control = list(
adapt_delta = 0.95, # Increase for divergent transitions
max_treedepth = 15 # Increase for complex posteriors
)
)
Parameter Selection Rationale
| Parameter | Default | When to Increase | Why |
|---|---|---|---|
adapt_delta |
0.8 | Divergent transitions | Smaller step sizes improve exploration |
max_treedepth |
10 | "Maximum treedepth" warnings | Allows longer trajectories |
iter |
2000 | Low effective sample size | More samples for inference |
warmup |
iter/2 | Slow convergence | More adaptation time |
Warmup Considerations
- Default warmup is 50% of
iter - For 100,000 iterations: 50,000 warmup + 50,000 sampling per chain
- Explicitly specify if different warmup proportion needed:
sampling(..., iter = 100000, warmup = 25000) # 25% warmup
Convergence Diagnostics
Required Validation Steps
Always verify these after sampling:
# 1. Check Rhat (potential scale reduction factor)
summary(fit)$summary[, "Rhat"]
# Target: All Rhat < 1.01 (ideally < 1.005)
# 2. Check effective sample size
summary(fit)$summary[, "n_eff"]
# Target: n_eff > 400 for reliable inference
# 3. Check for divergent transitions
get_num_divergent(fit)
# Target: 0 divergent transitions
# 4. Check treedepth saturation
get_num_max_treedepth(fit)
# Target: 0 or minimal saturation
Diagnostic Interpretation
| Issue | Indicator | Resolution |
|---|---|---|
| Poor mixing | Rhat > 1.01 | Increase iter, reparameterize |
| Inefficient sampling | Low n_eff | Increase iter, adjust adapt_delta |
| Geometric problems | Divergent transitions | Increase adapt_delta, reparameterize |
| Complex posterior | Max treedepth hits | Increase max_treedepth |
Common Pitfalls and Solutions
1. Installation Order Errors
Wrong: Install RStan → Installation fails → Check dependencies Right: Check dependencies → Install prerequisites → Install RStan
2. Missing Convergence Checks
Wrong: Run sampling → Extract means → Report results Right: Run sampling → Check Rhat/n_eff/divergences → Validate → Extract results
3. Improper Prior Issues
When using improper priors like p(α,β) ∝ (α+β)^(-5/2):
- Ensure parameters are constrained (
<lower=0>) - Verify posterior is proper (converges)
- Check for boundary behavior near zero
4. Command Syntax in Shell
Avoid piping R output with shell redirection that causes parsing errors:
# Problematic
Rscript -e "code" 2>&1 | grep ...
# Safer
Rscript -e "code" > output.txt 2>&1
grep ... output.txt
Verification Strategy
Stepwise Verification Approach
- Environment: Verify RStan installation before model development
- Compilation: Test Stan model compiles without errors
- Sampling: Run with reduced iterations first to catch issues
- Diagnostics: Check all convergence metrics before trusting results
- Results: Extract posteriors only after validation passes
Sanity Checks for Results
- Posterior means should be plausible given prior and data
- 95% credible intervals should have reasonable width
- Multiple chains should show similar posteriors
- Results should be consistent across different seeds (approximately)
References
For detailed Stan documentation and examples, consult:
- Stan User's Guide: Model specification and best practices
- RStan Getting Started: Installation and basic usage
- Stan Functions Reference: Available distributions and functions