name	mcmc-sampling-stan
description	Guidance for Bayesian MCMC sampling tasks using RStan. This skill applies when implementing hierarchical Bayesian models, configuring Stan/RStan for MCMC inference, or working with posterior distributions. Use for tasks involving Stan model specification, RStan installation, MCMC diagnostics, and Bayesian hierarchical modeling.

MCMC Sampling with Stan

Overview

This skill provides procedural guidance for implementing Markov Chain Monte Carlo (MCMC) sampling using Stan and RStan. It covers the complete workflow from environment setup through model specification to convergence validation, with emphasis on avoiding common pitfalls in hierarchical Bayesian modeling.

When to Use This Skill

Installing and configuring RStan in R environments
Writing Stan models for Bayesian inference
Implementing hierarchical/multilevel models
Configuring MCMC sampling parameters
Diagnosing and validating MCMC convergence

Pre-Implementation Checklist

Before writing any code, verify the following in order:

1. System Dependencies First

Critical: Always check system dependencies before attempting R package installation.

# Check for C++ compiler and build tools
which g++ || which clang++
dpkg -l | grep -E "(build-essential|r-base-dev)" # Debian/Ubuntu

Required system packages for RStan:

build-essential or equivalent C++ toolchain
r-base-dev for R development headers
Sufficient memory (Stan compilation is memory-intensive)

2. RStan Installation Strategy

Install RStan with explicit version pinning when specified:

# If specific version required (e.g., 2.32.7)
install.packages("rstan", repos = "https://cloud.r-project.org/")

# Verify installation and version
library(rstan)
packageVersion("rstan")

Common installation failures:

Missing C++ compiler → Install build-essential first
Compilation errors → Check g++ version compatibility
Memory exhaustion → Close other applications, increase swap

3. Environment Verification

Before proceeding with model development:

# Verify RStan loads without errors
library(rstan)

# Test Stan compilation with minimal model
stan_model(model_code = "parameters { real x; } model { x ~ normal(0,1); }")

Stan Model Specification

Hierarchical Model Structure

For hierarchical Bayesian models, follow this block structure:

data {
  // Declare all input data with types and constraints
  int<lower=0> N;           // Number of observations
  int<lower=0> y[N];        // Observed counts
  int<lower=0> n[N];        // Trial sizes
}

parameters {
  // Hyperparameters (population-level)
  real<lower=0> alpha;
  real<lower=0> beta;

  // Group-level parameters
  real<lower=0, upper=1> theta[N];
}

model {
  // Hyperprior (if using custom/improper priors)
  // Example: p(α,β) ∝ (α+β)^(-5/2)
  target += -2.5 * log(alpha + beta);

  // Prior for group parameters
  theta ~ beta(alpha, beta);

  // Likelihood
  y ~ binomial(n, theta);
}

Custom Prior Implementation

When implementing custom or improper priors:

Prior Form	Stan Implementation
p(x) ∝ x^a	`target += a * log(x);`
p(x) ∝ (a+b)^c	`target += c * log(a + b);`
Flat/improper	No explicit statement needed

Document the mathematical transformation: Always comment the relationship between the mathematical prior and the target += statement.

MCMC Sampling Configuration

Recommended Control Parameters

For hierarchical models with potential sampling difficulties:

fit <- sampling(
  model,
  data = stan_data,
  chains = 4,                    # Multiple chains for convergence assessment
  iter = 100000,                 # Total iterations (including warmup)
  seed = 1,                      # Reproducibility
  control = list(
    adapt_delta = 0.95,          # Increase for divergent transitions
    max_treedepth = 15           # Increase for complex posteriors
  )
)

Parameter Selection Rationale

Parameter	Default	When to Increase	Why
`adapt_delta`	0.8	Divergent transitions	Smaller step sizes improve exploration
`max_treedepth`	10	"Maximum treedepth" warnings	Allows longer trajectories
`iter`	2000	Low effective sample size	More samples for inference
`warmup`	iter/2	Slow convergence	More adaptation time

Warmup Considerations

Default warmup is 50% of iter
For 100,000 iterations: 50,000 warmup + 50,000 sampling per chain
Explicitly specify if different warmup proportion needed:

sampling(..., iter = 100000, warmup = 25000)  # 25% warmup

Convergence Diagnostics

Required Validation Steps

Always verify these after sampling:

# 1. Check Rhat (potential scale reduction factor)
summary(fit)$summary[, "Rhat"]
# Target: All Rhat < 1.01 (ideally < 1.005)

# 2. Check effective sample size
summary(fit)$summary[, "n_eff"]
# Target: n_eff > 400 for reliable inference

# 3. Check for divergent transitions
get_num_divergent(fit)
# Target: 0 divergent transitions

# 4. Check treedepth saturation
get_num_max_treedepth(fit)
# Target: 0 or minimal saturation

Diagnostic Interpretation

Issue	Indicator	Resolution
Poor mixing	Rhat > 1.01	Increase iter, reparameterize
Inefficient sampling	Low n_eff	Increase iter, adjust adapt_delta
Geometric problems	Divergent transitions	Increase adapt_delta, reparameterize
Complex posterior	Max treedepth hits	Increase max_treedepth

Common Pitfalls and Solutions

1. Installation Order Errors

Wrong: Install RStan → Installation fails → Check dependencies Right: Check dependencies → Install prerequisites → Install RStan

2. Missing Convergence Checks

Wrong: Run sampling → Extract means → Report results Right: Run sampling → Check Rhat/n_eff/divergences → Validate → Extract results

3. Improper Prior Issues

When using improper priors like p(α,β) ∝ (α+β)^(-5/2):

Ensure parameters are constrained (<lower=0>)
Verify posterior is proper (converges)
Check for boundary behavior near zero

4. Command Syntax in Shell

Avoid piping R output with shell redirection that causes parsing errors:

# Problematic
Rscript -e "code" 2>&1 | grep ...

# Safer
Rscript -e "code" > output.txt 2>&1
grep ... output.txt

Verification Strategy

Stepwise Verification Approach

Environment: Verify RStan installation before model development
Compilation: Test Stan model compiles without errors
Sampling: Run with reduced iterations first to catch issues
Diagnostics: Check all convergence metrics before trusting results
Results: Extract posteriors only after validation passes

Sanity Checks for Results

Posterior means should be plausible given prior and data
95% credible intervals should have reasonable width
Multiple chains should show similar posteriors
Results should be consistent across different seeds (approximately)

References

For detailed Stan documentation and examples, consult:

Stan User's Guide: Model specification and best practices
RStan Getting Started: Installation and basic usage
Stan Functions Reference: Available distributions and functions

mcmc-sampling-stan

Install Skill

SKILL.md