12. Sampling and Monte Carlo Methods¶

Learning Objectives¶

Understand the fundamental principles of Monte Carlo methods and their convergence rate
Implement basic sampling methods (inverse transform, rejection sampling) and recognize their limitations
Master variance reduction techniques through importance sampling
Understand the principles and implementation of MCMC (Metropolis-Hastings, Gibbs sampling)
Grasp the mathematical background of the reparameterization trick and its role in VAEs
Learn various use cases of sampling in machine learning

1. Why Sampling?¶

1.1 Integrals that Cannot Be Computed Analytically¶

Many machine learning problems require computing integrals of the form:

$$ \mathbb{E}_{p(x)}[f(x)] = \int f(x) p(x) dx $$

For example: - Bayesian inference: Normalizing constant of the posterior $\int p(x|\theta) p(\theta) d\theta$ - Reinforcement learning: Expected reward of a policy $\mathbb{E}_{\pi}[R]$ - VAE: Expected value in ELBO $\mathbb{E}_{q(z|x)}[\log p(x|z)]$

These integrals are intractable analytically in high dimensions.

1.2 Monte Carlo Estimation¶

Monte Carlo principle: If we draw $N$ samples $x_1, \ldots, x_N$ from distribution $p(x)$:

$$ \mathbb{E}_{p(x)}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i) $$

This is guaranteed by the Law of Large Numbers.

1.3 Convergence Rate¶

The standard error of Monte Carlo estimation is:

$$ \text{SE} = \frac{\sigma}{\sqrt{N}} $$

where $\sigma^2 = \text{Var}_{p(x)}[f(x)]$.

Key observation: - Error decreases as $O(1/\sqrt{N})$ → to increase accuracy by 10x, need 100x more samples - Dimension-independent: applicable to high-dimensional integrals (grid methods suffer from the curse of dimensionality)

import numpy as np
import matplotlib.pyplot as plt

# 몬테카를로로 원주율 π 추정
def estimate_pi(n_samples):
    """단위 사각형 내 랜덤 점을 생성하고 단위 원 내부 비율로 π 추정"""
    x = np.random.uniform(-1, 1, n_samples)
    y = np.random.uniform(-1, 1, n_samples)
    inside_circle = (x**2 + y**2) <= 1
    pi_estimate = 4 * np.mean(inside_circle)
    return pi_estimate

# 샘플 수에 따른 수렴
sample_sizes = [10, 100, 1000, 10000, 100000]
estimates = [estimate_pi(n) for n in sample_sizes]

print("몬테카를로 π 추정:")
for n, est in zip(sample_sizes, estimates):
    error = abs(est - np.pi)
    print(f"N={n:6d}: π ≈ {est:.6f}, 오차 = {error:.6f}")

# 수렴 시각화
n_trials = 100
sample_range = np.logspace(1, 5, 50, dtype=int)
errors = []

for n in sample_range:
    trial_estimates = [estimate_pi(n) for _ in range(n_trials)]
    errors.append(np.std(trial_estimates))

plt.figure(figsize=(10, 6))
plt.loglog(sample_range, errors, 'b-', label='실제 표준 오차')
plt.loglog(sample_range, 1/np.sqrt(sample_range), 'r--', label=r'$O(1/\sqrt{N})$')
plt.xlabel('샘플 수 (N)')
plt.ylabel('표준 오차')
plt.title('몬테카를로 추정의 수렴 속도')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

2. Basic Sampling Methods¶

2.1 Inverse Transform Sampling¶

Theorem: If $U \sim \text{Uniform}(0, 1)$ and $F$ is a cumulative distribution function (CDF), then $X = F^{-1}(U)$ follows CDF $F$.

Algorithm: 1. Generate $u \sim \text{Uniform}(0, 1)$ 2. Compute $x = F^{-1}(u)$

# 역변환 샘플링 예제: 지수분포
def inverse_transform_exponential(n_samples, lambda_param=1.0):
    """지수분포 Exp(λ)에서 샘플링"""
    u = np.random.uniform(0, 1, n_samples)
    # F^(-1)(u) = -ln(1-u)/λ
    x = -np.log(1 - u) / lambda_param
    return x

# 검증
samples = inverse_transform_exponential(10000, lambda_param=2.0)

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.hist(samples, bins=50, density=True, alpha=0.7, label='샘플링 결과')
x_range = np.linspace(0, 5, 100)
plt.plot(x_range, 2 * np.exp(-2 * x_range), 'r-', linewidth=2, label='이론 PDF')
plt.xlabel('x')
plt.ylabel('밀도')
plt.title('역변환 샘플링: 지수분포')
plt.legend()

plt.subplot(1, 2, 2)
# Q-Q plot
from scipy import stats
stats.probplot(samples, dist=stats.expon(scale=0.5), plot=plt)
plt.title('Q-Q Plot')
plt.tight_layout()
plt.show()

Limitation: Computing $F^{-1}$ is often difficult.

2.2 Rejection Sampling¶

When direct sampling from the target distribution $p(x)$ is difficult, use a proposal distribution $q(x)$.

Algorithm: 1. Choose constant $M$ such that $M q(x) \geq p(x)$ (for all $x$) 2. Generate $x \sim q(x)$ 3. Generate $u \sim \text{Uniform}(0, 1)$ 4. Accept $x$ if $u \leq \frac{p(x)}{M q(x)}$, otherwise reject

Acceptance rate: $\frac{1}{M}$

# 기각 샘플링 예제: 베타분포 Beta(2, 5)
def target_pdf(x):
    """목표 분포: Beta(2, 5)"""
    if 0 <= x <= 1:
        return 30 * x * (1-x)**4  # 정규화된 Beta(2,5)
    return 0

def proposal_pdf(x):
    """제안 분포: Uniform(0, 1)"""
    return 1.0 if 0 <= x <= 1 else 0

# M 찾기 (최대값)
x_grid = np.linspace(0, 1, 1000)
M = max(target_pdf(x) / proposal_pdf(x) for x in x_grid)
print(f"M = {M:.2f}")

def rejection_sampling(n_samples):
    """기각 샘플링 구현"""
    samples = []
    n_rejected = 0

    while len(samples) < n_samples:
        x = np.random.uniform(0, 1)  # 제안 분포에서 샘플
        u = np.random.uniform(0, 1)

        if u <= target_pdf(x) / (M * proposal_pdf(x)):
            samples.append(x)
        else:
            n_rejected += 1

    acceptance_rate = n_samples / (n_samples + n_rejected)
    print(f"수락률: {acceptance_rate:.2%} (이론값: {1/M:.2%})")
    return np.array(samples)

samples = rejection_sampling(10000)

plt.figure(figsize=(10, 6))
plt.hist(samples, bins=50, density=True, alpha=0.7, label='샘플링 결과')
x_range = np.linspace(0, 1, 100)
plt.plot(x_range, [target_pdf(x) for x in x_range], 'r-', linewidth=2, label='목표 PDF')
plt.xlabel('x')
plt.ylabel('밀도')
plt.title('기각 샘플링: Beta(2, 5)')
plt.legend()
plt.show()

2.3 Limitations in High Dimensions¶

The acceptance rate of rejection sampling is $1/M$. In high dimensions, $M$ grows exponentially, making it inefficient.

Example: For a 10-dimensional Gaussian, $M \sim e^{10}$ → acceptance rate $< 0.01\%$

3. Importance Sampling¶

3.1 Basic Principle¶

Goal: Compute $\mathbb{E}_{p(x)}[f(x)]$

Importance sampling identity:

$$ \mathbb{E}_{p(x)}[f(x)] = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx = \mathbb{E}_{q(x)}\left[f(x) \frac{p(x)}{q(x)}\right] $$

$q(x)$ is called the proposal distribution, and $w(x) = \frac{p(x)}{q(x)}$ is the importance weight.

Monte Carlo estimate: $$ \mathbb{E}_{p(x)}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i) w(x_i), \quad x_i \sim q(x) $$

3.2 Choosing the Proposal Distribution¶

Good proposal distribution: - If $q(x)$ is proportional to $|f(x)| p(x)$, variance is minimized - In practice, $q(x)$ should cover heavy tails of $p(x)$

Bad choice: If $q(x)$ has lighter tails than $p(x)$ → weights explode, leading to high variance

3.3 Self-Normalized Importance Sampling¶

When $p(x)$ is known only up to a normalizing constant ($p(x) = \tilde{p}(x)/Z$):

$$ \mathbb{E}_{p(x)}[f(x)] \approx \frac{\sum_{i=1}^{N} f(x_i) \tilde{w}(x_i)}{\sum_{i=1}^{N} \tilde{w}(x_i)}, \quad \tilde{w}(x_i) = \frac{\tilde{p}(x_i)}{q(x_i)} $$

# 중요도 샘플링 예제
def target_unnormalized(x):
    """정규화되지 않은 목표 분포: 혼합 가우스"""
    return 0.3 * np.exp(-0.5 * ((x-2)/0.5)**2) + \
           0.7 * np.exp(-0.5 * ((x+2)/1.0)**2)

def proposal(x):
    """제안 분포: N(0, 3)"""
    return np.exp(-0.5 * (x/3)**2) / (3 * np.sqrt(2*np.pi))

def f(x):
    """계산하려는 함수: x^2"""
    return x**2

# 일반 몬테카를로 (목표 분포에서 직접 샘플링이 가능하다고 가정)
def naive_monte_carlo(n_samples):
    # 기각 샘플링으로 목표 분포에서 샘플
    samples = []
    while len(samples) < n_samples:
        x = np.random.normal(0, 3)
        u = np.random.uniform(0, target_unnormalized(x))
        if u <= target_unnormalized(x):
            samples.append(x)
    samples = np.array(samples)
    return np.mean(f(samples))

# 중요도 샘플링
def importance_sampling(n_samples):
    # 제안 분포에서 샘플
    samples = np.random.normal(0, 3, n_samples)
    weights = target_unnormalized(samples) / proposal(samples)
    weights = weights / np.sum(weights)  # 자기정규화
    return np.sum(f(samples) * weights * n_samples)

# 비교
n_trials = 100
n_samples = 1000

naive_results = [naive_monte_carlo(n_samples) for _ in range(n_trials)]
is_results = [importance_sampling(n_samples) for _ in range(n_trials)]

print(f"일반 MC: 평균 = {np.mean(naive_results):.4f}, 표준편차 = {np.std(naive_results):.4f}")
print(f"중요도 샘플링: 평균 = {np.mean(is_results):.4f}, 표준편차 = {np.std(is_results):.4f}")
print(f"분산 감소율: {np.var(naive_results) / np.var(is_results):.2f}x")

3.4 Connection to Reinforcement Learning¶

Importance sampling is central to policy gradient methods:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_{\theta_{\text{old}}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \nabla_\theta \log \pi_\theta(a|s) Q(s,a) \right] $$

This enables off-policy learning (PPO, TRPO).

4. Markov Chain Monte Carlo (MCMC)¶

4.1 Markov Chain Basics¶

Markov chain: Stochastic state transitions $P(x_{t+1} | x_t)$

Stationary distribution: $\pi(x) = \int \pi(x') P(x|x') dx'$

Detailed balance condition: $$ \pi(x) P(x'|x) = \pi(x') P(x|x') $$

If detailed balance is satisfied, $\pi$ is the stationary distribution.

4.2 Metropolis-Hastings Algorithm¶

Goal: Sample from target distribution $\pi(x)$ (normalizing constant not needed)

Algorithm: 1. Start from current state $x_t$ 2. Generate candidate $x'$ from proposal distribution $q(x'|x_t)$ 3. Compute acceptance probability: $$ \alpha(x', x_t) = \min\left(1, \frac{\pi(x') q(x_t|x')}{\pi(x_t) q(x'|x_t)}\right) $$ 4. Accept $x_{t+1} = x'$ with probability $\alpha$, otherwise $x_{t+1} = x_t$

# 메트로폴리스-헤이스팅스 구현
def target_distribution(x):
    """목표 분포: 이봉 분포 (bimodal)"""
    return np.exp(-0.5 * ((x-3)/0.8)**2) + np.exp(-0.5 * ((x+3)/0.8)**2)

def metropolis_hastings(n_samples, proposal_std=1.0):
    """MH 알고리즘 (대칭 제안 분포)"""
    samples = np.zeros(n_samples)
    x = 0.0  # 초기 상태
    n_accepted = 0

    for i in range(n_samples):
        # 제안 생성 (대칭 가우스)
        x_proposed = x + np.random.normal(0, proposal_std)

        # 수락 확률 계산 (대칭 제안이므로 q 항 소거)
        acceptance_prob = min(1, target_distribution(x_proposed) / target_distribution(x))

        # 수락/거부 결정
        if np.random.uniform() < acceptance_prob:
            x = x_proposed
            n_accepted += 1

        samples[i] = x

    acceptance_rate = n_accepted / n_samples
    print(f"수락률: {acceptance_rate:.2%}")
    return samples

# 샘플링
samples = metropolis_hastings(50000, proposal_std=2.0)

# 번인(burn-in) 제거
burn_in = 5000
samples_burned = samples[burn_in:]

plt.figure(figsize=(15, 5))

# 궤적
plt.subplot(1, 3, 1)
plt.plot(samples[:1000])
plt.xlabel('반복')
plt.ylabel('x')
plt.title('MCMC 궤적 (처음 1000개)')
plt.axvline(burn_in, color='r', linestyle='--', label='번인 종료')

# 히스토그램
plt.subplot(1, 3, 2)
plt.hist(samples_burned, bins=50, density=True, alpha=0.7, label='MCMC 샘플')
x_range = np.linspace(-6, 6, 200)
plt.plot(x_range, target_distribution(x_range) /
         (np.sqrt(2*np.pi*0.8**2) * 2), 'r-', linewidth=2, label='목표 분포')
plt.xlabel('x')
plt.ylabel('밀도')
plt.title('샘플 분포')
plt.legend()

# 자기상관
plt.subplot(1, 3, 3)
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(samples_burned[::10], lags=50, ax=plt.gca())
plt.title('자기상관 함수 (ACF)')
plt.tight_layout()
plt.show()

4.3 Gibbs Sampling¶

Used when conditional distributions $p(x_i | x_{-i})$ are known for multivariate distribution $p(x_1, \ldots, x_d)$.

Algorithm: 1. Iterate over each variable 2. Sample $x_i^{(t+1)} \sim p(x_i | x_1^{(t+1)}, \ldots, x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, \ldots, x_d^{(t)})$

# 깁스 샘플링: 이변량 가우스
def gibbs_sampling_bivariate_gaussian(n_samples, rho=0.9):
    """이변량 가우스 N([0,0], [[1,ρ],[ρ,1]])에서 샘플링"""
    samples = np.zeros((n_samples, 2))
    x, y = 0.0, 0.0  # 초기값

    for i in range(n_samples):
        # x | y ~ N(ρy, 1-ρ²)
        x = np.random.normal(rho * y, np.sqrt(1 - rho**2))
        # y | x ~ N(ρx, 1-ρ²)
        y = np.random.normal(rho * x, np.sqrt(1 - rho**2))
        samples[i] = [x, y]

    return samples

samples = gibbs_sampling_bivariate_gaussian(10000, rho=0.9)
samples = samples[1000:]  # 번인 제거

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(samples[:, 0], samples[:, 1], alpha=0.3, s=1)
plt.xlabel('x')
plt.ylabel('y')
plt.title('깁스 샘플링: 이변량 가우스 (ρ=0.9)')
plt.axis('equal')

plt.subplot(1, 2, 2)
# 이론적 등고선
from scipy.stats import multivariate_normal
x_grid = np.linspace(-4, 4, 100)
y_grid = np.linspace(-4, 4, 100)
X, Y = np.meshgrid(x_grid, y_grid)
pos = np.dstack((X, Y))
rv = multivariate_normal([0, 0], [[1, 0.9], [0.9, 1]])
plt.contour(X, Y, rv.pdf(pos), levels=10)
plt.scatter(samples[:, 0], samples[:, 1], alpha=0.1, s=1, c='red')
plt.xlabel('x')
plt.ylabel('y')
plt.title('이론 분포와 비교')
plt.axis('equal')
plt.tight_layout()
plt.show()

4.4 Burn-in and Autocorrelation¶

Burn-in: Initial samples haven't reached the stationary distribution, so discard them
Autocorrelation: Consecutive samples are correlated → thin by taking every k-th sample
Effective sample size: $n_{\text{eff}} = \frac{n}{1 + 2\sum_{k=1}^{\infty} \rho_k}$

5. Reparameterization Trick¶

5.1 Problem: Backpropagating Through Stochastic Nodes¶

In VAE, the encoder outputs $q_\phi(z|x)$ and samples $z \sim q_\phi(z|x)$. The loss function:

$$ \mathcal{L}(\phi) = \mathbb{E}_{z \sim q_\phi(z|x)}[f(z)] $$

Problem: How to compute $\frac{\partial}{\partial \phi} \mathbb{E}_{z \sim q_\phi}[f(z)]$?

Sampling operations are not differentiable.

5.2 Reparameterization Solution¶

Idea: Separate randomness into noise independent of $\phi$.

Gaussian distribution: $z \sim \mathcal{N}(\mu_\phi, \sigma_\phi^2)$

Reparameterization: $z = \mu_\phi + \sigma_\phi \cdot \epsilon$, where $\epsilon \sim \mathcal{N}(0, 1)$

Now: $$ \frac{\partial}{\partial \phi} \mathbb{E}_{\epsilon \sim \mathcal{N}(0,1)}[f(\mu_\phi + \sigma_\phi \epsilon)] = \mathbb{E}_{\epsilon}\left[\frac{\partial f(z)}{\partial z} \frac{\partial z}{\partial \phi}\right] $$

Gradients can flow through the sampling operation!

import torch
import torch.nn as nn

# 재매개변수화 트릭 구현
class VAEEncoder(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.fc = nn.Linear(input_dim, 128)
        self.fc_mu = nn.Linear(128, latent_dim)
        self.fc_logvar = nn.Linear(128, latent_dim)

    def forward(self, x):
        h = torch.relu(self.fc(x))
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)
        return mu, logvar

    def reparameterize(self, mu, logvar):
        """재매개변수화 트릭"""
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)  # ε ~ N(0, 1)
        z = mu + std * eps           # z = μ + σε
        return z

# 기울기 흐름 테스트
encoder = VAEEncoder(input_dim=10, latent_dim=2)
x = torch.randn(32, 10)

mu, logvar = encoder(x)
z = encoder.reparameterize(mu, logvar)

# 임의의 손실 함수
loss = (z ** 2).sum()
loss.backward()

print("기울기가 성공적으로 계산됨:")
print(f"  mu의 기울기: {encoder.fc_mu.weight.grad is not None}")
print(f"  logvar의 기울기: {encoder.fc_logvar.weight.grad is not None}")
print(f"  기울기 norm: {encoder.fc_mu.weight.grad.norm():.4f}")

5.3 Reparameterization for Other Distributions¶

Bernoulli: Gumbel-Softmax trick $$ z = \text{one-hot}\left(\arg\max_i \left[\log \pi_i + G_i\right]\right) $$ where $G_i \sim \text{Gumbel}(0, 1)$

Gamma distribution: Shape augmentation

General principle: Reparameterizable if expressible as $z = g(\epsilon; \theta)$

5.4 VAE ELBO¶

VAE loss function (ELBO):

$$ \mathcal{L} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z)) $$

Applying reparameterization: - First term: Compute gradients using reparameterization trick - Second term: Analytically computable under Gaussian prior assumption

def vae_loss(x, x_recon, mu, logvar):
    """VAE 손실 함수 (ELBO)"""
    # 재구성 손실 (log p(x|z))
    recon_loss = nn.functional.binary_cross_entropy(x_recon, x, reduction='sum')

    # KL 발산 (가우스 가정 하에 해석적 계산)
    # KL(N(μ,σ²) || N(0,1)) = -0.5 * Σ(1 + log(σ²) - μ² - σ²)
    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

    return recon_loss + kl_loss

# 전체 VAE 학습 루프 (간략화)
def train_vae(encoder, decoder, data_loader, optimizer, epochs):
    for epoch in range(epochs):
        for x in data_loader:
            # 인코더
            mu, logvar = encoder(x)
            z = encoder.reparameterize(mu, logvar)

            # 디코더
            x_recon = decoder(z)

            # 손실 계산 및 역전파
            loss = vae_loss(x, x_recon, mu, logvar)
            optimizer.zero_grad()
            loss.backward()  # 재매개변수화 덕분에 기울기 계산 가능!
            optimizer.step()

6. Machine Learning Applications¶

6.1 VAE: ELBO Optimization¶

Without the reparameterization trick, VAEs would be impossible. REINFORCE (score function estimation) has too high variance.

6.2 Reinforcement Learning: REINFORCE = Monte Carlo¶

Policy gradient theorem: $$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau)\right] $$

This is estimated using Monte Carlo sampling.

Variance reduction: Use baselines, GAE (Generalized Advantage Estimation)

6.3 Bayesian Inference: MCMC vs Variational Inference¶

MCMC: - Pros: Theoretically exact samples - Cons: Slow, difficult to diagnose convergence

Variational Inference: - Pros: Fast, scalable - Cons: Approximation quality limited by expressiveness of $q$

6.4 Dropout = Bernoulli Sampling¶

Dropout is Bernoulli sampling during training: $$ h_{\text{drop}} = h \odot \text{Bernoulli}(p) $$

Bayesian interpretation: Dropout can be viewed as approximate Bayesian inference (Gal & Ghahramani, 2016).

# MC Dropout으로 불확실성 추정
class MCDropoutModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 50)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(50, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)  # 추론 시에도 dropout 적용
        x = self.fc2(x)
        return x

def predict_with_uncertainty(model, x, n_samples=100):
    """MC Dropout으로 예측 불확실성 추정"""
    model.train()  # Dropout 활성화
    predictions = []

    with torch.no_grad():
        for _ in range(n_samples):
            pred = model(x)
            predictions.append(pred)

    predictions = torch.cat(predictions, dim=1)
    mean = predictions.mean(dim=1)
    std = predictions.std(dim=1)
    return mean, std

# 테스트
model = MCDropoutModel()
x_test = torch.randn(10, 10)
mean, std = predict_with_uncertainty(model, x_test, n_samples=100)
print(f"예측 평균: {mean[:5]}")
print(f"예측 표준편차 (불확실성): {std[:5]}")

Practice Problems¶

Problem 1: Monte Carlo Integration¶

Estimate the following integral using Monte Carlo method: $$ I = \int_0^1 e^{-x^2} dx $$

(a) Estimate using 10, 100, 1000, 10000 samples and calculate errors. (b) Verify on a log-log plot that error decreases as $O(1/\sqrt{N})$. (c) Compare with results from scipy.integrate.quad.

Problem 2: Importance Sampling¶

You want to compute $\mathbb{E}[x^2]$ from target distribution $p(x) \propto e^{-|x|^3}$.

(a) Implement importance sampling using proposal distribution $q(x) = \mathcal{N}(0, 1)$. (b) Compare variance with the case using $q(x) = \text{Laplace}(0, 1)$ as proposal. (c) Which proposal distribution is more efficient? Explain why.

Problem 3: MCMC Diagnostics¶

Perform Metropolis-Hastings sampling from bimodal distribution $p(x) \propto e^{-(x-3)^2/2} + e^{-(x+3)^2/2}$.

(a) Observe acceptance rates as you vary proposal distribution standard deviation: 0.1, 1.0, 10.0. (b) Plot the autocorrelation function (ACF) for each case and estimate effective sample size. (c) What is the optimal proposal distribution standard deviation?

Problem 4: Reparameterization Trick¶

Implement a simple VAE and verify the effect of the reparameterization trick.

(a) Train a small VAE on MNIST dataset (latent_dim=2). (b) Compare with training using REINFORCE without reparameterization trick. (c) Visualize the latent space with 2D plots.

Problem 5: Gibbs Sampling Extension¶

Implement Gibbs sampling for a 3-variate Gaussian distribution.

$$ \mathbf{x} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} 1 & 0.8 & 0.5 \\ 0.8 & 1 & 0.7 \\ 0.5 & 0.7 & 1 \end{bmatrix}\right) $$

(a) Derive conditional distributions $p(x_i | x_{-i})$ (use Gaussian conditional formula). (b) Perform 10,000 iterations of Gibbs sampling and compute sample covariance. (c) Compare with theoretical covariance.

References¶

Books¶

Pattern Recognition and Machine Learning (Bishop, 2006) - Chapter 11: Sampling Methods
Monte Carlo Statistical Methods (Robert & Casella, 2004)
Deep Learning (Goodfellow et al., 2016) - Chapter 17: Monte Carlo Methods

Papers¶

Kingma & Welling (2013), "Auto-Encoding Variational Bayes" - VAE and reparameterization trick
Metropolis et al. (1953), "Equation of State Calculations by Fast Computing Machines"
Hastings (1970), "Monte Carlo Sampling Methods Using Markov Chains"
Gal & Ghahramani (2016), "Dropout as a Bayesian Approximation"

Online Resources¶

MCMC Interactive Gallery
Distill.pub: Visualizing Sampling Methods
PyMC Documentation - Practical Bayesian inference library

Libraries¶

PyMC3/PyMC4: Bayesian inference and MCMC
Stan: Powerful MCMC inference engine
PyTorch: VAE implementation
NumPyro: JAX-based probabilistic programming