12. Sampling and Monte Carlo Methods

12. Sampling and Monte Carlo Methods

Learning Objectives

  • Understand the fundamental principles of Monte Carlo methods and their convergence rate
  • Implement basic sampling methods (inverse transform, rejection sampling) and recognize their limitations
  • Master variance reduction techniques through importance sampling
  • Understand the principles and implementation of MCMC (Metropolis-Hastings, Gibbs sampling)
  • Grasp the mathematical background of the reparameterization trick and its role in VAEs
  • Learn various use cases of sampling in machine learning

1. Why Sampling?

1.1 Integrals that Cannot Be Computed Analytically

Many machine learning problems require computing integrals of the form:

$$ \mathbb{E}_{p(x)}[f(x)] = \int f(x) p(x) dx $$

For example: - Bayesian inference: Normalizing constant of the posterior $\int p(x|\theta) p(\theta) d\theta$ - Reinforcement learning: Expected reward of a policy $\mathbb{E}_{\pi}[R]$ - VAE: Expected value in ELBO $\mathbb{E}_{q(z|x)}[\log p(x|z)]$

These integrals are intractable analytically in high dimensions.

1.2 Monte Carlo Estimation

Monte Carlo principle: If we draw $N$ samples $x_1, \ldots, x_N$ from distribution $p(x)$:

$$ \mathbb{E}_{p(x)}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i) $$

This is guaranteed by the Law of Large Numbers.

1.3 Convergence Rate

The standard error of Monte Carlo estimation is:

$$ \text{SE} = \frac{\sigma}{\sqrt{N}} $$

where $\sigma^2 = \text{Var}_{p(x)}[f(x)]$.

Key observation: - Error decreases as $O(1/\sqrt{N})$ โ†’ to increase accuracy by 10x, need 100x more samples - Dimension-independent: applicable to high-dimensional integrals (grid methods suffer from the curse of dimensionality)

import numpy as np
import matplotlib.pyplot as plt

# ๋ชฌํ…Œ์นด๋ฅผ๋กœ๋กœ ์›์ฃผ์œจ ฯ€ ์ถ”์ •
def estimate_pi(n_samples):
    """๋‹จ์œ„ ์‚ฌ๊ฐํ˜• ๋‚ด ๋žœ๋ค ์ ์„ ์ƒ์„ฑํ•˜๊ณ  ๋‹จ์œ„ ์› ๋‚ด๋ถ€ ๋น„์œจ๋กœ ฯ€ ์ถ”์ •"""
    x = np.random.uniform(-1, 1, n_samples)
    y = np.random.uniform(-1, 1, n_samples)
    inside_circle = (x**2 + y**2) <= 1
    pi_estimate = 4 * np.mean(inside_circle)
    return pi_estimate

# ์ƒ˜ํ”Œ ์ˆ˜์— ๋”ฐ๋ฅธ ์ˆ˜๋ ด
sample_sizes = [10, 100, 1000, 10000, 100000]
estimates = [estimate_pi(n) for n in sample_sizes]

print("๋ชฌํ…Œ์นด๋ฅผ๋กœ ฯ€ ์ถ”์ •:")
for n, est in zip(sample_sizes, estimates):
    error = abs(est - np.pi)
    print(f"N={n:6d}: ฯ€ โ‰ˆ {est:.6f}, ์˜ค์ฐจ = {error:.6f}")

# ์ˆ˜๋ ด ์‹œ๊ฐํ™”
n_trials = 100
sample_range = np.logspace(1, 5, 50, dtype=int)
errors = []

for n in sample_range:
    trial_estimates = [estimate_pi(n) for _ in range(n_trials)]
    errors.append(np.std(trial_estimates))

plt.figure(figsize=(10, 6))
plt.loglog(sample_range, errors, 'b-', label='์‹ค์ œ ํ‘œ์ค€ ์˜ค์ฐจ')
plt.loglog(sample_range, 1/np.sqrt(sample_range), 'r--', label=r'$O(1/\sqrt{N})$')
plt.xlabel('์ƒ˜ํ”Œ ์ˆ˜ (N)')
plt.ylabel('ํ‘œ์ค€ ์˜ค์ฐจ')
plt.title('๋ชฌํ…Œ์นด๋ฅผ๋กœ ์ถ”์ •์˜ ์ˆ˜๋ ด ์†๋„')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

2. Basic Sampling Methods

2.1 Inverse Transform Sampling

Theorem: If $U \sim \text{Uniform}(0, 1)$ and $F$ is a cumulative distribution function (CDF), then $X = F^{-1}(U)$ follows CDF $F$.

Algorithm: 1. Generate $u \sim \text{Uniform}(0, 1)$ 2. Compute $x = F^{-1}(u)$

# ์—ญ๋ณ€ํ™˜ ์ƒ˜ํ”Œ๋ง ์˜ˆ์ œ: ์ง€์ˆ˜๋ถ„ํฌ
def inverse_transform_exponential(n_samples, lambda_param=1.0):
    """์ง€์ˆ˜๋ถ„ํฌ Exp(ฮป)์—์„œ ์ƒ˜ํ”Œ๋ง"""
    u = np.random.uniform(0, 1, n_samples)
    # F^(-1)(u) = -ln(1-u)/ฮป
    x = -np.log(1 - u) / lambda_param
    return x

# ๊ฒ€์ฆ
samples = inverse_transform_exponential(10000, lambda_param=2.0)

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.hist(samples, bins=50, density=True, alpha=0.7, label='์ƒ˜ํ”Œ๋ง ๊ฒฐ๊ณผ')
x_range = np.linspace(0, 5, 100)
plt.plot(x_range, 2 * np.exp(-2 * x_range), 'r-', linewidth=2, label='์ด๋ก  PDF')
plt.xlabel('x')
plt.ylabel('๋ฐ€๋„')
plt.title('์—ญ๋ณ€ํ™˜ ์ƒ˜ํ”Œ๋ง: ์ง€์ˆ˜๋ถ„ํฌ')
plt.legend()

plt.subplot(1, 2, 2)
# Q-Q plot
from scipy import stats
stats.probplot(samples, dist=stats.expon(scale=0.5), plot=plt)
plt.title('Q-Q Plot')
plt.tight_layout()
plt.show()

Limitation: Computing $F^{-1}$ is often difficult.

2.2 Rejection Sampling

When direct sampling from the target distribution $p(x)$ is difficult, use a proposal distribution $q(x)$.

Algorithm: 1. Choose constant $M$ such that $M q(x) \geq p(x)$ (for all $x$) 2. Generate $x \sim q(x)$ 3. Generate $u \sim \text{Uniform}(0, 1)$ 4. Accept $x$ if $u \leq \frac{p(x)}{M q(x)}$, otherwise reject

Acceptance rate: $\frac{1}{M}$

# ๊ธฐ๊ฐ ์ƒ˜ํ”Œ๋ง ์˜ˆ์ œ: ๋ฒ ํƒ€๋ถ„ํฌ Beta(2, 5)
def target_pdf(x):
    """๋ชฉํ‘œ ๋ถ„ํฌ: Beta(2, 5)"""
    if 0 <= x <= 1:
        return 30 * x * (1-x)**4  # ์ •๊ทœํ™”๋œ Beta(2,5)
    return 0

def proposal_pdf(x):
    """์ œ์•ˆ ๋ถ„ํฌ: Uniform(0, 1)"""
    return 1.0 if 0 <= x <= 1 else 0

# M ์ฐพ๊ธฐ (์ตœ๋Œ€๊ฐ’)
x_grid = np.linspace(0, 1, 1000)
M = max(target_pdf(x) / proposal_pdf(x) for x in x_grid)
print(f"M = {M:.2f}")

def rejection_sampling(n_samples):
    """๊ธฐ๊ฐ ์ƒ˜ํ”Œ๋ง ๊ตฌํ˜„"""
    samples = []
    n_rejected = 0

    while len(samples) < n_samples:
        x = np.random.uniform(0, 1)  # ์ œ์•ˆ ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ
        u = np.random.uniform(0, 1)

        if u <= target_pdf(x) / (M * proposal_pdf(x)):
            samples.append(x)
        else:
            n_rejected += 1

    acceptance_rate = n_samples / (n_samples + n_rejected)
    print(f"์ˆ˜๋ฝ๋ฅ : {acceptance_rate:.2%} (์ด๋ก ๊ฐ’: {1/M:.2%})")
    return np.array(samples)

samples = rejection_sampling(10000)

plt.figure(figsize=(10, 6))
plt.hist(samples, bins=50, density=True, alpha=0.7, label='์ƒ˜ํ”Œ๋ง ๊ฒฐ๊ณผ')
x_range = np.linspace(0, 1, 100)
plt.plot(x_range, [target_pdf(x) for x in x_range], 'r-', linewidth=2, label='๋ชฉํ‘œ PDF')
plt.xlabel('x')
plt.ylabel('๋ฐ€๋„')
plt.title('๊ธฐ๊ฐ ์ƒ˜ํ”Œ๋ง: Beta(2, 5)')
plt.legend()
plt.show()

2.3 Limitations in High Dimensions

The acceptance rate of rejection sampling is $1/M$. In high dimensions, $M$ grows exponentially, making it inefficient.

Example: For a 10-dimensional Gaussian, $M \sim e^{10}$ โ†’ acceptance rate $< 0.01\%$


3. Importance Sampling

3.1 Basic Principle

Goal: Compute $\mathbb{E}_{p(x)}[f(x)]$

Importance sampling identity:

$$ \mathbb{E}_{p(x)}[f(x)] = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx = \mathbb{E}_{q(x)}\left[f(x) \frac{p(x)}{q(x)}\right] $$

$q(x)$ is called the proposal distribution, and $w(x) = \frac{p(x)}{q(x)}$ is the importance weight.

Monte Carlo estimate: $$ \mathbb{E}_{p(x)}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i) w(x_i), \quad x_i \sim q(x) $$

3.2 Choosing the Proposal Distribution

Good proposal distribution: - If $q(x)$ is proportional to $|f(x)| p(x)$, variance is minimized - In practice, $q(x)$ should cover heavy tails of $p(x)$

Bad choice: If $q(x)$ has lighter tails than $p(x)$ โ†’ weights explode, leading to high variance

3.3 Self-Normalized Importance Sampling

When $p(x)$ is known only up to a normalizing constant ($p(x) = \tilde{p}(x)/Z$):

$$ \mathbb{E}_{p(x)}[f(x)] \approx \frac{\sum_{i=1}^{N} f(x_i) \tilde{w}(x_i)}{\sum_{i=1}^{N} \tilde{w}(x_i)}, \quad \tilde{w}(x_i) = \frac{\tilde{p}(x_i)}{q(x_i)} $$

# ์ค‘์š”๋„ ์ƒ˜ํ”Œ๋ง ์˜ˆ์ œ
def target_unnormalized(x):
    """์ •๊ทœํ™”๋˜์ง€ ์•Š์€ ๋ชฉํ‘œ ๋ถ„ํฌ: ํ˜ผํ•ฉ ๊ฐ€์šฐ์Šค"""
    return 0.3 * np.exp(-0.5 * ((x-2)/0.5)**2) + \
           0.7 * np.exp(-0.5 * ((x+2)/1.0)**2)

def proposal(x):
    """์ œ์•ˆ ๋ถ„ํฌ: N(0, 3)"""
    return np.exp(-0.5 * (x/3)**2) / (3 * np.sqrt(2*np.pi))

def f(x):
    """๊ณ„์‚ฐํ•˜๋ ค๋Š” ํ•จ์ˆ˜: x^2"""
    return x**2

# ์ผ๋ฐ˜ ๋ชฌํ…Œ์นด๋ฅผ๋กœ (๋ชฉํ‘œ ๋ถ„ํฌ์—์„œ ์ง์ ‘ ์ƒ˜ํ”Œ๋ง์ด ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ๊ฐ€์ •)
def naive_monte_carlo(n_samples):
    # ๊ธฐ๊ฐ ์ƒ˜ํ”Œ๋ง์œผ๋กœ ๋ชฉํ‘œ ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ
    samples = []
    while len(samples) < n_samples:
        x = np.random.normal(0, 3)
        u = np.random.uniform(0, target_unnormalized(x))
        if u <= target_unnormalized(x):
            samples.append(x)
    samples = np.array(samples)
    return np.mean(f(samples))

# ์ค‘์š”๋„ ์ƒ˜ํ”Œ๋ง
def importance_sampling(n_samples):
    # ์ œ์•ˆ ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ
    samples = np.random.normal(0, 3, n_samples)
    weights = target_unnormalized(samples) / proposal(samples)
    weights = weights / np.sum(weights)  # ์ž๊ธฐ์ •๊ทœํ™”
    return np.sum(f(samples) * weights * n_samples)

# ๋น„๊ต
n_trials = 100
n_samples = 1000

naive_results = [naive_monte_carlo(n_samples) for _ in range(n_trials)]
is_results = [importance_sampling(n_samples) for _ in range(n_trials)]

print(f"์ผ๋ฐ˜ MC: ํ‰๊ท  = {np.mean(naive_results):.4f}, ํ‘œ์ค€ํŽธ์ฐจ = {np.std(naive_results):.4f}")
print(f"์ค‘์š”๋„ ์ƒ˜ํ”Œ๋ง: ํ‰๊ท  = {np.mean(is_results):.4f}, ํ‘œ์ค€ํŽธ์ฐจ = {np.std(is_results):.4f}")
print(f"๋ถ„์‚ฐ ๊ฐ์†Œ์œจ: {np.var(naive_results) / np.var(is_results):.2f}x")

3.4 Connection to Reinforcement Learning

Importance sampling is central to policy gradient methods:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_{\theta_{\text{old}}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \nabla_\theta \log \pi_\theta(a|s) Q(s,a) \right] $$

This enables off-policy learning (PPO, TRPO).


4. Markov Chain Monte Carlo (MCMC)

4.1 Markov Chain Basics

Markov chain: Stochastic state transitions $P(x_{t+1} | x_t)$

Stationary distribution: $\pi(x) = \int \pi(x') P(x|x') dx'$

Detailed balance condition: $$ \pi(x) P(x'|x) = \pi(x') P(x|x') $$

If detailed balance is satisfied, $\pi$ is the stationary distribution.

4.2 Metropolis-Hastings Algorithm

Goal: Sample from target distribution $\pi(x)$ (normalizing constant not needed)

Algorithm: 1. Start from current state $x_t$ 2. Generate candidate $x'$ from proposal distribution $q(x'|x_t)$ 3. Compute acceptance probability: $$ \alpha(x', x_t) = \min\left(1, \frac{\pi(x') q(x_t|x')}{\pi(x_t) q(x'|x_t)}\right) $$ 4. Accept $x_{t+1} = x'$ with probability $\alpha$, otherwise $x_{t+1} = x_t$

# ๋ฉ”ํŠธ๋กœํด๋ฆฌ์Šค-ํ—ค์ด์ŠคํŒ…์Šค ๊ตฌํ˜„
def target_distribution(x):
    """๋ชฉํ‘œ ๋ถ„ํฌ: ์ด๋ด‰ ๋ถ„ํฌ (bimodal)"""
    return np.exp(-0.5 * ((x-3)/0.8)**2) + np.exp(-0.5 * ((x+3)/0.8)**2)

def metropolis_hastings(n_samples, proposal_std=1.0):
    """MH ์•Œ๊ณ ๋ฆฌ์ฆ˜ (๋Œ€์นญ ์ œ์•ˆ ๋ถ„ํฌ)"""
    samples = np.zeros(n_samples)
    x = 0.0  # ์ดˆ๊ธฐ ์ƒํƒœ
    n_accepted = 0

    for i in range(n_samples):
        # ์ œ์•ˆ ์ƒ์„ฑ (๋Œ€์นญ ๊ฐ€์šฐ์Šค)
        x_proposed = x + np.random.normal(0, proposal_std)

        # ์ˆ˜๋ฝ ํ™•๋ฅ  ๊ณ„์‚ฐ (๋Œ€์นญ ์ œ์•ˆ์ด๋ฏ€๋กœ q ํ•ญ ์†Œ๊ฑฐ)
        acceptance_prob = min(1, target_distribution(x_proposed) / target_distribution(x))

        # ์ˆ˜๋ฝ/๊ฑฐ๋ถ€ ๊ฒฐ์ •
        if np.random.uniform() < acceptance_prob:
            x = x_proposed
            n_accepted += 1

        samples[i] = x

    acceptance_rate = n_accepted / n_samples
    print(f"์ˆ˜๋ฝ๋ฅ : {acceptance_rate:.2%}")
    return samples

# ์ƒ˜ํ”Œ๋ง
samples = metropolis_hastings(50000, proposal_std=2.0)

# ๋ฒˆ์ธ(burn-in) ์ œ๊ฑฐ
burn_in = 5000
samples_burned = samples[burn_in:]

plt.figure(figsize=(15, 5))

# ๊ถค์ 
plt.subplot(1, 3, 1)
plt.plot(samples[:1000])
plt.xlabel('๋ฐ˜๋ณต')
plt.ylabel('x')
plt.title('MCMC ๊ถค์  (์ฒ˜์Œ 1000๊ฐœ)')
plt.axvline(burn_in, color='r', linestyle='--', label='๋ฒˆ์ธ ์ข…๋ฃŒ')

# ํžˆ์Šคํ† ๊ทธ๋žจ
plt.subplot(1, 3, 2)
plt.hist(samples_burned, bins=50, density=True, alpha=0.7, label='MCMC ์ƒ˜ํ”Œ')
x_range = np.linspace(-6, 6, 200)
plt.plot(x_range, target_distribution(x_range) /
         (np.sqrt(2*np.pi*0.8**2) * 2), 'r-', linewidth=2, label='๋ชฉํ‘œ ๋ถ„ํฌ')
plt.xlabel('x')
plt.ylabel('๋ฐ€๋„')
plt.title('์ƒ˜ํ”Œ ๋ถ„ํฌ')
plt.legend()

# ์ž๊ธฐ์ƒ๊ด€
plt.subplot(1, 3, 3)
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(samples_burned[::10], lags=50, ax=plt.gca())
plt.title('์ž๊ธฐ์ƒ๊ด€ ํ•จ์ˆ˜ (ACF)')
plt.tight_layout()
plt.show()

4.3 Gibbs Sampling

Used when conditional distributions $p(x_i | x_{-i})$ are known for multivariate distribution $p(x_1, \ldots, x_d)$.

Algorithm: 1. Iterate over each variable 2. Sample $x_i^{(t+1)} \sim p(x_i | x_1^{(t+1)}, \ldots, x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, \ldots, x_d^{(t)})$

# ๊น์Šค ์ƒ˜ํ”Œ๋ง: ์ด๋ณ€๋Ÿ‰ ๊ฐ€์šฐ์Šค
def gibbs_sampling_bivariate_gaussian(n_samples, rho=0.9):
    """์ด๋ณ€๋Ÿ‰ ๊ฐ€์šฐ์Šค N([0,0], [[1,ฯ],[ฯ,1]])์—์„œ ์ƒ˜ํ”Œ๋ง"""
    samples = np.zeros((n_samples, 2))
    x, y = 0.0, 0.0  # ์ดˆ๊ธฐ๊ฐ’

    for i in range(n_samples):
        # x | y ~ N(ฯy, 1-ฯยฒ)
        x = np.random.normal(rho * y, np.sqrt(1 - rho**2))
        # y | x ~ N(ฯx, 1-ฯยฒ)
        y = np.random.normal(rho * x, np.sqrt(1 - rho**2))
        samples[i] = [x, y]

    return samples

samples = gibbs_sampling_bivariate_gaussian(10000, rho=0.9)
samples = samples[1000:]  # ๋ฒˆ์ธ ์ œ๊ฑฐ

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(samples[:, 0], samples[:, 1], alpha=0.3, s=1)
plt.xlabel('x')
plt.ylabel('y')
plt.title('๊น์Šค ์ƒ˜ํ”Œ๋ง: ์ด๋ณ€๋Ÿ‰ ๊ฐ€์šฐ์Šค (ฯ=0.9)')
plt.axis('equal')

plt.subplot(1, 2, 2)
# ์ด๋ก ์  ๋“ฑ๊ณ ์„ 
from scipy.stats import multivariate_normal
x_grid = np.linspace(-4, 4, 100)
y_grid = np.linspace(-4, 4, 100)
X, Y = np.meshgrid(x_grid, y_grid)
pos = np.dstack((X, Y))
rv = multivariate_normal([0, 0], [[1, 0.9], [0.9, 1]])
plt.contour(X, Y, rv.pdf(pos), levels=10)
plt.scatter(samples[:, 0], samples[:, 1], alpha=0.1, s=1, c='red')
plt.xlabel('x')
plt.ylabel('y')
plt.title('์ด๋ก  ๋ถ„ํฌ์™€ ๋น„๊ต')
plt.axis('equal')
plt.tight_layout()
plt.show()

4.4 Burn-in and Autocorrelation

  • Burn-in: Initial samples haven't reached the stationary distribution, so discard them
  • Autocorrelation: Consecutive samples are correlated โ†’ thin by taking every k-th sample
  • Effective sample size: $n_{\text{eff}} = \frac{n}{1 + 2\sum_{k=1}^{\infty} \rho_k}$

5. Reparameterization Trick

5.1 Problem: Backpropagating Through Stochastic Nodes

In VAE, the encoder outputs $q_\phi(z|x)$ and samples $z \sim q_\phi(z|x)$. The loss function:

$$ \mathcal{L}(\phi) = \mathbb{E}_{z \sim q_\phi(z|x)}[f(z)] $$

Problem: How to compute $\frac{\partial}{\partial \phi} \mathbb{E}_{z \sim q_\phi}[f(z)]$?

Sampling operations are not differentiable.

5.2 Reparameterization Solution

Idea: Separate randomness into noise independent of $\phi$.

Gaussian distribution: $z \sim \mathcal{N}(\mu_\phi, \sigma_\phi^2)$

Reparameterization: $z = \mu_\phi + \sigma_\phi \cdot \epsilon$, where $\epsilon \sim \mathcal{N}(0, 1)$

Now: $$ \frac{\partial}{\partial \phi} \mathbb{E}_{\epsilon \sim \mathcal{N}(0,1)}[f(\mu_\phi + \sigma_\phi \epsilon)] = \mathbb{E}_{\epsilon}\left[\frac{\partial f(z)}{\partial z} \frac{\partial z}{\partial \phi}\right] $$

Gradients can flow through the sampling operation!

import torch
import torch.nn as nn

# ์žฌ๋งค๊ฐœ๋ณ€์ˆ˜ํ™” ํŠธ๋ฆญ ๊ตฌํ˜„
class VAEEncoder(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.fc = nn.Linear(input_dim, 128)
        self.fc_mu = nn.Linear(128, latent_dim)
        self.fc_logvar = nn.Linear(128, latent_dim)

    def forward(self, x):
        h = torch.relu(self.fc(x))
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)
        return mu, logvar

    def reparameterize(self, mu, logvar):
        """์žฌ๋งค๊ฐœ๋ณ€์ˆ˜ํ™” ํŠธ๋ฆญ"""
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)  # ฮต ~ N(0, 1)
        z = mu + std * eps           # z = ฮผ + ฯƒฮต
        return z

# ๊ธฐ์šธ๊ธฐ ํ๋ฆ„ ํ…Œ์ŠคํŠธ
encoder = VAEEncoder(input_dim=10, latent_dim=2)
x = torch.randn(32, 10)

mu, logvar = encoder(x)
z = encoder.reparameterize(mu, logvar)

# ์ž„์˜์˜ ์†์‹ค ํ•จ์ˆ˜
loss = (z ** 2).sum()
loss.backward()

print("๊ธฐ์šธ๊ธฐ๊ฐ€ ์„ฑ๊ณต์ ์œผ๋กœ ๊ณ„์‚ฐ๋จ:")
print(f"  mu์˜ ๊ธฐ์šธ๊ธฐ: {encoder.fc_mu.weight.grad is not None}")
print(f"  logvar์˜ ๊ธฐ์šธ๊ธฐ: {encoder.fc_logvar.weight.grad is not None}")
print(f"  ๊ธฐ์šธ๊ธฐ norm: {encoder.fc_mu.weight.grad.norm():.4f}")

5.3 Reparameterization for Other Distributions

Bernoulli: Gumbel-Softmax trick $$ z = \text{one-hot}\left(\arg\max_i \left[\log \pi_i + G_i\right]\right) $$ where $G_i \sim \text{Gumbel}(0, 1)$

Gamma distribution: Shape augmentation

General principle: Reparameterizable if expressible as $z = g(\epsilon; \theta)$

5.4 VAE ELBO

VAE loss function (ELBO):

$$ \mathcal{L} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z)) $$

Applying reparameterization: - First term: Compute gradients using reparameterization trick - Second term: Analytically computable under Gaussian prior assumption

def vae_loss(x, x_recon, mu, logvar):
    """VAE ์†์‹ค ํ•จ์ˆ˜ (ELBO)"""
    # ์žฌ๊ตฌ์„ฑ ์†์‹ค (log p(x|z))
    recon_loss = nn.functional.binary_cross_entropy(x_recon, x, reduction='sum')

    # KL ๋ฐœ์‚ฐ (๊ฐ€์šฐ์Šค ๊ฐ€์ • ํ•˜์— ํ•ด์„์  ๊ณ„์‚ฐ)
    # KL(N(ฮผ,ฯƒยฒ) || N(0,1)) = -0.5 * ฮฃ(1 + log(ฯƒยฒ) - ฮผยฒ - ฯƒยฒ)
    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

    return recon_loss + kl_loss

# ์ „์ฒด VAE ํ•™์Šต ๋ฃจํ”„ (๊ฐ„๋žตํ™”)
def train_vae(encoder, decoder, data_loader, optimizer, epochs):
    for epoch in range(epochs):
        for x in data_loader:
            # ์ธ์ฝ”๋”
            mu, logvar = encoder(x)
            z = encoder.reparameterize(mu, logvar)

            # ๋””์ฝ”๋”
            x_recon = decoder(z)

            # ์†์‹ค ๊ณ„์‚ฐ ๋ฐ ์—ญ์ „ํŒŒ
            loss = vae_loss(x, x_recon, mu, logvar)
            optimizer.zero_grad()
            loss.backward()  # ์žฌ๋งค๊ฐœ๋ณ€์ˆ˜ํ™” ๋•๋ถ„์— ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ!
            optimizer.step()

6. Machine Learning Applications

6.1 VAE: ELBO Optimization

Without the reparameterization trick, VAEs would be impossible. REINFORCE (score function estimation) has too high variance.

6.2 Reinforcement Learning: REINFORCE = Monte Carlo

Policy gradient theorem: $$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau)\right] $$

This is estimated using Monte Carlo sampling.

Variance reduction: Use baselines, GAE (Generalized Advantage Estimation)

6.3 Bayesian Inference: MCMC vs Variational Inference

MCMC: - Pros: Theoretically exact samples - Cons: Slow, difficult to diagnose convergence

Variational Inference: - Pros: Fast, scalable - Cons: Approximation quality limited by expressiveness of $q$

6.4 Dropout = Bernoulli Sampling

Dropout is Bernoulli sampling during training: $$ h_{\text{drop}} = h \odot \text{Bernoulli}(p) $$

Bayesian interpretation: Dropout can be viewed as approximate Bayesian inference (Gal & Ghahramani, 2016).

# MC Dropout์œผ๋กœ ๋ถˆํ™•์‹ค์„ฑ ์ถ”์ •
class MCDropoutModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 50)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(50, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)  # ์ถ”๋ก  ์‹œ์—๋„ dropout ์ ์šฉ
        x = self.fc2(x)
        return x

def predict_with_uncertainty(model, x, n_samples=100):
    """MC Dropout์œผ๋กœ ์˜ˆ์ธก ๋ถˆํ™•์‹ค์„ฑ ์ถ”์ •"""
    model.train()  # Dropout ํ™œ์„ฑํ™”
    predictions = []

    with torch.no_grad():
        for _ in range(n_samples):
            pred = model(x)
            predictions.append(pred)

    predictions = torch.cat(predictions, dim=1)
    mean = predictions.mean(dim=1)
    std = predictions.std(dim=1)
    return mean, std

# ํ…Œ์ŠคํŠธ
model = MCDropoutModel()
x_test = torch.randn(10, 10)
mean, std = predict_with_uncertainty(model, x_test, n_samples=100)
print(f"์˜ˆ์ธก ํ‰๊ท : {mean[:5]}")
print(f"์˜ˆ์ธก ํ‘œ์ค€ํŽธ์ฐจ (๋ถˆํ™•์‹ค์„ฑ): {std[:5]}")

Practice Problems

Problem 1: Monte Carlo Integration

Estimate the following integral using Monte Carlo method: $$ I = \int_0^1 e^{-x^2} dx $$

(a) Estimate using 10, 100, 1000, 10000 samples and calculate errors. (b) Verify on a log-log plot that error decreases as $O(1/\sqrt{N})$. (c) Compare with results from scipy.integrate.quad.

Problem 2: Importance Sampling

You want to compute $\mathbb{E}[x^2]$ from target distribution $p(x) \propto e^{-|x|^3}$.

(a) Implement importance sampling using proposal distribution $q(x) = \mathcal{N}(0, 1)$. (b) Compare variance with the case using $q(x) = \text{Laplace}(0, 1)$ as proposal. (c) Which proposal distribution is more efficient? Explain why.

Problem 3: MCMC Diagnostics

Perform Metropolis-Hastings sampling from bimodal distribution $p(x) \propto e^{-(x-3)^2/2} + e^{-(x+3)^2/2}$.

(a) Observe acceptance rates as you vary proposal distribution standard deviation: 0.1, 1.0, 10.0. (b) Plot the autocorrelation function (ACF) for each case and estimate effective sample size. (c) What is the optimal proposal distribution standard deviation?

Problem 4: Reparameterization Trick

Implement a simple VAE and verify the effect of the reparameterization trick.

(a) Train a small VAE on MNIST dataset (latent_dim=2). (b) Compare with training using REINFORCE without reparameterization trick. (c) Visualize the latent space with 2D plots.

Problem 5: Gibbs Sampling Extension

Implement Gibbs sampling for a 3-variate Gaussian distribution.

$$ \mathbf{x} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} 1 & 0.8 & 0.5 \\ 0.8 & 1 & 0.7 \\ 0.5 & 0.7 & 1 \end{bmatrix}\right) $$

(a) Derive conditional distributions $p(x_i | x_{-i})$ (use Gaussian conditional formula). (b) Perform 10,000 iterations of Gibbs sampling and compute sample covariance. (c) Compare with theoretical covariance.


References

Books

  • Pattern Recognition and Machine Learning (Bishop, 2006) - Chapter 11: Sampling Methods
  • Monte Carlo Statistical Methods (Robert & Casella, 2004)
  • Deep Learning (Goodfellow et al., 2016) - Chapter 17: Monte Carlo Methods

Papers

  • Kingma & Welling (2013), "Auto-Encoding Variational Bayes" - VAE and reparameterization trick
  • Metropolis et al. (1953), "Equation of State Calculations by Fast Computing Machines"
  • Hastings (1970), "Monte Carlo Sampling Methods Using Markov Chains"
  • Gal & Ghahramani (2016), "Dropout as a Bayesian Approximation"

Online Resources

Libraries

  • PyMC3/PyMC4: Bayesian inference and MCMC
  • Stan: Powerful MCMC inference engine
  • PyTorch: VAE implementation
  • NumPyro: JAX-based probabilistic programming
to navigate between lessons