33. Diffusion Models (DDPM)

Previous: Diffusion Models | Next: CLIP and Multimodal Learning


33. Diffusion Models (DDPM)

Overview

Denoising Diffusion Probabilistic Models (DDPM) are powerful generative models that learn to generate data by reversing a gradual noising process. "Denoising Diffusion Probabilistic Models" (Ho et al., 2020)


Mathematical Background

1. Forward Diffusion Process

Goal: Gradually add Gaussian noise to data xβ‚€

q(xβ‚œ|xβ‚œβ‚‹β‚) = N(xβ‚œ; √(1-Ξ²β‚œ)xβ‚œβ‚‹β‚, Ξ²β‚œI)

Where:
- xβ‚€: original data
- xβ‚œ: noisy data at timestep t
- Ξ²β‚œ: noise schedule (β₁, ..., Ξ²β‚œ)
- T: total timesteps (typically 1000)

Closed form (using Ξ±β‚œ = 1 - Ξ²β‚œ, αΎ±β‚œ = βˆα΅’β‚Œβ‚α΅— Ξ±α΅’):
q(xβ‚œ|xβ‚€) = N(xβ‚œ; βˆšαΎ±β‚œ xβ‚€, (1-αΎ±β‚œ)I)

xβ‚œ = βˆšαΎ±β‚œ xβ‚€ + √(1-αΎ±β‚œ) Ξ΅,  Ξ΅ ~ N(0, I)

As t β†’ T: xβ‚œ β†’ N(0, I) (pure noise)

2. Reverse Diffusion Process

Goal: Learn to denoise p(xβ‚œβ‚‹β‚|xβ‚œ)

True posterior (intractable):
q(xβ‚œβ‚‹β‚|xβ‚œ, xβ‚€) = N(xβ‚œβ‚‹β‚; ΞΌΜƒβ‚œ(xβ‚œ, xβ‚€), Ξ²Μƒβ‚œI)

Where:
ΞΌΜƒβ‚œ(xβ‚œ, xβ‚€) = (βˆšαΎ±β‚œβ‚‹β‚ Ξ²β‚œ)/(1-αΎ±β‚œ) xβ‚€ + (βˆšΞ±β‚œ(1-αΎ±β‚œβ‚‹β‚))/(1-αΎ±β‚œ) xβ‚œ
Ξ²Μƒβ‚œ = (1-αΎ±β‚œβ‚‹β‚)/(1-αΎ±β‚œ) Β· Ξ²β‚œ

Learned reverse process:
pΞΈ(xβ‚œβ‚‹β‚|xβ‚œ) = N(xβ‚œβ‚‹β‚; ΞΌΞΈ(xβ‚œ, t), Σθ(xβ‚œ, t))

Simplified: predict noise Ξ΅ instead of mean
Ρθ(xβ‚œ, t) β‰ˆ Ξ΅

3. Training Objective

Variational Lower Bound (ELBO):
L = Eβ‚œ,xβ‚€,Ξ΅[||Ξ΅ - Ρθ(xβ‚œ, t)||Β²]

Where:
- t ~ Uniform(1, T)
- xβ‚€ ~ q(xβ‚€)
- Ξ΅ ~ N(0, I)
- xβ‚œ = βˆšαΎ±β‚œ xβ‚€ + √(1-αΎ±β‚œ) Ξ΅

Simple MSE loss on predicted noise!

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Training:                              β”‚
β”‚  1. Sample xβ‚€, t, Ξ΅                     β”‚
β”‚  2. Create xβ‚œ = βˆšαΎ±β‚œ xβ‚€ + √(1-αΎ±β‚œ) Ξ΅     β”‚
β”‚  3. Predict Ξ΅Μ‚ = Ρθ(xβ‚œ, t)              β”‚
β”‚  4. Loss = ||Ξ΅ - Ξ΅Μ‚||Β²                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4. Sampling (Generation)

Start from xβ‚œ ~ N(0, I)

For t = T, T-1, ..., 1:
    z ~ N(0, I) if t > 1, else z = 0

    Ξ΅Μ‚ = Ρθ(xβ‚œ, t)

    xβ‚œβ‚‹β‚ = 1/βˆšΞ±β‚œ (xβ‚œ - (1-Ξ±β‚œ)/√(1-αΎ±β‚œ) Ξ΅Μ‚) + Οƒβ‚œz

Where:
Οƒβ‚œ = βˆšΞ²Μƒβ‚œ or βˆšΞ²β‚œ (variance schedule)

Final: xβ‚€ is the generated sample

DDPM Architecture

UNet with Time Embedding

Time Embedding (Sinusoidal Positional Encoding):
t (scalar)
    ↓
PE(t, dim) = [sin(t/10000^(0/d)), cos(t/10000^(0/d)),
              sin(t/10000^(2/d)), cos(t/10000^(2/d)), ...]
    ↓
Linear(dim→4*dim) + SiLU + Linear(4*dim→4*dim)
    ↓
time_emb (broadcast to spatial dimensions)


UNet Structure (e.g., 32Γ—32Γ—3 images):

Input xβ‚œ (32Γ—32Γ—3) + time_emb
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Encoder (Downsampling)                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Conv(3β†’64) + TimeEmb + ResBlock         β”‚ β†’ skip1
β”‚     ↓ Downsample                        β”‚
β”‚ Conv(64β†’128) + TimeEmb + ResBlock       β”‚ β†’ skip2
β”‚     ↓ Downsample                        β”‚
β”‚ Conv(128β†’256) + TimeEmb + ResBlock      β”‚ β†’ skip3
β”‚     ↓ Downsample                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Bottleneck                             β”‚
β”‚  Conv(256β†’512) + Attention + ResBlock   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Decoder (Upsampling)                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚     ↑ Upsample + Concat(skip3)          β”‚
β”‚ Conv(512+256β†’256) + TimeEmb + ResBlock  β”‚
β”‚     ↑ Upsample + Concat(skip2)          β”‚
β”‚ Conv(256+128β†’128) + TimeEmb + ResBlock  β”‚
β”‚     ↑ Upsample + Concat(skip1)          β”‚
β”‚ Conv(128+64β†’64) + TimeEmb + ResBlock    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
Conv(64β†’3) + GroupNorm
    ↓
Output Ρθ(xβ‚œ, t) (32Γ—32Γ—3)

ResBlock with Time Embedding

x, time_emb β†’ ResBlock β†’ out

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GroupNorm β†’ SiLU β†’ Conv                β”‚
β”‚       ↓                                 β”‚
β”‚  + time_emb (broadcast)                 β”‚
β”‚       ↓                                 β”‚
β”‚  GroupNorm β†’ SiLU β†’ Conv                β”‚
β”‚       ↓                                 β”‚
β”‚  + skip connection (with projection)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Noise Schedule

Linear Schedule

# Linear schedule (Ho et al., 2020)
β₁ = 1e-4
Ξ²β‚œ = 0.02
Ξ²β‚œ = linear_interpolate(β₁, Ξ²β‚œ, t/T)

# Precompute for efficiency
Ξ±β‚œ = 1 - Ξ²β‚œ
αΎ±β‚œ = βˆα΅’β‚Œβ‚α΅— Ξ±α΅’
βˆšαΎ±β‚œ, √(1-αΎ±β‚œ)  # Used in forward process

Cosine Schedule (Improved)

# Cosine schedule (Nichol & Dhariwal, 2021)
s = 0.008
f(t) = cosΒ²((t/T + s)/(1 + s) Β· Ο€/2)
αΎ±β‚œ = f(t) / f(0)
Ξ²β‚œ = 1 - Ξ±β‚œ/Ξ±β‚œβ‚‹β‚

# Smoother noise schedule, better for high resolution

File Structure

13_Diffusion/
β”œβ”€β”€ README.md
β”œβ”€β”€ pytorch_lowlevel/
β”‚   β”œβ”€β”€ ddpm_mnist.py         # DDPM on MNIST (28Γ—28)
β”‚   └── ddpm_cifar.py         # DDPM on CIFAR-10 (32Γ—32)
β”œβ”€β”€ paper/
β”‚   β”œβ”€β”€ ddpm_paper.py         # Full DDPM implementation
β”‚   β”œβ”€β”€ ddim_sampling.py      # DDIM faster sampling
β”‚   └── cosine_schedule.py    # Improved noise schedule
└── exercises/
    β”œβ”€β”€ 01_noise_schedule.md  # Visualize noise schedules
    └── 02_sampling_steps.md  # Compare DDPM vs DDIM

Core Concepts

1. DDPM vs DDIM Sampling

DDPM (Ho et al., 2020):
- Stochastic sampling (adds noise z at each step)
- Requires T steps (e.g., 1000 steps)
- High quality but slow

DDIM (Song et al., 2020):
- Deterministic sampling (z = 0)
- Skip timesteps: use subset [τ₁, Ο„β‚‚, ..., Ο„β‚›]
- 10-50x faster (e.g., 50 steps)
- Slight quality drop

DDIM update:
xβ‚œβ‚‹β‚ = βˆšαΎ±β‚œβ‚‹β‚ xΜ‚β‚€ + √(1-αΎ±β‚œβ‚‹β‚) Ρθ(xβ‚œ, t)

Where xΜ‚β‚€ = (xβ‚œ - √(1-αΎ±β‚œ)Ρθ(xβ‚œ, t))/βˆšαΎ±β‚œ

2. Classifier Guidance

Goal: Generate samples conditioned on class y

Conditional score:
βˆ‡β‚“ log p(xβ‚œ|y) β‰ˆ βˆ‡β‚“ log p(xβ‚œ) + sΒ·βˆ‡β‚“ log p(y|xβ‚œ)
                  ─────────────   ─────────────────
                  Unconditional   Classifier gradient

Guided noise prediction:
Ξ΅Μ‚ = Ρθ(xβ‚œ, t) - s·√(1-αΎ±β‚œ)Β·βˆ‡β‚“ log pΟ†(y|xβ‚œ)

s: guidance scale (s > 1 β†’ stronger conditioning)

3. Classifier-Free Guidance

No separate classifier needed!

Train model to handle both conditional and unconditional:
Ρθ(xβ‚œ, t, c) with probability p
Ρθ(xβ‚œ, t, βˆ…) with probability 1-p (βˆ… = null class)

Guided prediction:
Ξ΅Μ‚ = Ρθ(xβ‚œ, t, βˆ…) + wΒ·(Ρθ(xβ‚œ, t, c) - Ρθ(xβ‚œ, t, βˆ…))

w: guidance weight (w=0 β†’ unconditional, w>1 β†’ stronger)

Used in: Stable Diffusion, DALL-E 2, Imagen

4. Training Tips

1. EMA (Exponential Moving Average):
   - Maintain ΞΈ_ema = 0.9999Β·ΞΈ_ema + 0.0001Β·ΞΈ
   - Use ΞΈ_ema for sampling

2. Progressive Training:
   - Start with smaller resolution
   - Gradually increase (8Γ—8 β†’ 16Γ—16 β†’ 32Γ—32)

3. Data Augmentation:
   - Random horizontal flip
   - Normalize to [-1, 1]

4. Learning Rate:
   - 2e-4 for MNIST/CIFAR
   - 1e-4 for high resolution

5. Batch Size:
   - 128-256 for small images
   - 32-64 for large images

Implementation Levels

Level 2: PyTorch Low-Level (pytorch_lowlevel/)

  • Implement forward/reverse diffusion
  • Implement noise schedule (linear)
  • Build UNet with time embedding
  • Train on MNIST (28Γ—28) and CIFAR-10 (32Γ—32)

Level 3: Paper Implementation (paper/)

  • Full DDPM with cosine schedule
  • DDIM sampling (faster inference)
  • Classifier-free guidance
  • FID/IS evaluation metrics

Training Loop

# Pseudocode
for epoch in epochs:
    for x0, _ in dataloader:
        # Sample random timestep
        t = torch.randint(1, T+1, (batch_size,))

        # Sample noise
        noise = torch.randn_like(x0)

        # Forward diffusion: create noisy image
        xt = sqrt_alpha_bar[t] * x0 + sqrt_one_minus_alpha_bar[t] * noise

        # Predict noise
        noise_pred = model(xt, t)

        # MSE loss
        loss = F.mse_loss(noise_pred, noise)

        # Backprop
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Sampling Loop

# DDPM sampling
x = torch.randn(batch_size, 3, 32, 32)  # Start from noise

for t in reversed(range(1, T+1)):
    # Predict noise
    t_batch = torch.full((batch_size,), t)
    noise_pred = model(x, t_batch)

    # Compute mean
    alpha_t = alpha[t]
    alpha_bar_t = alpha_bar[t]
    mean = (x - (1 - alpha_t) / sqrt(1 - alpha_bar_t) * noise_pred) / sqrt(alpha_t)

    # Add noise (except last step)
    if t > 1:
        noise = torch.randn_like(x)
        sigma_t = sqrt(beta[t])
        x = mean + sigma_t * noise
    else:
        x = mean

# x is the generated image

Learning Checklist

  • [ ] Understand forward diffusion closed-form
  • [ ] Derive reverse diffusion from ELBO
  • [ ] Implement noise schedules (linear, cosine)
  • [ ] Build UNet with time embedding
  • [ ] Understand DDPM vs DDIM sampling
  • [ ] Implement classifier-free guidance
  • [ ] Calculate FID score for evaluation

References

  • Ho et al. (2020). "Denoising Diffusion Probabilistic Models"
  • Song et al. (2020). "Denoising Diffusion Implicit Models"
  • Nichol & Dhariwal (2021). "Improved Denoising Diffusion Probabilistic Models"
  • Ho & Salimans (2022). "Classifier-Free Diffusion Guidance"
  • 32_Diffusion_Models.md
to navigate between lessons