33. Diffusion Models (DDPM)
Previous: Diffusion Models | Next: CLIP and Multimodal Learning
33. Diffusion Models (DDPM)¶
Overview¶
Denoising Diffusion Probabilistic Models (DDPM) are powerful generative models that learn to generate data by reversing a gradual noising process. "Denoising Diffusion Probabilistic Models" (Ho et al., 2020)
Mathematical Background¶
1. Forward Diffusion Process¶
Goal: Gradually add Gaussian noise to data xβ
q(xβ|xβββ) = N(xβ; β(1-Ξ²β)xβββ, Ξ²βI)
Where:
- xβ: original data
- xβ: noisy data at timestep t
- Ξ²β: noise schedule (Ξ²β, ..., Ξ²β)
- T: total timesteps (typically 1000)
Closed form (using Ξ±β = 1 - Ξ²β, αΎ±β = βα΅’ββα΅ Ξ±α΅’):
q(xβ|xβ) = N(xβ; βαΎ±β xβ, (1-αΎ±β)I)
xβ = βαΎ±β xβ + β(1-αΎ±β) Ξ΅, Ξ΅ ~ N(0, I)
As t β T: xβ β N(0, I) (pure noise)
2. Reverse Diffusion Process¶
Goal: Learn to denoise p(xβββ|xβ)
True posterior (intractable):
q(xβββ|xβ, xβ) = N(xβββ; ΞΌΜβ(xβ, xβ), Ξ²ΜβI)
Where:
ΞΌΜβ(xβ, xβ) = (βαΎ±βββ Ξ²β)/(1-αΎ±β) xβ + (βΞ±β(1-αΎ±βββ))/(1-αΎ±β) xβ
Ξ²Μβ = (1-αΎ±βββ)/(1-αΎ±β) Β· Ξ²β
Learned reverse process:
pΞΈ(xβββ|xβ) = N(xβββ; ΞΌΞΈ(xβ, t), Σθ(xβ, t))
Simplified: predict noise Ξ΅ instead of mean
Ρθ(xβ, t) β Ξ΅
3. Training Objective¶
Variational Lower Bound (ELBO):
L = Eβ,xβ,Ξ΅[||Ξ΅ - Ρθ(xβ, t)||Β²]
Where:
- t ~ Uniform(1, T)
- xβ ~ q(xβ)
- Ξ΅ ~ N(0, I)
- xβ = βαΎ±β xβ + β(1-αΎ±β) Ξ΅
Simple MSE loss on predicted noise!
βββββββββββββββββββββββββββββββββββββββββββ
β Training: β
β 1. Sample xβ, t, Ξ΅ β
β 2. Create xβ = βαΎ±β xβ + β(1-αΎ±β) Ξ΅ β
β 3. Predict Ξ΅Μ = Ρθ(xβ, t) β
β 4. Loss = ||Ξ΅ - Ξ΅Μ||Β² β
βββββββββββββββββββββββββββββββββββββββββββ
4. Sampling (Generation)¶
Start from xβ ~ N(0, I)
For t = T, T-1, ..., 1:
z ~ N(0, I) if t > 1, else z = 0
Ξ΅Μ = Ρθ(xβ, t)
xβββ = 1/βΞ±β (xβ - (1-Ξ±β)/β(1-αΎ±β) Ξ΅Μ) + Οβz
Where:
Οβ = βΞ²Μβ or βΞ²β (variance schedule)
Final: xβ is the generated sample
DDPM Architecture¶
UNet with Time Embedding¶
Time Embedding (Sinusoidal Positional Encoding):
t (scalar)
β
PE(t, dim) = [sin(t/10000^(0/d)), cos(t/10000^(0/d)),
sin(t/10000^(2/d)), cos(t/10000^(2/d)), ...]
β
Linear(dimβ4*dim) + SiLU + Linear(4*dimβ4*dim)
β
time_emb (broadcast to spatial dimensions)
UNet Structure (e.g., 32Γ32Γ3 images):
Input xβ (32Γ32Γ3) + time_emb
β
βββββββββββββββββββββββββββββββββββββββββββ
β Encoder (Downsampling) β
βββββββββββββββββββββββββββββββββββββββββββ€
β Conv(3β64) + TimeEmb + ResBlock β β skip1
β β Downsample β
β Conv(64β128) + TimeEmb + ResBlock β β skip2
β β Downsample β
β Conv(128β256) + TimeEmb + ResBlock β β skip3
β β Downsample β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β Bottleneck β
β Conv(256β512) + Attention + ResBlock β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β Decoder (Upsampling) β
βββββββββββββββββββββββββββββββββββββββββββ€
β β Upsample + Concat(skip3) β
β Conv(512+256β256) + TimeEmb + ResBlock β
β β Upsample + Concat(skip2) β
β Conv(256+128β128) + TimeEmb + ResBlock β
β β Upsample + Concat(skip1) β
β Conv(128+64β64) + TimeEmb + ResBlock β
βββββββββββββββββββββββββββββββββββββββββββ
β
Conv(64β3) + GroupNorm
β
Output Ρθ(xβ, t) (32Γ32Γ3)
ResBlock with Time Embedding¶
x, time_emb β ResBlock β out
βββββββββββββββββββββββββββββββββββββββββββ
β GroupNorm β SiLU β Conv β
β β β
β + time_emb (broadcast) β
β β β
β GroupNorm β SiLU β Conv β
β β β
β + skip connection (with projection) β
βββββββββββββββββββββββββββββββββββββββββββ
Noise Schedule¶
Linear Schedule¶
# Linear schedule (Ho et al., 2020)
Ξ²β = 1e-4
Ξ²β = 0.02
Ξ²β = linear_interpolate(Ξ²β, Ξ²β, t/T)
# Precompute for efficiency
Ξ±β = 1 - Ξ²β
αΎ±β = βα΅’ββα΅ Ξ±α΅’
βαΎ±β, β(1-αΎ±β) # Used in forward process
Cosine Schedule (Improved)¶
# Cosine schedule (Nichol & Dhariwal, 2021)
s = 0.008
f(t) = cosΒ²((t/T + s)/(1 + s) Β· Ο/2)
αΎ±β = f(t) / f(0)
Ξ²β = 1 - Ξ±β/Ξ±βββ
# Smoother noise schedule, better for high resolution
File Structure¶
13_Diffusion/
βββ README.md
βββ pytorch_lowlevel/
β βββ ddpm_mnist.py # DDPM on MNIST (28Γ28)
β βββ ddpm_cifar.py # DDPM on CIFAR-10 (32Γ32)
βββ paper/
β βββ ddpm_paper.py # Full DDPM implementation
β βββ ddim_sampling.py # DDIM faster sampling
β βββ cosine_schedule.py # Improved noise schedule
βββ exercises/
βββ 01_noise_schedule.md # Visualize noise schedules
βββ 02_sampling_steps.md # Compare DDPM vs DDIM
Core Concepts¶
1. DDPM vs DDIM Sampling¶
DDPM (Ho et al., 2020):
- Stochastic sampling (adds noise z at each step)
- Requires T steps (e.g., 1000 steps)
- High quality but slow
DDIM (Song et al., 2020):
- Deterministic sampling (z = 0)
- Skip timesteps: use subset [Οβ, Οβ, ..., Οβ]
- 10-50x faster (e.g., 50 steps)
- Slight quality drop
DDIM update:
xβββ = βαΎ±βββ xΜβ + β(1-αΎ±βββ) Ρθ(xβ, t)
Where xΜβ = (xβ - β(1-αΎ±β)Ρθ(xβ, t))/βαΎ±β
2. Classifier Guidance¶
Goal: Generate samples conditioned on class y
Conditional score:
ββ log p(xβ|y) β ββ log p(xβ) + sΒ·ββ log p(y|xβ)
βββββββββββββ βββββββββββββββββ
Unconditional Classifier gradient
Guided noise prediction:
Ξ΅Μ = Ρθ(xβ, t) - sΒ·β(1-αΎ±β)Β·ββ log pΟ(y|xβ)
s: guidance scale (s > 1 β stronger conditioning)
3. Classifier-Free Guidance¶
No separate classifier needed!
Train model to handle both conditional and unconditional:
Ρθ(xβ, t, c) with probability p
Ρθ(xβ, t, β
) with probability 1-p (β
= null class)
Guided prediction:
Ξ΅Μ = Ρθ(xβ, t, β
) + wΒ·(Ρθ(xβ, t, c) - Ρθ(xβ, t, β
))
w: guidance weight (w=0 β unconditional, w>1 β stronger)
Used in: Stable Diffusion, DALL-E 2, Imagen
4. Training Tips¶
1. EMA (Exponential Moving Average):
- Maintain ΞΈ_ema = 0.9999Β·ΞΈ_ema + 0.0001Β·ΞΈ
- Use ΞΈ_ema for sampling
2. Progressive Training:
- Start with smaller resolution
- Gradually increase (8Γ8 β 16Γ16 β 32Γ32)
3. Data Augmentation:
- Random horizontal flip
- Normalize to [-1, 1]
4. Learning Rate:
- 2e-4 for MNIST/CIFAR
- 1e-4 for high resolution
5. Batch Size:
- 128-256 for small images
- 32-64 for large images
Implementation Levels¶
Level 2: PyTorch Low-Level (pytorch_lowlevel/)¶
- Implement forward/reverse diffusion
- Implement noise schedule (linear)
- Build UNet with time embedding
- Train on MNIST (28Γ28) and CIFAR-10 (32Γ32)
Level 3: Paper Implementation (paper/)¶
- Full DDPM with cosine schedule
- DDIM sampling (faster inference)
- Classifier-free guidance
- FID/IS evaluation metrics
Training Loop¶
# Pseudocode
for epoch in epochs:
for x0, _ in dataloader:
# Sample random timestep
t = torch.randint(1, T+1, (batch_size,))
# Sample noise
noise = torch.randn_like(x0)
# Forward diffusion: create noisy image
xt = sqrt_alpha_bar[t] * x0 + sqrt_one_minus_alpha_bar[t] * noise
# Predict noise
noise_pred = model(xt, t)
# MSE loss
loss = F.mse_loss(noise_pred, noise)
# Backprop
optimizer.zero_grad()
loss.backward()
optimizer.step()
Sampling Loop¶
# DDPM sampling
x = torch.randn(batch_size, 3, 32, 32) # Start from noise
for t in reversed(range(1, T+1)):
# Predict noise
t_batch = torch.full((batch_size,), t)
noise_pred = model(x, t_batch)
# Compute mean
alpha_t = alpha[t]
alpha_bar_t = alpha_bar[t]
mean = (x - (1 - alpha_t) / sqrt(1 - alpha_bar_t) * noise_pred) / sqrt(alpha_t)
# Add noise (except last step)
if t > 1:
noise = torch.randn_like(x)
sigma_t = sqrt(beta[t])
x = mean + sigma_t * noise
else:
x = mean
# x is the generated image
Learning Checklist¶
- [ ] Understand forward diffusion closed-form
- [ ] Derive reverse diffusion from ELBO
- [ ] Implement noise schedules (linear, cosine)
- [ ] Build UNet with time embedding
- [ ] Understand DDPM vs DDIM sampling
- [ ] Implement classifier-free guidance
- [ ] Calculate FID score for evaluation
References¶
- Ho et al. (2020). "Denoising Diffusion Probabilistic Models"
- Song et al. (2020). "Denoising Diffusion Implicit Models"
- Nichol & Dhariwal (2021). "Improved Denoising Diffusion Probabilistic Models"
- Ho & Salimans (2022). "Classifier-Free Diffusion Guidance"
- 32_Diffusion_Models.md