Diffusion Models are a class of generative models that learn to generate data by reversing a gradual noising process. They have emerged as state-of-the-art for high-quality image generation, powering systems like DALL-E 2, Stable Diffusion, and Imagen.
Diffusion models have roots in non-equilibrium thermodynamics and were formalized for deep learning by Sohl-Dickstein et al. (2015). The breakthrough came with Denoising Diffusion Probabilistic Models (DDPM) by Ho et al. (2020), which simplified training and achieved remarkable results.
Key milestones:
Advantages over GANs:
Trade-offs:
Phase 1: Forward Diffusion (Fixed, No Learning)
Gradually add Gaussian noise to data over T timesteps until it becomes pure noise. This process is:
Phase 2: Reverse Denoising (Learned)
Train a neural network to reverse the forward process, denoising step-by-step:
Imagine dropping ink into a glass of water:
Now imagine you could reverse this process: starting with murky water and somehow un-diffusing it back to a clear drop of ink. This is exactly what diffusion models learn to do!
Once we've learned to denoise (reverse the diffusion), we can generate new samples:
| Model | Approach | Generation Process |
|---|---|---|
| GAN | Adversarial | One-shot: noise → image (single forward pass) |
| VAE | Latent encoding | One-shot: latent code → image (decode) |
| Diffusion | Iterative denoising | Multi-step: noise → ... → image (T steps) |
| Autoregressive | Sequential prediction | Pixel-by-pixel: predict each pixel given previous |
Diffusion models can be viewed as Markov chains:
Forward chain:
Reverse chain:
At each timestep t, we add Gaussian noise to the previous state:
Components:
Sampling equation:
The noise schedule controls how quickly we add noise. It's a sequence of values from β_1 to β_T.
Common schedules:
A powerful property: we can jump directly from x_0 to x_t in one step, without computing intermediate steps!
Define cumulative products:
Closed-form sampling:
Reparameterization for sampling:
1. Variance preservation:
The variance of x_t is designed to remain constant (approximately 1):
2. Convergence to pure noise:
As t → T, ᾱ_t → 0, so:
This means x_T is approximately pure Gaussian noise, regardless of the original x_0.
3. Information destruction:
The signal-to-noise ratio decreases monotonically:
As t increases, SNR decreases, and the image loses structure.
# Noise schedule (linear)
def linear_beta_schedule(timesteps, beta_start=1e-4, beta_end=0.02):
return torch.linspace(beta_start, beta_end, timesteps)
# Precompute alpha values
betas = linear_beta_schedule(timesteps=1000)
alphas = 1.0 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
alphas_cumprod_prev = torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])
# Forward diffusion (closed form)
def q_sample(x_0, t, noise=None):
"""
Sample from q(x_t | x_0) - add noise to x_0 directly
Args:
x_0: Clean image (batch_size, channels, height, width)
t: Timestep (batch_size,)
noise: Optional pre-sampled noise
Returns:
Noisy image x_t
"""
if noise is None:
noise = torch.randn_like(x_0)
# Get sqrt(alpha_cumprod) for each sample in batch
sqrt_alphas_cumprod_t = extract(torch.sqrt(alphas_cumprod), t, x_0.shape)
sqrt_one_minus_alphas_cumprod_t = extract(
torch.sqrt(1.0 - alphas_cumprod), t, x_0.shape
)
# Apply noise: x_t = sqrt(alpha_cumprod_t) * x_0 + sqrt(1 - alpha_cumprod_t) * noise
return sqrt_alphas_cumprod_t * x_0 + sqrt_one_minus_alphas_cumprod_t * noise
def extract(a, t, x_shape):
"""
Extract values from 'a' at indices 't' and reshape for broadcasting
"""
batch_size = t.shape[0]
out = a.gather(-1, t)
return out.reshape(batch_size, *((1,) * (len(x_shape) - 1)))
The reverse process learns to go backward through the diffusion chain, removing noise step by step.
If we knew the true data distribution:
This is intractable to compute exactly. However, if β_t is small enough, the reverse process is also approximately Gaussian!
We model the reverse process as a Gaussian:
Simplification (DDPM):
Keep variance fixed, only learn the mean:
Where σ_t² is chosen to be either β_t or β̃_t (posterior variance).
There are multiple equivalent parameterizations:
Option 1: Predict the mean μ_θ(x_t, t) directly
Option 2: Predict the noise ε_θ(x_t, t)
Option 3: Predict the clean image x̂_0
DDPM found that predicting the noise (Option 2) works best empirically.
Given noisy image x_t and timestep t, predict the noise ε:
Where ε is the actual noise that was added to create x_t from x_0:
Once we have predicted noise, we can compute the mean:
To sample x_{t-1} given x_t, we:
The posterior q(x_{t-1} | x_t, x_0) has variance:
Two choices for σ_t²:
DDPM uses β̃_t for better results.
Think of x_t as a mixture of signal and noise:
The network learns to identify and remove the noise component, leaving us with a slightly cleaner signal. Repeating this T times progressively reveals the original image.
The goal is to train the neural network ε_θ(x_t, t) to predict the noise that was added at each timestep.
Simplified training loss (DDPM):
Where:
# Simplified DDPM training loop
def train_ddpm(model, dataloader, optimizer, timesteps=1000, epochs=100):
"""
Train a diffusion model to predict noise
Args:
model: Neural network (U-Net) that predicts noise
dataloader: Training data loader
optimizer: Optimizer (e.g., Adam)
timesteps: Number of diffusion steps (T)
epochs: Training epochs
"""
for epoch in range(epochs):
for batch_idx, (x_0, _) in enumerate(dataloader):
x_0 = x_0.to(device)
batch_size = x_0.size(0)
# Sample random timesteps
t = torch.randint(0, timesteps, (batch_size,), device=device).long()
# Sample noise
noise = torch.randn_like(x_0)
# Create noisy images using closed-form forward process
x_t = q_sample(x_0, t, noise=noise)
# Predict noise
noise_pred = model(x_t, t)
# Compute MSE loss
loss = F.mse_loss(noise_pred, noise)
# Backprop and update
optimizer.zero_grad()
loss.backward()
optimizer.step()
if batch_idx % 100 == 0:
print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')
# Optionally generate samples for monitoring
if epoch % 10 == 0:
generate_samples(model, timesteps, num_samples=16)
1. Direct supervision:
Unlike GANs, we have a clear ground truth (the actual noise ε). The network learns a supervised denoising task.
2. All timesteps contribute:
By randomly sampling t, the network learns to denoise at all noise levels. This makes training stable and comprehensive.
3. Variance reduction:
The loss is an expectation over many random variables (t, x_0, ε), but in practice we get low-variance gradients.
4. Scalability:
The loss scales well - we can train on high-resolution images by sampling small patches or using progressive training.
| Parameter | Typical Value | Purpose |
|---|---|---|
| T (timesteps) | 1000 | Number of diffusion steps |
| Learning rate | 1e-4 to 2e-4 | Adam optimizer LR |
| Batch size | 32-128 | Depends on image size and GPU memory |
| β_1 | 1e-4 | Starting noise schedule value |
| β_T | 0.02 | Final noise schedule value |
| EMA decay | 0.9999 | Exponential moving average of weights |
Simple loss (DDPM):
Weighted loss:
Where w_t can emphasize certain timesteps.
Hybrid loss (predict x_0):
Improved DDPM uses hybrid objectives for better results.
Once trained, we can generate new images by sampling from the reverse process.
# Sampling (generation) algorithm
@torch.no_grad()
def p_sample(model, x_t, t, t_index):
"""
Single reverse diffusion step: x_t -> x_{t-1}
Args:
model: Trained noise prediction network
x_t: Current noisy image
t: Current timestep (tensor)
t_index: Timestep as integer for indexing
Returns:
x_{t-1}: Less noisy image
"""
# Precomputed values
betas_t = extract(betas, t, x_t.shape)
sqrt_one_minus_alphas_cumprod_t = extract(
torch.sqrt(1. - alphas_cumprod), t, x_t.shape
)
sqrt_recip_alphas_t = extract(torch.sqrt(1.0 / alphas), t, x_t.shape)
# Predict noise
predicted_noise = model(x_t, t)
# Compute mean
model_mean = sqrt_recip_alphas_t * (
x_t - betas_t * predicted_noise / sqrt_one_minus_alphas_cumprod_t
)
if t_index == 0:
# No noise at final step
return model_mean
else:
# Add noise
posterior_variance_t = extract(posterior_variance, t, x_t.shape)
noise = torch.randn_like(x_t)
return model_mean + torch.sqrt(posterior_variance_t) * noise
@torch.no_grad()
def p_sample_loop(model, shape, timesteps=1000):
"""
Full sampling loop: x_T -> x_0
Args:
model: Trained diffusion model
shape: Shape of images to generate (batch, channels, height, width)
timesteps: Number of diffusion steps
Returns:
Generated images
"""
device = next(model.parameters()).device
# Start from pure noise
x = torch.randn(shape, device=device)
# Iteratively denoise
for i in reversed(range(0, timesteps)):
t = torch.full((shape[0],), i, device=device, dtype=torch.long)
x = p_sample(model, x, t, i)
return x
# Generate samples
@torch.no_grad()
def generate_samples(model, timesteps, num_samples=16, channels=3,
img_size=32):
"""
Generate new samples from trained model
"""
model.eval()
shape = (num_samples, channels, img_size, img_size)
samples = p_sample_loop(model, shape, timesteps)
# Denormalize from [-1, 1] to [0, 1]
samples = (samples + 1) / 2
samples = torch.clamp(samples, 0, 1)
return samples
The main drawback of diffusion models is slow sampling:
Example timing (single image):
Denoising Diffusion Implicit Models (DDIM) enable faster sampling by skipping timesteps.
Key idea:
Diffusion models can be easily conditioned on additional information (text, class labels, images):
Classifier-free guidance:
Where:
This technique is used in DALL-E 2 and Stable Diffusion for text-to-image generation.
The standard architecture for diffusion models is a U-Net with time embeddings. This architecture:
The timestep t is encoded into a high-dimensional embedding that's injected into the network.
Sinusoidal positional encoding:
Where d is the embedding dimension (typically 256 or 512).
Why this works:
# Time embedding (sinusoidal)
class SinusoidalPositionEmbeddings(nn.Module):
def __init__(self, dim):
super().__init__()
self.dim = dim
def forward(self, time):
device = time.device
half_dim = self.dim // 2
embeddings = math.log(10000) / (half_dim - 1)
embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
embeddings = time[:, None] * embeddings[None, :]
embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
return embeddings
# Simplified U-Net for diffusion
class SimpleUNet(nn.Module):
def __init__(self, channels=3, time_emb_dim=256):
super().__init__()
# Time embedding
self.time_mlp = nn.Sequential(
SinusoidalPositionEmbeddings(time_emb_dim),
nn.Linear(time_emb_dim, time_emb_dim),
nn.GELU(),
nn.Linear(time_emb_dim, time_emb_dim)
)
# Encoder (downsampling)
self.conv1 = nn.Conv2d(channels, 64, 3, padding=1)
self.down1 = Down(64, 128, time_emb_dim)
self.down2 = Down(128, 256, time_emb_dim)
# Bottleneck
self.bottleneck = nn.Sequential(
nn.Conv2d(256, 512, 3, padding=1),
nn.GroupNorm(8, 512),
nn.SiLU(),
SelfAttention(512), # Attention at lowest resolution
nn.Conv2d(512, 256, 3, padding=1)
)
# Decoder (upsampling)
self.up1 = Up(256 + 256, 128, time_emb_dim) # +256 from skip connection
self.up2 = Up(128 + 128, 64, time_emb_dim)
# Output
self.out = nn.Conv2d(64 + 64, channels, 1)
def forward(self, x, t):
# Get time embedding
t_emb = self.time_mlp(t)
# Encoder with skip connections
x1 = self.conv1(x)
x2 = self.down1(x1, t_emb)
x3 = self.down2(x2, t_emb)
# Bottleneck
x = self.bottleneck(x3)
# Decoder with skip connections
x = self.up1(torch.cat([x, x3], dim=1), t_emb)
x = self.up2(torch.cat([x, x2], dim=1), t_emb)
x = self.out(torch.cat([x, x1], dim=1))
return x
1. ResNet blocks:
2. Self-attention layers:
3. Skip connections:
4. Adaptive Group Normalization:
The theoretical foundation of diffusion models comes from maximizing the log-likelihood of the data.
Goal: Maximize
This is intractable, so we optimize a variational lower bound (ELBO):
Diffusion models are closely related to score-based generative models.
The score:
This is the gradient of the log density. Predicting the noise ε is equivalent to predicting the score:
Both approaches learn the same thing from different perspectives!
The diffusion process can be viewed as discretizing a continuous SDE.
Forward SDE:
Where dw is Brownian motion. The reverse-time SDE is:
This continuous perspective enables:
1. Small diffusion steps:
When β_t is small, the reverse process is approximately Gaussian, making it learnable.
2. Markov property:
Each step only depends on the previous one, simplifying the distribution.
3. Convergence to Gaussian:
After enough steps, any distribution becomes approximately Gaussian (central limit theorem behavior).
4. Tractable training:
The closed-form forward process enables efficient training without running the full chain.
Diffusion models have fundamentally stable training:
Quality metrics:
Diversity:
Unlike GANs, diffusion models can compute likelihood:
Diffusion models are easier to condition:
Text-to-image (like DALL-E 2, Stable Diffusion):
Inpainting and editing:
| Aspect | Diffusion Models | GANs |
|---|---|---|
| Training Stability | Very stable, no collapse | Unstable, mode collapse common |
| Sample Quality | State-of-the-art (FID ~2-5) | Excellent (FID ~5-20) |
| Sample Diversity | Excellent coverage | Can miss modes |
| Sampling Speed | Slow (1000 steps, ~10s) | Fast (1 step, ~0.01s) |
| Training Objective | Single loss (MSE) | Minimax game (adversarial) |
| Likelihood | Can compute log p(x) | Cannot compute likelihood |
| Hyperparameter Tuning | Robust, less sensitive | Sensitive, careful tuning needed |
| Conditioning | Easy and flexible | More difficult |
| Architecture | U-Net standard | Various (DCGAN, StyleGAN) |
| Applications | Text-to-image, super-resolution | Face generation, style transfer |
Trade-offs:
Rule of thumb: Start with T=1000, reduce with DDIM if speed matters.
Linear schedule:
beta_start = 1e-4
beta_end = 0.02
betas = torch.linspace(beta_start, beta_end, T)
Cosine schedule (often better):
def cosine_beta_schedule(timesteps, s=0.008):
steps = timesteps + 1
x = torch.linspace(0, timesteps, steps)
alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * torch.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clip(betas, 0.0001, 0.9999)
Always normalize images to [-1, 1] to match the Gaussian noise:
# Normalize to [-1, 1]
images = (images - 0.5) * 2
# Denormalize for visualization
images = (images + 1) / 2
Using EMA of model weights significantly improves sample quality:
class EMA:
def __init__(self, model, decay=0.9999):
self.model = model
self.decay = decay
self.shadow = {}
self.backup = {}
for name, param in model.named_parameters():
if param.requires_grad:
self.shadow[name] = param.data.clone()
def update(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
self.shadow[name].mul_(self.decay).add_(
param.data, alpha=1 - self.decay
)
Use the EMA weights for sampling, not the raw training weights.
For high-resolution images:
Key metrics to track:
# Generate samples for monitoring
if epoch % 10 == 0:
model.eval()
samples = generate_samples(model, timesteps=1000, num_samples=16)
save_image(samples, f'samples/epoch_{epoch}.png')
# Compute FID
fid = compute_fid(real_features, generated_features)
print(f'Epoch {epoch}, FID: {fid:.2f}')
| Problem | Symptom | Solution |
|---|---|---|
| Blurry samples | Generated images lack detail | Train longer, use cosine schedule, add attention |
| NaN losses | Loss becomes NaN during training | Reduce LR, check normalization, clip gradients |
| Slow convergence | Loss decreases very slowly | Increase LR, increase batch size, use EMA |
| OOM errors | Out of memory | Reduce batch size, use FP16, gradient checkpointing |
| Color shift | Generated images wrong colors | Check normalization, verify data preprocessing |
DDIM accelerates sampling by using a deterministic process:
Run diffusion in a compressed latent space instead of pixel space:
Advantages:
Strengthen conditioning (e.g., text prompts) without a classifier:
How it works:
Generate images at progressively higher resolutions:
This approach is used in DALL-E 2 and Imagen for high-resolution generation.
1. Text-to-Image:
2. Super-Resolution:
3. Inpainting:
4. Image-to-Image Translation:
5. Video Generation:
6. 3D Generation:
You have completed the comprehensive guide to Diffusion Models.
Topics Covered:
Introduction • Diffusion Process • Forward Process • Reverse Process •
Training • Sampling • U-Net Architecture • Mathematics •
Advantages over GANs • Practical Tips • Advanced Variants
Next Steps:
Implement a simple diffusion model in PyTorch •
Compare with GANs on the same dataset •
Explore conditional generation with text embeddings •
Review the Anki flashcards to reinforce key concepts