DIFFUSION MODELS - COMPLETE GUIDE

CMPUT 328 - Deep Learning | Denoising Diffusion Probabilistic Models (DDPM)
From Forward Diffusion to Reverse Denoising and Beyond

1. INTRODUCTION TO DIFFUSION MODELS

What are Diffusion Models?

Diffusion Models are a class of generative models that learn to generate data by reversing a gradual noising process. They have emerged as state-of-the-art for high-quality image generation, powering systems like DALL-E 2, Stable Diffusion, and Imagen.

Core Idea: If we can learn to reverse the process of adding noise to data, we can generate new data by starting from pure noise and progressively denoising it.

Historical Context

Diffusion models have roots in non-equilibrium thermodynamics and were formalized for deep learning by Sohl-Dickstein et al. (2015). The breakthrough came with Denoising Diffusion Probabilistic Models (DDPM) by Ho et al. (2020), which simplified training and achieved remarkable results.

Key milestones:

Why Diffusion Models Matter

Advantages over GANs:

Trade-offs:

┌─────────────────────────────────────────────────────────────────┐ │ DIFFUSION MODEL OVERVIEW │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ FORWARD PROCESS (Fixed, Adds Noise) │ │ ═════════════════════════════════════════ │ │ │ │ x₀ ────> x₁ ────> x₂ ────> ... ────> x_T │ │ │ │ │ │ │ │ │ │ │ │ │ │ Clean Slightly More Pure Noise │ │ Image Noisy Noisy (Gaussian) │ │ │ │ ───────────────────────────────────────────────────────── │ │ │ │ REVERSE PROCESS (Learned, Removes Noise) │ │ ═════════════════════════════════════════════ │ │ │ │ x₀ <──── x₁ <──── x₂ <──── ... <──── x_T │ │ │ │ │ │ │ │ │ │ │ │ │ │ Clean Denoise Denoise Start Here │ │ Image Step Step (Random) │ │ ↑ ↑ │ │ └────────┴──── Neural Network Predicts │ │ │ └─────────────────────────────────────────────────────────────────┘

The Two-Phase Process

Phase 1: Forward Diffusion (Fixed, No Learning)

Gradually add Gaussian noise to data over T timesteps until it becomes pure noise. This process is:

Phase 2: Reverse Denoising (Learned)

Train a neural network to reverse the forward process, denoising step-by-step:

The key insight is that adding noise is easy (forward process), but if we can learn to reverse it (backward process), we have a powerful generative model.

2. THE DIFFUSION PROCESS INTUITION

The Ink Drop Analogy

Imagine dropping ink into a glass of water:

  1. t=0: Clear water with a drop of concentrated ink (clean image)
  2. t=1 to t=T-1: Ink gradually diffuses, spreading throughout the water (adding noise)
  3. t=T: Ink completely dispersed, water looks uniformly murky (pure noise)

Now imagine you could reverse this process: starting with murky water and somehow un-diffusing it back to a clear drop of ink. This is exactly what diffusion models learn to do!

FORWARD DIFFUSION (Ink dispersing in water) t=0 t=20 t=50 t=100 ┌────┐ ┌────┐ ┌────┐ ┌────┐ │ ● │ │░●░ │ │░░░░│ │░░░░│ │ │ --> │░░░ │ --> │░░░░│ --> │░░░░│ │ │ │ ░ │ │░░░░│ │░░░░│ └────┘ └────┘ └────┘ └────┘ Clear Slightly Very Noisy Pure Noise Drop Diffused (Uniform) REVERSE DENOISING (Learning to un-diffuse) t=100 t=50 t=20 t=0 ┌────┐ ┌────┐ ┌────┐ ┌────┐ │░░░░│ │░░░░│ │░●░ │ │ ● │ │░░░░│ <-- │░░░░│ <-- │░░░ │ <-- │ │ │░░░░│ │░░░░│ │ ░ │ │ │ └────┘ └────┘ └────┘ └────┘

Why This Works for Generation

Once we've learned to denoise (reverse the diffusion), we can generate new samples:

  1. Sample random noise from N(0, I) - like murky water
  2. Apply the learned denoising process step by step
  3. Gradually reveal a clean image - like un-diffusing ink
  4. The final result is a new sample from the learned distribution
Key Insight: We're not directly learning to generate images. We're learning to denoise, which indirectly gives us generation capability.

Comparison to Other Generative Models

Model Approach Generation Process
GAN Adversarial One-shot: noise → image (single forward pass)
VAE Latent encoding One-shot: latent code → image (decode)
Diffusion Iterative denoising Multi-step: noise → ... → image (T steps)
Autoregressive Sequential prediction Pixel-by-pixel: predict each pixel given previous

The Markov Chain Perspective

Diffusion models can be viewed as Markov chains:

Forward chain:

Reverse chain:

Forward: x_0 → x_1 → x_2 → ... → x_T (destroy structure)
Reverse: x_T → x_{T-1} → x_{T-2} → ... → x_0 (create structure)

3. FORWARD DIFFUSION MATHEMATICS

Single-Step Forward Process

At each timestep t, we add Gaussian noise to the previous state:

q(x_t | x_{t-1}) = N(x_t; √(1 - β_t) · x_{t-1}, β_t · I)

Components:

Sampling equation:

x_t = √(1 - β_t) · x_{t-1} + √β_t · ε, where ε ~ N(0, I)

The Noise Schedule β_t

The noise schedule controls how quickly we add noise. It's a sequence of values from β_1 to β_T.

Common schedules:

The choice of noise schedule significantly affects training stability and sample quality. Linear schedule is most common for DDPM.

Multi-Step Forward Process (Closed Form)

A powerful property: we can jump directly from x_0 to x_t in one step, without computing intermediate steps!

Define cumulative products:

α_t = 1 - β_t
ᾱ_t = ∏_{s=1}^t α_s (product of all alphas up to t)

Closed-form sampling:

q(x_t | x_0) = N(x_t; √ᾱ_t · x_0, (1 - ᾱ_t) · I)

Reparameterization for sampling:

x_t = √ᾱ_t · x_0 + √(1 - ᾱ_t) · ε, where ε ~ N(0, I)
Key Advantage: This closed form allows us to train on any timestep t without running the full forward process. We can directly sample noisy versions of x_0 at any noise level.

Properties of the Forward Process

1. Variance preservation:

The variance of x_t is designed to remain constant (approximately 1):

Var(x_t) = ᾱ_t · Var(x_0) + (1 - ᾱ_t) ≈ 1

2. Convergence to pure noise:

As t → T, ᾱ_t → 0, so:

x_T ≈ √(1 - ᾱ_T) · ε ≈ N(0, I)

This means x_T is approximately pure Gaussian noise, regardless of the original x_0.

3. Information destruction:

The signal-to-noise ratio decreases monotonically:

SNR(t) = ᾱ_t / (1 - ᾱ_t)

As t increases, SNR decreases, and the image loses structure.

PyTorch Implementation of Forward Process

# Noise schedule (linear)
def linear_beta_schedule(timesteps, beta_start=1e-4, beta_end=0.02):
    return torch.linspace(beta_start, beta_end, timesteps)

# Precompute alpha values
betas = linear_beta_schedule(timesteps=1000)
alphas = 1.0 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
alphas_cumprod_prev = torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])

# Forward diffusion (closed form)
def q_sample(x_0, t, noise=None):
    """
    Sample from q(x_t | x_0) - add noise to x_0 directly

    Args:
        x_0: Clean image (batch_size, channels, height, width)
        t: Timestep (batch_size,)
        noise: Optional pre-sampled noise

    Returns:
        Noisy image x_t
    """
    if noise is None:
        noise = torch.randn_like(x_0)

    # Get sqrt(alpha_cumprod) for each sample in batch
    sqrt_alphas_cumprod_t = extract(torch.sqrt(alphas_cumprod), t, x_0.shape)
    sqrt_one_minus_alphas_cumprod_t = extract(
        torch.sqrt(1.0 - alphas_cumprod), t, x_0.shape
    )

    # Apply noise: x_t = sqrt(alpha_cumprod_t) * x_0 + sqrt(1 - alpha_cumprod_t) * noise
    return sqrt_alphas_cumprod_t * x_0 + sqrt_one_minus_alphas_cumprod_t * noise

def extract(a, t, x_shape):
    """
    Extract values from 'a' at indices 't' and reshape for broadcasting
    """
    batch_size = t.shape[0]
    out = a.gather(-1, t)
    return out.reshape(batch_size, *((1,) * (len(x_shape) - 1)))

4. REVERSE DENOISING PROCESS

The Goal: Learn p(x_{t-1} | x_t)

The reverse process learns to go backward through the diffusion chain, removing noise step by step.

If we knew the true data distribution:

p(x_{t-1} | x_t) = ?

This is intractable to compute exactly. However, if β_t is small enough, the reverse process is also approximately Gaussian!

Key Theorem: When β_t is small, the reverse process p(x_{t-1} | x_t) is approximately Gaussian with learnable mean and fixed variance.

Parameterizing the Reverse Process

We model the reverse process as a Gaussian:

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))

Simplification (DDPM):

Keep variance fixed, only learn the mean:

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), σ_t² · I)

Where σ_t² is chosen to be either β_t or β̃_t (posterior variance).

What Should the Neural Network Predict?

There are multiple equivalent parameterizations:

Option 1: Predict the mean μ_θ(x_t, t) directly

Option 2: Predict the noise ε_θ(x_t, t)

Option 3: Predict the clean image x̂_0

DDPM found that predicting the noise (Option 2) works best empirically.

Standard Choice: Train a neural network ε_θ(x_t, t) to predict the noise that was added at timestep t.

Noise Prediction Network

Given noisy image x_t and timestep t, predict the noise ε:

ε_θ(x_t, t) ≈ ε

Where ε is the actual noise that was added to create x_t from x_0:

x_t = √ᾱ_t · x_0 + √(1 - ᾱ_t) · ε

Once we have predicted noise, we can compute the mean:

μ_θ(x_t, t) = (1/√α_t) · (x_t - (β_t/√(1 - ᾱ_t)) · ε_θ(x_t, t))

Denoising Step Formula

To sample x_{t-1} given x_t, we:

  1. Predict the noise using the network: ε̂ = ε_θ(x_t, t)
  2. Compute the mean using the formula above
  3. Sample with added randomness (except at t=1):
x_{t-1} = μ_θ(x_t, t) + σ_t · z, where z ~ N(0, I) if t > 1, else z = 0
At the final step (t=1), we don't add noise - we just use the predicted mean to get x_0.

Posterior Variance

The posterior q(x_{t-1} | x_t, x_0) has variance:

β̃_t = ((1 - ᾱ_{t-1}) / (1 - ᾱ_t)) · β_t

Two choices for σ_t²:

DDPM uses β̃_t for better results.

Intuition: Separating Signal and Noise

Think of x_t as a mixture of signal and noise:

x_t = signal_t + noise_t

The network learns to identify and remove the noise component, leaving us with a slightly cleaner signal. Repeating this T times progressively reveals the original image.

REVERSE DENOISING VISUALIZATION t=1000 (Pure Noise) t=500 (Vague Shapes) ┌──────────────┐ ┌──────────────┐ │▓▓▒▒░░▓▓▒▒░░│ │░░ ▓▓ ░░ │ │▒▒░░▓▓▒▒░░▓▓│ Denoise │ ░░▓▓░░ │ │░░▓▓▒▒░░▓▓▒▒│ ──────> │░░ ░░ ▓▓ │ │▓▓▒▒░░▓▓▒▒░░│ │ ▓▓ ░░░░│ └──────────────┘ └──────────────┘ t=100 (Clear Structure) t=0 (Clean Image) ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ ▓▓▓▓▓▓ │ Denoise │ ████████ │ │ ▓▓ ▓▓ │ ──────> │ ██ ██ │ │ ▓▓▓▓▓▓ │ │ ████████ │ └──────────────┘ └──────────────┘

5. TRAINING ALGORITHM

Training Objective

The goal is to train the neural network ε_θ(x_t, t) to predict the noise that was added at each timestep.

Simplified training loss (DDPM):

L_simple = E_{t, x_0, ε} [ ||ε - ε_θ(x_t, t)||² ]

Where:

Training is simple: Just predict what noise was added! This is much simpler than GANs (no adversarial training) or VAEs (no KL divergence).

Training Algorithm Pseudocode

DDPM Training Loop:

REPEAT until converged:
  1. Sample clean image x_0 from training data
  2. Sample timestep t ~ Uniform({1, ..., T})
  3. Sample noise ε ~ N(0, I)
  4. Create noisy image: x_t = √ᾱ_t · x_0 + √(1 - ᾱ_t) · ε
  5. Predict noise: ε̂ = ε_θ(x_t, t)
  6. Compute loss: L = ||ε - ε̂||²
  7. Update θ using gradient descent: θ ← θ - η · ∇_θ L

PyTorch Training Implementation

# Simplified DDPM training loop
def train_ddpm(model, dataloader, optimizer, timesteps=1000, epochs=100):
    """
    Train a diffusion model to predict noise

    Args:
        model: Neural network (U-Net) that predicts noise
        dataloader: Training data loader
        optimizer: Optimizer (e.g., Adam)
        timesteps: Number of diffusion steps (T)
        epochs: Training epochs
    """

    for epoch in range(epochs):
        for batch_idx, (x_0, _) in enumerate(dataloader):
            x_0 = x_0.to(device)
            batch_size = x_0.size(0)

            # Sample random timesteps
            t = torch.randint(0, timesteps, (batch_size,), device=device).long()

            # Sample noise
            noise = torch.randn_like(x_0)

            # Create noisy images using closed-form forward process
            x_t = q_sample(x_0, t, noise=noise)

            # Predict noise
            noise_pred = model(x_t, t)

            # Compute MSE loss
            loss = F.mse_loss(noise_pred, noise)

            # Backprop and update
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if batch_idx % 100 == 0:
                print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')

        # Optionally generate samples for monitoring
        if epoch % 10 == 0:
            generate_samples(model, timesteps, num_samples=16)

Why This Training Works

1. Direct supervision:

Unlike GANs, we have a clear ground truth (the actual noise ε). The network learns a supervised denoising task.

2. All timesteps contribute:

By randomly sampling t, the network learns to denoise at all noise levels. This makes training stable and comprehensive.

3. Variance reduction:

The loss is an expectation over many random variables (t, x_0, ε), but in practice we get low-variance gradients.

4. Scalability:

The loss scales well - we can train on high-resolution images by sampling small patches or using progressive training.

Training Hyperparameters

Parameter Typical Value Purpose
T (timesteps) 1000 Number of diffusion steps
Learning rate 1e-4 to 2e-4 Adam optimizer LR
Batch size 32-128 Depends on image size and GPU memory
β_1 1e-4 Starting noise schedule value
β_T 0.02 Final noise schedule value
EMA decay 0.9999 Exponential moving average of weights
Many implementations use Exponential Moving Average (EMA) of model weights for sampling, which improves sample quality significantly.

Loss Variations

Simple loss (DDPM):

L_simple = ||ε - ε_θ(x_t, t)||²

Weighted loss:

L_weighted = E[w_t · ||ε - ε_θ(x_t, t)||²]

Where w_t can emphasize certain timesteps.

Hybrid loss (predict x_0):

L_hybrid = ||x_0 - x̂_0||² + λ · ||ε - ε_θ(x_t, t)||²

Improved DDPM uses hybrid objectives for better results.

6. SAMPLING (INFERENCE) ALGORITHM

Generation Process

Once trained, we can generate new images by sampling from the reverse process.

Sampling Algorithm (DDPM):

1. Sample x_T ~ N(0, I) (pure random noise)

2. FOR t = T, T-1, ..., 1:
   a. Predict noise: ε̂ = ε_θ(x_t, t)
   b. Compute mean: μ = (1/√α_t) · (x_t - (β_t/√(1 - ᾱ_t)) · ε̂)
   c. Sample z ~ N(0, I) if t > 1, else z = 0
   d. Compute x_{t-1} = μ + σ_t · z

3. RETURN x_0 (generated image)

PyTorch Sampling Implementation

# Sampling (generation) algorithm
@torch.no_grad()
def p_sample(model, x_t, t, t_index):
    """
    Single reverse diffusion step: x_t -> x_{t-1}

    Args:
        model: Trained noise prediction network
        x_t: Current noisy image
        t: Current timestep (tensor)
        t_index: Timestep as integer for indexing

    Returns:
        x_{t-1}: Less noisy image
    """
    # Precomputed values
    betas_t = extract(betas, t, x_t.shape)
    sqrt_one_minus_alphas_cumprod_t = extract(
        torch.sqrt(1. - alphas_cumprod), t, x_t.shape
    )
    sqrt_recip_alphas_t = extract(torch.sqrt(1.0 / alphas), t, x_t.shape)

    # Predict noise
    predicted_noise = model(x_t, t)

    # Compute mean
    model_mean = sqrt_recip_alphas_t * (
        x_t - betas_t * predicted_noise / sqrt_one_minus_alphas_cumprod_t
    )

    if t_index == 0:
        # No noise at final step
        return model_mean
    else:
        # Add noise
        posterior_variance_t = extract(posterior_variance, t, x_t.shape)
        noise = torch.randn_like(x_t)
        return model_mean + torch.sqrt(posterior_variance_t) * noise

@torch.no_grad()
def p_sample_loop(model, shape, timesteps=1000):
    """
    Full sampling loop: x_T -> x_0

    Args:
        model: Trained diffusion model
        shape: Shape of images to generate (batch, channels, height, width)
        timesteps: Number of diffusion steps

    Returns:
        Generated images
    """
    device = next(model.parameters()).device

    # Start from pure noise
    x = torch.randn(shape, device=device)

    # Iteratively denoise
    for i in reversed(range(0, timesteps)):
        t = torch.full((shape[0],), i, device=device, dtype=torch.long)
        x = p_sample(model, x, t, i)

    return x

# Generate samples
@torch.no_grad()
def generate_samples(model, timesteps, num_samples=16, channels=3,
                    img_size=32):
    """
    Generate new samples from trained model
    """
    model.eval()
    shape = (num_samples, channels, img_size, img_size)
    samples = p_sample_loop(model, shape, timesteps)

    # Denormalize from [-1, 1] to [0, 1]
    samples = (samples + 1) / 2
    samples = torch.clamp(samples, 0, 1)

    return samples

Sampling Time Complexity

The main drawback of diffusion models is slow sampling:

Example timing (single image):

This has motivated research into faster sampling methods like DDIM, which can reduce steps to 50-100 with minimal quality loss.

Accelerated Sampling: DDIM

Denoising Diffusion Implicit Models (DDIM) enable faster sampling by skipping timesteps.

Key idea:

DDPM: 1000 steps, stochastic (adds noise at each step)
DDIM: 50-100 steps, deterministic (no added noise), ~20× faster

Conditional Generation

Diffusion models can be easily conditioned on additional information (text, class labels, images):

Classifier-free guidance:

ε̂_guided = ε̂_uncond + w · (ε̂_cond - ε̂_uncond)

Where:

This technique is used in DALL-E 2 and Stable Diffusion for text-to-image generation.

7. ARCHITECTURE: U-NET WITH TIME EMBEDDING

Why U-Net?

The standard architecture for diffusion models is a U-Net with time embeddings. This architecture:

U-NET ARCHITECTURE FOR DIFFUSION Input: Noisy Image x_t + Time Embedding t ┌─────────────────────────────────────────────────────┐ │ ENCODER (Downsample) │ ├─────────────────────────────────────────────────────┤ │ │ │ x_t (64×64) ──> Conv+GN+SiLU ──> 64×64, C=64 │ │ │ │ │ │ │ (skip conn) │ │ ↓ │ │ │ Downsample ──> 32×32, C=128 ──┐ │ │ │ │ │ │ │ │ (skip conn) │ │ │ ↓ │ │ │ │ Downsample ──> 16×16, C=256 ──┼──┐ │ │ │ │ │ │ │ │ │ (skip conn) │ │ │ │ ↓ │ │ │ │ │ │ │ │ │ ├─────────────────────────────────────┼──────┼──┼────┤ │ BOTTLENECK │ │ │ │ ├─────────────────────────────────────┼──────┼──┼────┤ │ │ │ │ │ │ Self-Attention + ResBlocks │ │ │ │ │ 8×8, C=512 │ │ │ │ │ │ │ │ │ │ ├─────────────────────────────────────┼──────┼──┼────┤ │ DECODER (Upsample) │ │ │ │ ├─────────────────────────────────────┼──────┼──┼────┤ │ │ │ │ │ │ │ Upsample ──> 16×16 <──┘ │ │ │ │ │ + skip │ │ │ │ ↓ │ │ │ │ Upsample ──> 32×32 <─────────┘ │ │ │ │ + skip │ │ │ ↓ │ │ │ Upsample ──> 64×64 <────────────┘ │ │ │ + skip │ │ ↓ │ │ Output Conv ──> 64×64, C=3 │ │ │ └─────────────────────────────────────────────────────┘ Output: Predicted Noise ε̂ Time Embedding t is injected at each resolution level

Time Embedding

The timestep t is encoded into a high-dimensional embedding that's injected into the network.

Sinusoidal positional encoding:

emb[2i] = sin(t / 10000^(2i/d))
emb[2i+1] = cos(t / 10000^(2i/d))

Where d is the embedding dimension (typically 256 or 512).

Why this works:

# Time embedding (sinusoidal)
class SinusoidalPositionEmbeddings(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, time):
        device = time.device
        half_dim = self.dim // 2
        embeddings = math.log(10000) / (half_dim - 1)
        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
        embeddings = time[:, None] * embeddings[None, :]
        embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
        return embeddings

Simplified U-Net Implementation

# Simplified U-Net for diffusion
class SimpleUNet(nn.Module):
    def __init__(self, channels=3, time_emb_dim=256):
        super().__init__()

        # Time embedding
        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbeddings(time_emb_dim),
            nn.Linear(time_emb_dim, time_emb_dim),
            nn.GELU(),
            nn.Linear(time_emb_dim, time_emb_dim)
        )

        # Encoder (downsampling)
        self.conv1 = nn.Conv2d(channels, 64, 3, padding=1)
        self.down1 = Down(64, 128, time_emb_dim)
        self.down2 = Down(128, 256, time_emb_dim)

        # Bottleneck
        self.bottleneck = nn.Sequential(
            nn.Conv2d(256, 512, 3, padding=1),
            nn.GroupNorm(8, 512),
            nn.SiLU(),
            SelfAttention(512),  # Attention at lowest resolution
            nn.Conv2d(512, 256, 3, padding=1)
        )

        # Decoder (upsampling)
        self.up1 = Up(256 + 256, 128, time_emb_dim)  # +256 from skip connection
        self.up2 = Up(128 + 128, 64, time_emb_dim)

        # Output
        self.out = nn.Conv2d(64 + 64, channels, 1)

    def forward(self, x, t):
        # Get time embedding
        t_emb = self.time_mlp(t)

        # Encoder with skip connections
        x1 = self.conv1(x)
        x2 = self.down1(x1, t_emb)
        x3 = self.down2(x2, t_emb)

        # Bottleneck
        x = self.bottleneck(x3)

        # Decoder with skip connections
        x = self.up1(torch.cat([x, x3], dim=1), t_emb)
        x = self.up2(torch.cat([x, x2], dim=1), t_emb)
        x = self.out(torch.cat([x, x1], dim=1))

        return x

Key Components

1. ResNet blocks:

2. Self-attention layers:

3. Skip connections:

4. Adaptive Group Normalization:

8. MATHEMATICAL FOUNDATIONS

Variational Lower Bound

The theoretical foundation of diffusion models comes from maximizing the log-likelihood of the data.

Goal: Maximize

log p_θ(x_0) = log ∫ p_θ(x_{0:T}) dx_{1:T}

This is intractable, so we optimize a variational lower bound (ELBO):

log p_θ(x_0) ≥ E_q [ log p_θ(x_{0:T}) / q(x_{1:T} | x_0) ]
In practice, the simplified loss L_simple works better than directly optimizing the ELBO, but the ELBO provides theoretical justification.

Connection to Score Matching

Diffusion models are closely related to score-based generative models.

The score:

s_θ(x, t) = ∇_x log p_t(x)

This is the gradient of the log density. Predicting the noise ε is equivalent to predicting the score:

ε_θ(x_t, t) = -√(1 - ᾱ_t) · s_θ(x_t, t)

Both approaches learn the same thing from different perspectives!

Stochastic Differential Equations (SDEs)

The diffusion process can be viewed as discretizing a continuous SDE.

Forward SDE:

dx = f(x, t)dt + g(t)dw

Where dw is Brownian motion. The reverse-time SDE is:

dx = [f(x, t) - g(t)² ∇_x log p_t(x)] dt + g(t)dw̄

This continuous perspective enables:

Why the Math Works

1. Small diffusion steps:

When β_t is small, the reverse process is approximately Gaussian, making it learnable.

2. Markov property:

Each step only depends on the previous one, simplifying the distribution.

3. Convergence to Gaussian:

After enough steps, any distribution becomes approximately Gaussian (central limit theorem behavior).

4. Tractable training:

The closed-form forward process enables efficient training without running the full chain.

9. ADVANTAGES OVER GANS

Training Stability

Diffusion models have fundamentally stable training:

Key Difference: Diffusion models optimize a single objective (predict noise), while GANs balance two competing objectives (generator vs discriminator).

Sample Quality and Diversity

Quality metrics:

Diversity:

Likelihood-Based Framework

Unlike GANs, diffusion models can compute likelihood:

Conditioning and Control

Diffusion models are easier to condition:

Text-to-image (like DALL-E 2, Stable Diffusion):

Inpainting and editing:

Comparison Table: Diffusion Models vs GANs

Aspect Diffusion Models GANs
Training Stability Very stable, no collapse Unstable, mode collapse common
Sample Quality State-of-the-art (FID ~2-5) Excellent (FID ~5-20)
Sample Diversity Excellent coverage Can miss modes
Sampling Speed Slow (1000 steps, ~10s) Fast (1 step, ~0.01s)
Training Objective Single loss (MSE) Minimax game (adversarial)
Likelihood Can compute log p(x) Cannot compute likelihood
Hyperparameter Tuning Robust, less sensitive Sensitive, careful tuning needed
Conditioning Easy and flexible More difficult
Architecture U-Net standard Various (DCGAN, StyleGAN)
Applications Text-to-image, super-resolution Face generation, style transfer
While diffusion models excel in most areas, GANs still have the advantage for real-time applications due to their much faster sampling speed.

10. PRACTICAL CONSIDERATIONS

Choosing Timesteps (T)

Trade-offs:

Rule of thumb: Start with T=1000, reduce with DDIM if speed matters.

Noise Schedule Selection

Linear schedule:

beta_start = 1e-4
beta_end = 0.02
betas = torch.linspace(beta_start, beta_end, T)

Cosine schedule (often better):

def cosine_beta_schedule(timesteps, s=0.008):
    steps = timesteps + 1
    x = torch.linspace(0, timesteps, steps)
    alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * torch.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0.0001, 0.9999)

Image Normalization

Always normalize images to [-1, 1] to match the Gaussian noise:

# Normalize to [-1, 1]
images = (images - 0.5) * 2

# Denormalize for visualization
images = (images + 1) / 2

EMA (Exponential Moving Average)

Using EMA of model weights significantly improves sample quality:

class EMA:
    def __init__(self, model, decay=0.9999):
        self.model = model
        self.decay = decay
        self.shadow = {}
        self.backup = {}

        for name, param in model.named_parameters():
            if param.requires_grad:
                self.shadow[name] = param.data.clone()

    def update(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                self.shadow[name].mul_(self.decay).add_(
                    param.data, alpha=1 - self.decay
                )

Use the EMA weights for sampling, not the raw training weights.

Memory Optimization

For high-resolution images:

Monitoring Training

Key metrics to track:

# Generate samples for monitoring
if epoch % 10 == 0:
    model.eval()
    samples = generate_samples(model, timesteps=1000, num_samples=16)
    save_image(samples, f'samples/epoch_{epoch}.png')

    # Compute FID
    fid = compute_fid(real_features, generated_features)
    print(f'Epoch {epoch}, FID: {fid:.2f}')

Common Issues and Solutions

Problem Symptom Solution
Blurry samples Generated images lack detail Train longer, use cosine schedule, add attention
NaN losses Loss becomes NaN during training Reduce LR, check normalization, clip gradients
Slow convergence Loss decreases very slowly Increase LR, increase batch size, use EMA
OOM errors Out of memory Reduce batch size, use FP16, gradient checkpointing
Color shift Generated images wrong colors Check normalization, verify data preprocessing

11. ADVANCED TOPICS & VARIANTS

DDIM: Denoising Diffusion Implicit Models

DDIM accelerates sampling by using a deterministic process:

x_{t-1} = √ᾱ_{t-1} · predicted_x_0 + √(1 - ᾱ_{t-1}) · predicted_ε

Latent Diffusion Models (Stable Diffusion)

Run diffusion in a compressed latent space instead of pixel space:

  1. Encode: VAE encoder: image → latent (8× compression)
  2. Diffuse: Run diffusion in latent space
  3. Decode: VAE decoder: latent → image

Advantages:

Classifier-Free Guidance

Strengthen conditioning (e.g., text prompts) without a classifier:

ε̂ = ε̂_uncond + w · (ε̂_cond - ε̂_uncond)

How it works:

Cascaded Diffusion

Generate images at progressively higher resolutions:

  1. Base model: Generate 64×64 image
  2. Super-resolution 1: Upsample to 256×256
  3. Super-resolution 2: Upsample to 1024×1024

This approach is used in DALL-E 2 and Imagen for high-resolution generation.

Applications

1. Text-to-Image:

2. Super-Resolution:

3. Inpainting:

4. Image-to-Image Translation:

5. Video Generation:

6. 3D Generation:

Future Directions

END OF LESSON

You have completed the comprehensive guide to Diffusion Models.

Topics Covered:
Introduction • Diffusion Process • Forward Process • Reverse Process •
Training • Sampling • U-Net Architecture • Mathematics •
Advantages over GANs • Practical Tips • Advanced Variants

Next Steps:
Implement a simple diffusion model in PyTorch •
Compare with GANs on the same dataset •
Explore conditional generation with text embeddings •
Review the Anki flashcards to reinforce key concepts

DOWNLOAD ANKI DECK