Diffusion Models - Complete Guide

1. INTRODUCTION TO DIFFUSION MODELS

What are Diffusion Models?

Diffusion Models are a class of generative models that learn to generate data by reversing a gradual noising process. They have emerged as state-of-the-art for high-quality image generation, powering systems like DALL-E 2, Stable Diffusion, and Imagen.

            Core Idea: If we can learn to reverse the process of adding noise to data, we can generate new data by starting from pure noise and progressively denoising it.
        

Historical Context

Diffusion models have roots in non-equilibrium thermodynamics and were formalized for deep learning by Sohl-Dickstein et al. (2015). The breakthrough came with Denoising Diffusion Probabilistic Models (DDPM) by Ho et al. (2020), which simplified training and achieved remarkable results.

Key milestones:

2015: Deep Unsupervised Learning using Nonequilibrium Thermodynamics
2020: DDPM - Practical training procedure, high-quality samples
2021: Improved DDPM - Faster sampling, better quality
2022: DALL-E 2, Stable Diffusion - Mainstream applications
2023+: Video diffusion, 3D generation, text-to-anything

Why Diffusion Models Matter

Advantages over GANs:

Training stability: No adversarial dynamics, no mode collapse
Sample quality: State-of-the-art FID scores
Likelihood-based: Can evaluate exact log-likelihood
Flexibility: Easy to condition on text, images, etc.
Scalability: Straightforward to scale to high resolutions

Trade-offs:

Sampling speed: Requires many denoising steps (100-1000)
Computational cost: Slower inference than GANs
Memory: Need to store noise schedule parameters

┌─────────────────────────────────────────────────────────────────┐ │ DIFFUSION MODEL OVERVIEW │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ FORWARD PROCESS (Fixed, Adds Noise) │ │ ═════════════════════════════════════════ │ │ │ │ x₀ ────> x₁ ────> x₂ ────> ... ────> x_T │ │ │ │ │ │ │ │ │ │ │ │ │ │ Clean Slightly More Pure Noise │ │ Image Noisy Noisy (Gaussian) │ │ │ │ ───────────────────────────────────────────────────────── │ │ │ │ REVERSE PROCESS (Learned, Removes Noise) │ │ ═════════════════════════════════════════════ │ │ │ │ x₀ <──── x₁ <──── x₂ <──── ... <──── x_T │ │ │ │ │ │ │ │ │ │ │ │ │ │ Clean Denoise Denoise Start Here │ │ Image Step Step (Random) │ │ ↑ ↑ │ │ └────────┴──── Neural Network Predicts │ │ │ └─────────────────────────────────────────────────────────────────┘

The Two-Phase Process

Phase 1: Forward Diffusion (Fixed, No Learning)

Gradually add Gaussian noise to data over T timesteps until it becomes pure noise. This process is:

Predetermined by a noise schedule
Mathematically simple (closed-form)
No neural networks involved
Converts any data to standard Gaussian noise

Phase 2: Reverse Denoising (Learned)

Train a neural network to reverse the forward process, denoising step-by-step:

Neural network learns to predict noise at each step
Start from random noise x_T ~ N(0, I)
Iteratively denoise: x_T → x_{T-1} → ... → x_1 → x_0
Final output x_0 is a generated sample

The key insight is that adding noise is easy (forward process), but if we can learn to reverse it (backward process), we have a powerful generative model.

2. THE DIFFUSION PROCESS INTUITION

The Ink Drop Analogy

Imagine dropping ink into a glass of water:

t=0: Clear water with a drop of concentrated ink (clean image)
t=1 to t=T-1: Ink gradually diffuses, spreading throughout the water (adding noise)
t=T: Ink completely dispersed, water looks uniformly murky (pure noise)

Now imagine you could reverse this process: starting with murky water and somehow un-diffusing it back to a clear drop of ink. This is exactly what diffusion models learn to do!

FORWARD DIFFUSION (Ink dispersing in water) t=0 t=20 t=50 t=100 ┌────┐ ┌────┐ ┌────┐ ┌────┐ │ ● │ │░●░ │ │░░░░│ │░░░░│ │ │ --> │░░░ │ --> │░░░░│ --> │░░░░│ │ │ │ ░ │ │░░░░│ │░░░░│ └────┘ └────┘ └────┘ └────┘ Clear Slightly Very Noisy Pure Noise Drop Diffused (Uniform) REVERSE DENOISING (Learning to un-diffuse) t=100 t=50 t=20 t=0 ┌────┐ ┌────┐ ┌────┐ ┌────┐ │░░░░│ │░░░░│ │░●░ │ │ ● │ │░░░░│ <-- │░░░░│ <-- │░░░ │ <-- │ │ │░░░░│ │░░░░│ │ ░ │ │ │ └────┘ └────┘ └────┘ └────┘

Why This Works for Generation

Once we've learned to denoise (reverse the diffusion), we can generate new samples:

Sample random noise from N(0, I) - like murky water
Apply the learned denoising process step by step
Gradually reveal a clean image - like un-diffusing ink
The final result is a new sample from the learned distribution

            Key Insight: We're not directly learning to generate images. We're learning to denoise, which indirectly gives us generation capability.
        

Comparison to Other Generative Models

Model	Approach	Generation Process
GAN	Adversarial	One-shot: noise → image (single forward pass)
VAE	Latent encoding	One-shot: latent code → image (decode)
Diffusion	Iterative denoising	Multi-step: noise → ... → image (T steps)
Autoregressive	Sequential prediction	Pixel-by-pixel: predict each pixel given previous

The Markov Chain Perspective

Diffusion models can be viewed as Markov chains:

Forward chain:

Each step x_t only depends on x_{t-1} (Markov property)
q(x_t | x_{t-1}) is a simple Gaussian transition
The chain gradually destroys information

Reverse chain:

Each step x_{t-1} depends on x_t (going backward)
p(x_{t-1} | x_t) is approximated by a neural network
The chain gradually creates information

Forward: x_0 → x_1 → x_2 → ... → x_T (destroy structure)
Reverse: x_T → x_{T-1} → x_{T-2} → ... → x_0 (create structure)

3. FORWARD DIFFUSION MATHEMATICS

Single-Step Forward Process

At each timestep t, we add Gaussian noise to the previous state:

q(x_t | x_{t-1}) = N(x_t; √(1 - β_t) · x_{t-1}, β_t · I)

Components:

x_t: Noisy image at timestep t
x_{t-1}: Image from previous timestep
β_t: Noise schedule parameter (controls how much noise to add)
√(1 - β_t): Scaling factor to preserve variance
I: Identity matrix (isotropic noise)

Sampling equation:

x_t = √(1 - β_t) · x_{t-1} + √β_t · ε, where ε ~ N(0, I)

The Noise Schedule β_t

The noise schedule controls how quickly we add noise. It's a sequence of values from β_1 to β_T.

Common schedules:

Linear: β_t increases linearly from β_1 = 1e-4 to β_T = 0.02
Cosine: Smoother increase, better preserves structure
Quadratic: Faster increase in noise

The choice of noise schedule significantly affects training stability and sample quality. Linear schedule is most common for DDPM.

Multi-Step Forward Process (Closed Form)

A powerful property: we can jump directly from x_0 to x_t in one step, without computing intermediate steps!

Define cumulative products:

α_t = 1 - β_t
ᾱ_t = ∏_{s=1}^t α_s (product of all alphas up to t)

Closed-form sampling:

q(x_t | x_0) = N(x_t; √ᾱ_t · x_0, (1 - ᾱ_t) · I)

Reparameterization for sampling:

x_t = √ᾱ_t · x_0 + √(1 - ᾱ_t) · ε, where ε ~ N(0, I)

            Key Advantage: This closed form allows us to train on any timestep t without running the full forward process. We can directly sample noisy versions of x_0 at any noise level.
        

Properties of the Forward Process

1. Variance preservation:

The variance of x_t is designed to remain constant (approximately 1):

Var(x_t) = ᾱ_t · Var(x_0) + (1 - ᾱ_t) ≈ 1

2. Convergence to pure noise:

As t → T, ᾱ_t → 0, so:

x_T ≈ √(1 - ᾱ_T) · ε ≈ N(0, I)

This means x_T is approximately pure Gaussian noise, regardless of the original x_0.

3. Information destruction:

The signal-to-noise ratio decreases monotonically:

SNR(t) = ᾱ_t / (1 - ᾱ_t)

As t increases, SNR decreases, and the image loses structure.

PyTorch Implementation of Forward Process

# Noise schedule (linear)
def linear_beta_schedule(timesteps, beta_start=1e-4, beta_end=0.02):
    return torch.linspace(beta_start, beta_end, timesteps)

# Precompute alpha values
betas = linear_beta_schedule(timesteps=1000)
alphas = 1.0 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
alphas_cumprod_prev = torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])

# Forward diffusion (closed form)
def q_sample(x_0, t, noise=None):
    """
    Sample from q(x_t | x_0) - add noise to x_0 directly

    Args:
        x_0: Clean image (batch_size, channels, height, width)
        t: Timestep (batch_size,)
        noise: Optional pre-sampled noise

    Returns:
        Noisy image x_t
    """
    if noise is None:
        noise = torch.randn_like(x_0)

    # Get sqrt(alpha_cumprod) for each sample in batch
    sqrt_alphas_cumprod_t = extract(torch.sqrt(alphas_cumprod), t, x_0.shape)
    sqrt_one_minus_alphas_cumprod_t = extract(
        torch.sqrt(1.0 - alphas_cumprod), t, x_0.shape
    )

    # Apply noise: x_t = sqrt(alpha_cumprod_t) * x_0 + sqrt(1 - alpha_cumprod_t) * noise
    return sqrt_alphas_cumprod_t * x_0 + sqrt_one_minus_alphas_cumprod_t * noise

def extract(a, t, x_shape):
    """
    Extract values from 'a' at indices 't' and reshape for broadcasting
    """
    batch_size = t.shape[0]
    out = a.gather(-1, t)
    return out.reshape(batch_size, *((1,) * (len(x_shape) - 1)))

4. REVERSE DENOISING PROCESS

The Goal: Learn p(x_{t-1} | x_t)

The reverse process learns to go backward through the diffusion chain, removing noise step by step.

If we knew the true data distribution:

p(x_{t-1} | x_t) = ?

This is intractable to compute exactly. However, if β_t is small enough, the reverse process is also approximately Gaussian!

            Key Theorem: When β_t is small, the reverse process p(x_{t-1} | x_t) is approximately Gaussian with learnable mean and fixed variance.
        

Parameterizing the Reverse Process

We model the reverse process as a Gaussian:

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))

Simplification (DDPM):

Keep variance fixed, only learn the mean:

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), σ_t² · I)

Where σ_t² is chosen to be either β_t or β̃_t (posterior variance).

What Should the Neural Network Predict?

There are multiple equivalent parameterizations:

Option 1: Predict the mean μ_θ(x_t, t) directly

Option 2: Predict the noise ε_θ(x_t, t)

Option 3: Predict the clean image x̂_0

DDPM found that predicting the noise (Option 2) works best empirically.

            Standard Choice: Train a neural network ε_θ(x_t, t) to predict the noise that was added at timestep t.
        

Noise Prediction Network

Given noisy image x_t and timestep t, predict the noise ε:

ε_θ(x_t, t) ≈ ε

Where ε is the actual noise that was added to create x_t from x_0:

x_t = √ᾱ_t · x_0 + √(1 - ᾱ_t) · ε

Once we have predicted noise, we can compute the mean:

μ_θ(x_t, t) = (1/√α_t) · (x_t - (β_t/√(1 - ᾱ_t)) · ε_θ(x_t, t))

Denoising Step Formula

To sample x_{t-1} given x_t, we:

Predict the noise using the network: ε̂ = ε_θ(x_t, t)
Compute the mean using the formula above
Sample with added randomness (except at t=1):

x_{t-1} = μ_θ(x_t, t) + σ_t · z, where z ~ N(0, I) if t > 1, else z = 0

At the final step (t=1), we don't add noise - we just use the predicted mean to get x_0.

Posterior Variance

The posterior q(x_{t-1} | x_t, x_0) has variance:

β̃_t = ((1 - ᾱ_{t-1}) / (1 - ᾱ_t)) · β_t

Two choices for σ_t²:

σ_t² = β_t: Upper bound, more random sampling
σ_t² = β̃_t: Lower bound, less random, often better quality

DDPM uses β̃_t for better results.

Intuition: Separating Signal and Noise

Think of x_t as a mixture of signal and noise:

x_t = signal_t + noise_t

The network learns to identify and remove the noise component, leaving us with a slightly cleaner signal. Repeating this T times progressively reveals the original image.

REVERSE DENOISING VISUALIZATION t=1000 (Pure Noise) t=500 (Vague Shapes) ┌──────────────┐ ┌──────────────┐ │▓▓▒▒░░▓▓▒▒░░│ │░░ ▓▓ ░░ │ │▒▒░░▓▓▒▒░░▓▓│ Denoise │ ░░▓▓░░ │ │░░▓▓▒▒░░▓▓▒▒│ ──────> │░░ ░░ ▓▓ │ │▓▓▒▒░░▓▓▒▒░░│ │ ▓▓ ░░░░│ └──────────────┘ └──────────────┘ t=100 (Clear Structure) t=0 (Clean Image) ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ ▓▓▓▓▓▓ │ Denoise │ ████████ │ │ ▓▓ ▓▓ │ ──────> │ ██ ██ │ │ ▓▓▓▓▓▓ │ │ ████████ │ └──────────────┘ └──────────────┘

5. TRAINING ALGORITHM

Training Objective

The goal is to train the neural network ε_θ(x_t, t) to predict the noise that was added at each timestep.

Simplified training loss (DDPM):

L_simple = E_{t, x_0, ε} [ ||ε - ε_θ(x_t, t)||² ]

Where:

t: Random timestep sampled uniformly from {1, ..., T}
x_0: Clean image from training data
ε: Random noise sampled from N(0, I)
x_t: Noisy image created by adding ε to x_0
ε_θ(x_t, t): Network's prediction of the noise

            Training is simple: Just predict what noise was added! This is much simpler than GANs (no adversarial training) or VAEs (no KL divergence).
        

Training Algorithm Pseudocode

            DDPM Training Loop:

            REPEAT until converged:

              1. Sample clean image x_0 from training data

              2. Sample timestep t ~ Uniform({1, ..., T})

              3. Sample noise ε ~ N(0, I)

              4. Create noisy image: x_t = √ᾱ_t · x_0 + √(1 - ᾱ_t) · ε

              5. Predict noise: ε̂ = ε_θ(x_t, t)

              6. Compute loss: L = ||ε - ε̂||²

              7. Update θ using gradient descent: θ ← θ - η · ∇_θ L

PyTorch Training Implementation

# Simplified DDPM training loop
def train_ddpm(model, dataloader, optimizer, timesteps=1000, epochs=100):
    """
    Train a diffusion model to predict noise

    Args:
        model: Neural network (U-Net) that predicts noise
        dataloader: Training data loader
        optimizer: Optimizer (e.g., Adam)
        timesteps: Number of diffusion steps (T)
        epochs: Training epochs
    """

    for epoch in range(epochs):
        for batch_idx, (x_0, _) in enumerate(dataloader):
            x_0 = x_0.to(device)
            batch_size = x_0.size(0)

            # Sample random timesteps
            t = torch.randint(0, timesteps, (batch_size,), device=device).long()

            # Sample noise
            noise = torch.randn_like(x_0)

            # Create noisy images using closed-form forward process
            x_t = q_sample(x_0, t, noise=noise)

            # Predict noise
            noise_pred = model(x_t, t)

            # Compute MSE loss
            loss = F.mse_loss(noise_pred, noise)

            # Backprop and update
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if batch_idx % 100 == 0:
                print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')

        # Optionally generate samples for monitoring
        if epoch % 10 == 0:
            generate_samples(model, timesteps, num_samples=16)

Why This Training Works

1. Direct supervision:

Unlike GANs, we have a clear ground truth (the actual noise ε). The network learns a supervised denoising task.

2. All timesteps contribute:

By randomly sampling t, the network learns to denoise at all noise levels. This makes training stable and comprehensive.

3. Variance reduction:

The loss is an expectation over many random variables (t, x_0, ε), but in practice we get low-variance gradients.

4. Scalability:

The loss scales well - we can train on high-resolution images by sampling small patches or using progressive training.

Training Hyperparameters

Parameter	Typical Value	Purpose
T (timesteps)	1000	Number of diffusion steps
Learning rate	1e-4 to 2e-4	Adam optimizer LR
Batch size	32-128	Depends on image size and GPU memory
β_1	1e-4	Starting noise schedule value
β_T	0.02	Final noise schedule value
EMA decay	0.9999	Exponential moving average of weights

Many implementations use Exponential Moving Average (EMA) of model weights for sampling, which improves sample quality significantly.

Loss Variations

Simple loss (DDPM):

L_simple = ||ε - ε_θ(x_t, t)||²

Weighted loss:

L_weighted = E[w_t · ||ε - ε_θ(x_t, t)||²]

Where w_t can emphasize certain timesteps.

Hybrid loss (predict x_0):

L_hybrid = ||x_0 - x̂_0||² + λ · ||ε - ε_θ(x_t, t)||²

Improved DDPM uses hybrid objectives for better results.

6. SAMPLING (INFERENCE) ALGORITHM

Generation Process

Once trained, we can generate new images by sampling from the reverse process.

            Sampling Algorithm (DDPM):

            1. Sample x_T ~ N(0, I) (pure random noise)

            2. FOR t = T, T-1, ..., 1:

               a. Predict noise: ε̂ = ε_θ(x_t, t)

               b. Compute mean: μ = (1/√α_t) · (x_t - (β_t/√(1 - ᾱ_t)) · ε̂)

               c. Sample z ~ N(0, I) if t > 1, else z = 0

               d. Compute x_{t-1} = μ + σ_t · z

            3. RETURN x_0 (generated image)

PyTorch Sampling Implementation

# Sampling (generation) algorithm
@torch.no_grad()
def p_sample(model, x_t, t, t_index):
    """
    Single reverse diffusion step: x_t -> x_{t-1}

    Args:
        model: Trained noise prediction network
        x_t: Current noisy image
        t: Current timestep (tensor)
        t_index: Timestep as integer for indexing

    Returns:
        x_{t-1}: Less noisy image
    """
    # Precomputed values
    betas_t = extract(betas, t, x_t.shape)
    sqrt_one_minus_alphas_cumprod_t = extract(
        torch.sqrt(1. - alphas_cumprod), t, x_t.shape
    )
    sqrt_recip_alphas_t = extract(torch.sqrt(1.0 / alphas), t, x_t.shape)

    # Predict noise
    predicted_noise = model(x_t, t)

    # Compute mean
    model_mean = sqrt_recip_alphas_t * (
        x_t - betas_t * predicted_noise / sqrt_one_minus_alphas_cumprod_t
    )

    if t_index == 0:
        # No noise at final step
        return model_mean
    else:
        # Add noise
        posterior_variance_t = extract(posterior_variance, t, x_t.shape)
        noise = torch.randn_like(x_t)
        return model_mean + torch.sqrt(posterior_variance_t) * noise

@torch.no_grad()
def p_sample_loop(model, shape, timesteps=1000):
    """
    Full sampling loop: x_T -> x_0

    Args:
        model: Trained diffusion model
        shape: Shape of images to generate (batch, channels, height, width)
        timesteps: Number of diffusion steps

    Returns:
        Generated images
    """
    device = next(model.parameters()).device

    # Start from pure noise
    x = torch.randn(shape, device=device)

    # Iteratively denoise
    for i in reversed(range(0, timesteps)):
        t = torch.full((shape[0],), i, device=device, dtype=torch.long)
        x = p_sample(model, x, t, i)

    return x

# Generate samples
@torch.no_grad()
def generate_samples(model, timesteps, num_samples=16, channels=3,
                    img_size=32):
    """
    Generate new samples from trained model
    """
    model.eval()
    shape = (num_samples, channels, img_size, img_size)
    samples = p_sample_loop(model, shape, timesteps)

    # Denormalize from [-1, 1] to [0, 1]
    samples = (samples + 1) / 2
    samples = torch.clamp(samples, 0, 1)

    return samples

Sampling Time Complexity

The main drawback of diffusion models is slow sampling:

T forward passes: Typically T = 1000 steps
Sequential: Must run steps in order, can't parallelize
Time cost: ~1000× slower than GANs

Example timing (single image):

GAN: ~0.01 seconds (1 forward pass)
DDPM: ~10 seconds (1000 forward passes)

This has motivated research into faster sampling methods like DDIM, which can reduce steps to 50-100 with minimal quality loss.

Accelerated Sampling: DDIM

Denoising Diffusion Implicit Models (DDIM) enable faster sampling by skipping timesteps.

Key idea:

Instead of using all T steps, use only a subset (e.g., every 20th step)
Modify the sampling equation to be deterministic (remove noise)
Can reduce from 1000 steps to 50 steps with minimal quality loss

DDPM: 1000 steps, stochastic (adds noise at each step)
DDIM: 50-100 steps, deterministic (no added noise), ~20× faster

Conditional Generation

Diffusion models can be easily conditioned on additional information (text, class labels, images):

Classifier-free guidance:

ε̂_guided = ε̂_uncond + w · (ε̂_cond - ε̂_uncond)

Where:

ε̂_cond: Noise predicted with conditioning (e.g., text prompt)
ε̂_uncond: Noise predicted without conditioning
w: Guidance weight (higher = follow condition more closely)

This technique is used in DALL-E 2 and Stable Diffusion for text-to-image generation.

7. ARCHITECTURE: U-NET WITH TIME EMBEDDING

Why U-Net?

The standard architecture for diffusion models is a U-Net with time embeddings. This architecture:

Preserves spatial resolution: Skip connections maintain fine details
Multi-scale processing: Downsampling and upsampling capture different levels of structure
Time-aware: Embeddings tell the network which noise level to denoise
Proven effective: Originally for image segmentation, works great for denoising

U-NET ARCHITECTURE FOR DIFFUSION Input: Noisy Image x_t + Time Embedding t ┌─────────────────────────────────────────────────────┐ │ ENCODER (Downsample) │ ├─────────────────────────────────────────────────────┤ │ │ │ x_t (64×64) ──> Conv+GN+SiLU ──> 64×64, C=64 │ │ │ │ │ │ │ (skip conn) │ │ ↓ │ │ │ Downsample ──> 32×32, C=128 ──┐ │ │ │ │ │ │ │ │ (skip conn) │ │ │ ↓ │ │ │ │ Downsample ──> 16×16, C=256 ──┼──┐ │ │ │ │ │ │ │ │ │ (skip conn) │ │ │ │ ↓ │ │ │ │ │ │ │ │ │ ├─────────────────────────────────────┼──────┼──┼────┤ │ BOTTLENECK │ │ │ │ ├─────────────────────────────────────┼──────┼──┼────┤ │ │ │ │ │ │ Self-Attention + ResBlocks │ │ │ │ │ 8×8, C=512 │ │ │ │ │ │ │ │ │ │ ├─────────────────────────────────────┼──────┼──┼────┤ │ DECODER (Upsample) │ │ │ │ ├─────────────────────────────────────┼──────┼──┼────┤ │ │ │ │ │ │ │ Upsample ──> 16×16 <──┘ │ │ │ │ │ + skip │ │ │ │ ↓ │ │ │ │ Upsample ──> 32×32 <─────────┘ │ │ │ │ + skip │ │ │ ↓ │ │ │ Upsample ──> 64×64 <────────────┘ │ │ │ + skip │ │ ↓ │ │ Output Conv ──> 64×64, C=3 │ │ │ └─────────────────────────────────────────────────────┘ Output: Predicted Noise ε̂ Time Embedding t is injected at each resolution level

Time Embedding

The timestep t is encoded into a high-dimensional embedding that's injected into the network.

Sinusoidal positional encoding:

emb[2i] = sin(t / 10000^(2i/d))
emb[2i+1] = cos(t / 10000^(2i/d))

Where d is the embedding dimension (typically 256 or 512).

Why this works:

Different frequencies capture different timescales
Similar to positional encodings in Transformers
Network learns to condition behavior on noise level

# Time embedding (sinusoidal)
class SinusoidalPositionEmbeddings(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, time):
        device = time.device
        half_dim = self.dim // 2
        embeddings = math.log(10000) / (half_dim - 1)
        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
        embeddings = time[:, None] * embeddings[None, :]
        embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
        return embeddings

Simplified U-Net Implementation

# Simplified U-Net for diffusion
class SimpleUNet(nn.Module):
    def __init__(self, channels=3, time_emb_dim=256):
        super().__init__()

        # Time embedding
        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbeddings(time_emb_dim),
            nn.Linear(time_emb_dim, time_emb_dim),
            nn.GELU(),
            nn.Linear(time_emb_dim, time_emb_dim)
        )

        # Encoder (downsampling)
        self.conv1 = nn.Conv2d(channels, 64, 3, padding=1)
        self.down1 = Down(64, 128, time_emb_dim)
        self.down2 = Down(128, 256, time_emb_dim)

        # Bottleneck
        self.bottleneck = nn.Sequential(
            nn.Conv2d(256, 512, 3, padding=1),
            nn.GroupNorm(8, 512),
            nn.SiLU(),
            SelfAttention(512),  # Attention at lowest resolution
            nn.Conv2d(512, 256, 3, padding=1)
        )

        # Decoder (upsampling)
        self.up1 = Up(256 + 256, 128, time_emb_dim)  # +256 from skip connection
        self.up2 = Up(128 + 128, 64, time_emb_dim)

        # Output
        self.out = nn.Conv2d(64 + 64, channels, 1)

    def forward(self, x, t):
        # Get time embedding
        t_emb = self.time_mlp(t)

        # Encoder with skip connections
        x1 = self.conv1(x)
        x2 = self.down1(x1, t_emb)
        x3 = self.down2(x2, t_emb)

        # Bottleneck
        x = self.bottleneck(x3)

        # Decoder with skip connections
        x = self.up1(torch.cat([x, x3], dim=1), t_emb)
        x = self.up2(torch.cat([x, x2], dim=1), t_emb)
        x = self.out(torch.cat([x, x1], dim=1))

        return x

Key Components

1. ResNet blocks:

Residual connections for gradient flow
Group normalization (more stable than BatchNorm)
SiLU (Swish) activation function

2. Self-attention layers:

Applied at lower resolutions (e.g., 16×16)
Captures long-range dependencies
Crucial for image coherence

3. Skip connections:

Concatenate encoder features with decoder features
Preserve fine-grained spatial information
Enable accurate noise prediction

4. Adaptive Group Normalization:

Normalizes features based on time embedding
Allows network to adapt to different noise levels
Replaces BatchNorm for better conditioning

8. MATHEMATICAL FOUNDATIONS

Variational Lower Bound

The theoretical foundation of diffusion models comes from maximizing the log-likelihood of the data.

Goal: Maximize

log p_θ(x_0) = log ∫ p_θ(x_{0:T}) dx_{1:T}

This is intractable, so we optimize a variational lower bound (ELBO):

log p_θ(x_0) ≥ E_q [ log p_θ(x_{0:T}) / q(x_{1:T} | x_0) ]

In practice, the simplified loss L_simple works better than directly optimizing the ELBO, but the ELBO provides theoretical justification.

Connection to Score Matching

Diffusion models are closely related to score-based generative models.

The score:

s_θ(x, t) = ∇_x log p_t(x)

This is the gradient of the log density. Predicting the noise ε is equivalent to predicting the score:

ε_θ(x_t, t) = -√(1 - ᾱ_t) · s_θ(x_t, t)

Both approaches learn the same thing from different perspectives!

Stochastic Differential Equations (SDEs)

The diffusion process can be viewed as discretizing a continuous SDE.

Forward SDE:

dx = f(x, t)dt + g(t)dw

Where dw is Brownian motion. The reverse-time SDE is:

dx = [f(x, t) - g(t)² ∇_x log p_t(x)] dt + g(t)dw̄

This continuous perspective enables:

Flexible sampling trajectories
Connections to neural ODEs
Probability flow ODEs for deterministic sampling

Why the Math Works

1. Small diffusion steps:

When β_t is small, the reverse process is approximately Gaussian, making it learnable.

2. Markov property:

Each step only depends on the previous one, simplifying the distribution.

3. Convergence to Gaussian:

After enough steps, any distribution becomes approximately Gaussian (central limit theorem behavior).

4. Tractable training:

The closed-form forward process enables efficient training without running the full chain.

9. ADVANTAGES OVER GANS

Training Stability

Diffusion models have fundamentally stable training:

No adversarial dynamics: No discriminator to balance
No mode collapse: Cover full distribution naturally
Monotonic improvement: Loss consistently decreases
Robust to hyperparameters: Less sensitive tuning

            Key Difference: Diffusion models optimize a single objective (predict noise), while GANs balance two competing objectives (generator vs discriminator).
        

Sample Quality and Diversity

Quality metrics:

State-of-the-art FID scores (better than most GANs)
Sharp, detailed images
Consistent quality across samples

Diversity:

No mode collapse - naturally cover full distribution
Can generate rare examples
Better precision-recall trade-off

Likelihood-Based Framework

Unlike GANs, diffusion models can compute likelihood:

Evaluate log p(x): Measure how likely a sample is
Anomaly detection: Low likelihood indicates outliers
Compression: Use as a learned compression codec
Theoretical guarantees: Provable convergence properties

Conditioning and Control

Diffusion models are easier to condition:

Text-to-image (like DALL-E 2, Stable Diffusion):

Simply concatenate text embeddings with time embeddings
Classifier-free guidance for strong conditioning
More controllable than GAN conditioning

Inpainting and editing:

Easy to constrain certain pixels
Iterative refinement enables editing
Better than GAN inversion

Comparison Table: Diffusion Models vs GANs

Aspect	Diffusion Models	GANs
Training Stability	Very stable, no collapse	Unstable, mode collapse common
Sample Quality	State-of-the-art (FID ~2-5)	Excellent (FID ~5-20)
Sample Diversity	Excellent coverage	Can miss modes
Sampling Speed	Slow (1000 steps, ~10s)	Fast (1 step, ~0.01s)
Training Objective	Single loss (MSE)	Minimax game (adversarial)
Likelihood	Can compute log p(x)	Cannot compute likelihood
Hyperparameter Tuning	Robust, less sensitive	Sensitive, careful tuning needed
Conditioning	Easy and flexible	More difficult
Architecture	U-Net standard	Various (DCGAN, StyleGAN)
Applications	Text-to-image, super-resolution	Face generation, style transfer

While diffusion models excel in most areas, GANs still have the advantage for real-time applications due to their much faster sampling speed.

10. PRACTICAL CONSIDERATIONS

Choosing Timesteps (T)

Trade-offs:

T = 100: Fast sampling, but may miss fine details
T = 1000: Standard choice, good quality-speed balance
T = 4000: Very high quality, but very slow

Rule of thumb: Start with T=1000, reduce with DDIM if speed matters.

Noise Schedule Selection

Linear schedule:

beta_start = 1e-4
beta_end = 0.02
betas = torch.linspace(beta_start, beta_end, T)

Cosine schedule (often better):

def cosine_beta_schedule(timesteps, s=0.008):
    steps = timesteps + 1
    x = torch.linspace(0, timesteps, steps)
    alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * torch.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0.0001, 0.9999)

Image Normalization

Always normalize images to [-1, 1] to match the Gaussian noise:

# Normalize to [-1, 1]
images = (images - 0.5) * 2

# Denormalize for visualization
images = (images + 1) / 2

EMA (Exponential Moving Average)

Using EMA of model weights significantly improves sample quality:

class EMA:
    def __init__(self, model, decay=0.9999):
        self.model = model
        self.decay = decay
        self.shadow = {}
        self.backup = {}

        for name, param in model.named_parameters():
            if param.requires_grad:
                self.shadow[name] = param.data.clone()

    def update(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                self.shadow[name].mul_(self.decay).add_(
                    param.data, alpha=1 - self.decay
                )

Use the EMA weights for sampling, not the raw training weights.

Memory Optimization

For high-resolution images:

Gradient checkpointing: Trade compute for memory
Mixed precision (FP16): Reduce memory by 2×
Patch-based training: Train on image patches
Progressive training: Start at low resolution

Monitoring Training

Key metrics to track:

Training loss: Should decrease smoothly
Sample quality: Generate images every N epochs
FID score: Compute periodically on validation set
Gradient norms: Ensure no exploding/vanishing gradients

# Generate samples for monitoring
if epoch % 10 == 0:
    model.eval()
    samples = generate_samples(model, timesteps=1000, num_samples=16)
    save_image(samples, f'samples/epoch_{epoch}.png')

    # Compute FID
    fid = compute_fid(real_features, generated_features)
    print(f'Epoch {epoch}, FID: {fid:.2f}')

Common Issues and Solutions

Problem	Symptom	Solution
Blurry samples	Generated images lack detail	Train longer, use cosine schedule, add attention
NaN losses	Loss becomes NaN during training	Reduce LR, check normalization, clip gradients
Slow convergence	Loss decreases very slowly	Increase LR, increase batch size, use EMA
OOM errors	Out of memory	Reduce batch size, use FP16, gradient checkpointing
Color shift	Generated images wrong colors	Check normalization, verify data preprocessing

11. ADVANCED TOPICS & VARIANTS

DDIM: Denoising Diffusion Implicit Models

DDIM accelerates sampling by using a deterministic process:

Skip timesteps: Use only 50-100 steps instead of 1000
Deterministic: Remove stochastic noise term
Same quality: Minimal degradation in sample quality
~20× speedup: Much faster inference

x_{t-1} = √ᾱ_{t-1} · predicted_x_0 + √(1 - ᾱ_{t-1}) · predicted_ε

Latent Diffusion Models (Stable Diffusion)

Run diffusion in a compressed latent space instead of pixel space:

Encode: VAE encoder: image → latent (8× compression)
Diffuse: Run diffusion in latent space
Decode: VAE decoder: latent → image

Advantages:

Much faster (8× fewer pixels to process)
Lower memory requirements
Can handle high-resolution images (512×512, 1024×1024)
Powers Stable Diffusion and Imagen

Classifier-Free Guidance

Strengthen conditioning (e.g., text prompts) without a classifier:

ε̂ = ε̂_uncond + w · (ε̂_cond - ε̂_uncond)

How it works:

Train one model with and without conditioning
At inference, blend unconditional and conditional predictions
Higher w = follow condition more strongly
Used in DALL-E 2, Imagen, Stable Diffusion

Cascaded Diffusion

Generate images at progressively higher resolutions:

Base model: Generate 64×64 image
Super-resolution 1: Upsample to 256×256
Super-resolution 2: Upsample to 1024×1024

This approach is used in DALL-E 2 and Imagen for high-resolution generation.

Applications

1. Text-to-Image:

DALL-E 2, Stable Diffusion, Imagen, Midjourney
Condition diffusion on CLIP text embeddings
State-of-the-art quality and controllability

2. Super-Resolution:

Condition on low-resolution images
Generate plausible high-resolution details
Better than traditional upsampling methods

3. Inpainting:

Fill in missing regions of images
Constrain known pixels during diffusion
Natural and coherent completions

4. Image-to-Image Translation:

Style transfer, colorization, etc.
Condition on source image
More flexible than CycleGAN

5. Video Generation:

Extend to 3D U-Nets (spatial + temporal)
Generate coherent video sequences
Emerging area of research

6. 3D Generation:

DreamFusion: text-to-3D using diffusion
Generate 3D shapes, NeRFs, meshes
Rapidly advancing field

Future Directions

Faster sampling: Reduce to single-step generation
Continuous-time models: Neural ODEs/SDEs
Better conditioning: More controllable generation
Multi-modal: Audio, video, 3D, etc.
Efficient training: Reduce computational requirements