GENERATIVE ADVERSARIAL NETWORKS (GANs)

CMPUT 328 - Deep Learning | Complete Educational Guide
From Vanilla GAN to WGAN-GP and Beyond

1. INTRODUCTION TO GANS

What are Generative Models?

Generative models are a class of machine learning models that learn to create new data samples that resemble the training data. Unlike discriminative models that learn to classify or predict labels, generative models learn the underlying probability distribution of the data itself.

Key Concept: A generative model learns P(X), the probability distribution of data X, allowing it to generate new samples that look like they came from the same distribution.

The GAN Innovation

Generative Adversarial Networks (GANs), introduced by Ian Goodfellow and colleagues in 2014, revolutionized generative modeling by framing it as a competitive game between two neural networks.

Before GANs, generative models like Variational Autoencoders (VAEs) struggled to produce sharp, realistic images. GANs changed this by introducing an adversarial training process that pushes both networks to improve simultaneously.

The Two-Player Game Analogy

Think of a GAN as a game between a counterfeiter and a detective:

As the detective gets better at spotting fakes, the counterfeiter must improve their technique. As the counterfeiter produces more convincing fakes, the detective must become more discerning. This back-and-forth competition drives both to excellence.

┌─────────────────────────────────────────────────────────────────┐ │ GAN GAME DYNAMICS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Generator (G) Discriminator (D) │ │ ┌───────────┐ ┌───────────────┐ │ │ │ Noise z │ │ │ │ │ └─────┬─────┘ │ │ │ │ │ │ │ │ │ v │ │ │ │ ┌───────────┐ Fake x │ Real/Fake │ │ │ │ G(z) ───────────────────────>│ Classifier │ │ │ │ Generator │ │ │ │ │ └───────────┘ │ │ │ │ ^ │ │ │ │ │ └───────┬───────┘ │ │ │ │ │ │ │ Gradient Signal │ │ │ └──────────────────────────────────┘ │ │ │ │ Real Data ──────────────────────────────> D │ │ │ └─────────────────────────────────────────────────────────────────┘

Generator vs Discriminator Roles

Generator (G):

Discriminator (D):

The discriminator is trained on both real and fake data, while the generator never sees real data directly - it only learns from the discriminator's feedback.

2. VANILLA GAN FUNDAMENTALS

Architecture Overview

Generator Architecture: Noise → Image

The generator transforms a low-dimensional random noise vector into a high-dimensional data sample (e.g., an image). This is an upsampling process.

Generator: z (100D) → Image (28×28 = 784D) ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ Noise │────>│Linear │────>│Linear │────>│Linear │────> Image │ z │ │+ ReLU │ │+ ReLU │ │+ Tanh │ │ (100) │ │ (256) │ │ (512) │ │ (784) │ └────────┘ └────────┘ └────────┘ └────────┘

Discriminator Architecture: Image → Probability

The discriminator is essentially a binary classifier that outputs a single probability value indicating whether the input is real or fake.

Discriminator: Image (784D) → Probability (1D) ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ Image │────>│Linear │────>│Linear │────>│Linear │────> P(real) │ (784) │ │+LeakyR │ │+LeakyR │ │+Sigmoid│ │ │ │ (512) │ │ (256) │ │ (1) │ └────────┘ └────────┘ └────────┘ └────────┘

The Minimax Objective

The GAN training process is formalized as a minimax game where the generator tries to minimize what the discriminator tries to maximize.

minG maxD V(D,G) = Ex~pdata[log D(x)] + Ez~pz[log(1 - D(G(z)))]

Breaking down the objective:

Intuitive Explanation: The discriminator wants to output D(x_real) ≈ 1 and D(x_fake) ≈ 0, maximizing the objective. The generator wants D(G(z)) ≈ 1, which minimizes the second term and thus the overall objective.

Binary Cross-Entropy Loss

In practice, the minimax objective is implemented using Binary Cross-Entropy (BCE) loss, which measures the difference between predicted and actual binary labels.

BCE(y, ŷ) = -[y log(ŷ) + (1-y) log(1-ŷ)]

For the Discriminator:

# Real samples (label = 1)
loss_real = BCE(D(x_real), 1)

# Fake samples (label = 0)
loss_fake = BCE(D(G(z)), 0)

# Total discriminator loss
loss_D = loss_real + loss_fake

For the Generator:

# Generator wants D to output 1 for fake samples
loss_G = BCE(D(G(z)), 1)
The original GAN paper proposed minimizing log(1 - D(G(z))) for the generator, but in practice, maximizing log(D(G(z))) works better because it provides stronger gradients early in training.

Training Algorithm

GANs use alternating optimization: train the discriminator for one or more steps, then train the generator for one step, and repeat.

Training Loop (Vanilla GAN):

FOR each training iteration:
  1. DISCRIMINATOR TRAINING STEP
    a. Sample minibatch of real data x from dataset
    b. Sample minibatch of noise z from prior p(z)
    c. Generate fake data: x_fake = G(z)
    d. Compute loss: L_D = BCE(D(x_real), 1) + BCE(D(x_fake), 0)
    e. Update D parameters by ascending gradient

  2. GENERATOR TRAINING STEP
    a. Sample minibatch of noise z from prior p(z)
    b. Generate fake data: x_fake = G(z)
    c. Compute loss: L_G = BCE(D(G(z)), 1)
    d. Update G parameters by descending gradient

PyTorch Implementation Example

# Simplified vanilla GAN training loop
for epoch in range(num_epochs):
    for real_images, _ in dataloader:
        batch_size = real_images.size(0)

        # ==================
        # Train Discriminator
        # ==================
        optimizer_D.zero_grad()

        # Real images
        real_labels = torch.ones(batch_size, 1)
        real_output = discriminator(real_images)
        loss_D_real = criterion(real_output, real_labels)

        # Fake images
        noise = torch.randn(batch_size, latent_dim)
        fake_images = generator(noise)
        fake_labels = torch.zeros(batch_size, 1)
        fake_output = discriminator(fake_images.detach())
        loss_D_fake = criterion(fake_output, fake_labels)

        # Total discriminator loss
        loss_D = loss_D_real + loss_D_fake
        loss_D.backward()
        optimizer_D.step()

        # ==================
        # Train Generator
        # ==================
        optimizer_G.zero_grad()

        # Generator wants D to output 1 for fake images
        noise = torch.randn(batch_size, latent_dim)
        fake_images = generator(noise)
        fake_output = discriminator(fake_images)
        real_labels = torch.ones(batch_size, 1)

        loss_G = criterion(fake_output, real_labels)
        loss_G.backward()
        optimizer_G.step()

3. TRAINING INSTABILITY & PROBLEMS

Despite their success, vanilla GANs are notoriously difficult to train. Several fundamental problems arise from the adversarial training dynamics.

Mode Collapse

What it is:

Mode collapse occurs when the generator learns to produce only a limited variety of samples, ignoring much of the data distribution. Instead of generating diverse outputs, it "collapses" to producing a few safe samples that fool the discriminator.

Why it happens:

Mode Collapse Visualization (MNIST digits): Full Distribution: Mode Collapse: ┌───────────────┐ ┌───────────────┐ │ 0 1 2 3 4 │ │ 7 7 7 7 7 │ │ 5 6 7 8 9 │ --> │ 7 7 7 7 7 │ │ 3 1 5 2 8 │ │ 7 7 7 7 7 │ │ 9 4 6 0 1 │ │ 7 7 7 7 7 │ └───────────────┘ └───────────────┘ (Diverse samples) (Only generates 7s)
In severe mode collapse, the generator might produce identical or near-identical samples regardless of the input noise vector z.

Vanishing/Exploding Gradients

When discriminator becomes too strong:

If the discriminator becomes very good at distinguishing real from fake, it outputs values very close to 0 or 1. This causes the gradient of log(1 - D(G(z))) to vanish, leaving the generator with no learning signal.

When D(G(z)) ≈ 0: ∂/∂G log(1 - D(G(z))) ≈ 0

This is known as the vanishing gradient problem. The generator stops learning because the discriminator is so confident that the samples are fake.

Loss of learning signal:

Gradient Flow Problem: Strong Discriminator (D ≈ 1 for real, D ≈ 0 for fake) │ v Saturated Sigmoid (flat regions) │ v Near-zero gradients to Generator │ v Generator stops learning

Non-Convergence Issues

Unlike typical neural network training where loss decreases monotonically, GAN training involves two competing objectives that may never reach equilibrium.

Oscillating behavior:

Epoch G Loss D Loss Observation
1 2.45 0.89 D too strong
2 1.12 1.23 Better balance
3 3.78 0.45 Oscillation
4 0.67 2.01 G too strong
5 2.89 0.91 Instability

Lack of Meaningful Metrics

BCE loss doesn't correlate with quality:

The discriminator and generator losses provide little information about the actual quality of generated samples. You can have:

The Problem: You cannot look at loss curves alone to determine if your GAN is training well. You must visually inspect generated samples, which makes debugging and hyperparameter tuning extremely difficult.

Diagnostic challenges:

These fundamental problems motivated the development of improved GAN variants like Wasserstein GAN (WGAN), which addresses many of these issues through a different distance metric and training procedure.

4. WASSERSTEIN GAN (WGAN)

Motivation: Why We Need Better Distance Metrics

The instability of vanilla GANs stems from using the Jensen-Shannon (JS) divergence implicitly through the BCE loss. When the real and fake distributions have minimal overlap, the JS divergence becomes constant, providing no useful gradient.

The core problem:

Solution: Wasserstein GAN uses Earth Mover's Distance (Wasserstein distance), which provides meaningful gradients even when distributions don't overlap.

Earth Mover's Distance (Wasserstein Distance)

Intuitive explanation (moving piles of earth):

Imagine you have two piles of earth with different shapes. The Wasserstein distance measures the minimum amount of "work" needed to transform one pile into the other, where work = amount of earth × distance moved.

Earth Mover's Distance Intuition: Distribution A: Distribution B: ███ ███ █████ █████ ███████ ███████ █████████ --> █████████ ▀▀▀▀▀▀▀▀▀▀▀ ▀▀▀▀▀▀▀▀▀▀▀▀ W(A,B) = minimum cost to move earth from A to match B Key property: W(A,B) is continuous and provides gradients even when A and B don't overlap

Mathematical definition:

W(p_r, p_g) = infγ∈Π(p_r,p_g) E(x,y)~γ[||x - y||]

Where:

Why Wasserstein distance is better:

From Discriminator to Critic

WGAN replaces the discriminator with a "critic" that outputs raw scores instead of probabilities.

Aspect Vanilla GAN Discriminator WGAN Critic
Output activation Sigmoid (0 to 1) None (any real number)
Output interpretation Probability of being real Raw score (higher = more real)
Training objective Maximize classification accuracy Maximize separation between real and fake scores
Loss function Binary Cross-Entropy Wasserstein loss

Critic architecture (no sigmoid):

class Critic(nn.Module):
    def __init__(self):
        super(Critic, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(784, 512),
            nn.LeakyReLU(0.2),
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, 1)
            # NO SIGMOID - outputs raw scores
        )

    def forward(self, x):
        return self.model(x)

Kantorovich-Rubinstein Duality

Computing the Wasserstein distance directly is intractable. The Kantorovich-Rubinstein duality theorem provides a practical way to compute it using a critic function.

W(p_r, p_g) = sup||f||L≤1 Ex~p_r[f(x)] - Ex~p_g[f(x)]

Where:

Key Insight: The critic network approximates the optimal f. By training the critic to maximize the difference between scores on real and fake data (while maintaining the Lipschitz constraint), we approximate the Wasserstein distance.

Lipschitz Constraint

A function f is 1-Lipschitz if for all x₁ and x₂:

|f(x₁) - f(x₂)| ≤ ||x₁ - x₂||

This means the function cannot change faster than the input changes - it has bounded gradients.

Weight clipping (original WGAN approach):

The original WGAN paper enforced the Lipschitz constraint by clipping critic weights to a small range [-c, c] after each update.

# Weight clipping (original WGAN)
for param in critic.parameters():
    param.data.clamp_(-0.01, 0.01)  # Clip to [-0.01, 0.01]

Problems with weight clipping:

These limitations of weight clipping led to the development of WGAN-GP (Gradient Penalty), which enforces the Lipschitz constraint in a more principled way.

WGAN Loss Functions

Critic loss (maximize):

L_C = Ex~p_r[C(x)] - Ez~p_z[C(G(z))]

Generator loss (minimize):

L_G = -Ez~p_z[C(G(z))]

PyTorch implementation:

# WGAN losses (without gradient penalty yet)
# Critic loss
critic_real = critic(real_images).mean()
critic_fake = critic(fake_images).mean()
loss_C = -(critic_real - critic_fake)  # Maximize → minimize negative

# Generator loss
fake_images = generator(noise)
loss_G = -critic(fake_images).mean()  # Maximize critic score on fakes
Key Difference: Notice there's no BCE loss, no labels (0 or 1), and no sigmoid activation. We're directly optimizing the difference between real and fake scores.

5. WGAN WITH GRADIENT PENALTY (WGAN-GP)

Gradient Penalty Concept

WGAN-GP improves upon WGAN by replacing weight clipping with a gradient penalty term that directly enforces the Lipschitz constraint by penalizing the gradient norm of the critic.

The idea:

A 1-Lipschitz function must have gradients with norm at most 1 everywhere. Instead of clipping weights, we add a penalty term that encourages ||∇f(x)||₂ = 1.

GP = λ · Ex̂~p[(||∇ C(x̂)||₂ - 1)²]

Where:

Enforcing 1-Lipschitz constraint: By penalizing gradients that deviate from norm 1, we ensure the critic function doesn't change too rapidly, satisfying the Lipschitz constraint needed for the Wasserstein distance approximation.

Implementation Details

Interpolated samples:

We compute the gradient penalty on random interpolations between real and fake samples, not on real/fake data directly.

x̂ = ε · x_real + (1 - ε) · x_fake, where ε ~ Uniform(0, 1)

Why interpolations?

Gradient Computation

Full gradient penalty implementation (PyTorch):

def compute_gradient_penalty(critic, real_images, fake_images, device):
    """
    Compute gradient penalty for WGAN-GP

    Args:
        critic: Critic network
        real_images: Batch of real images
        fake_images: Batch of generated images
        device: 'cuda' or 'cpu'

    Returns:
        gradient_penalty: Scalar penalty value
    """
    batch_size = real_images.size(0)

    # Random weight term for interpolation
    epsilon = torch.rand(batch_size, 1, 1, 1, device=device)
    epsilon = epsilon.expand_as(real_images)

    # Interpolated samples
    interpolated = epsilon * real_images + (1 - epsilon) * fake_images
    interpolated.requires_grad_(True)

    # Critic scores for interpolated samples
    critic_interpolated = critic(interpolated)

    # Compute gradients of critic scores w.r.t. interpolated samples
    gradients = torch.autograd.grad(
        outputs=critic_interpolated,
        inputs=interpolated,
        grad_outputs=torch.ones_like(critic_interpolated),
        create_graph=True,
        retain_graph=True,
        only_inputs=True
    )[0]

    # Flatten gradients
    gradients = gradients.view(batch_size, -1)

    # Compute gradient norm
    gradient_norm = gradients.norm(2, dim=1)

    # Penalty for deviation from norm = 1
    gradient_penalty = ((gradient_norm - 1) ** 2).mean()

    return gradient_penalty
The create_graph=True flag is crucial - it allows us to backpropagate through the gradient computation itself, which is necessary for training the critic to have norm-1 gradients.

Training Algorithm

WGAN-GP Training Procedure:

FOR each training iteration:

  FOR n_critic iterations (typically 5):
    1. Sample real data x_real and noise z
    2. Generate fake data: x_fake = G(z)
    3. Compute critic scores: C(x_real), C(x_fake)
    4. Compute gradient penalty on interpolated samples
    5. Total critic loss: L_C = C(x_fake) - C(x_real) + λ · GP
    6. Update critic parameters

  Generator update:
    1. Sample noise z
    2. Generate fake data: x_fake = G(z)
    3. Compute generator loss: L_G = -C(G(z))
    4. Update generator parameters

Key hyperparameters:

Parameter Standard Value Purpose
λ (lambda) 10 Gradient penalty coefficient
n_critic 5 Critic updates per generator update
Learning rate 1e-4 (0.0001) Lower than vanilla GAN for stability
β₁ (Adam) 0.5 Momentum parameter
β₂ (Adam) 0.9 RMSprop parameter

Complete Training Loop

# WGAN-GP Training Loop
for epoch in range(num_epochs):
    for i, (real_images, _) in enumerate(dataloader):
        real_images = real_images.to(device)
        batch_size = real_images.size(0)

        # ==================
        # Train Critic (n_critic times)
        # ==================
        for _ in range(n_critic):
            optimizer_C.zero_grad()

            # Generate fake images
            noise = torch.randn(batch_size, latent_dim, device=device)
            fake_images = generator(noise)

            # Critic scores
            critic_real = critic(real_images).mean()
            critic_fake = critic(fake_images).mean()

            # Gradient penalty
            gp = compute_gradient_penalty(critic, real_images,
                                         fake_images.detach(), device)

            # Total critic loss
            loss_C = critic_fake - critic_real + lambda_gp * gp

            loss_C.backward()
            optimizer_C.step()

        # ==================
        # Train Generator
        # ==================
        optimizer_G.zero_grad()

        noise = torch.randn(batch_size, latent_dim, device=device)
        fake_images = generator(noise)

        # Generator loss
        loss_G = -critic(fake_images).mean()

        loss_G.backward()
        optimizer_G.step()

        # Log Wasserstein distance estimate
        wasserstein_distance = critic_real - critic_fake

Stability Improvements

Smoother loss curves:

WGAN-GP exhibits much more stable training compared to vanilla GAN. The losses decrease smoothly and predictably.

Loss Comparison Over Training: Vanilla GAN: WGAN-GP: Loss Loss │ │ │ ╱╲ ╱╲ │╲ │ ╱ ╲╱ ╲ ╱╲ │ ╲ │ ╱ ╲╱ ╲ │ ╲___ │╱ ╲ │ ╲____ │ (erratic) │ ╲____ └─────────────────> Iter └─────────────────> Iter (smooth decrease)

Meaningful Wasserstein distance metric:

Better sample quality:

Empirical Results: In practice, WGAN-GP demonstrates 2-5x reduction in loss volatility, more consistent convergence across random seeds, and significantly improved sample quality compared to vanilla GAN with the same architecture and training time.

6. CYCLEGAN

Unpaired Image Translation Problem

Traditional image-to-image translation methods (like pix2pix) require paired training examples: input image A and corresponding output image B. CycleGAN solves the harder problem of translation without paired data.

The challenge:

Paired vs Unpaired Data: Paired (pix2pix): Unpaired (CycleGAN): Domain X Domain Y Domain X Domain Y ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ Cat │───│ Edge │ │ Cat │ │ Edge │ └──────┘ └──────┘ └──────┘ └──────┘ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ Dog │───│ Edge │ │ Horse│ │ Edge │ └──────┘ └──────┘ └──────┘ └──────┘ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │Horse │───│ Edge │ │ Dog │ │ Edge │ └──────┘ └──────┘ └──────┘ └──────┘ (Aligned pairs) (Unaligned collections)

Cycle Consistency Loss

The key innovation of CycleGAN is cycle consistency: if we translate from domain X to Y and back to X, we should get back the original image.

Forward cycle: X → Y → X

If we map image x from domain X to domain Y using generator G, then map it back using generator F, we should recover x.

Forward Cycle: x → G(x) → F(G(x)) ≈ x

Backward cycle: Y → X → Y

Similarly, mapping y from Y to X and back should recover y.

Backward Cycle: y → F(y) → G(F(y)) ≈ y
Cycle Consistency Visualization: Forward Cycle x ────────> G(x) ────────> F(G(x)) (photo) (painting) (photo') │ │ └─────────── x ≈ F(G(x)) ────┘ (should match!) Backward Cycle y ────────> F(y) ────────> G(F(y)) (painting) (photo) (painting') │ │ └─────────── y ≈ G(F(y)) ────┘ (should match!)

Cycle consistency loss:

Lcyc(G, F) = Ex~X[||F(G(x)) - x||₁] + Ey~Y[||G(F(y)) - y||₁]

Architecture

Two generators:

Two discriminators:

CycleGAN Architecture: Domain X (Photos) Domain Y (Paintings) ┌────┐ ┌────┐ │ x │──────────G─────────────>│G(x)│ └────┘ └────┘ ↑ ↖ │ ↓ │ F │ D_Y │ ↖ ↓ ┌────┐ F(G(x)) ┌────┐ │F(y)│<──────────F─────────────│ y │ └────┘ └────┘ ↓ ↑ D_X │ Cycle: x → G(x) → F(G(x)) ≈ x Cycle: y → F(y) → G(F(y)) ≈ y

Total Loss Function

The CycleGAN objective combines adversarial losses (to make translations realistic) with cycle consistency losses (to preserve content).

Adversarial losses:

LGAN(G, DY) = Ey~Y[log DY(y)] + Ex~X[log(1 - DY(G(x)))]

LGAN(F, DX) = Ex~X[log DX(x)] + Ey~Y[log(1 - DX(F(y)))]

Full objective:

L(G, F, DX, DY) = LGAN(G, DY) + LGAN(F, DX) + λ · Lcyc(G, F)

Where λ controls the relative importance of cycle consistency (typically λ = 10).

Training Procedure

# CycleGAN Training (simplified)
for epoch in range(num_epochs):
    for real_X, real_Y in dataloader:

        # ==================
        # Train Generators
        # ==================
        optimizer_G.zero_grad()

        # Forward cycle: X -> Y -> X
        fake_Y = G(real_X)
        reconstructed_X = F(fake_Y)
        loss_cycle_X = L1(reconstructed_X, real_X)

        # Backward cycle: Y -> X -> Y
        fake_X = F(real_Y)
        reconstructed_Y = G(fake_X)
        loss_cycle_Y = L1(reconstructed_Y, real_Y)

        # Adversarial losses
        loss_G_adv = -D_Y(fake_Y).mean()
        loss_F_adv = -D_X(fake_X).mean()

        # Total generator loss
        loss_G = (loss_G_adv + loss_F_adv +
                 lambda_cyc * (loss_cycle_X + loss_cycle_Y))
        loss_G.backward()
        optimizer_G.step()

        # ==================
        # Train Discriminators
        # ==================
        # D_Y discriminates Y domain
        optimizer_D_Y.zero_grad()
        loss_D_Y_real = D_Y(real_Y).mean()
        loss_D_Y_fake = D_Y(fake_Y.detach()).mean()
        loss_D_Y = loss_D_Y_fake - loss_D_Y_real
        loss_D_Y.backward()
        optimizer_D_Y.step()

        # D_X discriminates X domain
        optimizer_D_X.zero_grad()
        loss_D_X_real = D_X(real_X).mean()
        loss_D_X_fake = D_X(fake_X.detach()).mean()
        loss_D_X = loss_D_X_fake - loss_D_X_real
        loss_D_X.backward()
        optimizer_D_X.step()

Applications

Style transfer:

Object transfiguration:

Domain adaptation:

CycleGAN works best when the geometric structure is preserved between domains. It can change appearance, style, and texture, but cannot handle transformations that significantly alter object shape or layout.

7. EVALUATION METRICS

Evaluating GANs is challenging because we care about both sample quality (do images look realistic?) and sample diversity (do we cover the full distribution?). No single metric captures both perfectly.

Inception Score (IS)

What it measures:

Inception Score uses a pre-trained Inception network to evaluate generated images based on two criteria:

IS(G) = exp(Ex~pg[KL(p(y|x) || p(y))])

Where:

Interpretation:

Limitations:

Inception Score should not be used in isolation. It's possible to have high IS but poor actual quality, or to have mode collapse with good IS if each mode is high quality.

Fréchet Inception Distance (FID)

What it measures:

FID compares the distribution of generated images to real images by looking at their features in the Inception network's feature space.

How it works:

  1. Pass real and generated images through Inception network
  2. Extract features from intermediate layer (before classification)
  3. Model both feature distributions as multivariate Gaussians
  4. Compute Fréchet distance between the two Gaussians
FID = ||μr - μg||² + Tr(Σr + Σg - 2(ΣrΣg)1/2)

Where:

Interpretation:

FID Visualization (Feature Space): Real Distribution: Generated Distribution: ╭─────╮ ╭─────╮ ╱ ╲ ╱ ╲ │ μ_r │ │ μ_g │ ╲ ╱ ╲ ╱ ╰─────╯ ╰─────╯ ↑ ↑ Σ_r (spread) Σ_g (spread) FID measures distance between these distributions in Inception feature space

Advantages over IS:

Limitations:

Comparison of Metrics

Aspect Inception Score (IS) Fréchet Inception Distance (FID)
What it measures Quality + diversity of classes Distance to real distribution
Better value Higher is better Lower is better
Uses real data No (only generator samples) Yes (compares to real)
Detects mode collapse Poorly Well
Samples needed ~5,000 ~10,000+
Computational cost Low Medium
Human correlation Moderate Better
Best Practice: Use FID as the primary metric for GAN evaluation, supplemented by visual inspection of samples and potentially IS for additional validation. No metric replaces human evaluation of sample quality and diversity.

PyTorch Implementation (FID)

import torch
from scipy import linalg
import numpy as np

def calculate_fid(real_features, fake_features):
    """
    Calculate Fréchet Inception Distance

    Args:
        real_features: Features from real images (N x D)
        fake_features: Features from generated images (M x D)

    Returns:
        fid_score: Scalar FID value (lower is better)
    """
    # Calculate mean and covariance
    mu_real = np.mean(real_features, axis=0)
    mu_fake = np.mean(fake_features, axis=0)

    sigma_real = np.cov(real_features, rowvar=False)
    sigma_fake = np.cov(fake_features, rowvar=False)

    # Calculate squared difference of means
    diff = mu_real - mu_fake
    mean_diff = diff.dot(diff)

    # Calculate sqrt of product of covariances
    covmean, _ = linalg.sqrtm(sigma_real.dot(sigma_fake), disp=False)

    # Handle numerical errors
    if np.iscomplexobj(covmean):
        covmean = covmean.real

    # Calculate FID
    fid = mean_diff + np.trace(sigma_real + sigma_fake - 2*covmean)

    return fid

# Extract features using pre-trained Inception
inception_model = torchvision.models.inception_v3(pretrained=True)
inception_model.fc = torch.nn.Identity()  # Remove final layer
inception_model.eval()

def get_features(images):
    with torch.no_grad():
        features = inception_model(images)
    return features.cpu().numpy()

# Compute FID
real_features = get_features(real_images)
fake_features = get_features(generated_images)
fid_score = calculate_fid(real_features, fake_features)
print(f"FID Score: {fid_score:.2f}")

8. PRACTICAL IMPLEMENTATION

Key Takeaways from Assignment 8

The Assignment 8 implementation on FashionMNIST provides valuable insights into the practical differences between Vanilla GAN and WGAN-GP.

Vanilla GAN instability observed:

WGAN-GP stability improvements:

Batch-level volatility analysis:

Examining loss at the batch level reveals the extent of training instability:

Metric Vanilla GAN WGAN-GP
Generator loss std dev ~0.45 ~0.12
Discriminator/Critic loss std dev ~0.38 ~0.09
Stability improvement Baseline 3.7× less volatile
These volatility metrics provide quantitative evidence for WGAN-GP's superior stability, complementing qualitative visual assessment.

Complete Generator Architecture (PyTorch)

import torch
import torch.nn as nn

class Generator(nn.Module):
    """
    Generator network for FashionMNIST (28x28 grayscale images)
    Maps 64-dimensional noise to 784-dimensional image
    """
    def __init__(self, latent_dim=64, img_dim=784):
        super(Generator, self).__init__()

        self.model = nn.Sequential(
            # Input: latent_dim (64)
            nn.Linear(latent_dim, 256),
            nn.LeakyReLU(0.2, inplace=True),
            nn.BatchNorm1d(256),

            # Hidden layer 1
            nn.Linear(256, 512),
            nn.LeakyReLU(0.2, inplace=True),
            nn.BatchNorm1d(512),

            # Hidden layer 2
            nn.Linear(512, 1024),
            nn.LeakyReLU(0.2, inplace=True),
            nn.BatchNorm1d(1024),

            # Output: img_dim (784)
            nn.Linear(1024, img_dim),
            nn.Tanh()  # Output in [-1, 1] to match normalized images
        )

    def forward(self, z):
        """
        Args:
            z: Noise vector (batch_size, latent_dim)
        Returns:
            Generated image (batch_size, img_dim)
        """
        img = self.model(z)
        return img

Complete Critic Architecture (PyTorch)

class Critic(nn.Module):
    """
    Critic network for WGAN-GP
    NO sigmoid activation - outputs raw scores
    """
    def __init__(self, img_dim=784):
        super(Critic, self).__init__()

        self.model = nn.Sequential(
            # Input: img_dim (784)
            nn.Linear(img_dim, 512),
            nn.LeakyReLU(0.2, inplace=True),

            # Hidden layer 1
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2, inplace=True),

            # Output: single score (no sigmoid!)
            nn.Linear(256, 1)
        )

    def forward(self, img):
        """
        Args:
            img: Image (batch_size, img_dim)
        Returns:
            Critic score (batch_size, 1) - raw unbounded value
        """
        score = self.model(img)
        return score

Gradient Penalty Function (Complete)

def compute_gradient_penalty(critic, real_images, fake_images, device='cuda'):
    """
    Compute gradient penalty for WGAN-GP

    Enforces 1-Lipschitz constraint by penalizing gradients
    that deviate from norm = 1

    Args:
        critic: Critic network
        real_images: Batch of real images (B, 784)
        fake_images: Batch of generated images (B, 784)
        device: 'cuda' or 'cpu'

    Returns:
        gradient_penalty: Scalar penalty value
    """
    batch_size = real_images.size(0)

    # Random interpolation coefficient for each sample
    epsilon = torch.rand(batch_size, 1, device=device)

    # Interpolated samples between real and fake
    interpolated = epsilon * real_images + (1 - epsilon) * fake_images
    interpolated.requires_grad_(True)

    # Get critic scores for interpolated samples
    critic_interpolated = critic(interpolated)

    # Compute gradients of scores w.r.t. interpolated inputs
    gradients = torch.autograd.grad(
        outputs=critic_interpolated,
        inputs=interpolated,
        grad_outputs=torch.ones_like(critic_interpolated),
        create_graph=True,      # Allow backprop through this operation
        retain_graph=True,      # Don't free computation graph
        only_inputs=True        # Only compute w.r.t. inputs
    )[0]

    # Compute L2 norm of gradients for each sample
    gradients = gradients.view(batch_size, -1)
    gradient_norm = gradients.norm(2, dim=1)

    # Penalize deviation from norm = 1
    gradient_penalty = ((gradient_norm - 1) ** 2).mean()

    return gradient_penalty

Training Loop Structure (WGAN-GP)

# Hyperparameters
latent_dim = 64
img_dim = 28 * 28
lr = 1e-4
beta1 = 0.5
beta2 = 0.9
n_critic = 5
lambda_gp = 10
num_epochs = 5
batch_size = 128

# Initialize models
generator = Generator(latent_dim, img_dim).to(device)
critic = Critic(img_dim).to(device)

# Optimizers
optimizer_G = torch.optim.Adam(generator.parameters(),
                               lr=lr, betas=(beta1, beta2))
optimizer_C = torch.optim.Adam(critic.parameters(),
                               lr=lr, betas=(beta1, beta2))

# Training loop
for epoch in range(num_epochs):
    for batch_idx, (real_images, _) in enumerate(dataloader):
        real_images = real_images.view(-1, img_dim).to(device)
        batch_size = real_images.size(0)

        # ==================
        # Train Critic (n_critic times per generator update)
        # ==================
        for _ in range(n_critic):
            optimizer_C.zero_grad()

            # Sample noise and generate fake images
            noise = torch.randn(batch_size, latent_dim, device=device)
            fake_images = generator(noise)

            # Critic scores on real and fake
            critic_real = critic(real_images).mean()
            critic_fake = critic(fake_images.detach()).mean()

            # Gradient penalty
            gp = compute_gradient_penalty(critic, real_images,
                                         fake_images.detach(), device)

            # Wasserstein loss with gradient penalty
            loss_C = critic_fake - critic_real + lambda_gp * gp

            loss_C.backward()
            optimizer_C.step()

        # ==================
        # Train Generator (once per n_critic critic updates)
        # ==================
        optimizer_G.zero_grad()

        # Generate fake images
        noise = torch.randn(batch_size, latent_dim, device=device)
        fake_images = generator(noise)

        # Generator wants critic to output high scores for fakes
        loss_G = -critic(fake_images).mean()

        loss_G.backward()
        optimizer_G.step()

        # ==================
        # Logging
        # ==================
        if batch_idx % 100 == 0:
            wasserstein_dist = (critic_real - critic_fake).item()
            print(f"Epoch [{epoch}/{num_epochs}] Batch [{batch_idx}] "
                  f"Loss_G: {loss_G.item():.4f} "
                  f"Loss_C: {loss_C.item():.4f} "
                  f"W-dist: {wasserstein_dist:.4f} "
                  f"GP: {gp.item():.4f}")

Common Pitfalls and Solutions

Problem Symptom Solution
Forgot to remove sigmoid Critic outputs always in [0,1] Ensure critic has no sigmoid activation
Wrong sign in losses Losses increase instead of decrease Critic minimizes negative W-distance
Gradient penalty too low Training unstable, mode collapse Use λ = 10 (standard value)
Not enough critic updates Generator dominates, poor samples Use n_critic = 5
Learning rate too high Oscillating losses, instability Use lr = 1e-4 (conservative)
Forgot create_graph=True Error during backward pass Enable in autograd.grad for GP
Debugging Checklist:
  1. Verify critic has no sigmoid activation
  2. Check loss signs (should decrease over time)
  3. Monitor gradient penalty (should be 0.5-2.0)
  4. Ensure n_critic > 1 (typically 5)
  5. Use conservative learning rate (1e-4)
  6. Visualize samples every epoch to catch mode collapse early

9. COMPARISON SUMMARY

Vanilla GAN vs WGAN vs WGAN-GP

Feature Vanilla GAN WGAN WGAN-GP
Distance Metric JS Divergence (implicit) Wasserstein Distance Wasserstein Distance
Loss Function Binary Cross-Entropy Wasserstein loss Wasserstein loss + GP
Output Activation Sigmoid (0-1) None (raw scores) None (raw scores)
Network Name Discriminator Critic Critic
Lipschitz Constraint None Weight clipping Gradient penalty
Training Stability Unstable More stable Very stable
Mode Collapse Common Less common Rare
Meaningful Metric No Yes (W-distance) Yes (W-distance)
Learning Rate ~2e-4 ~1e-4 ~1e-4
Update Ratio (D/C:G) 1:1 5:1 5:1
Sample Quality Good (if stable) Better Best
Training Time Fastest Moderate Slowest (GP overhead)

When to Use Each Variant

Use Vanilla GAN when:

Use WGAN when:

Use WGAN-GP when:

Recommendation: For new GAN projects, start with WGAN-GP. It provides the best balance of stability, sample quality, and ease of training. Only fall back to Vanilla GAN if computational constraints demand it.

Evolution of GAN Training

Evolution Timeline: 2014: Vanilla GAN │ ├─> Breakthrough: Adversarial training ├─> Problem: Unstable training, mode collapse │ v 2017: WGAN │ ├─> Innovation: Wasserstein distance ├─> Improvement: Meaningful metrics, smoother training ├─> Problem: Weight clipping limitations │ v 2017: WGAN-GP │ ├─> Innovation: Gradient penalty ├─> Improvement: Stable training, best quality ├─> Current: Standard for many applications │ v 2017+: CycleGAN, StyleGAN, Progressive GAN, BigGAN...

Key Lessons Learned

From Vanilla GAN:

From WGAN:

From WGAN-GP:

The progression from Vanilla GAN to WGAN-GP demonstrates how theoretical insights (Wasserstein distance, Lipschitz continuity) combined with practical engineering (gradient penalty implementation) can dramatically improve model performance and training reliability.

END OF LESSON

You have completed the comprehensive guide to Generative Adversarial Networks.

Topics Covered:
Introduction • Vanilla GAN • Training Problems • Wasserstein GAN •
WGAN-GP • CycleGAN • Evaluation Metrics • Implementation • Comparison

Next Steps:
Practice implementing these architectures in PyTorch •
Study the Diffusion Models lesson for state-of-the-art generative modeling •
Review the Anki flashcards to reinforce key concepts

DOWNLOAD ANKI DECK