Generative Adversarial Networks (GANs)

1. INTRODUCTION TO GANS

What are Generative Models?

Generative models are a class of machine learning models that learn to create new data samples that resemble the training data. Unlike discriminative models that learn to classify or predict labels, generative models learn the underlying probability distribution of the data itself.

            Key Concept: A generative model learns P(X), the probability distribution of data X, allowing it to generate new samples that look like they came from the same distribution.
        

The GAN Innovation

Generative Adversarial Networks (GANs), introduced by Ian Goodfellow and colleagues in 2014, revolutionized generative modeling by framing it as a competitive game between two neural networks.

Before GANs, generative models like Variational Autoencoders (VAEs) struggled to produce sharp, realistic images. GANs changed this by introducing an adversarial training process that pushes both networks to improve simultaneously.

The Two-Player Game Analogy

Think of a GAN as a game between a counterfeiter and a detective:

The Counterfeiter (Generator): Tries to create fake money that looks real
The Detective (Discriminator): Tries to distinguish real money from counterfeit

As the detective gets better at spotting fakes, the counterfeiter must improve their technique. As the counterfeiter produces more convincing fakes, the detective must become more discerning. This back-and-forth competition drives both to excellence.

┌─────────────────────────────────────────────────────────────────┐ │ GAN GAME DYNAMICS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Generator (G) Discriminator (D) │ │ ┌───────────┐ ┌───────────────┐ │ │ │ Noise z │ │ │ │ │ └─────┬─────┘ │ │ │ │ │ │ │ │ │ v │ │ │ │ ┌───────────┐ Fake x │ Real/Fake │ │ │ │ G(z) ───────────────────────>│ Classifier │ │ │ │ Generator │ │ │ │ │ └───────────┘ │ │ │ │ ^ │ │ │ │ │ └───────┬───────┘ │ │ │ │ │ │ │ Gradient Signal │ │ │ └──────────────────────────────────┘ │ │ │ │ Real Data ──────────────────────────────> D │ │ │ └─────────────────────────────────────────────────────────────────┘

Generator vs Discriminator Roles

Generator (G):

Input: Random noise vector z (typically from Gaussian distribution)
Output: Synthetic data x_fake = G(z)
Goal: Fool the discriminator into thinking fake data is real
Training objective: Maximize D(G(z))

Discriminator (D):

Input: Either real data x_real or fake data x_fake
Output: Probability that input is real (0 to 1)
Goal: Correctly classify real vs fake data
Training objective: Maximize D(x_real) and minimize D(x_fake)

The discriminator is trained on both real and fake data, while the generator never sees real data directly - it only learns from the discriminator's feedback.

2. VANILLA GAN FUNDAMENTALS

Architecture Overview

Generator Architecture: Noise → Image

The generator transforms a low-dimensional random noise vector into a high-dimensional data sample (e.g., an image). This is an upsampling process.

Generator: z (100D) → Image (28×28 = 784D) ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ Noise │────>│Linear │────>│Linear │────>│Linear │────> Image │ z │ │+ ReLU │ │+ ReLU │ │+ Tanh │ │ (100) │ │ (256) │ │ (512) │ │ (784) │ └────────┘ └────────┘ └────────┘ └────────┘

Discriminator Architecture: Image → Probability

The discriminator is essentially a binary classifier that outputs a single probability value indicating whether the input is real or fake.

Discriminator: Image (784D) → Probability (1D) ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ Image │────>│Linear │────>│Linear │────>│Linear │────> P(real) │ (784) │ │+LeakyR │ │+LeakyR │ │+Sigmoid│ │ │ │ (512) │ │ (256) │ │ (1) │ └────────┘ └────────┘ └────────┘ └────────┘

The Minimax Objective

The GAN training process is formalized as a minimax game where the generator tries to minimize what the discriminator tries to maximize.

min_G max_D V(D,G) = E_{x~p_data}[log D(x)] + E_{z~p_z}[log(1 - D(G(z)))]

Breaking down the objective:

E_{x~p_data}[log D(x)]: Expected log probability that D correctly identifies real data
E_{z~p_z}[log(1 - D(G(z)))]: Expected log probability that D correctly identifies fake data
Discriminator (max): Wants both terms to be large (correct classifications)
Generator (min): Wants the second term to be small (fool the discriminator)

            Intuitive Explanation: The discriminator wants to output D(x_real) ≈ 1 and D(x_fake) ≈ 0, maximizing the objective. The generator wants D(G(z)) ≈ 1, which minimizes the second term and thus the overall objective.
        

Binary Cross-Entropy Loss

In practice, the minimax objective is implemented using Binary Cross-Entropy (BCE) loss, which measures the difference between predicted and actual binary labels.

BCE(y, ŷ) = -[y log(ŷ) + (1-y) log(1-ŷ)]

For the Discriminator:

# Real samples (label = 1)
loss_real = BCE(D(x_real), 1)

# Fake samples (label = 0)
loss_fake = BCE(D(G(z)), 0)

# Total discriminator loss
loss_D = loss_real + loss_fake

For the Generator:

# Generator wants D to output 1 for fake samples
loss_G = BCE(D(G(z)), 1)

The original GAN paper proposed minimizing log(1 - D(G(z))) for the generator, but in practice, maximizing log(D(G(z))) works better because it provides stronger gradients early in training.

Training Algorithm

GANs use alternating optimization: train the discriminator for one or more steps, then train the generator for one step, and repeat.

            Training Loop (Vanilla GAN):

            FOR each training iteration:

              1. DISCRIMINATOR TRAINING STEP

                a. Sample minibatch of real data x from dataset

                b. Sample minibatch of noise z from prior p(z)

                c. Generate fake data: x_fake = G(z)

                d. Compute loss: L_D = BCE(D(x_real), 1) + BCE(D(x_fake), 0)

                e. Update D parameters by ascending gradient

              2. GENERATOR TRAINING STEP

                a. Sample minibatch of noise z from prior p(z)

                b. Generate fake data: x_fake = G(z)

                c. Compute loss: L_G = BCE(D(G(z)), 1)

                d. Update G parameters by descending gradient

PyTorch Implementation Example

# Simplified vanilla GAN training loop
for epoch in range(num_epochs):
    for real_images, _ in dataloader:
        batch_size = real_images.size(0)

        # ==================
        # Train Discriminator
        # ==================
        optimizer_D.zero_grad()

        # Real images
        real_labels = torch.ones(batch_size, 1)
        real_output = discriminator(real_images)
        loss_D_real = criterion(real_output, real_labels)

        # Fake images
        noise = torch.randn(batch_size, latent_dim)
        fake_images = generator(noise)
        fake_labels = torch.zeros(batch_size, 1)
        fake_output = discriminator(fake_images.detach())
        loss_D_fake = criterion(fake_output, fake_labels)

        # Total discriminator loss
        loss_D = loss_D_real + loss_D_fake
        loss_D.backward()
        optimizer_D.step()

        # ==================
        # Train Generator
        # ==================
        optimizer_G.zero_grad()

        # Generator wants D to output 1 for fake images
        noise = torch.randn(batch_size, latent_dim)
        fake_images = generator(noise)
        fake_output = discriminator(fake_images)
        real_labels = torch.ones(batch_size, 1)

        loss_G = criterion(fake_output, real_labels)
        loss_G.backward()
        optimizer_G.step()

3. TRAINING INSTABILITY & PROBLEMS

Despite their success, vanilla GANs are notoriously difficult to train. Several fundamental problems arise from the adversarial training dynamics.

Mode Collapse

What it is:

Mode collapse occurs when the generator learns to produce only a limited variety of samples, ignoring much of the data distribution. Instead of generating diverse outputs, it "collapses" to producing a few safe samples that fool the discriminator.

Why it happens:

The generator finds a few samples that reliably fool the discriminator
It has no incentive to explore other parts of the data distribution
The minimax objective doesn't explicitly reward diversity
Once the discriminator updates, the generator may jump to a different mode

Mode Collapse Visualization (MNIST digits): Full Distribution: Mode Collapse: ┌───────────────┐ ┌───────────────┐ │ 0 1 2 3 4 │ │ 7 7 7 7 7 │ │ 5 6 7 8 9 │ --> │ 7 7 7 7 7 │ │ 3 1 5 2 8 │ │ 7 7 7 7 7 │ │ 9 4 6 0 1 │ │ 7 7 7 7 7 │ └───────────────┘ └───────────────┘ (Diverse samples) (Only generates 7s)

In severe mode collapse, the generator might produce identical or near-identical samples regardless of the input noise vector z.

Vanishing/Exploding Gradients

When discriminator becomes too strong:

If the discriminator becomes very good at distinguishing real from fake, it outputs values very close to 0 or 1. This causes the gradient of log(1 - D(G(z))) to vanish, leaving the generator with no learning signal.

When D(G(z)) ≈ 0: ∂/∂G log(1 - D(G(z))) ≈ 0

This is known as the vanishing gradient problem. The generator stops learning because the discriminator is so confident that the samples are fake.

Loss of learning signal:

Early in training: Discriminator easily spots fakes (D(G(z)) ≈ 0)
Generator receives near-zero gradients
Generator improvement stalls
This is why we maximize log(D(G(z))) instead in practice

Gradient Flow Problem: Strong Discriminator (D ≈ 1 for real, D ≈ 0 for fake) │ v Saturated Sigmoid (flat regions) │ v Near-zero gradients to Generator │ v Generator stops learning

Non-Convergence Issues

Unlike typical neural network training where loss decreases monotonically, GAN training involves two competing objectives that may never reach equilibrium.

Oscillating behavior:

Generator and discriminator losses oscillate without clear convergence
Neither network reaches a stable optimum
Training can be stable for epochs, then suddenly destabilize
No clear stopping criterion based on loss values

Epoch	G Loss	D Loss	Observation
1	2.45	0.89	D too strong
2	1.12	1.23	Better balance
3	3.78	0.45	Oscillation
4	0.67	2.01	G too strong
5	2.89	0.91	Instability

Lack of Meaningful Metrics

BCE loss doesn't correlate with quality:

The discriminator and generator losses provide little information about the actual quality of generated samples. You can have:

Low generator loss but poor samples: Generator found a way to fool D with low-quality images
High generator loss but good samples: Discriminator got better, but generator is still producing quality outputs
Balanced losses but mode collapse: Both networks in equilibrium, but generator only produces a few modes

            The Problem: You cannot look at loss curves alone to determine if your GAN is training well. You must visually inspect generated samples, which makes debugging and hyperparameter tuning extremely difficult.
        

Diagnostic challenges:

Cannot determine convergence from loss values
Cannot set early stopping criteria
Cannot compare different model architectures objectively
Must manually inspect samples throughout training

These fundamental problems motivated the development of improved GAN variants like Wasserstein GAN (WGAN), which addresses many of these issues through a different distance metric and training procedure.

4. WASSERSTEIN GAN (WGAN)

Motivation: Why We Need Better Distance Metrics

The instability of vanilla GANs stems from using the Jensen-Shannon (JS) divergence implicitly through the BCE loss. When the real and fake distributions have minimal overlap, the JS divergence becomes constant, providing no useful gradient.

The core problem:

Real data distribution p_data and generated distribution p_g often have disjoint supports
JS divergence = log(2) when supports don't overlap
Gradient becomes zero, halting learning

            Solution: Wasserstein GAN uses Earth Mover's Distance (Wasserstein distance), which provides meaningful gradients even when distributions don't overlap.
        

Earth Mover's Distance (Wasserstein Distance)

Intuitive explanation (moving piles of earth):

Imagine you have two piles of earth with different shapes. The Wasserstein distance measures the minimum amount of "work" needed to transform one pile into the other, where work = amount of earth × distance moved.

Earth Mover's Distance Intuition: Distribution A: Distribution B: ███ ███ █████ █████ ███████ ███████ █████████ --> █████████ ▀▀▀▀▀▀▀▀▀▀▀ ▀▀▀▀▀▀▀▀▀▀▀▀ W(A,B) = minimum cost to move earth from A to match B Key property: W(A,B) is continuous and provides gradients even when A and B don't overlap

Mathematical definition:

W(p_r, p_g) = inf_{γ∈Π(p_r,p_g)} E_(x,y)~γ[||x - y||]

Where:

p_r is the real data distribution
p_g is the generated data distribution
γ is a joint distribution with marginals p_r and p_g
The infimum is taken over all possible joint distributions

Why Wasserstein distance is better:

Provides continuous gradients everywhere
Correlates with sample quality (lower W = better samples)
Works even when distributions are far apart
No saturation issues

From Discriminator to Critic

WGAN replaces the discriminator with a "critic" that outputs raw scores instead of probabilities.

Aspect	Vanilla GAN Discriminator	WGAN Critic
Output activation	Sigmoid (0 to 1)	None (any real number)
Output interpretation	Probability of being real	Raw score (higher = more real)
Training objective	Maximize classification accuracy	Maximize separation between real and fake scores
Loss function	Binary Cross-Entropy	Wasserstein loss

Critic architecture (no sigmoid):

class Critic(nn.Module):
    def __init__(self):
        super(Critic, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(784, 512),
            nn.LeakyReLU(0.2),
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, 1)
            # NO SIGMOID - outputs raw scores
        )

    def forward(self, x):
        return self.model(x)

Kantorovich-Rubinstein Duality

Computing the Wasserstein distance directly is intractable. The Kantorovich-Rubinstein duality theorem provides a practical way to compute it using a critic function.

W(p_r, p_g) = sup_{||f||_L≤1} E_{x~p_r}[f(x)] - E_{x~p_g}[f(x)]

Where:

f is a 1-Lipschitz function (the critic)
||f||_L ≤ 1 means the function has Lipschitz constant at most 1
The supremum is taken over all 1-Lipschitz functions

            Key Insight: The critic network approximates the optimal f. By training the critic to maximize the difference between scores on real and fake data (while maintaining the Lipschitz constraint), we approximate the Wasserstein distance.
        

Lipschitz Constraint

A function f is 1-Lipschitz if for all x₁ and x₂:

|f(x₁) - f(x₂)| ≤ ||x₁ - x₂||

This means the function cannot change faster than the input changes - it has bounded gradients.

Weight clipping (original WGAN approach):

The original WGAN paper enforced the Lipschitz constraint by clipping critic weights to a small range [-c, c] after each update.

# Weight clipping (original WGAN)
for param in critic.parameters():
    param.data.clamp_(-0.01, 0.01)  # Clip to [-0.01, 0.01]

Problems with weight clipping:

Capacity limitation: Constraining weights to a small range reduces the critic's capacity
Gradient explosion/vanishing: Can lead to pathological value surfaces
Slow training: Critic needs many updates to reach optimum
Hyperparameter sensitivity: Clipping threshold c is difficult to tune

These limitations of weight clipping led to the development of WGAN-GP (Gradient Penalty), which enforces the Lipschitz constraint in a more principled way.

WGAN Loss Functions

Critic loss (maximize):

L_C = E_{x~p_r}[C(x)] - E_{z~p_z}[C(G(z))]

Generator loss (minimize):

L_G = -E_{z~p_z}[C(G(z))]

PyTorch implementation:

# WGAN losses (without gradient penalty yet)
# Critic loss
critic_real = critic(real_images).mean()
critic_fake = critic(fake_images).mean()
loss_C = -(critic_real - critic_fake)  # Maximize → minimize negative

# Generator loss
fake_images = generator(noise)
loss_G = -critic(fake_images).mean()  # Maximize critic score on fakes

            Key Difference: Notice there's no BCE loss, no labels (0 or 1), and no sigmoid activation. We're directly optimizing the difference between real and fake scores.
        

5. WGAN WITH GRADIENT PENALTY (WGAN-GP)

Gradient Penalty Concept

WGAN-GP improves upon WGAN by replacing weight clipping with a gradient penalty term that directly enforces the Lipschitz constraint by penalizing the gradient norm of the critic.

The idea:

A 1-Lipschitz function must have gradients with norm at most 1 everywhere. Instead of clipping weights, we add a penalty term that encourages ||∇f(x)||₂ = 1.

GP = λ · E_{x̂~p_x̂}[(||∇_x̂ C(x̂)||₂ - 1)²]

Where:

λ is the gradient penalty coefficient (typically 10)
x̂ are interpolated samples between real and fake data
||∇_x̂ C(x̂)||₂ is the L2 norm of the critic's gradient with respect to x̂
We penalize deviations from gradient norm = 1

            Enforcing 1-Lipschitz constraint: By penalizing gradients that deviate from norm 1, we ensure the critic function doesn't change too rapidly, satisfying the Lipschitz constraint needed for the Wasserstein distance approximation.
        

Implementation Details

Interpolated samples:

We compute the gradient penalty on random interpolations between real and fake samples, not on real/fake data directly.

x̂ = ε · x_real + (1 - ε) · x_fake, where ε ~ Uniform(0, 1)

Why interpolations?

The optimal critic has gradients with norm 1 almost everywhere between real and fake distributions
Interpolations sample the space between distributions
This is where the Lipschitz constraint matters most
More efficient than sampling uniformly over entire input space

Gradient Computation

Full gradient penalty implementation (PyTorch):

def compute_gradient_penalty(critic, real_images, fake_images, device):
    """
    Compute gradient penalty for WGAN-GP

    Args:
        critic: Critic network
        real_images: Batch of real images
        fake_images: Batch of generated images
        device: 'cuda' or 'cpu'

    Returns:
        gradient_penalty: Scalar penalty value
    """
    batch_size = real_images.size(0)

    # Random weight term for interpolation
    epsilon = torch.rand(batch_size, 1, 1, 1, device=device)
    epsilon = epsilon.expand_as(real_images)

    # Interpolated samples
    interpolated = epsilon * real_images + (1 - epsilon) * fake_images
    interpolated.requires_grad_(True)

    # Critic scores for interpolated samples
    critic_interpolated = critic(interpolated)

    # Compute gradients of critic scores w.r.t. interpolated samples
    gradients = torch.autograd.grad(
        outputs=critic_interpolated,
        inputs=interpolated,
        grad_outputs=torch.ones_like(critic_interpolated),
        create_graph=True,
        retain_graph=True,
        only_inputs=True
    )[0]

    # Flatten gradients
    gradients = gradients.view(batch_size, -1)

    # Compute gradient norm
    gradient_norm = gradients.norm(2, dim=1)

    # Penalty for deviation from norm = 1
    gradient_penalty = ((gradient_norm - 1) ** 2).mean()

    return gradient_penalty

The create_graph=True flag is crucial - it allows us to backpropagate through the gradient computation itself, which is necessary for training the critic to have norm-1 gradients.

Training Algorithm

WGAN-GP Training Procedure:

            FOR each training iteration:

              FOR n_critic iterations (typically 5):

                1. Sample real data x_real and noise z

                2. Generate fake data: x_fake = G(z)

                3. Compute critic scores: C(x_real), C(x_fake)

                4. Compute gradient penalty on interpolated samples

                5. Total critic loss: L_C = C(x_fake) - C(x_real) + λ · GP

                6. Update critic parameters

              Generator update:

                1. Sample noise z

                2. Generate fake data: x_fake = G(z)

                3. Compute generator loss: L_G = -C(G(z))

                4. Update generator parameters

Key hyperparameters:

Parameter	Standard Value	Purpose
λ (lambda)	10	Gradient penalty coefficient
n_critic	5	Critic updates per generator update
Learning rate	1e-4 (0.0001)	Lower than vanilla GAN for stability
β₁ (Adam)	0.5	Momentum parameter
β₂ (Adam)	0.9	RMSprop parameter

Complete Training Loop

# WGAN-GP Training Loop
for epoch in range(num_epochs):
    for i, (real_images, _) in enumerate(dataloader):
        real_images = real_images.to(device)
        batch_size = real_images.size(0)

        # ==================
        # Train Critic (n_critic times)
        # ==================
        for _ in range(n_critic):
            optimizer_C.zero_grad()

            # Generate fake images
            noise = torch.randn(batch_size, latent_dim, device=device)
            fake_images = generator(noise)

            # Critic scores
            critic_real = critic(real_images).mean()
            critic_fake = critic(fake_images).mean()

            # Gradient penalty
            gp = compute_gradient_penalty(critic, real_images,
                                         fake_images.detach(), device)

            # Total critic loss
            loss_C = critic_fake - critic_real + lambda_gp * gp

            loss_C.backward()
            optimizer_C.step()

        # ==================
        # Train Generator
        # ==================
        optimizer_G.zero_grad()

        noise = torch.randn(batch_size, latent_dim, device=device)
        fake_images = generator(noise)

        # Generator loss
        loss_G = -critic(fake_images).mean()

        loss_G.backward()
        optimizer_G.step()

        # Log Wasserstein distance estimate
        wasserstein_distance = critic_real - critic_fake

Stability Improvements

Smoother loss curves:

WGAN-GP exhibits much more stable training compared to vanilla GAN. The losses decrease smoothly and predictably.

Loss Comparison Over Training: Vanilla GAN: WGAN-GP: Loss Loss │ │ │ ╱╲ ╱╲ │╲ │ ╱ ╲╱ ╲ ╱╲ │ ╲ │ ╱ ╲╱ ╲ │ ╲___ │╱ ╲ │ ╲____ │ (erratic) │ ╲____ └─────────────────> Iter └─────────────────> Iter (smooth decrease)

Meaningful Wasserstein distance metric:

W-distance correlates with sample quality
Lower W-distance = generated distribution closer to real
Provides interpretable training progress
Can use for early stopping or model selection

Better sample quality:

Sharper, more realistic images
Reduced mode collapse
More stable training allows longer training
Less sensitive to hyperparameters

            Empirical Results: In practice, WGAN-GP demonstrates 2-5x reduction in loss volatility, more consistent convergence across random seeds, and significantly improved sample quality compared to vanilla GAN with the same architecture and training time.
        

6. CYCLEGAN

Unpaired Image Translation Problem

Traditional image-to-image translation methods (like pix2pix) require paired training examples: input image A and corresponding output image B. CycleGAN solves the harder problem of translation without paired data.

The challenge:

Paired data is expensive or impossible to obtain (e.g., photo ↔ painting)
Need to learn mapping between two domains X and Y
Only have separate collections of images from each domain
No one-to-one correspondence between images

Paired vs Unpaired Data: Paired (pix2pix): Unpaired (CycleGAN): Domain X Domain Y Domain X Domain Y ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ Cat │───│ Edge │ │ Cat │ │ Edge │ └──────┘ └──────┘ └──────┘ └──────┘ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ Dog │───│ Edge │ │ Horse│ │ Edge │ └──────┘ └──────┘ └──────┘ └──────┘ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │Horse │───│ Edge │ │ Dog │ │ Edge │ └──────┘ └──────┘ └──────┘ └──────┘ (Aligned pairs) (Unaligned collections)

Cycle Consistency Loss

The key innovation of CycleGAN is cycle consistency: if we translate from domain X to Y and back to X, we should get back the original image.

Forward cycle: X → Y → X

If we map image x from domain X to domain Y using generator G, then map it back using generator F, we should recover x.

Forward Cycle: x → G(x) → F(G(x)) ≈ x

Backward cycle: Y → X → Y

Similarly, mapping y from Y to X and back should recover y.

Backward Cycle: y → F(y) → G(F(y)) ≈ y

Cycle Consistency Visualization: Forward Cycle x ────────> G(x) ────────> F(G(x)) (photo) (painting) (photo') │ │ └─────────── x ≈ F(G(x)) ────┘ (should match!) Backward Cycle y ────────> F(y) ────────> G(F(y)) (painting) (photo) (painting') │ │ └─────────── y ≈ G(F(y)) ────┘ (should match!)

Cycle consistency loss:

L_cyc(G, F) = E_x~X[||F(G(x)) - x||₁] + E_y~Y[||G(F(y)) - y||₁]

Architecture

Two generators:

G: X → Y (e.g., photo → painting)
F: Y → X (e.g., painting → photo)

Two discriminators:

D_X: Distinguishes real X images from F(y)
D_Y: Distinguishes real Y images from G(x)

CycleGAN Architecture: Domain X (Photos) Domain Y (Paintings) ┌────┐ ┌────┐ │ x │──────────G─────────────>│G(x)│ └────┘ └────┘ ↑ ↖ │ ↓ │ F │ D_Y │ ↖ ↓ ┌────┐ F(G(x)) ┌────┐ │F(y)│<──────────F─────────────│ y │ └────┘ └────┘ ↓ ↑ D_X │ Cycle: x → G(x) → F(G(x)) ≈ x Cycle: y → F(y) → G(F(y)) ≈ y

Total Loss Function

The CycleGAN objective combines adversarial losses (to make translations realistic) with cycle consistency losses (to preserve content).

Adversarial losses:

L_GAN(G, D_Y) = E_y~Y[log D_Y(y)] + E_x~X[log(1 - D_Y(G(x)))]

L_GAN(F, D_X) = E_x~X[log D_X(x)] + E_y~Y[log(1 - D_X(F(y)))]

Full objective:

L(G, F, D_X, D_Y) = L_GAN(G, D_Y) + L_GAN(F, D_X) + λ · L_cyc(G, F)

Where λ controls the relative importance of cycle consistency (typically λ = 10).

Training Procedure

# CycleGAN Training (simplified)
for epoch in range(num_epochs):
    for real_X, real_Y in dataloader:

        # ==================
        # Train Generators
        # ==================
        optimizer_G.zero_grad()

        # Forward cycle: X -> Y -> X
        fake_Y = G(real_X)
        reconstructed_X = F(fake_Y)
        loss_cycle_X = L1(reconstructed_X, real_X)

        # Backward cycle: Y -> X -> Y
        fake_X = F(real_Y)
        reconstructed_Y = G(fake_X)
        loss_cycle_Y = L1(reconstructed_Y, real_Y)

        # Adversarial losses
        loss_G_adv = -D_Y(fake_Y).mean()
        loss_F_adv = -D_X(fake_X).mean()

        # Total generator loss
        loss_G = (loss_G_adv + loss_F_adv +
                 lambda_cyc * (loss_cycle_X + loss_cycle_Y))
        loss_G.backward()
        optimizer_G.step()

        # ==================
        # Train Discriminators
        # ==================
        # D_Y discriminates Y domain
        optimizer_D_Y.zero_grad()
        loss_D_Y_real = D_Y(real_Y).mean()
        loss_D_Y_fake = D_Y(fake_Y.detach()).mean()
        loss_D_Y = loss_D_Y_fake - loss_D_Y_real
        loss_D_Y.backward()
        optimizer_D_Y.step()

        # D_X discriminates X domain
        optimizer_D_X.zero_grad()
        loss_D_X_real = D_X(real_X).mean()
        loss_D_X_fake = D_X(fake_X.detach()).mean()
        loss_D_X = loss_D_X_fake - loss_D_X_real
        loss_D_X.backward()
        optimizer_D_X.step()

Applications

Style transfer:

Photos to paintings (Monet, Van Gogh, Cezanne styles)
Paintings to photorealistic images
Day to night scenes
Summer to winter landscapes

Object transfiguration:

Horses ↔ Zebras
Apples ↔ Oranges
Dogs ↔ Cats

Domain adaptation:

Synthetic data to real-world appearance
Simulation to reality transfer
Medical image modality translation (MRI ↔ CT)

CycleGAN works best when the geometric structure is preserved between domains. It can change appearance, style, and texture, but cannot handle transformations that significantly alter object shape or layout.

7. EVALUATION METRICS

Evaluating GANs is challenging because we care about both sample quality (do images look realistic?) and sample diversity (do we cover the full distribution?). No single metric captures both perfectly.

Inception Score (IS)

What it measures:

Inception Score uses a pre-trained Inception network to evaluate generated images based on two criteria:

Quality: Each image should be confidently classified as a specific class
Diversity: Overall distribution should cover all classes uniformly

IS(G) = exp(E_{x~p_g}[KL(p(y|x) || p(y))])

Where:

p(y|x) is the conditional class distribution (quality)
p(y) is the marginal class distribution (diversity)
KL is Kullback-Leibler divergence

Interpretation:

Higher IS = better quality and diversity
Random noise: IS ≈ 1
Perfect model on ImageNet: IS ≈ 300+
Good GAN on CIFAR-10: IS ≈ 8-10

Limitations:

Only works for images that Inception was trained on (ImageNet classes)
Doesn't compare to real data distribution directly
Can be fooled by mode collapse (if generator produces one perfect image per class)
Not sensitive to intra-class diversity
Requires generated images to match ImageNet domain

Inception Score should not be used in isolation. It's possible to have high IS but poor actual quality, or to have mode collapse with good IS if each mode is high quality.

Fréchet Inception Distance (FID)

What it measures:

FID compares the distribution of generated images to real images by looking at their features in the Inception network's feature space.

How it works:

Pass real and generated images through Inception network
Extract features from intermediate layer (before classification)
Model both feature distributions as multivariate Gaussians
Compute Fréchet distance between the two Gaussians

FID = ||μ_r - μ_g||² + Tr(Σ_r + Σ_g - 2(Σ_rΣ_g)^1/2)

Where:

μ_r, Σ_r = mean and covariance of real data features
μ_g, Σ_g = mean and covariance of generated data features
Tr = trace of a matrix

Interpretation:

Lower FID = better (closer to real distribution)
FID = 0 means identical distributions
Good GANs: FID < 10-50 (dataset dependent)
Poor GANs: FID > 100

FID Visualization (Feature Space): Real Distribution: Generated Distribution: ╭─────╮ ╭─────╮ ╱ ╲ ╱ ╲ │ μ_r │ │ μ_g │ ╲ ╱ ╲ ╱ ╰─────╯ ╰─────╯ ↑ ↑ Σ_r (spread) Σ_g (spread) FID measures distance between these distributions in Inception feature space

Advantages over IS:

Directly compares to real data distribution
More sensitive to mode collapse and lack of diversity
Correlates better with human judgment of quality
More robust and consistent across different datasets
Detects both quality issues and diversity problems

Limitations:

Still relies on Inception network (biased toward ImageNet)
Gaussian assumption may not hold for complex distributions
Requires many samples for stable estimation (typically 10k+)
Cannot tell you what's wrong, just that something is wrong

Comparison of Metrics

Aspect	Inception Score (IS)	Fréchet Inception Distance (FID)
What it measures	Quality + diversity of classes	Distance to real distribution
Better value	Higher is better	Lower is better
Uses real data	No (only generator samples)	Yes (compares to real)
Detects mode collapse	Poorly	Well
Samples needed	~5,000	~10,000+
Computational cost	Low	Medium
Human correlation	Moderate	Better

            Best Practice: Use FID as the primary metric for GAN evaluation, supplemented by visual inspection of samples and potentially IS for additional validation. No metric replaces human evaluation of sample quality and diversity.
        

PyTorch Implementation (FID)

import torch
from scipy import linalg
import numpy as np

def calculate_fid(real_features, fake_features):
    """
    Calculate Fréchet Inception Distance

    Args:
        real_features: Features from real images (N x D)
        fake_features: Features from generated images (M x D)

    Returns:
        fid_score: Scalar FID value (lower is better)
    """
    # Calculate mean and covariance
    mu_real = np.mean(real_features, axis=0)
    mu_fake = np.mean(fake_features, axis=0)

    sigma_real = np.cov(real_features, rowvar=False)
    sigma_fake = np.cov(fake_features, rowvar=False)

    # Calculate squared difference of means
    diff = mu_real - mu_fake
    mean_diff = diff.dot(diff)

    # Calculate sqrt of product of covariances
    covmean, _ = linalg.sqrtm(sigma_real.dot(sigma_fake), disp=False)

    # Handle numerical errors
    if np.iscomplexobj(covmean):
        covmean = covmean.real

    # Calculate FID
    fid = mean_diff + np.trace(sigma_real + sigma_fake - 2*covmean)

    return fid

# Extract features using pre-trained Inception
inception_model = torchvision.models.inception_v3(pretrained=True)
inception_model.fc = torch.nn.Identity()  # Remove final layer
inception_model.eval()

def get_features(images):
    with torch.no_grad():
        features = inception_model(images)
    return features.cpu().numpy()

# Compute FID
real_features = get_features(real_images)
fake_features = get_features(generated_images)
fid_score = calculate_fid(real_features, fake_features)
print(f"FID Score: {fid_score:.2f}")

8. PRACTICAL IMPLEMENTATION

Key Takeaways from Assignment 8

The Assignment 8 implementation on FashionMNIST provides valuable insights into the practical differences between Vanilla GAN and WGAN-GP.

Vanilla GAN instability observed:

Loss curves oscillate wildly without clear convergence
Discriminator scores D(real) and D(fake) diverge from ideal 0.5
Training can suddenly destabilize even after stable epochs
Sample quality varies unpredictably across epochs
High batch-level volatility (large standard deviation)

WGAN-GP stability improvements:

Smooth, decreasing loss curves indicate real progress
Wasserstein distance correlates with visual sample quality
Gradient penalty stabilizes around 0.5-2.0 range
Consistent improvement in sample sharpness over epochs
2-5× reduction in loss volatility compared to Vanilla GAN

Batch-level volatility analysis:

Examining loss at the batch level reveals the extent of training instability:

Metric	Vanilla GAN	WGAN-GP
Generator loss std dev	~0.45	~0.12
Discriminator/Critic loss std dev	~0.38	~0.09
Stability improvement	Baseline	3.7× less volatile

These volatility metrics provide quantitative evidence for WGAN-GP's superior stability, complementing qualitative visual assessment.

Complete Generator Architecture (PyTorch)

import torch
import torch.nn as nn

class Generator(nn.Module):
    """
    Generator network for FashionMNIST (28x28 grayscale images)
    Maps 64-dimensional noise to 784-dimensional image
    """
    def __init__(self, latent_dim=64, img_dim=784):
        super(Generator, self).__init__()

        self.model = nn.Sequential(
            # Input: latent_dim (64)
            nn.Linear(latent_dim, 256),
            nn.LeakyReLU(0.2, inplace=True),
            nn.BatchNorm1d(256),

            # Hidden layer 1
            nn.Linear(256, 512),
            nn.LeakyReLU(0.2, inplace=True),
            nn.BatchNorm1d(512),

            # Hidden layer 2
            nn.Linear(512, 1024),
            nn.LeakyReLU(0.2, inplace=True),
            nn.BatchNorm1d(1024),

            # Output: img_dim (784)
            nn.Linear(1024, img_dim),
            nn.Tanh()  # Output in [-1, 1] to match normalized images
        )

    def forward(self, z):
        """
        Args:
            z: Noise vector (batch_size, latent_dim)
        Returns:
            Generated image (batch_size, img_dim)
        """
        img = self.model(z)
        return img

Complete Critic Architecture (PyTorch)

class Critic(nn.Module):
    """
    Critic network for WGAN-GP
    NO sigmoid activation - outputs raw scores
    """
    def __init__(self, img_dim=784):
        super(Critic, self).__init__()

        self.model = nn.Sequential(
            # Input: img_dim (784)
            nn.Linear(img_dim, 512),
            nn.LeakyReLU(0.2, inplace=True),

            # Hidden layer 1
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2, inplace=True),

            # Output: single score (no sigmoid!)
            nn.Linear(256, 1)
        )

    def forward(self, img):
        """
        Args:
            img: Image (batch_size, img_dim)
        Returns:
            Critic score (batch_size, 1) - raw unbounded value
        """
        score = self.model(img)
        return score

Gradient Penalty Function (Complete)

def compute_gradient_penalty(critic, real_images, fake_images, device='cuda'):
    """
    Compute gradient penalty for WGAN-GP

    Enforces 1-Lipschitz constraint by penalizing gradients
    that deviate from norm = 1

    Args:
        critic: Critic network
        real_images: Batch of real images (B, 784)
        fake_images: Batch of generated images (B, 784)
        device: 'cuda' or 'cpu'

    Returns:
        gradient_penalty: Scalar penalty value
    """
    batch_size = real_images.size(0)

    # Random interpolation coefficient for each sample
    epsilon = torch.rand(batch_size, 1, device=device)

    # Interpolated samples between real and fake
    interpolated = epsilon * real_images + (1 - epsilon) * fake_images
    interpolated.requires_grad_(True)

    # Get critic scores for interpolated samples
    critic_interpolated = critic(interpolated)

    # Compute gradients of scores w.r.t. interpolated inputs
    gradients = torch.autograd.grad(
        outputs=critic_interpolated,
        inputs=interpolated,
        grad_outputs=torch.ones_like(critic_interpolated),
        create_graph=True,      # Allow backprop through this operation
        retain_graph=True,      # Don't free computation graph
        only_inputs=True        # Only compute w.r.t. inputs
    )[0]

    # Compute L2 norm of gradients for each sample
    gradients = gradients.view(batch_size, -1)
    gradient_norm = gradients.norm(2, dim=1)

    # Penalize deviation from norm = 1
    gradient_penalty = ((gradient_norm - 1) ** 2).mean()

    return gradient_penalty

Training Loop Structure (WGAN-GP)

# Hyperparameters
latent_dim = 64
img_dim = 28 * 28
lr = 1e-4
beta1 = 0.5
beta2 = 0.9
n_critic = 5
lambda_gp = 10
num_epochs = 5
batch_size = 128

# Initialize models
generator = Generator(latent_dim, img_dim).to(device)
critic = Critic(img_dim).to(device)

# Optimizers
optimizer_G = torch.optim.Adam(generator.parameters(),
                               lr=lr, betas=(beta1, beta2))
optimizer_C = torch.optim.Adam(critic.parameters(),
                               lr=lr, betas=(beta1, beta2))

# Training loop
for epoch in range(num_epochs):
    for batch_idx, (real_images, _) in enumerate(dataloader):
        real_images = real_images.view(-1, img_dim).to(device)
        batch_size = real_images.size(0)

        # ==================
        # Train Critic (n_critic times per generator update)
        # ==================
        for _ in range(n_critic):
            optimizer_C.zero_grad()

            # Sample noise and generate fake images
            noise = torch.randn(batch_size, latent_dim, device=device)
            fake_images = generator(noise)

            # Critic scores on real and fake
            critic_real = critic(real_images).mean()
            critic_fake = critic(fake_images.detach()).mean()

            # Gradient penalty
            gp = compute_gradient_penalty(critic, real_images,
                                         fake_images.detach(), device)

            # Wasserstein loss with gradient penalty
            loss_C = critic_fake - critic_real + lambda_gp * gp

            loss_C.backward()
            optimizer_C.step()

        # ==================
        # Train Generator (once per n_critic critic updates)
        # ==================
        optimizer_G.zero_grad()

        # Generate fake images
        noise = torch.randn(batch_size, latent_dim, device=device)
        fake_images = generator(noise)

        # Generator wants critic to output high scores for fakes
        loss_G = -critic(fake_images).mean()

        loss_G.backward()
        optimizer_G.step()

        # ==================
        # Logging
        # ==================
        if batch_idx % 100 == 0:
            wasserstein_dist = (critic_real - critic_fake).item()
            print(f"Epoch [{epoch}/{num_epochs}] Batch [{batch_idx}] "
                  f"Loss_G: {loss_G.item():.4f} "
                  f"Loss_C: {loss_C.item():.4f} "
                  f"W-dist: {wasserstein_dist:.4f} "
                  f"GP: {gp.item():.4f}")

Common Pitfalls and Solutions

Problem	Symptom	Solution
Forgot to remove sigmoid	Critic outputs always in [0,1]	Ensure critic has no sigmoid activation
Wrong sign in losses	Losses increase instead of decrease	Critic minimizes negative W-distance
Gradient penalty too low	Training unstable, mode collapse	Use λ = 10 (standard value)
Not enough critic updates	Generator dominates, poor samples	Use n_critic = 5
Learning rate too high	Oscillating losses, instability	Use lr = 1e-4 (conservative)
Forgot create_graph=True	Error during backward pass	Enable in autograd.grad for GP

            Debugging Checklist:
            Verify critic has no sigmoid activation
Check loss signs (should decrease over time)
Monitor gradient penalty (should be 0.5-2.0)
Ensure n_critic > 1 (typically 5)
Use conservative learning rate (1e-4)
Visualize samples every epoch to catch mode collapse early

        

9. COMPARISON SUMMARY

Vanilla GAN vs WGAN vs WGAN-GP

Feature	Vanilla GAN	WGAN	WGAN-GP
Distance Metric	JS Divergence (implicit)	Wasserstein Distance	Wasserstein Distance
Loss Function	Binary Cross-Entropy	Wasserstein loss	Wasserstein loss + GP
Output Activation	Sigmoid (0-1)	None (raw scores)	None (raw scores)
Network Name	Discriminator	Critic	Critic
Lipschitz Constraint	None	Weight clipping	Gradient penalty
Training Stability	Unstable	More stable	Very stable
Mode Collapse	Common	Less common	Rare
Meaningful Metric	No	Yes (W-distance)	Yes (W-distance)
Learning Rate	~2e-4	~1e-4	~1e-4
Update Ratio (D/C:G)	1:1	5:1	5:1
Sample Quality	Good (if stable)	Better	Best
Training Time	Fastest	Moderate	Slowest (GP overhead)

When to Use Each Variant

Use Vanilla GAN when:

You have extensive GAN training experience
Computational resources are very limited
You can carefully tune hyperparameters
The task is simple and well-studied

Use WGAN when:

You need more stable training than Vanilla GAN
Computational cost of gradient penalty is prohibitive
You're willing to accept weight clipping limitations
Historical/research comparison purposes

Use WGAN-GP when:

Default choice for most applications
Training stability is crucial
You need meaningful progress metrics
You want to minimize mode collapse risk
Computational resources allow gradient penalty

            Recommendation: For new GAN projects, start with WGAN-GP. It provides the best balance of stability, sample quality, and ease of training. Only fall back to Vanilla GAN if computational constraints demand it.
        

Evolution of GAN Training

Evolution Timeline: 2014: Vanilla GAN │ ├─> Breakthrough: Adversarial training ├─> Problem: Unstable training, mode collapse │ v 2017: WGAN │ ├─> Innovation: Wasserstein distance ├─> Improvement: Meaningful metrics, smoother training ├─> Problem: Weight clipping limitations │ v 2017: WGAN-GP │ ├─> Innovation: Gradient penalty ├─> Improvement: Stable training, best quality ├─> Current: Standard for many applications │ v 2017+: CycleGAN, StyleGAN, Progressive GAN, BigGAN...

Key Lessons Learned

From Vanilla GAN:

Adversarial training is powerful but challenging
Loss values alone don't indicate training progress
Visual inspection is crucial for GAN development
Balancing generator and discriminator is critical

From WGAN:

Choice of distance metric matters enormously
Continuous, meaningful metrics enable better training
Theoretical foundations guide practical improvements
Lipschitz constraint is key to stability

From WGAN-GP:

Enforcing constraints through regularization > hard constraints
Gradient penalties are more flexible than weight clipping
Stability enables longer training and better results
Extra computation for stability is often worth it

The progression from Vanilla GAN to WGAN-GP demonstrates how theoretical insights (Wasserstein distance, Lipschitz continuity) combined with practical engineering (gradient penalty implementation) can dramatically improve model performance and training reliability.

1. INTRODUCTION TO GANS

What are Generative Models?

The GAN Innovation

The Two-Player Game Analogy

Generator vs Discriminator Roles

2. VANILLA GAN FUNDAMENTALS

Architecture Overview

The Minimax Objective

Binary Cross-Entropy Loss

Training Algorithm

PyTorch Implementation Example

3. TRAINING INSTABILITY & PROBLEMS

Mode Collapse

Vanishing/Exploding Gradients

Non-Convergence Issues

Lack of Meaningful Metrics

4. WASSERSTEIN GAN (WGAN)

Motivation: Why We Need Better Distance Metrics

Earth Mover's Distance (Wasserstein Distance)

From Discriminator to Critic

Kantorovich-Rubinstein Duality

Lipschitz Constraint

WGAN Loss Functions

5. WGAN WITH GRADIENT PENALTY (WGAN-GP)

Gradient Penalty Concept

Implementation Details

Gradient Computation

Training Algorithm

Complete Training Loop

Stability Improvements

6. CYCLEGAN

Unpaired Image Translation Problem

Cycle Consistency Loss

Architecture

Total Loss Function

Training Procedure

Applications

7. EVALUATION METRICS

Inception Score (IS)

Fréchet Inception Distance (FID)

Comparison of Metrics

PyTorch Implementation (FID)

8. PRACTICAL IMPLEMENTATION

Key Takeaways from Assignment 8

Complete Generator Architecture (PyTorch)

Complete Critic Architecture (PyTorch)

Gradient Penalty Function (Complete)

Training Loop Structure (WGAN-GP)

Common Pitfalls and Solutions

9. COMPARISON SUMMARY

Vanilla GAN vs WGAN vs WGAN-GP

When to Use Each Variant

Evolution of GAN Training

Key Lessons Learned

END OF LESSON