CMPUT 328 Assignment 7: Autoencoders and Generative Models

1. Introduction to Autoencoders

What is an Autoencoder?

An autoencoder is a neural network designed to learn efficient representations of data in an unsupervised manner. It consists of two main parts:

Encoder: Compresses input data into a lower-dimensional latent representation
Decoder: Reconstructs the original input from the latent representation

Key Objective: Learn to reconstruct input $x$ such that $\hat{x} \approx x$

Why Use Autoencoders?

Unsupervised Learning: Learn from unlabeled data
Dimensionality Reduction: Compress high-dimensional data
Feature Learning: Extract meaningful representations
Denoising: Remove noise from corrupted data
Generation: Create new data samples (with VAE)
Pre-training: Initialize networks for downstream tasks

Historical Context

1980s: Basic autoencoder concept introduced
2006: Deep autoencoders with layer-wise pre-training (Hinton & Salakhutdinov)
2013: Variational Autoencoders (Kingma & Welling)
2017: VQ-VAE for discrete latent spaces
Modern: Foundation for diffusion models and large-scale generation

2. Representation Learning

What is Representation Learning?

Representation learning is the automatic discovery of representations needed for feature detection or classification from raw data.

Why Learn Representations?

Traditional approach: 1. Hand-craft features (edges, textures, SIFT, HOG) 2. Train classifier on features

Autoencoder approach: 1. Learn features automatically from unlabeled data 2. Use learned features for downstream tasks

Benefits

No manual feature engineering
Captures data structure automatically
Transferable to multiple tasks
Scales to large datasets

Example: MNIST

For MNIST digits (784-dimensional): - Raw pixels: 28×28 = 784 features - Autoencoder: Learn 32-dimensional representation - Result: 32 features capture all essential information

---

3. Autoencoder Architecture

Basic Architecture

Input → Encoder → Latent Code → Decoder → Reconstruction
  x   →    φ    →      z      →    ψ    →      x̂

Mathematical Formulation

Encoder: $z = f_\phi(x)$ where $z \in \mathbb{R}^d$ and $d < \text{input\_dim}$

Decoder: $\hat{x} = g_\psi(z)$

Loss Function: Reconstruction error

Architecture Types

1. Vanilla Autoencoder
Fully connected layers
Simple architecture
Good for structured data

2. Convolutional Autoencoder
Conv layers in encoder
Transposed conv in decoder
Ideal for images

3. Sparse Autoencoder
Adds sparsity constraint on $z$
Forces learning of sparse representations

4. Contractive Autoencoder
Penalizes derivative of hidden layer
Makes representations robust to small input changes

Example: MNIST Autoencoder

class VanillaAutoencoder(nn.Module): def __init__(self, input_dim=784, latent_dim=32): super().__init__()

# Encoder self.encoder = nn.Sequential( nn.Linear(input_dim, 256), nn.ReLU(), nn.Linear(256, 128), nn.ReLU(), nn.Linear(128, latent_dim) )

# Decoder self.decoder = nn.Sequential( nn.Linear(latent_dim, 128), nn.ReLU(), nn.Linear(128, 256), nn.ReLU(), nn.Linear(256, input_dim), nn.Sigmoid() # Output in [0, 1] )

def forward(self, x): z = self.encoder(x) x_recon = self.decoder(z) return x_recon, z ```

Common Dimensions

For MNIST (28×28 = 784): - Input: 784 - Hidden layers: 512 → 256 → 128 - Latent: 32 or 64 - Mirror architecture for decoder

Compression ratio: 784 / 32 = 24.5× compression!

---

4. Dimensionality Reduction

Autoencoders vs PCA

PCA (Principal Component Analysis):
Linear transformation
Finds orthogonal directions of maximum variance
$z = W^T x$ where $W$ are principal components

Autoencoder:
Non-linear transformation
Can learn complex manifolds
More expressive than PCA

When Autoencoders Match PCA

If encoder and decoder are linear with no activation: $$z = W_e x$$ $$\hat{x} = W_d z$$

With MSE loss, this learns PCA subspace!

When Autoencoders Excel

With non-linear activations (ReLU, tanh), autoencoders can: - Learn curved manifolds - Capture complex relationships - Achieve better reconstruction

Visualization Example

For 3D data compressed to 2D:

PCA: Projects onto best-fitting plane
Autoencoder: Can learn curved 2D surface embedded in 3D

Implementation

# PCA pca = PCA(n_components=32) z_pca = pca.fit_transform(X_train) X_recon_pca = pca.inverse_transform(z_pca)

# Autoencoder model = VanillaAutoencoder(input_dim=784, latent_dim=32) X_recon_ae, z_ae = model(torch.tensor(X_train))

# Compare reconstruction error mse_pca = np.mean((X_train - X_recon_pca) ** 2) mse_ae = np.mean((X_train - X_recon_ae.detach().numpy()) ** 2)

print(f"PCA MSE: {mse_pca:.4f}") print(f"Autoencoder MSE: {mse_ae:.4f}") ```

---

5. Convolutional Autoencoders

Why Convolutional?

For image data: - Preserve spatial structure - Parameter efficiency (weight sharing) - Translation invariance - Better features for visual data

Architecture

Encoder: Conv + Pooling (downsample)
Decoder: Transposed Conv (upsample)

Example Architecture

# Encoder: 28x28 -> 14x14 -> 7x7 self.encoder = nn.Sequential( nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1), # 28->14 nn.ReLU(), nn.Conv2d(16, 32, kernel_size=3, stride=2, padding=1), # 14->7 nn.ReLU(), nn.Conv2d(32, 64, kernel_size=7) # 7->1 )

# Decoder: 1x1 -> 7x7 -> 14x14 -> 28x28 self.decoder = nn.Sequential( nn.ConvTranspose2d(64, 32, kernel_size=7), # 1->7 nn.ReLU(), nn.ConvTranspose2d(32, 16, kernel_size=3, stride=2, padding=1, output_padding=1), # 7->14 nn.ReLU(), nn.ConvTranspose2d(16, 1, kernel_size=3, stride=2, padding=1, output_padding=1), # 14->28 nn.Sigmoid() )

def forward(self, x): z = self.encoder(x) x_recon = self.decoder(z) return x_recon, z ```

Output Size Calculation

Convolution:

Transposed Convolution:

Skip Connections (U-Net Style)

For better reconstruction, add skip connections:

# Encoder self.enc1 = nn.Conv2d(1, 64, 3, padding=1) self.enc2 = nn.Conv2d(64, 128, 3, padding=1) self.pool = nn.MaxPool2d(2)

# Decoder self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2) self.dec1 = nn.Conv2d(128, 64, 3, padding=1) # 128 = 64 + 64 (skip) self.dec2 = nn.Conv2d(64, 1, 3, padding=1)

def forward(self, x): # Encoder e1 = F.relu(self.enc1(x)) e2 = F.relu(self.enc2(self.pool(e1)))

# Decoder with skip connection d1 = self.up1(e2) d1 = torch.cat([d1, e1], dim=1) # Skip connection d1 = F.relu(self.dec1(d1)) out = torch.sigmoid(self.dec2(d1))

return out, e2 ```

---

6. Training Strategies

Standard Training

for epoch in range(num_epochs): for batch in dataloader: x = batch.to(device)

# Forward pass x_recon, z = model(x) loss = criterion(x_recon, x)

# Backward pass optimizer.zero_grad() loss.backward() optimizer.step() ```

Phased Training (Greedy Layer-wise)

Idea: Train layers one at a time (Hinton & Salakhutdinov, 2006)

Phase 1: Train first encoder-decoder pair

Phase 2: Freeze E1, train second pair

Benefits:
Easier optimization
Better initialization
Historical: Important before batch normalization and good initialization

Modern Usage: Less common now, but still useful for very deep networks

Loss Functions

1. Mean Squared Error (L2):

Best for continuous values
Penalizes large errors heavily

2. Binary Cross-Entropy:

Best for binary or normalized images
Use with sigmoid output

3. L1 Loss:

More robust to outliers
Encourages sparsity

Regularization

Sparsity Constraint:

Forces sparse latent codes.

Weight Decay:

Prevents overfitting.

Training Tips

Normalize inputs to [0, 1] or [-1, 1]
Use batch normalization in encoder/decoder
Learning rate: Start with 1e-3, reduce if loss plateaus
Monitor reconstruction quality visually
Check for mode collapse: Ensure varied reconstructions
Gradient clipping if training unstable

---

7. Semi-Supervised Learning with Autoencoders

The Setup

Problem: Have lots of unlabeled data, few labeled samples

Solution: Use autoencoder for pre-training

Two-Stage Approach

Stage 1: Unsupervised Pre-training

Stage 2: Supervised Fine-tuning

def forward(self, x): z = self.encoder(x) return self.classifier(z)

# Fine-tune on labeled data model = Classifier(autoencoder.encoder, num_classes=10) train_classifier(model, labeled_data) ```

Why This Works

Feature learning: Encoder learns general features from all data
Regularization: Pre-training prevents overfitting on small labeled set
Transfer learning: Features transfer to classification task

Joint Training Alternative

Train both objectives simultaneously: $$L = L_{\text{recon}} + \alpha L_{\text{classification}}$$

def forward(self, x): z = self.encoder(x) x_recon = self.decoder(z) logits = self.classifier(z) return x_recon, logits

# Training loop for batch_x, batch_y in dataloader: x_recon, logits = model(batch_x)

loss_recon = F.mse_loss(x_recon, batch_x) loss_class = F.cross_entropy(logits, batch_y) if batch_y is not None else 0

loss = loss_recon + alpha * loss_class loss.backward() ```

Expected Results

With 100 labeled MNIST samples: - No pre-training: ~70% accuracy - With autoencoder pre-training: ~85% accuracy

Key insight: Unlabeled data provides structure of data manifold

---

8. Denoising Autoencoders

What is a Denoising Autoencoder (DAE)?

Idea: Train autoencoder to reconstruct clean input from corrupted version

$$\tilde{x} = \text{Corrupt}(x)$$ $$\hat{x} = \text{Decoder}(\text{Encoder}(\tilde{x}))$$ $$L = \|x - \hat{x}\|^2$$

Types of Corruption

1. Gaussian Noise:

2. Masking Noise (Dropout):

3. Salt-and-Pepper Noise:

4. Blur:

Why Denoising?

Robustness: Learn features invariant to noise
Better representations: Forces model to capture data structure
Prevents memorization: Can't just copy input
Implicit regularization: Acts as data augmentation

Implementation

def add_noise(self, x, noise_type='gaussian', noise_level=0.3): if noise_type == 'gaussian': noise = torch.randn_like(x) * noise_level return torch.clamp(x + noise, 0, 1)

elif noise_type == 'masking': mask = torch.rand_like(x) > noise_level return x * mask.float()

return x

def forward(self, x, add_noise=True): if add_noise and self.training: x_noisy = self.add_noise(x) else: x_noisy = x

z = self.encoder(x_noisy) x_recon = self.decoder(z) return x_recon, z

# Training model = DenoisingAutoencoder() for batch in dataloader: x = batch.to(device) x_recon, z = model(x, add_noise=True)

# Loss: reconstruct CLEAN input from NOISY input loss = F.mse_loss(x_recon, x) # Note: compare to original x, not noisy!

loss.backward() optimizer.step() ```

Noise Level Selection

Too low (σ < 0.1): Model might still memorize
Optimal (σ = 0.2-0.4): Good balance
Too high (σ > 0.5): Task becomes too difficult

Tip: Start with 0.3, adjust based on reconstruction quality

Applications

Image denoising: Remove noise from photos
Inpainting: Fill in missing regions
Super-resolution: Upscale low-res images
Artifact removal: Remove compression artifacts

---

9. Probability Primer for Generative Models

Random Variables

Discrete Random Variable $X$:
Takes countable values: $X \in \{x_1, x_2, \ldots, x_n\}$
Example: Die roll, $X \in \{1, 2, 3, 4, 5, 6\}$

Continuous Random Variable $X$:
Takes values in continuous range: $X \in \mathbb{R}$ or $X \in [a, b]$
Example: Height, temperature, pixel intensity

Probability Mass Function (PMF)

For discrete $X$: $$P(X = x_i) = p_i$$

Properties:
$0 \leq p_i \leq 1$ for all $i$
$\sum_{i=1}^n p_i = 1$

Example: Fair die

Probability Density Function (PDF)

For continuous $X$: $$p(x) = \frac{dP(X \leq x)}{dx}$$

Properties:
$p(x) \geq 0$ for all $x$
$\int_{-\infty}^{\infty} p(x) \, dx = 1$

Note: $p(x)$ can be > 1! It's a density, not a probability.

Probability of interval:

Expectation

Discrete:

Continuous:

Variance

$$\text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

where $\mu = \mathbb{E}[X]$

Multiple Random Variables

Joint Distribution: $p(x, y)$

Marginal Distribution:

Conditional Distribution:

Independence:

---

10. Gaussian Distributions

Univariate Gaussian (1D)

$$p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Parameters:
$\mu$: Mean (center)
$\sigma^2$: Variance (spread)
$\sigma$: Standard deviation

Notation: $X \sim \mathcal{N}(\mu, \sigma^2)$

Properties

Symmetric around $\mu$
68-95-99.7 rule:
68% of data within $\mu \pm \sigma$
95% within $\mu \pm 2\sigma$
99.7% within $\mu \pm 3\sigma$
Maximum at $x = \mu$

Standard Normal

Special case: $\mu = 0$, $\sigma^2 = 1$ $$p(x) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{x^2}{2}\right)$$

Notation: $X \sim \mathcal{N}(0, 1)$

Multivariate Gaussian (Vector)

For $\mathbf{x} \in \mathbb{R}^d$: $$p(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$$

Parameters:
$\boldsymbol{\mu} \in \mathbb{R}^d$: Mean vector
$\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}$: Covariance matrix

Notation: $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$

Diagonal Covariance

Often assume diagonal covariance: $$\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2, \sigma_2^2, \ldots, \sigma_d^2)$$

Then: $$p(\mathbf{x}) = \prod_{i=1}^d \frac{1}{\sqrt{2\pi\sigma_i^2}} \exp\left(-\frac{(x_i-\mu_i)^2}{2\sigma_i^2}\right)$$

Simplification: Variables are independent!

Sampling from Gaussian

1D:

mu, sigma = 0, 1 samples = np.random.normal(mu, sigma, size=1000) ```

Multivariate:

Why Gaussian for VAE?

Mathematically tractable: Easy to compute KL-divergence
Central Limit Theorem: Many distributions converge to Gaussian
Universal: Can approximate many distributions
Differentiable: Easy to optimize

---

11. Conditional Density Functions

Definition

Conditional density $p(x \mid y)$: Probability of $x$ given $y$

Bayes' Rule:

where: - $p(x \mid y)$: Posterior - $p(y \mid x)$: Likelihood - $p(x)$: Prior - $p(y)$: Evidence (marginal)

Example: Image Generation

Goal: Generate image $x$ conditioned on class $y$

$$p(x \mid y) = \text{"probability of image } x \text{ given it's class } y\text{"}$$

Examples: - $p(\text{image} \mid y=\text{"cat"})$: Distribution of cat images - $p(\text{image} \mid y=\text{"dog"})$: Distribution of dog images

Conditional Gaussian

$$p(x \mid y) = \mathcal{N}(x \mid \mu(y), \sigma^2(y))$$

Mean and variance depend on condition $y$!

Example:
$p(\text{height} \mid \text{gender=male}) = \mathcal{N}(175, 100)$
$p(\text{height} \mid \text{gender=female}) = \mathcal{N}(162, 90)$

Chain Rule of Probability

For multiple variables: $$p(x_1, x_2, \ldots, x_n) = p(x_1) p(x_2 \mid x_1) p(x_3 \mid x_1, x_2) \cdots p(x_n \mid x_1, \ldots, x_{n-1})$$

Application: Autoregressive models (PixelCNN, Transformers)

In VAE Context

Decoder is conditional density:

"Probability of image $x$ given latent code $z$"

Encoder is approximate posterior:

"Probability of latent $z$ given image $x$"

Conditional Generation Example

# Encoder: p(z | x, y) self.encoder = nn.Sequential(...)

# Decoder: p(x | z, y) self.decoder = nn.Sequential(...)

# Embed class label self.class_embed = nn.Embedding(num_classes, latent_dim)

def encode(self, x, y): h = self.encoder(x) y_embed = self.class_embed(y) h = h + y_embed # Condition on class mu, logvar = torch.chunk(h, 2, dim=1) return mu, logvar

def decode(self, z, y): y_embed = self.class_embed(y) z_cond = z + y_embed # Condition on class return self.decoder(z_cond)

def forward(self, x, y): mu, logvar = self.encode(x, y) z = reparameterize(mu, logvar) x_recon = self.decode(z, y) return x_recon, mu, logvar ```

---

12. Introduction to Image Generation

Types of Generative Models

1. Generative Adversarial Networks (GANs)
Two networks: Generator $G$ and Discriminator $D$
Adversarial training: $\min_G \max_D V(D, G)$
Pros: High-quality images
Cons: Training instability, mode collapse

2. Variational Autoencoders (VAE)
Probabilistic encoder and decoder
Optimize ELBO (Evidence Lower Bound)
Pros: Stable training, principled framework
Cons: Blurry images (in basic form)

3. Flow-Based Models (Normalizing Flows)
Invertible transformations
Exact likelihood computation
Pros: Exact inference, stable training
Cons: Architecture constraints

4. Diffusion Models
Learn to reverse noise process
State-of-the-art quality (Stable Diffusion, DALL-E)
Pros: Best quality, stable training
Cons: Slow sampling

Why VAE for This Course?

Principled: Based on probabilistic inference
Stable: Easier to train than GANs
Interpretable: Latent space has structure
Versatile: Can add conditions, disentangle factors
Foundation: Understanding VAE helps with other models

Generative vs Discriminative Models

Discriminative: $p(y \mid x)$
Learns decision boundary
Examples: Classifiers (ResNet, VGGNet)

Generative: $p(x)$ or $p(x, y)$
Learns data distribution
Can generate new samples
Examples: VAE, GAN, Diffusion

The Generation Pipeline

1. Sample latent code: z ~ N(0, I)
2. Decode to image: x = Decoder(z)
3. Result: New image x

Key insight: Latent space is continuous, so can:
Interpolate: $z_{\text{new}} = 0.5 z_1 + 0.5 z_2$
Arithmetic: $z_{\text{smile}} = z_{\text{smiling}} - z_{\text{neutral}}$
Explore: Walk through latent space

Evaluation Metrics

1. Reconstruction Quality (for autoencoders):
MSE, PSNR, SSIM

2. Generation Quality:
FID (Fréchet Inception Distance): Measures distribution similarity
IS (Inception Score): Measures quality and diversity
Human evaluation: Still the gold standard

3. Latent Space Quality:
Disentanglement: Independent factors
Smoothness: Nearby points → similar images

---

13. Variational Autoencoders (VAE)

Motivation

Problem with standard autoencoders:
Latent space can have "holes"
Can't generate new samples (only reconstruct)
No structure in latent space

VAE solution:
Learn probabilistic encoding
Force latent distribution to match prior (usually $\mathcal{N}(0, I)$)
Can sample from prior to generate!

Probabilistic Formulation

Goal: Model data distribution $p(x)$

Latent variable model:

where: - $p(z) = \mathcal{N}(0, I)$: Prior (simple Gaussian) - $p_\theta(x \mid z)$: Decoder (learned)

Problem: Computing $p(x)$ requires intractable integral!

Variational Inference

Idea: Approximate intractable posterior $p(z \mid x)$ with tractable $q_\phi(z \mid x)$

Encoder: $q_\phi(z \mid x) = \mathcal{N}(z \mid \mu_\phi(x), \sigma_\phi^2(x))$

Learn $\mu_\phi$ and $\sigma_\phi$ with neural networks!

Evidence Lower Bound (ELBO)

Derivation:

$$= \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{p_\theta(x, z)}{q_\phi(z|x)} \frac{q_\phi(z|x)}{p(z|x)}\right]$$

$$= \underbrace{\mathbb{E}_{q_\phi(z|x)}\left[\log \frac{p_\theta(x, z)}{q_\phi(z|x)}\right]}_{\text{ELBO}} + \underbrace{D_{KL}(q_\phi(z|x) \| p(z|x))}_{\geq 0}$$

Since $D_{KL} \geq 0$: $$\log p(x) \geq \text{ELBO}$$

ELBO (Evidence Lower Bound):

VAE Loss Function

$$\text{Loss} = -\mathcal{L} = \underbrace{-\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{Reconstruction Loss}} + \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{KL Divergence}}$$

Two terms:

Reconstruction: How well can we reconstruct $x$ from $z$?
In practice: $\|x - \hat{x}\|^2$ (MSE)

KL Divergence: How different is $q(z|x)$ from prior $p(z)$?
Forces latent distribution to be $\mathcal{N}(0, I)$

The Reparameterization Trick

Problem: Can't backprop through sampling $z \sim q_\phi(z|x)$

Solution: Reparameterize!

Original:

Reparameterized:

where $\odot$ is element-wise multiplication.

Now: Randomness is in $\epsilon$ (fixed distribution), and $z$ is deterministic function of $\mu, \sigma, \epsilon$

Backprop works!

VAE Architecture

# Encoder: x -> mu, logvar self.encoder = nn.Sequential( nn.Linear(input_dim, 400), nn.ReLU(), nn.Linear(400, 400), nn.ReLU() ) self.fc_mu = nn.Linear(400, latent_dim) self.fc_logvar = nn.Linear(400, latent_dim)

# Decoder: z -> x self.decoder = nn.Sequential( nn.Linear(latent_dim, 400), nn.ReLU(), nn.Linear(400, 400), nn.ReLU(), nn.Linear(400, input_dim), nn.Sigmoid() )

def encode(self, x): h = self.encoder(x) mu = self.fc_mu(h) logvar = self.fc_logvar(h) return mu, logvar

def reparameterize(self, mu, logvar): std = torch.exp(0.5 logvar) # sigma = exp(0.5 log(sigma^2)) eps = torch.randn_like(std) # Sample epsilon ~ N(0, 1) z = mu + eps std # z = mu + sigma epsilon return z

def decode(self, z): return self.decoder(z)

def forward(self, x): mu, logvar = self.encode(x) z = self.reparameterize(mu, logvar) x_recon = self.decode(z) return x_recon, mu, logvar ```

VAE Loss Implementation

# KL divergence: D_KL(N(mu, sigma) || N(0, 1)) # Formula: -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2) kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

return recon_loss + kl_loss

# Training loop model = VAE() optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(num_epochs): for batch in dataloader: x = batch.to(device)

x_recon, mu, logvar = model(x) loss = vae_loss(x, x_recon, mu, logvar)

optimizer.zero_grad() loss.backward() optimizer.step() ```

Why VAE Works for Generation

Structured latent space: KL term forces $q(z|x) \approx \mathcal{N}(0, I)$
Smooth: Nearby $z$ → similar $x$
Complete: No holes in latent space
Sampling: $z \sim \mathcal{N}(0, I)$, then $x = \text{Decoder}(z)$

Generating New Images

# Decode with torch.no_grad(): samples = model.decode(z)

# Visualize import matplotlib.pyplot as plt fig, axes = plt.subplots(8, 8, figsize=(8, 8)) for i, ax in enumerate(axes.flat): ax.imshow(samples[i].cpu().reshape(28, 28), cmap='gray') ax.axis('off') plt.show() ```

Interpolation in Latent Space

# Interpolate alphas = torch.linspace(0, 1, 10) for alpha in alphas: z_interp = alpha mu1 + (1 - alpha) mu2 img_interp = model.decode(z_interp) # Display img_interp ```

---

14. KL-Divergence

Definition

KL-divergence (Kullback-Leibler divergence) measures how one probability distribution $Q$ differs from another $P$:

Discrete:

Continuous:

Properties

Non-negative: $D_{KL}(P \| Q) \geq 0$
Zero iff equal: $D_{KL}(P \| Q) = 0$ iff $P = Q$
Asymmetric: $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$ in general
Not a metric: Doesn't satisfy triangle inequality

Intuition

$D_{KL}(P \| Q)$ = "How much information is lost when $Q$ is used to approximate $P$"

Large $D_{KL}$: $P$ and $Q$ are very different
Small $D_{KL}$: $P$ and $Q$ are similar

Example: Two Gaussians

$$P = \mathcal{N}(\mu_1, \sigma_1^2), \quad Q = \mathcal{N}(\mu_2, \sigma_2^2)$$

$$D_{KL}(P \| Q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$$

Special case: $Q = \mathcal{N}(0, 1)$

$$D_{KL}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0, 1)) = \frac{1}{2}\left(\mu^2 + \sigma^2 - \log \sigma^2 - 1\right)$$

Multivariate Case

For $P = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ and $Q = \mathcal{N}(0, I)$:

$$D_{KL}(P \| Q) = \frac{1}{2}\left(\text{tr}(\boldsymbol{\Sigma}) + \boldsymbol{\mu}^T\boldsymbol{\mu} - k - \log \det(\boldsymbol{\Sigma})\right)$$

where $k$ is dimensionality.

With diagonal covariance $\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2, \ldots, \sigma_k^2)$:

$$D_{KL}(P \| Q) = \frac{1}{2} \sum_{i=1}^k \left(\mu_i^2 + \sigma_i^2 - \log \sigma_i^2 - 1\right)$$

Implementation

Args: mu: Mean of shape (batch, latent_dim) logvar: Log variance of shape (batch, latent_dim)

Returns: KL divergence summed over latent dimensions """ # Formula: -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2) kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(), dim=1) return kl.mean() # Average over batch ```

Why This Formula?

Start with definition: $$D_{KL}(q(z|x) \| p(z)) = \mathbb{E}_{q(z|x)}\left[\log \frac{q(z|x)}{p(z)}\right]$$

For $q(z|x) = \mathcal{N}(\mu, \sigma^2)$ and $p(z) = \mathcal{N}(0, 1)$:

$$D_{KL} = \mathbb{E}\left[\log q(z|x) - \log p(z)\right]$$

$$= \mathbb{E}\left[-\frac{(z-\mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2) + \frac{z^2}{2} + \frac{1}{2}\log(2\pi)\right]$$

After simplification (taking expectation over $z \sim \mathcal{N}(\mu, \sigma^2)$):

$$D_{KL} = \frac{1}{2}\left(\mu^2 + \sigma^2 - \log \sigma^2 - 1\right)$$

In code, we use $\log \sigma^2$ (logvar) directly:

$$D_{KL} = \frac{1}{2}\left(\mu^2 + \exp(\text{logvar}) - \text{logvar} - 1\right)$$

Role in VAE

KL term regularizes latent space:

Forces $q(z|x) \approx p(z)$: Latent codes follow standard normal
Prevents overfitting: Can't just memorize training data
Enables sampling: Can sample $z \sim \mathcal{N}(0, I)$ to generate

Trade-off:
High KL weight: Blurry reconstructions (too much regularization)
Low KL weight: Better reconstructions but worse generation (latent space not normalized)

β-VAE

Adjust KL weight: $$\text{Loss} = \text{Recon} + \beta \cdot D_{KL}$$

$\beta < 1$: Better reconstruction
$\beta > 1$: More disentangled latent space
$\beta = 1$: Standard VAE

---

15. Vector Quantized VAE (VQ-VAE)

Motivation

Problem with standard VAE:
Continuous latent space
Blurry reconstructions (due to sampling)
Difficult to model complex distributions

VQ-VAE solution:
Discrete latent space
Learn a codebook of embeddings
Quantize encoder output to nearest codebook entry

Architecture

Input → Encoder → Continuous z_e → Quantization → Discrete z_q → Decoder → Output

Key difference: Replace sampling with vector quantization

The Codebook

Codebook $\mathcal{C} = \{e_1, e_2, \ldots, e_K\}$ where $e_k \in \mathbb{R}^D$

$K$: Number of codebook entries (e.g., 512)
$D$: Embedding dimension (e.g., 64)

Quantization: Map encoder output $z_e$ to nearest codebook entry $e_k$

$$z_q = e_k \text{ where } k = \arg\min_j \|z_e - e_j\|_2$$

VQ-VAE Loss

Three components:

$$\text{Loss} = \underbrace{\|x - \hat{x}\|^2}_{\text{Reconstruction}} + \underbrace{\|\text{sg}[z_e] - e_k\|_2^2}_{\text{Codebook}} + \underbrace{\beta \|z_e - \text{sg}[e_k]\|_2^2}_{\text{Commitment}}$$

where $\text{sg}[\cdot]$ means stop gradient (no backprop through this term).

1. Reconstruction Loss: Standard autoencoder loss

2. Codebook Loss: Moves codebook entries $e_k$ towards encoder outputs $z_e$
Updates codebook
Gradient only flows to $e_k$

3. Commitment Loss: Forces encoder to commit to a codebook entry
Prevents encoder from growing arbitrarily
Gradient only flows to encoder

Straight-Through Estimator

Problem: Quantization is non-differentiable!

$$z_q = \arg\min_k \|z_e - e_k\|_2$$

Solution: In backward pass, copy gradient from decoder to encoder

$$\frac{\partial \text{Loss}}{\partial z_e} = \frac{\partial \text{Loss}}{\partial z_q}$$

"Pretend" quantization is identity in backward pass.

Implementation

# Codebook self.embedding = nn.Embedding(num_embeddings, embedding_dim) self.embedding.weight.data.uniform_(-1/num_embeddings, 1/num_embeddings)

def forward(self, z_e): # z_e: (batch, embedding_dim, H, W)

# Flatten spatial dimensions z_e_flat = z_e.permute(0, 2, 3, 1).reshape(-1, self.embedding_dim)

# Calculate distances to codebook entries distances = (z_e_flat.pow(2).sum(1, keepdim=True) + self.embedding.weight.pow(2).sum(1) - 2 * z_e_flat @ self.embedding.weight.t())

# Find nearest codebook entry encoding_indices = torch.argmin(distances, dim=1)

# Quantize z_q_flat = self.embedding(encoding_indices) z_q = z_q_flat.view_as(z_e.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)

# Compute losses codebook_loss = F.mse_loss(z_q.detach(), z_e) commitment_loss = F.mse_loss(z_q, z_e.detach())

# Straight-through estimator z_q = z_e + (z_q - z_e).detach()

loss = codebook_loss + self.commitment_cost * commitment_loss

return z_q, loss, encoding_indices

class VQVAE(nn.Module): def __init__(self, num_embeddings=512, embedding_dim=64): super().__init__()

self.encoder = Encoder(embedding_dim) # Output: (batch, 64, H/4, W/4) self.quantizer = VectorQuantizer(num_embeddings, embedding_dim) self.decoder = Decoder(embedding_dim) # Input: (batch, 64, H/4, W/4)

def forward(self, x): z_e = self.encoder(x) z_q, vq_loss, indices = self.quantizer(z_e) x_recon = self.decoder(z_q)

return x_recon, vq_loss, indices ```

Training

for batch in dataloader: x = batch.to(device)

x_recon, vq_loss, _ = model(x)

recon_loss = F.mse_loss(x_recon, x) loss = recon_loss + vq_loss

optimizer.zero_grad() loss.backward() optimizer.step() ```

Advantages of VQ-VAE

Discrete latent space: No posterior collapse
Sharper reconstructions: No sampling blur
Better for autoregressive models: Can model $p(z)$ with PixelCNN
Interpretable: Codebook entries are learned visual concepts

Disadvantages

More complex: Three loss terms, straight-through estimator
Codebook collapse: Some entries may never be used
Hyperparameter sensitive: Need to tune commitment cost, codebook size

Codebook Collapse Prevention

Problem: Some codebook entries never get used

Solutions:

EMA (Exponential Moving Average) updates:

Restart unused entries:

Increase commitment cost: Force encoder to use all entries

Applications

High-quality image generation: DALL-E uses VQ-VAE
Audio generation: Jukebox (OpenAI)
Video compression: Discrete latent codes
Hierarchical models: VQ-VAE-2 with multiple levels

---

16. Implementation Guide

Complete VAE Training Script

# Hyperparameters BATCH_SIZE = 128 LATENT_DIM = 20 LEARNING_RATE = 1e-3 NUM_EPOCHS = 50 DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Data loading transform = transforms.Compose([ transforms.ToTensor(), ])

train_dataset = datasets.MNIST('data/', train=True, download=True, transform=transform) test_dataset = datasets.MNIST('data/', train=False, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

# Model class VAE(nn.Module): def __init__(self, input_dim=784, hidden_dim=400, latent_dim=20): super().__init__()

# Encoder self.fc1 = nn.Linear(input_dim, hidden_dim) self.fc2 = nn.Linear(hidden_dim, hidden_dim) self.fc_mu = nn.Linear(hidden_dim, latent_dim) self.fc_logvar = nn.Linear(hidden_dim, latent_dim)

# Decoder self.fc3 = nn.Linear(latent_dim, hidden_dim) self.fc4 = nn.Linear(hidden_dim, hidden_dim) self.fc5 = nn.Linear(hidden_dim, input_dim)

def encode(self, x): h = F.relu(self.fc1(x)) h = F.relu(self.fc2(h)) mu = self.fc_mu(h) logvar = self.fc_logvar(h) return mu, logvar

def reparameterize(self, mu, logvar): std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + eps * std

def decode(self, z): h = F.relu(self.fc3(z)) h = F.relu(self.fc4(h)) return torch.sigmoid(self.fc5(h))

def forward(self, x): mu, logvar = self.encode(x) z = self.reparameterize(mu, logvar) return self.decode(z), mu, logvar

# Loss function def vae_loss(x, x_recon, mu, logvar): # Reconstruction loss recon_loss = F.binary_cross_entropy(x_recon, x, reduction='sum')

# KL divergence kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

return recon_loss + kl_loss

# Training model = VAE(latent_dim=LATENT_DIM).to(DEVICE) optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

def train(epoch): model.train() train_loss = 0 for batch_idx, (data, _) in enumerate(train_loader): data = data.view(-1, 784).to(DEVICE)

optimizer.zero_grad()

x_recon, mu, logvar = model(data) loss = vae_loss(data, x_recon, mu, logvar)

loss.backward() train_loss += loss.item() optimizer.step()

if batch_idx % 100 == 0: print(f'Epoch {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)}] ' f'Loss: {loss.item() / len(data):.4f}')

avg_loss = train_loss / len(train_loader.dataset) print(f'====> Epoch {epoch} Average loss: {avg_loss:.4f}')

def test(epoch): model.eval() test_loss = 0 with torch.no_grad(): for data, _ in test_loader: data = data.view(-1, 784).to(DEVICE) x_recon, mu, logvar = model(data) loss = vae_loss(data, x_recon, mu, logvar) test_loss += loss.item()

avg_loss = test_loss / len(test_loader.dataset) print(f'====> Test set loss: {avg_loss:.4f}')

# Main training loop for epoch in range(1, NUM_EPOCHS + 1): train(epoch) test(epoch)

# Save samples if epoch % 5 == 0: with torch.no_grad(): z = torch.randn(64, LATENT_DIM).to(DEVICE) samples = model.decode(z).cpu().view(64, 1, 28, 28)

fig, axes = plt.subplots(8, 8, figsize=(8, 8)) for i, ax in enumerate(axes.flat): ax.imshow(samples[i, 0], cmap='gray') ax.axis('off') plt.savefig(f'samples_epoch_{epoch}.png') plt.close()

# Save model torch.save(model.state_dict(), 'vae_mnist.pth') ```

Convolutional VAE

# Encoder self.encoder = nn.Sequential( nn.Conv2d(1, 32, 4, stride=2, padding=1), # 28 -> 14 nn.ReLU(), nn.Conv2d(32, 64, 4, stride=2, padding=1), # 14 -> 7 nn.ReLU(), nn.Conv2d(64, 128, 7), # 7 -> 1 nn.ReLU() )

self.fc_mu = nn.Linear(128, latent_dim) self.fc_logvar = nn.Linear(128, latent_dim)

# Decoder self.fc_decode = nn.Linear(latent_dim, 128)

self.decoder = nn.Sequential( nn.ConvTranspose2d(128, 64, 7), # 1 -> 7 nn.ReLU(), nn.ConvTranspose2d(64, 32, 4, stride=2, padding=1), # 7 -> 14 nn.ReLU(), nn.ConvTranspose2d(32, 1, 4, stride=2, padding=1), # 14 -> 28 nn.Sigmoid() )

def encode(self, x): h = self.encoder(x) h = h.view(h.size(0), -1) mu = self.fc_mu(h) logvar = self.fc_logvar(h) return mu, logvar

def reparameterize(self, mu, logvar): std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + eps * std

def decode(self, z): h = F.relu(self.fc_decode(z)) h = h.view(-1, 128, 1, 1) return self.decoder(h)

def forward(self, x): mu, logvar = self.encode(x) z = self.reparameterize(mu, logvar) return self.decode(z), mu, logvar ```

Latent Space Visualization

with torch.no_grad(): for data, label in dataloader: data = data.view(-1, 784).to(device) mu, _ = model.encode(data) latents.append(mu.cpu().numpy()) labels.append(label.numpy())

latents = np.concatenate(latents) labels = np.concatenate(labels)

# Use t-SNE for visualization if latent_dim > 2 if latents.shape[1] > 2: from sklearn.manifold import TSNE latents_2d = TSNE(n_components=2).fit_transform(latents) else: latents_2d = latents

# Plot plt.figure(figsize=(10, 8)) scatter = plt.scatter(latents_2d[:, 0], latents_2d[:, 1], c=labels, cmap='tab10', alpha=0.6) plt.colorbar(scatter) plt.title('Latent Space Visualization') plt.xlabel('Dimension 1') plt.ylabel('Dimension 2') plt.show()

visualize_latent_space(model, test_loader, DEVICE) ```

Interpolation

with torch.no_grad(): mu1, _ = model.encode(img1.view(-1, 784).to(device)) mu2, _ = model.encode(img2.view(-1, 784).to(device))

interpolations = [] for alpha in torch.linspace(0, 1, steps): z = alpha mu1 + (1 - alpha) mu2 img = model.decode(z) interpolations.append(img.cpu().view(28, 28))

# Plot fig, axes = plt.subplots(1, steps, figsize=(15, 2)) for i, ax in enumerate(axes): ax.imshow(interpolations[i], cmap='gray') ax.axis('off') plt.show()

# Example usage img1 = test_dataset[0][0] img2 = test_dataset[1][0] interpolate(model, img1, img2, steps=10, device=DEVICE) ```

---

17. Common Pitfalls and Solutions

1. Posterior Collapse in VAE

Problem: KL divergence goes to zero, model ignores latent code

Symptoms:
KL loss ≈ 0
All samples look the same
Decoder becomes unconditional

Causes:
Decoder too powerful
KL weight too high early in training

Solutions:

A. KL Annealing:

# In training loop loss = recon_loss + kl_weight(epoch) * kl_loss ```

B. Free Bits:

kl_loss = free_bits_kl(kl_loss, free_bits=2.0) ```

C. Weaken decoder:
Use fewer layers
Reduce hidden dimensions
Add dropout

2. Blurry Reconstructions

Problem: VAE produces blurry images

Causes:
MSE/BCE loss averages over all possible reconstructions
Sampling introduces noise

Solutions:

A. Increase latent dimension:

B. Use VQ-VAE: Discrete latents eliminate sampling noise

C. Perceptual loss:

def forward(self, x, x_recon): features_x = self.feature_extractor(x) features_recon = self.feature_extractor(x_recon) return F.mse_loss(features_recon, features_x)

# Use in training perceptual_loss_fn = PerceptualLoss() loss = recon_loss + 0.1 * perceptual_loss_fn(x, x_recon) + kl_loss ```

3. Mode Collapse

Problem: Model generates limited variety of samples

Symptoms:
All samples look similar
Doesn't cover full data distribution

Solutions:

A. Increase KL weight (β-VAE):

B. Increase latent dimension

C. Architectural improvements:
Deeper encoder/decoder
More parameters

4. Numerical Instability

Problem: NaN or Inf in loss

Causes:
log(0) in KL divergence
Exponential overflow in exp(logvar)

Solutions:

A. Clamp logvar:

B. Use log_softmax for stability:

# Use loss = F.binary_cross_entropy_with_logits(logits, x) ```

C. Gradient clipping:

5. Training Time Issues

Problem: VAE trains slowly

Solutions:

A. Larger batch size:

B. Learning rate scheduling:

C. Use convolutional architecture: Much faster than fully connected for images

6. Evaluation Mode Errors

Problem: Forgot to set model.eval() during generation

Issue: Batch norm and dropout behave differently

Solution:

7. Codebook Collapse (VQ-VAE)

Problem: Only a few codebook entries are used

Solutions:

A. EMA updates instead of gradient-based

B. Monitor codebook usage:

# In training loop unique_indices, counts = torch.unique(encoding_indices, return_counts=True) usage_count[unique_indices] += counts ```

C. Reset unused entries:

8. Wrong Loss Scale

Problem: Reconstruction and KL losses have very different scales

Symptoms:
One dominates the other
Poor balance

Solution: Normalize losses or adjust weights:

# Or adjust weight loss = recon_loss + beta * kl_loss ```

9. Memory Issues

Problem: Out of memory errors

Solutions:

A. Reduce batch size

B. Use gradient checkpointing:

def forward(self, x): mu, logvar = checkpoint(self.encode, x) z = self.reparameterize(mu, logvar) x_recon = checkpoint(self.decode, z) return x_recon, mu, logvar ```

C. Use mixed precision training:

scaler = GradScaler()

with autocast(): x_recon, mu, logvar = model(x) loss = vae_loss(x, x_recon, mu, logvar)

scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() ```

10. Overfitting

Problem: Good training loss, poor test performance

Solutions:

A. Data augmentation:

B. Dropout in encoder/decoder:

C. Weight decay:

---

Summary

Key Takeaways

Autoencoders learn compressed representations through reconstruction
Denoising improves robustness and representation quality
VAE adds probabilistic structure for generation
KL divergence regularizes latent space to match prior
Reparameterization trick enables backpropagation through sampling
VQ-VAE uses discrete latents for sharper reconstructions

Comparison Table

| Model | Latent Space | Generation | Reconstruction | Training | |-------|-------------|------------|----------------|----------| | Autoencoder | Continuous, unstructured | ❌ No | ✅ Sharp | Easy | | VAE | Continuous, structured | ✅ Yes | ⚠️ Blurry | Medium | | VQ-VAE | Discrete | ✅ Yes (with prior) | ✅ Sharp | Hard |

When to Use Each

Autoencoder:
Dimensionality reduction
Feature learning for classification
Denoising
Don't need generation

VAE:
Need to generate new samples
Want smooth latent space
Probabilistic modeling
Conditional generation

VQ-VAE:
Need sharp reconstructions
Discrete latent space useful (e.g., for autoregressive models)
Have computational resources for more complex training

Expected Performance on MNIST

Autoencoder: Reconstruction MSE ≈ 0.01-0.02
VAE (latent_dim=20): Reconstruction MSE ≈ 0.02-0.04, good generation
VQ-VAE (512 codes): Reconstruction MSE ≈ 0.01-0.02, sharp generation

---

References

Kingma & Welling (2013). Auto-Encoding Variational Bayes. [arXiv:1312.6114](https://arxiv.org/abs/1312.6114)
van den Oord et al. (2017). Neural Discrete Representation Learning (VQ-VAE). [arXiv:1711.00937](https://arxiv.org/abs/1711.00937)
Higgins et al. (2017). β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. [ICLR 2017]
Vincent et al. (2008). Extracting and Composing Robust Features with Denoising Autoencoders.
Hinton & Salakhutdinov (2006). Reducing the Dimensionality of Data with Neural Networks. Science.

---

End of Lesson

Good luck with Assignment 7!

CMPUT 328 Assignment 7: Autoencoders and Generative Models

Table of Contents

1. Introduction to Autoencoders

What is an Autoencoder?

Why Use Autoencoders?

Historical Context

2. Representation Learning

What is Representation Learning?

Why Learn Representations?

Benefits

Example: MNIST

3. Autoencoder Architecture

Basic Architecture

Mathematical Formulation

Architecture Types

Example: MNIST Autoencoder

Common Dimensions

4. Dimensionality Reduction

Autoencoders vs PCA

When Autoencoders Match PCA

When Autoencoders Excel

Visualization Example

Implementation

5. Convolutional Autoencoders

Why Convolutional?

Architecture

Example Architecture

Output Size Calculation

Skip Connections (U-Net Style)

6. Training Strategies

Standard Training

Phased Training (Greedy Layer-wise)

Loss Functions

Regularization

Training Tips

7. Semi-Supervised Learning with Autoencoders

The Setup

Two-Stage Approach

Why This Works

Joint Training Alternative

Expected Results

8. Denoising Autoencoders

What is a Denoising Autoencoder (DAE)?

Types of Corruption

Why Denoising?

Implementation

Noise Level Selection

Applications

9. Probability Primer for Generative Models

Random Variables

Probability Mass Function (PMF)

Probability Density Function (PDF)

Expectation

Variance

Multiple Random Variables

10. Gaussian Distributions

Univariate Gaussian (1D)

Properties

Standard Normal

Multivariate Gaussian (Vector)

Diagonal Covariance

Sampling from Gaussian

Why Gaussian for VAE?

11. Conditional Density Functions

Definition

Example: Image Generation

Conditional Gaussian

Chain Rule of Probability

In VAE Context

Conditional Generation Example

12. Introduction to Image Generation

Types of Generative Models

Why VAE for This Course?

Generative vs Discriminative Models

The Generation Pipeline

Evaluation Metrics

13. Variational Autoencoders (VAE)

Motivation

Probabilistic Formulation

Variational Inference