← Back to Topics

CMPUT 328 Assignment 7: Autoencoders and Generative Models

CMPUT 328 - Visual Recognition - Assignment 7

Comprehensive Study Guide

Table of Contents

1. Introduction to Autoencoders

What is an Autoencoder?

An autoencoder is a neural network designed to learn efficient representations of data in an unsupervised manner. It consists of two main parts:

Why Use Autoencoders?

Historical Context

2. Representation Learning

What is Representation Learning?

Why Learn Representations?

Traditional approach: 1. Hand-craft features (edges, textures, SIFT, HOG) 2. Train classifier on features

Autoencoder approach: 1. Learn features automatically from unlabeled data 2. Use learned features for downstream tasks

Benefits

Example: MNIST

For MNIST digits (784-dimensional): - Raw pixels: 28×28 = 784 features - Autoencoder: Learn 32-dimensional representation - Result: 32 features capture all essential information

---

3. Autoencoder Architecture

Basic Architecture

Input → Encoder → Latent Code → Decoder → Reconstruction
  x   →    φ    →      z      →    ψ    →      x̂

Mathematical Formulation

Architecture Types

Example: MNIST Autoencoder

class VanillaAutoencoder(nn.Module): def __init__(self, input_dim=784, latent_dim=32): super().__init__()

# Encoder self.encoder = nn.Sequential( nn.Linear(input_dim, 256), nn.ReLU(), nn.Linear(256, 128), nn.ReLU(), nn.Linear(128, latent_dim) )

# Decoder self.decoder = nn.Sequential( nn.Linear(latent_dim, 128), nn.ReLU(), nn.Linear(128, 256), nn.ReLU(), nn.Linear(256, input_dim), nn.Sigmoid() # Output in [0, 1] )

def forward(self, x): z = self.encoder(x) x_recon = self.decoder(z) return x_recon, z ```

Common Dimensions

For MNIST (28×28 = 784): - Input: 784 - Hidden layers: 512 → 256 → 128 - Latent: 32 or 64 - Mirror architecture for decoder

---

4. Dimensionality Reduction

Autoencoders vs PCA

When Autoencoders Match PCA

If encoder and decoder are linear with no activation: $$z = W_e x$$ $$\hat{x} = W_d z$$

With MSE loss, this learns PCA subspace!

When Autoencoders Excel

With non-linear activations (ReLU, tanh), autoencoders can: - Learn curved manifolds - Capture complex relationships - Achieve better reconstruction

Visualization Example

For 3D data compressed to 2D:

Implementation

# PCA pca = PCA(n_components=32) z_pca = pca.fit_transform(X_train) X_recon_pca = pca.inverse_transform(z_pca)

# Autoencoder model = VanillaAutoencoder(input_dim=784, latent_dim=32) X_recon_ae, z_ae = model(torch.tensor(X_train))

# Compare reconstruction error mse_pca = np.mean((X_train - X_recon_pca) ** 2) mse_ae = np.mean((X_train - X_recon_ae.detach().numpy()) ** 2)

print(f"PCA MSE: {mse_pca:.4f}") print(f"Autoencoder MSE: {mse_ae:.4f}") ```

---

5. Convolutional Autoencoders

Why Convolutional?

For image data: - Preserve spatial structure - Parameter efficiency (weight sharing) - Translation invariance - Better features for visual data

Architecture

Example Architecture

# Encoder: 28x28 -> 14x14 -> 7x7 self.encoder = nn.Sequential( nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1), # 28->14 nn.ReLU(), nn.Conv2d(16, 32, kernel_size=3, stride=2, padding=1), # 14->7 nn.ReLU(), nn.Conv2d(32, 64, kernel_size=7) # 7->1 )

# Decoder: 1x1 -> 7x7 -> 14x14 -> 28x28 self.decoder = nn.Sequential( nn.ConvTranspose2d(64, 32, kernel_size=7), # 1->7 nn.ReLU(), nn.ConvTranspose2d(32, 16, kernel_size=3, stride=2, padding=1, output_padding=1), # 7->14 nn.ReLU(), nn.ConvTranspose2d(16, 1, kernel_size=3, stride=2, padding=1, output_padding=1), # 14->28 nn.Sigmoid() )

def forward(self, x): z = self.encoder(x) x_recon = self.decoder(z) return x_recon, z ```

Output Size Calculation

Skip Connections (U-Net Style)

For better reconstruction, add skip connections:

# Encoder self.enc1 = nn.Conv2d(1, 64, 3, padding=1) self.enc2 = nn.Conv2d(64, 128, 3, padding=1) self.pool = nn.MaxPool2d(2)

# Decoder self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2) self.dec1 = nn.Conv2d(128, 64, 3, padding=1) # 128 = 64 + 64 (skip) self.dec2 = nn.Conv2d(64, 1, 3, padding=1)

def forward(self, x): # Encoder e1 = F.relu(self.enc1(x)) e2 = F.relu(self.enc2(self.pool(e1)))

# Decoder with skip connection d1 = self.up1(e2) d1 = torch.cat([d1, e1], dim=1) # Skip connection d1 = F.relu(self.dec1(d1)) out = torch.sigmoid(self.dec2(d1))

return out, e2 ```

---

6. Training Strategies

Standard Training

for epoch in range(num_epochs): for batch in dataloader: x = batch.to(device)

# Forward pass x_recon, z = model(x) loss = criterion(x_recon, x)

# Backward pass optimizer.zero_grad() loss.backward() optimizer.step() ```

Phased Training (Greedy Layer-wise)

Loss Functions

Regularization

Forces sparse latent codes.

Prevents overfitting.

Training Tips

---

7. Semi-Supervised Learning with Autoencoders

The Setup

Two-Stage Approach

def forward(self, x): z = self.encoder(x) return self.classifier(z)

# Fine-tune on labeled data model = Classifier(autoencoder.encoder, num_classes=10) train_classifier(model, labeled_data) ```

Why This Works

Joint Training Alternative

Train both objectives simultaneously: $$L = L_{\text{recon}} + \alpha L_{\text{classification}}$$

def forward(self, x): z = self.encoder(x) x_recon = self.decoder(z) logits = self.classifier(z) return x_recon, logits

# Training loop for batch_x, batch_y in dataloader: x_recon, logits = model(batch_x)

loss_recon = F.mse_loss(x_recon, batch_x) loss_class = F.cross_entropy(logits, batch_y) if batch_y is not None else 0

loss = loss_recon + alpha * loss_class loss.backward() ```

Expected Results

With 100 labeled MNIST samples: - No pre-training: ~70% accuracy - With autoencoder pre-training: ~85% accuracy

---

8. Denoising Autoencoders

What is a Denoising Autoencoder (DAE)?

$$\tilde{x} = \text{Corrupt}(x)$$ $$\hat{x} = \text{Decoder}(\text{Encoder}(\tilde{x}))$$ $$L = \|x - \hat{x}\|^2$$

Types of Corruption

Why Denoising?

Implementation

def add_noise(self, x, noise_type='gaussian', noise_level=0.3): if noise_type == 'gaussian': noise = torch.randn_like(x) * noise_level return torch.clamp(x + noise, 0, 1)

elif noise_type == 'masking': mask = torch.rand_like(x) > noise_level return x * mask.float()

return x

def forward(self, x, add_noise=True): if add_noise and self.training: x_noisy = self.add_noise(x) else: x_noisy = x

z = self.encoder(x_noisy) x_recon = self.decoder(z) return x_recon, z

# Training model = DenoisingAutoencoder() for batch in dataloader: x = batch.to(device) x_recon, z = model(x, add_noise=True)

# Loss: reconstruct CLEAN input from NOISY input loss = F.mse_loss(x_recon, x) # Note: compare to original x, not noisy!

loss.backward() optimizer.step() ```

Noise Level Selection

Applications

---

9. Probability Primer for Generative Models

Random Variables

Probability Mass Function (PMF)

For discrete $X$: $$P(X = x_i) = p_i$$

Probability Density Function (PDF)

For continuous $X$: $$p(x) = \frac{dP(X \leq x)}{dx}$$

Expectation

Variance

$$\text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

where $\mu = \mathbb{E}[X]$

Multiple Random Variables

---

10. Gaussian Distributions

Univariate Gaussian (1D)

$$p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Properties

Standard Normal

Special case: $\mu = 0$, $\sigma^2 = 1$ $$p(x) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{x^2}{2}\right)$$

Multivariate Gaussian (Vector)

For $\mathbf{x} \in \mathbb{R}^d$: $$p(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$$

Diagonal Covariance

Often assume diagonal covariance: $$\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2, \sigma_2^2, \ldots, \sigma_d^2)$$

Then: $$p(\mathbf{x}) = \prod_{i=1}^d \frac{1}{\sqrt{2\pi\sigma_i^2}} \exp\left(-\frac{(x_i-\mu_i)^2}{2\sigma_i^2}\right)$$

Sampling from Gaussian

mu, sigma = 0, 1 samples = np.random.normal(mu, sigma, size=1000) ```

Why Gaussian for VAE?

---

11. Conditional Density Functions

Definition

where: - $p(x \mid y)$: Posterior - $p(y \mid x)$: Likelihood - $p(x)$: Prior - $p(y)$: Evidence (marginal)

Example: Image Generation

$$p(x \mid y) = \text{"probability of image } x \text{ given it's class } y\text{"}$$

Examples: - $p(\text{image} \mid y=\text{"cat"})$: Distribution of cat images - $p(\text{image} \mid y=\text{"dog"})$: Distribution of dog images

Conditional Gaussian

$$p(x \mid y) = \mathcal{N}(x \mid \mu(y), \sigma^2(y))$$

Mean and variance depend on condition $y$!

Chain Rule of Probability

For multiple variables: $$p(x_1, x_2, \ldots, x_n) = p(x_1) p(x_2 \mid x_1) p(x_3 \mid x_1, x_2) \cdots p(x_n \mid x_1, \ldots, x_{n-1})$$

In VAE Context

"Probability of image $x$ given latent code $z$"

"Probability of latent $z$ given image $x$"

Conditional Generation Example

# Encoder: p(z | x, y) self.encoder = nn.Sequential(...)

# Decoder: p(x | z, y) self.decoder = nn.Sequential(...)

# Embed class label self.class_embed = nn.Embedding(num_classes, latent_dim)

def encode(self, x, y): h = self.encoder(x) y_embed = self.class_embed(y) h = h + y_embed # Condition on class mu, logvar = torch.chunk(h, 2, dim=1) return mu, logvar

def decode(self, z, y): y_embed = self.class_embed(y) z_cond = z + y_embed # Condition on class return self.decoder(z_cond)

def forward(self, x, y): mu, logvar = self.encode(x, y) z = reparameterize(mu, logvar) x_recon = self.decode(z, y) return x_recon, mu, logvar ```

---

12. Introduction to Image Generation

Types of Generative Models

Why VAE for This Course?

Generative vs Discriminative Models

The Generation Pipeline

1. Sample latent code: z ~ N(0, I)
2. Decode to image: x = Decoder(z)
3. Result: New image x

Evaluation Metrics

---

13. Variational Autoencoders (VAE)

Motivation

Probabilistic Formulation

where: - $p(z) = \mathcal{N}(0, I)$: Prior (simple Gaussian) - $p_\theta(x \mid z)$: Decoder (learned)

Variational Inference

Learn $\mu_\phi$ and $\sigma_\phi$ with neural networks!

Evidence Lower Bound (ELBO)

$$= \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{p_\theta(x, z)}{q_\phi(z|x)} \frac{q_\phi(z|x)}{p(z|x)}\right]$$

$$= \underbrace{\mathbb{E}_{q_\phi(z|x)}\left[\log \frac{p_\theta(x, z)}{q_\phi(z|x)}\right]}_{\text{ELBO}} + \underbrace{D_{KL}(q_\phi(z|x) \| p(z|x))}_{\geq 0}$$

Since $D_{KL} \geq 0$: $$\log p(x) \geq \text{ELBO}$$

VAE Loss Function

$$\text{Loss} = -\mathcal{L} = \underbrace{-\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{Reconstruction Loss}} + \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{KL Divergence}}$$

The Reparameterization Trick

where $\odot$ is element-wise multiplication.

VAE Architecture

# Encoder: x -> mu, logvar self.encoder = nn.Sequential( nn.Linear(input_dim, 400), nn.ReLU(), nn.Linear(400, 400), nn.ReLU() ) self.fc_mu = nn.Linear(400, latent_dim) self.fc_logvar = nn.Linear(400, latent_dim)

# Decoder: z -> x self.decoder = nn.Sequential( nn.Linear(latent_dim, 400), nn.ReLU(), nn.Linear(400, 400), nn.ReLU(), nn.Linear(400, input_dim), nn.Sigmoid() )

def encode(self, x): h = self.encoder(x) mu = self.fc_mu(h) logvar = self.fc_logvar(h) return mu, logvar

def reparameterize(self, mu, logvar): std = torch.exp(0.5 logvar) # sigma = exp(0.5 log(sigma^2)) eps = torch.randn_like(std) # Sample epsilon ~ N(0, 1) z = mu + eps std # z = mu + sigma epsilon return z

def decode(self, z): return self.decoder(z)

def forward(self, x): mu, logvar = self.encode(x) z = self.reparameterize(mu, logvar) x_recon = self.decode(z) return x_recon, mu, logvar ```

VAE Loss Implementation

# KL divergence: D_KL(N(mu, sigma) || N(0, 1)) # Formula: -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2) kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

return recon_loss + kl_loss

# Training loop model = VAE() optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(num_epochs): for batch in dataloader: x = batch.to(device)

x_recon, mu, logvar = model(x) loss = vae_loss(x, x_recon, mu, logvar)

optimizer.zero_grad() loss.backward() optimizer.step() ```

Why VAE Works for Generation

Generating New Images

# Decode with torch.no_grad(): samples = model.decode(z)

# Visualize import matplotlib.pyplot as plt fig, axes = plt.subplots(8, 8, figsize=(8, 8)) for i, ax in enumerate(axes.flat): ax.imshow(samples[i].cpu().reshape(28, 28), cmap='gray') ax.axis('off') plt.show() ```

Interpolation in Latent Space

# Interpolate alphas = torch.linspace(0, 1, 10) for alpha in alphas: z_interp = alpha mu1 + (1 - alpha) mu2 img_interp = model.decode(z_interp) # Display img_interp ```

---

14. KL-Divergence

Definition

Properties

Intuition

$D_{KL}(P \| Q)$ = "How much information is lost when $Q$ is used to approximate $P$"

Example: Two Gaussians

$$P = \mathcal{N}(\mu_1, \sigma_1^2), \quad Q = \mathcal{N}(\mu_2, \sigma_2^2)$$

$$D_{KL}(P \| Q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$$

$$D_{KL}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0, 1)) = \frac{1}{2}\left(\mu^2 + \sigma^2 - \log \sigma^2 - 1\right)$$

Multivariate Case

For $P = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ and $Q = \mathcal{N}(0, I)$:

$$D_{KL}(P \| Q) = \frac{1}{2}\left(\text{tr}(\boldsymbol{\Sigma}) + \boldsymbol{\mu}^T\boldsymbol{\mu} - k - \log \det(\boldsymbol{\Sigma})\right)$$

where $k$ is dimensionality.

$$D_{KL}(P \| Q) = \frac{1}{2} \sum_{i=1}^k \left(\mu_i^2 + \sigma_i^2 - \log \sigma_i^2 - 1\right)$$

Implementation

Args: mu: Mean of shape (batch, latent_dim) logvar: Log variance of shape (batch, latent_dim)

Returns: KL divergence summed over latent dimensions """ # Formula: -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2) kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(), dim=1) return kl.mean() # Average over batch ```

Why This Formula?

Start with definition: $$D_{KL}(q(z|x) \| p(z)) = \mathbb{E}_{q(z|x)}\left[\log \frac{q(z|x)}{p(z)}\right]$$

For $q(z|x) = \mathcal{N}(\mu, \sigma^2)$ and $p(z) = \mathcal{N}(0, 1)$:

$$D_{KL} = \mathbb{E}\left[\log q(z|x) - \log p(z)\right]$$

$$= \mathbb{E}\left[-\frac{(z-\mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2) + \frac{z^2}{2} + \frac{1}{2}\log(2\pi)\right]$$

After simplification (taking expectation over $z \sim \mathcal{N}(\mu, \sigma^2)$):

$$D_{KL} = \frac{1}{2}\left(\mu^2 + \sigma^2 - \log \sigma^2 - 1\right)$$

In code, we use $\log \sigma^2$ (logvar) directly:

$$D_{KL} = \frac{1}{2}\left(\mu^2 + \exp(\text{logvar}) - \text{logvar} - 1\right)$$

Role in VAE

β-VAE

Adjust KL weight: $$\text{Loss} = \text{Recon} + \beta \cdot D_{KL}$$

---

15. Vector Quantized VAE (VQ-VAE)

Motivation

Architecture

Input → Encoder → Continuous z_e → Quantization → Discrete z_q → Decoder → Output

The Codebook

$$z_q = e_k \text{ where } k = \arg\min_j \|z_e - e_j\|_2$$

VQ-VAE Loss

Three components:

$$\text{Loss} = \underbrace{\|x - \hat{x}\|^2}_{\text{Reconstruction}} + \underbrace{\|\text{sg}[z_e] - e_k\|_2^2}_{\text{Codebook}} + \underbrace{\beta \|z_e - \text{sg}[e_k]\|_2^2}_{\text{Commitment}}$$

where $\text{sg}[\cdot]$ means stop gradient (no backprop through this term).

Straight-Through Estimator

$$z_q = \arg\min_k \|z_e - e_k\|_2$$

$$\frac{\partial \text{Loss}}{\partial z_e} = \frac{\partial \text{Loss}}{\partial z_q}$$

"Pretend" quantization is identity in backward pass.

Implementation

# Codebook self.embedding = nn.Embedding(num_embeddings, embedding_dim) self.embedding.weight.data.uniform_(-1/num_embeddings, 1/num_embeddings)

def forward(self, z_e): # z_e: (batch, embedding_dim, H, W)

# Flatten spatial dimensions z_e_flat = z_e.permute(0, 2, 3, 1).reshape(-1, self.embedding_dim)

# Calculate distances to codebook entries distances = (z_e_flat.pow(2).sum(1, keepdim=True) + self.embedding.weight.pow(2).sum(1) - 2 * z_e_flat @ self.embedding.weight.t())

# Find nearest codebook entry encoding_indices = torch.argmin(distances, dim=1)

# Quantize z_q_flat = self.embedding(encoding_indices) z_q = z_q_flat.view_as(z_e.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)

# Compute losses codebook_loss = F.mse_loss(z_q.detach(), z_e) commitment_loss = F.mse_loss(z_q, z_e.detach())

# Straight-through estimator z_q = z_e + (z_q - z_e).detach()

loss = codebook_loss + self.commitment_cost * commitment_loss

return z_q, loss, encoding_indices

class VQVAE(nn.Module): def __init__(self, num_embeddings=512, embedding_dim=64): super().__init__()

self.encoder = Encoder(embedding_dim) # Output: (batch, 64, H/4, W/4) self.quantizer = VectorQuantizer(num_embeddings, embedding_dim) self.decoder = Decoder(embedding_dim) # Input: (batch, 64, H/4, W/4)

def forward(self, x): z_e = self.encoder(x) z_q, vq_loss, indices = self.quantizer(z_e) x_recon = self.decoder(z_q)

return x_recon, vq_loss, indices ```

Training

for batch in dataloader: x = batch.to(device)

x_recon, vq_loss, _ = model(x)

recon_loss = F.mse_loss(x_recon, x) loss = recon_loss + vq_loss

optimizer.zero_grad() loss.backward() optimizer.step() ```

Advantages of VQ-VAE

Disadvantages

Codebook Collapse Prevention

Applications

---

16. Implementation Guide

Complete VAE Training Script

# Hyperparameters BATCH_SIZE = 128 LATENT_DIM = 20 LEARNING_RATE = 1e-3 NUM_EPOCHS = 50 DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Data loading transform = transforms.Compose([ transforms.ToTensor(), ])

train_dataset = datasets.MNIST('data/', train=True, download=True, transform=transform) test_dataset = datasets.MNIST('data/', train=False, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

# Model class VAE(nn.Module): def __init__(self, input_dim=784, hidden_dim=400, latent_dim=20): super().__init__()

# Encoder self.fc1 = nn.Linear(input_dim, hidden_dim) self.fc2 = nn.Linear(hidden_dim, hidden_dim) self.fc_mu = nn.Linear(hidden_dim, latent_dim) self.fc_logvar = nn.Linear(hidden_dim, latent_dim)

# Decoder self.fc3 = nn.Linear(latent_dim, hidden_dim) self.fc4 = nn.Linear(hidden_dim, hidden_dim) self.fc5 = nn.Linear(hidden_dim, input_dim)

def encode(self, x): h = F.relu(self.fc1(x)) h = F.relu(self.fc2(h)) mu = self.fc_mu(h) logvar = self.fc_logvar(h) return mu, logvar

def reparameterize(self, mu, logvar): std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + eps * std

def decode(self, z): h = F.relu(self.fc3(z)) h = F.relu(self.fc4(h)) return torch.sigmoid(self.fc5(h))

def forward(self, x): mu, logvar = self.encode(x) z = self.reparameterize(mu, logvar) return self.decode(z), mu, logvar

# Loss function def vae_loss(x, x_recon, mu, logvar): # Reconstruction loss recon_loss = F.binary_cross_entropy(x_recon, x, reduction='sum')

# KL divergence kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

return recon_loss + kl_loss

# Training model = VAE(latent_dim=LATENT_DIM).to(DEVICE) optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

def train(epoch): model.train() train_loss = 0 for batch_idx, (data, _) in enumerate(train_loader): data = data.view(-1, 784).to(DEVICE)

optimizer.zero_grad()

x_recon, mu, logvar = model(data) loss = vae_loss(data, x_recon, mu, logvar)

loss.backward() train_loss += loss.item() optimizer.step()

if batch_idx % 100 == 0: print(f'Epoch {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)}] ' f'Loss: {loss.item() / len(data):.4f}')

avg_loss = train_loss / len(train_loader.dataset) print(f'====> Epoch {epoch} Average loss: {avg_loss:.4f}')

def test(epoch): model.eval() test_loss = 0 with torch.no_grad(): for data, _ in test_loader: data = data.view(-1, 784).to(DEVICE) x_recon, mu, logvar = model(data) loss = vae_loss(data, x_recon, mu, logvar) test_loss += loss.item()

avg_loss = test_loss / len(test_loader.dataset) print(f'====> Test set loss: {avg_loss:.4f}')

# Main training loop for epoch in range(1, NUM_EPOCHS + 1): train(epoch) test(epoch)

# Save samples if epoch % 5 == 0: with torch.no_grad(): z = torch.randn(64, LATENT_DIM).to(DEVICE) samples = model.decode(z).cpu().view(64, 1, 28, 28)

fig, axes = plt.subplots(8, 8, figsize=(8, 8)) for i, ax in enumerate(axes.flat): ax.imshow(samples[i, 0], cmap='gray') ax.axis('off') plt.savefig(f'samples_epoch_{epoch}.png') plt.close()

# Save model torch.save(model.state_dict(), 'vae_mnist.pth') ```

Convolutional VAE

# Encoder self.encoder = nn.Sequential( nn.Conv2d(1, 32, 4, stride=2, padding=1), # 28 -> 14 nn.ReLU(), nn.Conv2d(32, 64, 4, stride=2, padding=1), # 14 -> 7 nn.ReLU(), nn.Conv2d(64, 128, 7), # 7 -> 1 nn.ReLU() )

self.fc_mu = nn.Linear(128, latent_dim) self.fc_logvar = nn.Linear(128, latent_dim)

# Decoder self.fc_decode = nn.Linear(latent_dim, 128)

self.decoder = nn.Sequential( nn.ConvTranspose2d(128, 64, 7), # 1 -> 7 nn.ReLU(), nn.ConvTranspose2d(64, 32, 4, stride=2, padding=1), # 7 -> 14 nn.ReLU(), nn.ConvTranspose2d(32, 1, 4, stride=2, padding=1), # 14 -> 28 nn.Sigmoid() )

def encode(self, x): h = self.encoder(x) h = h.view(h.size(0), -1) mu = self.fc_mu(h) logvar = self.fc_logvar(h) return mu, logvar

def reparameterize(self, mu, logvar): std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + eps * std

def decode(self, z): h = F.relu(self.fc_decode(z)) h = h.view(-1, 128, 1, 1) return self.decoder(h)

def forward(self, x): mu, logvar = self.encode(x) z = self.reparameterize(mu, logvar) return self.decode(z), mu, logvar ```

Latent Space Visualization

with torch.no_grad(): for data, label in dataloader: data = data.view(-1, 784).to(device) mu, _ = model.encode(data) latents.append(mu.cpu().numpy()) labels.append(label.numpy())

latents = np.concatenate(latents) labels = np.concatenate(labels)

# Use t-SNE for visualization if latent_dim > 2 if latents.shape[1] > 2: from sklearn.manifold import TSNE latents_2d = TSNE(n_components=2).fit_transform(latents) else: latents_2d = latents

# Plot plt.figure(figsize=(10, 8)) scatter = plt.scatter(latents_2d[:, 0], latents_2d[:, 1], c=labels, cmap='tab10', alpha=0.6) plt.colorbar(scatter) plt.title('Latent Space Visualization') plt.xlabel('Dimension 1') plt.ylabel('Dimension 2') plt.show()

visualize_latent_space(model, test_loader, DEVICE) ```

Interpolation

with torch.no_grad(): mu1, _ = model.encode(img1.view(-1, 784).to(device)) mu2, _ = model.encode(img2.view(-1, 784).to(device))

interpolations = [] for alpha in torch.linspace(0, 1, steps): z = alpha mu1 + (1 - alpha) mu2 img = model.decode(z) interpolations.append(img.cpu().view(28, 28))

# Plot fig, axes = plt.subplots(1, steps, figsize=(15, 2)) for i, ax in enumerate(axes): ax.imshow(interpolations[i], cmap='gray') ax.axis('off') plt.show()

# Example usage img1 = test_dataset[0][0] img2 = test_dataset[1][0] interpolate(model, img1, img2, steps=10, device=DEVICE) ```

---

17. Common Pitfalls and Solutions

1. Posterior Collapse in VAE

# In training loop loss = recon_loss + kl_weight(epoch) * kl_loss ```

kl_loss = free_bits_kl(kl_loss, free_bits=2.0) ```

2. Blurry Reconstructions

def forward(self, x, x_recon): features_x = self.feature_extractor(x) features_recon = self.feature_extractor(x_recon) return F.mse_loss(features_recon, features_x)

# Use in training perceptual_loss_fn = PerceptualLoss() loss = recon_loss + 0.1 * perceptual_loss_fn(x, x_recon) + kl_loss ```

3. Mode Collapse

4. Numerical Instability

# Use loss = F.binary_cross_entropy_with_logits(logits, x) ```

5. Training Time Issues

6. Evaluation Mode Errors

7. Codebook Collapse (VQ-VAE)

# In training loop unique_indices, counts = torch.unique(encoding_indices, return_counts=True) usage_count[unique_indices] += counts ```

8. Wrong Loss Scale

# Or adjust weight loss = recon_loss + beta * kl_loss ```

9. Memory Issues

def forward(self, x): mu, logvar = checkpoint(self.encode, x) z = self.reparameterize(mu, logvar) x_recon = checkpoint(self.decode, z) return x_recon, mu, logvar ```

scaler = GradScaler()

with autocast(): x_recon, mu, logvar = model(x) loss = vae_loss(x, x_recon, mu, logvar)

scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() ```

10. Overfitting

---

Summary

Key Takeaways

Comparison Table

| Model | Latent Space | Generation | Reconstruction | Training | |-------|-------------|------------|----------------|----------| | Autoencoder | Continuous, unstructured | ❌ No | ✅ Sharp | Easy | | VAE | Continuous, structured | ✅ Yes | ⚠️ Blurry | Medium | | VQ-VAE | Discrete | ✅ Yes (with prior) | ✅ Sharp | Hard |

When to Use Each

Expected Performance on MNIST

---

References

---

End of Lesson

Good luck with Assignment 7!


End of Lesson

Good luck with Assignment 7!

Created: November 13, 2025
Source: Autoencoders (1).pdf (35 pages)
CMPUT 328 - Visual Recognition

DOWNLOAD ANKI DECK