← Back to Topics

CMPUT 328 Assignment 4: Vision Transformers & LoRA

Complete Study Guide for Vision Transformers, Transfer Learning, and Low-Rank Adaptation

Table of Contents

1. Introduction to Vision Transformers

What is a Vision Transformer (ViT)?

Vision Transformer (ViT) applies the transformer architecture (originally designed for NLP) directly to images by treating image patches as tokens.

Key innovation: Instead of using convolutions, ViT relies entirely on self-attention mechanisms to process visual information.

Why Vision Transformers?

ViT vs CNN: Philosophical Difference

Aspect CNN Vision Transformer
Inductive bias Strong (locality, translation equivariance) Minimal (learns from data)
Receptive field Grows gradually layer-by-layer Global from first layer
Data requirement Works well with small datasets Needs large datasets to excel
Computational cost O(n) per layer (local operations) O(n²) self-attention (global)

2. Image Patches & Tokenization

Converting Images to Sequences

Since transformers process sequences, ViT divides an image into fixed-size patches and treats each patch as a token.

Patch Extraction Process

For a 32×32 image with patch_size=4: Number of patches per dimension: 32 ÷ 4 = 8 Total patches: 8 × 8 = 64 patches Each patch: 4×4×3 (RGB) = 48 values Flattened patch dimension: 48
def img_to_patch(x, patch_size, flatten_channels=True): """ Convert image to patches Args: x: [B, C, H, W] image tensor patch_size: Size of each patch (e.g., 4) flatten_channels: If True, flatten to [B, num_patches, C*P*P] Returns: patches: [B, num_patches, patch_dim] if flatten_channels=True """ b, c, h, w = x.shape patch_h = h // patch_size patch_w = w // patch_size # Unfold creates patches patches = x.unfold(2, patch_size, patch_size) .unfold(3, patch_size, patch_size) # Reshape: [B, C, patch_h, patch_w, patch_size, patch_size] # → [B, num_patches, C, patch_size, patch_size] patches = patches.contiguous().view( b, c, patch_h * patch_w, patch_size, patch_size ) patches = patches.permute(0, 2, 1, 3, 4) if flatten_channels: # Flatten to [B, num_patches, C*patch_size*patch_size] patches = patches.view(b, patch_h * patch_w, -1) return patches

Linear Projection (Patch Embedding)

After extracting patches, a linear layer projects each patch to the embedding dimension:

Patch Embedding: patch_embed = Linear(C × P × P, embed_dim) For CIFAR-10 (patch_size=4, embed_dim=256): Input: 3 × 4 × 4 = 48 dimensions Output: 256 dimensions (embedding)

Important: Positional Information

Unlike CNNs, transformers have no inherent spatial awareness. Positional encodings must be added to tell the model where each patch came from in the original image.

3. Self-Attention Mechanism

What is Self-Attention?

Self-attention allows each patch to attend to all other patches, learning which parts of the image are relevant to each other.

Attention Formula

Scaled Dot-Product Attention: Q = x × W_Q (Query) K = x × W_K (Key) V = x × W_V (Value) Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V Where: - d_k: dimension of key vectors (for scaling) - Softmax normalizes attention scores to sum to 1 - Result: weighted combination of values

Multi-Head Attention

Instead of single attention, multi-head attention uses multiple parallel attention heads:

Multi-Head Attention: For num_heads = 8: - Split embed_dim (256) into 8 heads of 32 dimensions each - Each head learns different relationships - Concatenate outputs from all heads - Final linear projection Benefit: Captures different types of relationships simultaneously
class AttentionBlock(nn.Module): def __init__(self, embed_dim, hidden_dim, num_heads, dropout=0.0): super().__init__() self.norm1 = nn.LayerNorm(embed_dim) self.attn = nn.MultiheadAttention( embed_dim=embed_dim, num_heads=num_heads, batch_first=True ) self.norm2 = nn.LayerNorm(embed_dim) self.ff = nn.Sequential( nn.Linear(embed_dim, hidden_dim), nn.GELU(), nn.Dropout(dropout), nn.Linear(hidden_dim, embed_dim), nn.Dropout(dropout), ) def forward(self, x): # Pre-norm architecture attn_input = self.norm1(x) attn_output, _ = self.attn(attn_input, attn_input, attn_input) x = x + attn_output # Residual connection # Feed-forward network ff_input = self.norm2(x) x = x + self.ff(ff_input) # Residual connection return x

Why Self-Attention for Vision?

4. ViT Architecture

Complete ViT Pipeline

  1. Patch Extraction: Split image into patches
  2. Linear Projection: Embed patches to embed_dim
  3. Add [CLS] Token: Prepend learnable classification token
  4. Add Positional Embeddings: Encode spatial positions
  5. Transformer Encoder: Stack of attention blocks
  6. Classification Head: MLP on [CLS] token output

[CLS] Token

The [CLS] token is a learnable embedding prepended to the sequence. After passing through all transformer layers, its final representation is used for classification.

Intuition: The [CLS] token aggregates information from all patches through self-attention, creating a global image representation.

Positional Embeddings

Learnable Positional Embeddings: pos_embedding = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim)) - Shape: [1, 65, 256] for 64 patches + 1 CLS token - Learned during training (not fixed sinusoidal) - Added element-wise to patch embeddings
class VisionTransformer(nn.Module): def __init__(self, embed_dim, hidden_dim, num_channels, num_heads, num_layers, num_classes, patch_size, num_patches, dropout=0.0): super().__init__() self.patch_size = patch_size self.num_patches = num_patches patch_dim = num_channels * patch_size * patch_size # Patch embedding self.patch_embed = nn.Linear(patch_dim, embed_dim) # Transformer blocks self.transformer = nn.ModuleList([ AttentionBlock(embed_dim, hidden_dim, num_heads, dropout) for _ in range(num_layers) ]) # Classification head self.mlp_head = nn.Sequential( nn.LayerNorm(embed_dim), nn.Linear(embed_dim, num_classes), ) # Learnable parameters self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) self.pos_embedding = nn.Parameter( torch.zeros(1, num_patches + 1, embed_dim) ) self.dropout = nn.Dropout(dropout) def forward(self, x): # Extract and embed patches patches = img_to_patch(x, self.patch_size) tokens = self.patch_embed(patches) # Add CLS token batch_size = tokens.size(0) cls_tokens = self.cls_token.expand(batch_size, -1, -1) x = torch.cat((cls_tokens, tokens), dim=1) # Add positional encoding x = x + self.pos_embedding[:, :x.size(1)] x = self.dropout(x) # Apply transformer blocks for block in self.transformer: x = block(x) # Classification from CLS token cls = x[:, 0] # Extract [CLS] token out = self.mlp_head(cls) return out

ViT Configuration for CIFAR-10

Parameter Value Description
patch_size 4 Each patch is 4×4 pixels
num_patches 64 32÷4 = 8, so 8×8 = 64 patches
embed_dim 256 Embedding dimension
hidden_dim 512 Feed-forward hidden size (2×embed_dim)
num_heads 8 Multi-head attention heads
num_layers 6 Number of transformer blocks
dropout 0.1 Dropout rate

5. Training Vision Transformers

Optimizer: AdamW

AdamW (Adam with decoupled weight decay) is the standard optimizer for training transformers.

Learning Rate Schedule: Cosine Annealing

Cosine Annealing LR: η_t = η_min + (η_max - η_min) × (1 + cos(πt/T)) / 2 Where: - η_max: initial learning rate (3e-4) - η_min: minimum LR (typically 0) - t: current epoch - T: total epochs (T_max) Effect: Gradually decreases LR following a cosine curve
def configure_optimizers(self): optimizer = optim.AdamW( self.parameters(), lr=self.lr, weight_decay=self.weight_decay ) scheduler = optim.lr_scheduler.CosineAnnealingLR( optimizer, T_max=self.max_epochs ) return { "optimizer": optimizer, "lr_scheduler": scheduler, }

Data Augmentation for CIFAR-10

train_transform = transforms.Compose([ transforms.RandomCrop(32, padding=4), # Random crop with padding transforms.RandomHorizontalFlip(), # 50% chance horizontal flip transforms.ToTensor(), transforms.Normalize( mean=(0.4914, 0.4822, 0.4465), # CIFAR-10 mean std=(0.2023, 0.1994, 0.2010) # CIFAR-10 std ), ])

Training Tips

Expected Performance on CIFAR-10

Model Test Accuracy Training Time
ViT-Small (from scratch) ~75-80% 10 epochs
CNN (Assignment 3) ~85-90% Comparable
Pre-trained ViT (fine-tuned) ~90-95% Much faster

6. ViT vs CNN Comparison

Why Does CNN Outperform ViT on Small Datasets?

Inductive biases built into CNNs (locality, translation equivariance) help with small datasets. ViT needs to learn these patterns from data.

Detailed Comparison

Aspect CNN ViT (from scratch) ViT (pre-trained)
Small dataset (<50k) Excellent Mediocre Excellent
Large dataset (>1M) Good Excellent Excellent
Training time Fast Slow (quadratic attention) Very fast (fine-tuning)
Transfer learning Good Excellent Excellent
Interpretability Moderate (feature maps) Good (attention maps) Good (attention maps)
Parameters Fewer More More

Key Insight from Assignment

On CIFAR-10 (50k training samples), CNN achieves ~85-90% accuracy, while ViT from scratch achieves ~75-80%. This demonstrates the data-hungry nature of transformers.

However, a pre-trained ViT fine-tuned on CIFAR-10 can match or exceed CNN performance!

7. LoRA (Low-Rank Adaptation)

What is LoRA?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable low-rank decomposition matrices.

LoRA Concept

Standard Fine-tuning: W_new = W_pretrained + ΔW All parameters updated → expensive! LoRA: W_new = W_pretrained + B × A Where: - W: [d × k] original weight matrix (FROZEN) - A: [r × k] trainable matrix - B: [d × r] trainable matrix - r: rank (r << min(d, k)) Trainable parameters: r(d + k) instead of d×k

LoRA Benefits

LoRA Configuration

from peft import LoraConfig, get_peft_model lora_config = LoraConfig( r=16, # Rank (bottleneck dimension) lora_alpha=32, # Scaling factor (typically 2×r) lora_dropout=0.05, # Dropout for regularization bias="none", # Don't adapt bias terms target_modules=[ # Which layers to adapt "attn.c_attn", # Query, key, value projections "attn.c_proj", # Output projection "mlp.c_fc", # MLP first layer "mlp.c_proj", # MLP second layer ], task_type="CAUSAL_LM", # Task type ) # Apply LoRA to model model.decoder = get_peft_model(model.decoder, lora_config) model.decoder.print_trainable_parameters() # Output: trainable params: ~0.5M / total: ~100M (0.5%)

LoRA for Image Captioning (Assignment Task)

In Assignment 4, LoRA is applied to a ViT-GPT2 image captioning model:

Choosing Rank (r)

8. CLIP & Zero-Shot Learning

What is CLIP?

CLIP (Contrastive Language-Image Pre-training) learns to match images with text descriptions by training on 400M image-text pairs from the internet.

CLIP Architecture

Zero-Shot Classification with CLIP

Zero-Shot Inference: 1. Encode class names as text: text_features = encode_text(["a dog", "a cat", ...]) 2. Encode test image: image_features = encode_image(image) 3. Compute similarity (cosine): similarity = image_features @ text_features.T 4. Predict class with highest similarity: prediction = argmax(similarity)
import clip # Load pre-trained CLIP model, preprocess = clip.load("ViT-B/32", device=device) # Class prompts class_names = ["airplane", "automobile", "bird", ...] text_inputs = clip.tokenize(class_names).to(device) # Encode text text_features = model.encode_text(text_inputs) text_features = text_features / text_features.norm(dim=-1, keepdim=True) # Zero-shot prediction def predict(image): image_input = preprocess(image).unsqueeze(0).to(device) image_features = model.encode_image(image_input) image_features = image_features / image_features.norm(dim=-1, keepdim=True) # Cosine similarity logits = 100.0 * image_features @ text_features.T return logits.argmax(dim=-1)

CLIP Performance on CIFAR-10

Zero-shot CLIP (no training on CIFAR-10) achieves ~88-90% accuracy on CIFAR-10 test set!

This demonstrates the power of large-scale pre-training and vision-language alignment.

9. Implementation Details

PyTorch Lightning for Training

Assignment 4 uses PyTorch Lightning for clean, modular training code:

class ViT(pl.LightningModule): def __init__(self, model_kwargs, lr, weight_decay, max_epochs): super().__init__() self.model = VisionTransformer(**model_kwargs) self.criterion = nn.CrossEntropyLoss() self.train_acc = MulticlassAccuracy(num_classes=10) self.val_acc = MulticlassAccuracy(num_classes=10) def forward(self, x): return self.model(x) def training_step(self, batch, batch_idx): images, targets = batch logits = self(images) loss = self.criterion(logits, targets) preds = logits.argmax(dim=1) self.train_acc(preds, targets) self.log("train_loss", loss) self.log("train_acc", self.train_acc) return loss def configure_optimizers(self): optimizer = optim.AdamW( self.parameters(), lr=self.lr, weight_decay=self.weight_decay ) scheduler = optim.lr_scheduler.CosineAnnealingLR( optimizer, T_max=self.max_epochs ) return {"optimizer": optimizer, "lr_scheduler": scheduler}

Training LoRA on Custom Dataset

def finetune_lora(model, dataloader, epochs, max_steps, optimizer, scaler, grad_accum=1): global_step = 0 for epoch in range(epochs): for step, batch in enumerate(dataloader): pixel_values, input_ids, attention_mask = batch # Mixed precision forward pass with torch.cuda.amp.autocast(enabled=True): outputs = model( pixel_values=pixel_values, labels=input_ids, decoder_attention_mask=attention_mask ) loss = outputs.loss / grad_accum # Backward with gradient scaling scaler.scale(loss).backward() if (step + 1) % grad_accum == 0: scaler.unscale_(optimizer) clip_grad_norm_(model.decoder.parameters(), max_norm=1.0) scaler.step(optimizer) scaler.update() optimizer.zero_grad() global_step += 1 if global_step >= max_steps: break

Key Implementation Choices

Component Choice Reason
Optimizer AdamW Standard for transformers, decoupled weight decay
LR Schedule Cosine Annealing Smooth decay, good for transformers
Normalization Pre-LN (LayerNorm before attention) More stable training than Post-LN
Activation GELU Standard for transformers (smoother than ReLU)
Gradient Clip 1.0 Prevents exploding gradients
Mixed Precision Enabled (FP16) Faster training, lower memory

Summary: Key Takeaways

Vision Transformers

LoRA (Low-Rank Adaptation)

CLIP

Assignment Results

DOWNLOAD ANKI DECK