CMPUT 328 Assignment 4: Vision Transformers & LoRA

1. Introduction to Vision Transformers

What is a Vision Transformer (ViT)?

Vision Transformer (ViT) applies the transformer architecture (originally designed for NLP) directly to images by treating image patches as tokens.

Key innovation: Instead of using convolutions, ViT relies entirely on self-attention mechanisms to process visual information.

Why Vision Transformers?

Global receptive field: Self-attention can attend to any part of the image from layer 1
Scalability: Performance improves with more data and larger models
Architectural simplicity: No need for hand-crafted convolution hierarchies
Transfer learning: Pre-trained on large datasets, fine-tune on smaller tasks

ViT vs CNN: Philosophical Difference

Aspect	CNN	Vision Transformer
Inductive bias	Strong (locality, translation equivariance)	Minimal (learns from data)
Receptive field	Grows gradually layer-by-layer	Global from first layer
Data requirement	Works well with small datasets	Needs large datasets to excel
Computational cost	O(n) per layer (local operations)	O(n²) self-attention (global)

2. Image Patches & Tokenization

Converting Images to Sequences

Since transformers process sequences, ViT divides an image into fixed-size patches and treats each patch as a token.

Patch Extraction Process

For a 32×32 image with patch_size=4: Number of patches per dimension: 32 ÷ 4 = 8 Total patches: 8 × 8 = 64 patches Each patch: 4×4×3 (RGB) = 48 values Flattened patch dimension: 48

def img_to_patch(x, patch_size, flatten_channels=True):
    """
    Convert image to patches

    Args:
        x: [B, C, H, W] image tensor
        patch_size: Size of each patch (e.g., 4)
        flatten_channels: If True, flatten to [B, num_patches, C*P*P]

    Returns:
        patches: [B, num_patches, patch_dim] if flatten_channels=True
    """
    b, c, h, w = x.shape
    patch_h = h // patch_size
    patch_w = w // patch_size

    # Unfold creates patches
    patches = x.unfold(2, patch_size, patch_size)
                .unfold(3, patch_size, patch_size)

    # Reshape: [B, C, patch_h, patch_w, patch_size, patch_size]
    # → [B, num_patches, C, patch_size, patch_size]
    patches = patches.contiguous().view(
        b, c, patch_h * patch_w, patch_size, patch_size
    )
    patches = patches.permute(0, 2, 1, 3, 4)

    if flatten_channels:
        # Flatten to [B, num_patches, C*patch_size*patch_size]
        patches = patches.view(b, patch_h * patch_w, -1)

    return patches

Linear Projection (Patch Embedding)

After extracting patches, a linear layer projects each patch to the embedding dimension:

Patch Embedding: patch_embed = Linear(C × P × P, embed_dim) For CIFAR-10 (patch_size=4, embed_dim=256): Input: 3 × 4 × 4 = 48 dimensions Output: 256 dimensions (embedding)

Important: Positional Information

Unlike CNNs, transformers have no inherent spatial awareness. Positional encodings must be added to tell the model where each patch came from in the original image.

3. Self-Attention Mechanism

What is Self-Attention?

Self-attention allows each patch to attend to all other patches, learning which parts of the image are relevant to each other.

Attention Formula

Scaled Dot-Product Attention: Q = x × W_Q (Query) K = x × W_K (Key) V = x × W_V (Value) Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V Where: - d_k: dimension of key vectors (for scaling) - Softmax normalizes attention scores to sum to 1 - Result: weighted combination of values

Multi-Head Attention

Instead of single attention, multi-head attention uses multiple parallel attention heads:

Multi-Head Attention: For num_heads = 8: - Split embed_dim (256) into 8 heads of 32 dimensions each - Each head learns different relationships - Concatenate outputs from all heads - Final linear projection Benefit: Captures different types of relationships simultaneously

class AttentionBlock(nn.Module):
    def __init__(self, embed_dim, hidden_dim, num_heads, dropout=0.0):
        super().__init__()
        self.norm1 = nn.LayerNorm(embed_dim)
        self.attn = nn.MultiheadAttention(
            embed_dim=embed_dim,
            num_heads=num_heads,
            batch_first=True
        )
        self.norm2 = nn.LayerNorm(embed_dim)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, embed_dim),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        # Pre-norm architecture
        attn_input = self.norm1(x)
        attn_output, _ = self.attn(attn_input, attn_input, attn_input)
        x = x + attn_output  # Residual connection

        # Feed-forward network
        ff_input = self.norm2(x)
        x = x + self.ff(ff_input)  # Residual connection
        return x

Why Self-Attention for Vision?

Long-range dependencies: Can relate distant parts of image in one step
Adaptive receptive fields: Learns what to attend to based on content
Interpretable: Attention maps show what the model focuses on

4. ViT Architecture

Complete ViT Pipeline

Patch Extraction: Split image into patches
Linear Projection: Embed patches to embed_dim
Add [CLS] Token: Prepend learnable classification token
Add Positional Embeddings: Encode spatial positions
Transformer Encoder: Stack of attention blocks
Classification Head: MLP on [CLS] token output

[CLS] Token

The [CLS] token is a learnable embedding prepended to the sequence. After passing through all transformer layers, its final representation is used for classification.

Intuition: The [CLS] token aggregates information from all patches through self-attention, creating a global image representation.

Positional Embeddings

Learnable Positional Embeddings: pos_embedding = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim)) - Shape: [1, 65, 256] for 64 patches + 1 CLS token - Learned during training (not fixed sinusoidal) - Added element-wise to patch embeddings

class VisionTransformer(nn.Module):
    def __init__(self, embed_dim, hidden_dim, num_channels,
                 num_heads, num_layers, num_classes,
                 patch_size, num_patches, dropout=0.0):
        super().__init__()

        self.patch_size = patch_size
        self.num_patches = num_patches
        patch_dim = num_channels * patch_size * patch_size

        # Patch embedding
        self.patch_embed = nn.Linear(patch_dim, embed_dim)

        # Transformer blocks
        self.transformer = nn.ModuleList([
            AttentionBlock(embed_dim, hidden_dim, num_heads, dropout)
            for _ in range(num_layers)
        ])

        # Classification head
        self.mlp_head = nn.Sequential(
            nn.LayerNorm(embed_dim),
            nn.Linear(embed_dim, num_classes),
        )

        # Learnable parameters
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embedding = nn.Parameter(
            torch.zeros(1, num_patches + 1, embed_dim)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Extract and embed patches
        patches = img_to_patch(x, self.patch_size)
        tokens = self.patch_embed(patches)

        # Add CLS token
        batch_size = tokens.size(0)
        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
        x = torch.cat((cls_tokens, tokens), dim=1)

        # Add positional encoding
        x = x + self.pos_embedding[:, :x.size(1)]
        x = self.dropout(x)

        # Apply transformer blocks
        for block in self.transformer:
            x = block(x)

        # Classification from CLS token
        cls = x[:, 0]  # Extract [CLS] token
        out = self.mlp_head(cls)
        return out

ViT Configuration for CIFAR-10

Parameter	Value	Description
patch_size	4	Each patch is 4×4 pixels
num_patches	64	32÷4 = 8, so 8×8 = 64 patches
embed_dim	256	Embedding dimension
hidden_dim	512	Feed-forward hidden size (2×embed_dim)
num_heads	8	Multi-head attention heads
num_layers	6	Number of transformer blocks
dropout	0.1	Dropout rate

5. Training Vision Transformers

Optimizer: AdamW

AdamW (Adam with decoupled weight decay) is the standard optimizer for training transformers.

Learning rate: 3e-4 (typical for ViT)
Weight decay: 5e-5 (regularization)
Gradient clipping: max_norm=1.0 (stability)

Learning Rate Schedule: Cosine Annealing

Cosine Annealing LR: η_t = η_min + (η_max - η_min) × (1 + cos(πt/T)) / 2 Where: - η_max: initial learning rate (3e-4) - η_min: minimum LR (typically 0) - t: current epoch - T: total epochs (T_max) Effect: Gradually decreases LR following a cosine curve

def configure_optimizers(self):
    optimizer = optim.AdamW(
        self.parameters(),
        lr=self.lr,
        weight_decay=self.weight_decay
    )
    scheduler = optim.lr_scheduler.CosineAnnealingLR(
        optimizer,
        T_max=self.max_epochs
    )
    return {
        "optimizer": optimizer,
        "lr_scheduler": scheduler,
    }

Data Augmentation for CIFAR-10

train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),  # Random crop with padding
    transforms.RandomHorizontalFlip(),      # 50% chance horizontal flip
    transforms.ToTensor(),
    transforms.Normalize(
        mean=(0.4914, 0.4822, 0.4465),  # CIFAR-10 mean
        std=(0.2023, 0.1994, 0.2010)     # CIFAR-10 std
    ),
])

Training Tips

Warm-up: Gradually increase LR for first few epochs (optional)
Gradient clipping: Prevents exploding gradients (max_norm=1.0)
LayerNorm: Use Pre-LN (normalize before attention) for stability
Initialization: Truncated normal for positional embeddings

Expected Performance on CIFAR-10

Model	Test Accuracy	Training Time
ViT-Small (from scratch)	~75-80%	10 epochs
CNN (Assignment 3)	~85-90%	Comparable
Pre-trained ViT (fine-tuned)	~90-95%	Much faster

6. ViT vs CNN Comparison

Why Does CNN Outperform ViT on Small Datasets?

Inductive biases built into CNNs (locality, translation equivariance) help with small datasets. ViT needs to learn these patterns from data.

Detailed Comparison

Aspect	CNN	ViT (from scratch)	ViT (pre-trained)
Small dataset (<50k)	Excellent	Mediocre	Excellent
Large dataset (>1M)	Good	Excellent	Excellent
Training time	Fast	Slow (quadratic attention)	Very fast (fine-tuning)
Transfer learning	Good	Excellent	Excellent
Interpretability	Moderate (feature maps)	Good (attention maps)	Good (attention maps)
Parameters	Fewer	More	More

Key Insight from Assignment

On CIFAR-10 (50k training samples), CNN achieves ~85-90% accuracy, while ViT from scratch achieves ~75-80%. This demonstrates the data-hungry nature of transformers.

However, a pre-trained ViT fine-tuned on CIFAR-10 can match or exceed CNN performance!

7. LoRA (Low-Rank Adaptation)

What is LoRA?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable low-rank decomposition matrices.

LoRA Concept

Standard Fine-tuning: W_new = W_pretrained + ΔW All parameters updated → expensive! LoRA: W_new = W_pretrained + B × A Where: - W: [d × k] original weight matrix (FROZEN) - A: [r × k] trainable matrix - B: [d × r] trainable matrix - r: rank (r << min(d, k)) Trainable parameters: r(d + k) instead of d×k

LoRA Benefits

Memory efficient: Only store small A, B matrices per layer
Fast training: Fewer parameters to update
Modular: Can swap different LoRA adapters for different tasks
No inference overhead: Merge A×B into W at deployment

LoRA Configuration

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                    # Rank (bottleneck dimension)
    lora_alpha=32,           # Scaling factor (typically 2×r)
    lora_dropout=0.05,       # Dropout for regularization
    bias="none",             # Don't adapt bias terms
    target_modules=[         # Which layers to adapt
        "attn.c_attn",       # Query, key, value projections
        "attn.c_proj",       # Output projection
        "mlp.c_fc",          # MLP first layer
        "mlp.c_proj",        # MLP second layer
    ],
    task_type="CAUSAL_LM",   # Task type
)

# Apply LoRA to model
model.decoder = get_peft_model(model.decoder, lora_config)
model.decoder.print_trainable_parameters()
# Output: trainable params: ~0.5M / total: ~100M (0.5%)

LoRA for Image Captioning (Assignment Task)

In Assignment 4, LoRA is applied to a ViT-GPT2 image captioning model:

Encoder (ViT): FROZEN - extracts image features
Decoder (GPT-2): LoRA adapters added - generates captions
Dataset: CIFAR-10 images with captions like "A photo of a dog"
Result: Model learns to caption CIFAR-10 classes with <1% of parameters

Choosing Rank (r)

r = 4-8: Very parameter-efficient, may underfit
r = 16-32: Good balance (common choice)
r = 64+: More capacity, diminishing returns

8. CLIP & Zero-Shot Learning

What is CLIP?

CLIP (Contrastive Language-Image Pre-training) learns to match images with text descriptions by training on 400M image-text pairs from the internet.

CLIP Architecture

Image Encoder: ViT or ResNet extracts image features
Text Encoder: Transformer encodes text descriptions
Training: Contrastive loss - match correct image-text pairs

Zero-Shot Classification with CLIP

Zero-Shot Inference: 1. Encode class names as text: text_features = encode_text(["a dog", "a cat", ...]) 2. Encode test image: image_features = encode_image(image) 3. Compute similarity (cosine): similarity = image_features @ text_features.T 4. Predict class with highest similarity: prediction = argmax(similarity)

import clip

# Load pre-trained CLIP
model, preprocess = clip.load("ViT-B/32", device=device)

# Class prompts
class_names = ["airplane", "automobile", "bird", ...]
text_inputs = clip.tokenize(class_names).to(device)

# Encode text
text_features = model.encode_text(text_inputs)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)

# Zero-shot prediction
def predict(image):
    image_input = preprocess(image).unsqueeze(0).to(device)
    image_features = model.encode_image(image_input)
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)

    # Cosine similarity
    logits = 100.0 * image_features @ text_features.T
    return logits.argmax(dim=-1)

CLIP Performance on CIFAR-10

Zero-shot CLIP (no training on CIFAR-10) achieves ~88-90% accuracy on CIFAR-10 test set!

This demonstrates the power of large-scale pre-training and vision-language alignment.

9. Implementation Details

PyTorch Lightning for Training

Assignment 4 uses PyTorch Lightning for clean, modular training code:

class ViT(pl.LightningModule):
    def __init__(self, model_kwargs, lr, weight_decay, max_epochs):
        super().__init__()
        self.model = VisionTransformer(**model_kwargs)
        self.criterion = nn.CrossEntropyLoss()
        self.train_acc = MulticlassAccuracy(num_classes=10)
        self.val_acc = MulticlassAccuracy(num_classes=10)

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        images, targets = batch
        logits = self(images)
        loss = self.criterion(logits, targets)
        preds = logits.argmax(dim=1)
        self.train_acc(preds, targets)
        self.log("train_loss", loss)
        self.log("train_acc", self.train_acc)
        return loss

    def configure_optimizers(self):
        optimizer = optim.AdamW(
            self.parameters(),
            lr=self.lr,
            weight_decay=self.weight_decay
        )
        scheduler = optim.lr_scheduler.CosineAnnealingLR(
            optimizer, T_max=self.max_epochs
        )
        return {"optimizer": optimizer, "lr_scheduler": scheduler}

Training LoRA on Custom Dataset

def finetune_lora(model, dataloader, epochs, max_steps,
                  optimizer, scaler, grad_accum=1):
    global_step = 0
    for epoch in range(epochs):
        for step, batch in enumerate(dataloader):
            pixel_values, input_ids, attention_mask = batch

            # Mixed precision forward pass
            with torch.cuda.amp.autocast(enabled=True):
                outputs = model(
                    pixel_values=pixel_values,
                    labels=input_ids,
                    decoder_attention_mask=attention_mask
                )
                loss = outputs.loss / grad_accum

            # Backward with gradient scaling
            scaler.scale(loss).backward()

            if (step + 1) % grad_accum == 0:
                scaler.unscale_(optimizer)
                clip_grad_norm_(model.decoder.parameters(), max_norm=1.0)
                scaler.step(optimizer)
                scaler.update()
                optimizer.zero_grad()

            global_step += 1
            if global_step >= max_steps:
                break

Key Implementation Choices

Component	Choice	Reason
Optimizer	AdamW	Standard for transformers, decoupled weight decay
LR Schedule	Cosine Annealing	Smooth decay, good for transformers
Normalization	Pre-LN (LayerNorm before attention)	More stable training than Post-LN
Activation	GELU	Standard for transformers (smoother than ReLU)
Gradient Clip	1.0	Prevents exploding gradients
Mixed Precision	Enabled (FP16)	Faster training, lower memory

Summary: Key Takeaways

Vision Transformers

Apply transformers to vision by treating image patches as tokens
Global receptive field from layer 1 via self-attention
Data-hungry: need large datasets or pre-training to excel
Strong transfer learning capabilities

LoRA (Low-Rank Adaptation)

Parameter-efficient fine-tuning: train <1% of parameters
Inject low-rank matrices A, B instead of full updates
Modular: swap adapters for different tasks
Ideal for adapting large pre-trained models

CLIP

Vision-language model trained on 400M image-text pairs
Zero-shot classification: no training on target dataset
Achieves ~90% on CIFAR-10 without seeing any CIFAR-10 training data
Demonstrates power of large-scale pre-training

Assignment Results

ViT from scratch: ~75-80% on CIFAR-10
CNN (Assignment 3): ~85-90% on CIFAR-10
Pre-trained ViT: ~90-95% on CIFAR-10
CLIP zero-shot: ~88-90% on CIFAR-10
LoRA captioning: Successfully adapted with <0.5% trainable parameters