Vision Transformer (ViT) applies the transformer architecture (originally designed for NLP) directly to images by treating image patches as tokens.
Key innovation: Instead of using convolutions, ViT relies entirely on self-attention mechanisms to process visual information.
Why Vision Transformers?
Global receptive field: Self-attention can attend to any part of the image from layer 1
Scalability: Performance improves with more data and larger models
Architectural simplicity: No need for hand-crafted convolution hierarchies
Transfer learning: Pre-trained on large datasets, fine-tune on smaller tasks
ViT vs CNN: Philosophical Difference
Aspect
CNN
Vision Transformer
Inductive bias
Strong (locality, translation equivariance)
Minimal (learns from data)
Receptive field
Grows gradually layer-by-layer
Global from first layer
Data requirement
Works well with small datasets
Needs large datasets to excel
Computational cost
O(n) per layer (local operations)
O(n²) self-attention (global)
2. Image Patches & Tokenization
Converting Images to Sequences
Since transformers process sequences, ViT divides an image into fixed-size patches and treats each patch as a token.
Patch Extraction Process
For a 32×32 image with patch_size=4:
Number of patches per dimension: 32 ÷ 4 = 8
Total patches: 8 × 8 = 64 patches
Each patch: 4×4×3 (RGB) = 48 values
Flattened patch dimension: 48
Unlike CNNs, transformers have no inherent spatial awareness. Positional encodings must be added to tell the model where each patch came from in the original image.
3. Self-Attention Mechanism
What is Self-Attention?
Self-attention allows each patch to attend to all other patches, learning which parts of the image are relevant to each other.
Attention Formula
Scaled Dot-Product Attention:
Q = x × W_Q (Query)
K = x × W_K (Key)
V = x × W_V (Value)
Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V
Where:
- d_k: dimension of key vectors (for scaling)
- Softmax normalizes attention scores to sum to 1
- Result: weighted combination of values
Multi-Head Attention
Instead of single attention, multi-head attention uses multiple parallel attention heads:
Multi-Head Attention:
For num_heads = 8:
- Split embed_dim (256) into 8 heads of 32 dimensions each
- Each head learns different relationships
- Concatenate outputs from all heads
- Final linear projection
Benefit: Captures different types of relationships simultaneously
The [CLS] token is a learnable embedding prepended to the sequence. After passing through all transformer layers, its final representation is used for classification.
Intuition: The [CLS] token aggregates information from all patches through self-attention, creating a global image representation.
Positional Embeddings
Learnable Positional Embeddings:
pos_embedding = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
- Shape: [1, 65, 256] for 64 patches + 1 CLS token
- Learned during training (not fixed sinusoidal)
- Added element-wise to patch embeddings
LayerNorm: Use Pre-LN (normalize before attention) for stability
Initialization: Truncated normal for positional embeddings
Expected Performance on CIFAR-10
Model
Test Accuracy
Training Time
ViT-Small (from scratch)
~75-80%
10 epochs
CNN (Assignment 3)
~85-90%
Comparable
Pre-trained ViT (fine-tuned)
~90-95%
Much faster
6. ViT vs CNN Comparison
Why Does CNN Outperform ViT on Small Datasets?
Inductive biases built into CNNs (locality, translation equivariance) help with small datasets. ViT needs to learn these patterns from data.
Detailed Comparison
Aspect
CNN
ViT (from scratch)
ViT (pre-trained)
Small dataset (<50k)
Excellent
Mediocre
Excellent
Large dataset (>1M)
Good
Excellent
Excellent
Training time
Fast
Slow (quadratic attention)
Very fast (fine-tuning)
Transfer learning
Good
Excellent
Excellent
Interpretability
Moderate (feature maps)
Good (attention maps)
Good (attention maps)
Parameters
Fewer
More
More
Key Insight from Assignment
On CIFAR-10 (50k training samples), CNN achieves ~85-90% accuracy, while ViT from scratch achieves ~75-80%. This demonstrates the data-hungry nature of transformers.
However, a pre-trained ViT fine-tuned on CIFAR-10 can match or exceed CNN performance!
7. LoRA (Low-Rank Adaptation)
What is LoRA?
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable low-rank decomposition matrices.
LoRA Concept
Standard Fine-tuning:
W_new = W_pretrained + ΔW
All parameters updated → expensive!
LoRA:
W_new = W_pretrained + B × A
Where:
- W: [d × k] original weight matrix (FROZEN)
- A: [r × k] trainable matrix
- B: [d × r] trainable matrix
- r: rank (r << min(d, k))
Trainable parameters: r(d + k) instead of d×k
LoRA Benefits
Memory efficient: Only store small A, B matrices per layer
Fast training: Fewer parameters to update
Modular: Can swap different LoRA adapters for different tasks
No inference overhead: Merge A×B into W at deployment
LoRA Configuration
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # Rank (bottleneck dimension)
lora_alpha=32, # Scaling factor (typically 2×r)
lora_dropout=0.05, # Dropout for regularization
bias="none", # Don't adapt bias terms
target_modules=[ # Which layers to adapt
"attn.c_attn", # Query, key, value projections
"attn.c_proj", # Output projection
"mlp.c_fc", # MLP first layer
"mlp.c_proj", # MLP second layer
],
task_type="CAUSAL_LM", # Task type
)
# Apply LoRA to model
model.decoder = get_peft_model(model.decoder, lora_config)
model.decoder.print_trainable_parameters()
# Output: trainable params: ~0.5M / total: ~100M (0.5%)
LoRA for Image Captioning (Assignment Task)
In Assignment 4, LoRA is applied to a ViT-GPT2 image captioning model: