FOR IMAGE CLASSIFICATION
CMPUT 328 - ASSIGNMENT 3 STUDY GUIDE
A Convolutional Neural Network (CNN) is a specialized type of neural network designed for processing grid-like data, particularly images. CNNs are inspired by the visual cortex of animals and are the foundation of modern computer vision.
INPUT IMAGE (32×32×3)
↓
┌───────┐
│ CONV │ → Detect edges, colors
└───────┘
↓
┌───────┐
│ POOL │ → Reduce size
└───────┘
↓
┌───────┐
│ CONV │ → Detect shapes
└───────┘
↓
┌───────┐
│ POOL │ → Reduce size
└───────┘
↓
┌───────┐
│ CONV │ → Detect objects
└───────┘
↓
┌───────┐
│ FC │ → Classification
└───────┘
↓
OUTPUT (10 classes)
When we flatten an image for a fully connected network:
Input: 32×32×3 CIFAR-10 image
Flattened: 3,072 dimensional vector
Hidden layer: 1,024 neurons
Parameters: 3,072 × 1,024 = 3,145,728 parameters!
| Problem | Description |
|---|---|
| Loss of spatial structure | Pixels that are spatially close treated same as pixels far apart |
| Huge parameter count | Millions of parameters in first layer alone |
| No translation invariance | Must learn same feature at every possible position |
| Overfitting | Too many parameters lead to poor generalization |
┌────────────────────────────────────────┐
│ STANDARD CNN ARCHITECTURE │
├────────────────────────────────────────┤
│ Input Image (32×32×3) │
│ ↓ │
│ ┌──────────────────────┐ │
│ │ Conv → ReLU → Pool │ ×N │
│ └──────────────────────┘ │
│ ↓ │
│ ┌──────────────────────┐ │
│ │ Conv → ReLU → Pool │ ×M │
│ └──────────────────────┘ │
│ ↓ │
│ Global Average Pooling │
│ ↓ │
│ Fully Connected → Output (10 classes) │
└────────────────────────────────────────┘
Convolution is a mathematical operation that slides a small filter (kernel) over an input to produce a feature map.
| Parameter | Description | Common Values |
|---|---|---|
| Kernel size (k) | Size of the sliding window | 3×3, 5×5, 7×7 |
| Stride (s) | How many pixels to slide | 1, 2 |
| Padding (p) | Add zeros around border | 0, 1, 2 |
| Filters (out_channels) | Number of output feature maps | 32, 64, 128, 256 |
Example: Conv2d(3, 64, kernel_size=3)
= (3 × 3 × 3 + 1) × 64 = 1,792 parameters
| Type | Operation | Use Case |
|---|---|---|
| MaxPool2d(2) | Takes maximum value | Most common, preserves strong activations |
| AvgPool2d(2) | Takes average value | Smoother downsampling |
| AdaptiveAvgPool2d(1) | Global average pooling | Before final classifier, replaces flatten |
Size Reduction with MaxPool(2):
32×32 → MaxPool → 16×16
16×16 → MaxPool → 8×8
8×8 → MaxPool → 4×4
4×4 → MaxPool → 2×2
Without activation functions, stacking layers is useless:
Linear → Linear → Linear ≡ Single Linear Layer
Activations introduce non-linearity, allowing networks to learn complex patterns.
| Function | Formula | Use Case |
|---|---|---|
| Leaky ReLU | max(0.01x, x) | Prevents dying neurons |
| Sigmoid | 1 / (1 + e^(-x)) | Binary classification output |
| Tanh | (e^x - e^(-x)) / (e^x + e^(-x)) | Zero-centered alternative to sigmoid |
| Softmax | e^(x_i) / Σ e^(x_j) | Multi-class classification output |
Purpose: Normalize layer inputs to have mean=0, std=1
Conv → BatchNorm → ReLU
nn.Conv2d(64, 128, 3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True)
Purpose: Prevent overfitting by randomly dropping activations
# Higher dropout in FC layers
nn.Dropout(0.5)
# Lower dropout after conv layers
nn.Dropout2d(0.25)
optimizer = optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=5e-4 # λ = 0.0005
)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
Prevents overconfident predictions and improves calibration
| Augmentation | Effect | Expected Gain |
|---|---|---|
| RandomCrop(32, padding=4) | Translation invariance | +2-3% |
| RandomHorizontalFlip() | Left-right symmetry | +1-2% |
| ColorJitter(0.2, 0.2, 0.2, 0.1) | Lighting robustness | +1-2% |
| RandomErasing(p=0.15) | Occlusion robustness | +0.5-1% |
train_transform = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(0.2, 0.2, 0.2, 0.1),
transforms.ToTensor(),
transforms.Normalize(
mean=(0.4914, 0.4822, 0.4465),
std=(0.2470, 0.2435, 0.2616)
),
transforms.RandomErasing(p=0.15),
])
criterion = nn.CrossEntropyLoss(label_smoothing=0.05)
| Optimizer | Pros | Cons |
|---|---|---|
| Adam / AdamW | Fast convergence, adaptive LR | Can generalize slightly worse |
| SGD + Momentum | Better generalization | Requires careful tuning |
# Cosine Annealing (smooth decrease)
scheduler = optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=epochs, eta_min=1e-6
)
# Step LR (decrease by factor every N epochs)
scheduler = optim.lr_scheduler.StepLR(
optimizer, step_size=30, gamma=0.1
)
# ReduceLROnPlateau (adaptive)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', factor=0.5, patience=5
)
torch.nn.utils.clip_grad_norm_(
model.parameters(),
max_norm=1.0
)
Prevents exploding gradients, especially useful with high learning rates
| Property | Value |
|---|---|
| Total Images | 60,000 (50k train, 10k test) |
| Resolution | 32×32 pixels |
| Channels | 3 (RGB) |
| Classes | 10 (balanced) |
# CIFAR-10 statistics
mean = (0.4914, 0.4822, 0.4465) # RGB
std = (0.2470, 0.2435, 0.2616)
normalize = transforms.Normalize(mean=mean, std=std)
# Typical split: 45k train / 5k val / 10k test
val_size = 5000
train_indices = list(range(0, 50000 - val_size))
val_indices = list(range(50000 - val_size, 50000))
| Metric | Formula | Interpretation |
|---|---|---|
| Accuracy | correct / total | Overall correctness |
| Loss | CrossEntropyLoss | Confidence-aware error |
| Confidence | max(softmax(logits)) | Model certainty |
def evaluate(model, dataloader, criterion, device):
model.eval() # IMPORTANT!
running_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
for inputs, targets in dataloader:
inputs = inputs.to(device)
targets = targets.to(device)
outputs = model(inputs)
loss = criterion(outputs, targets)
running_loss += loss.item() * inputs.size(0)
predictions = outputs.argmax(dim=1)
correct += (predictions == targets).sum().item()
total += targets.size(0)
return running_loss / total, correct / total
| Network Type | First Layer | Parameters |
|---|---|---|
| Fully Connected | 3,072 → 1,024 | 3,145,728 |
| CNN | 3→64, kernel=3×3 | 1,792 |
| Ratio | 1,754:1 |
Fully Connected:
CNN:
| Model | Parameters | Test Accuracy |
|---|---|---|
| Random Guessing | - | 10% |
| 3-layer FC | 3.8M | 55-60% |
| Simple CNN | 0.6M | 75-80% |
| CNN + Augmentation | 1.2M | 85-90% |
| ResNet-18 | 11M | 92-95% |
class CIFAR10CNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
# Block 1: 32×32 → 16×16
nn.Conv2d(3, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.Conv2d(64, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2),
nn.Dropout(0.25),
# Block 2: 16×16 → 8×8
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.MaxPool2d(2),
nn.Dropout(0.35),
# Block 3: 8×8 → 1×1
nn.Conv2d(128, 256, kernel_size=3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.AdaptiveAvgPool2d(1),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Dropout(0.4),
nn.Linear(256, 128),
nn.ReLU(inplace=True),
nn.Dropout(0.2),
nn.Linear(128, num_classes),
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
# Model, loss, optimizer
model = CIFAR10CNN().to(device)
criterion = nn.CrossEntropyLoss(label_smoothing=0.05)
optimizer = optim.AdamW(model.parameters(), lr=2e-3, weight_decay=5e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
# Training loop
for epoch in range(num_epochs):
model.train()
for inputs, targets in train_loader:
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
WRONG:
def forward(self, x):
x = x.view(x.size(0), -1) # DESTROYS SPATIAL STRUCTURE!
x = self.conv1(x) # This will crash
CORRECT:
def forward(self, x):
# x shape: [batch, 3, 32, 32] - keep spatial structure!
x = self.features(x)
x = self.classifier(x)
WRONG:
test_transform = transforms.Compose([
transforms.RandomCrop(32, padding=4), # NO!
transforms.ToTensor(),
])
CORRECT:
test_transform = transforms.Compose([
transforms.ToTensor(), # No augmentation!
transforms.Normalize(mean, std),
])
WRONG:
with torch.no_grad():
for inputs, targets in test_loader:
outputs = model(inputs) # BatchNorm/Dropout still active!
CORRECT:
model.eval() # IMPORTANT!
with torch.no_grad():
for inputs, targets in test_loader:
outputs = model(inputs)
| Symptom | Problem | Solution |
|---|---|---|
| Loss = NaN | Learning rate too high | Lower LR, add gradient clipping |
| Train >> Val accuracy | Overfitting | More dropout, augmentation, weight decay |
| Both train/val low | Underfitting | Increase capacity, train longer |
| Loss oscillates wildly | LR too high or batch size too small | Lower LR, increase batch size |
| Configuration | Expected Test Accuracy |
|---|---|
| Simple CNN, no augmentation | 70-75% |
| CNN + basic augmentation | 80-85% |
| CNN + full augmentation pipeline | 85-90% |
END OF LESSON
CMPUT 328 - VISUAL RECOGNITION
ASSIGNMENT 3: CNNs FOR IMAGE CLASSIFICATION