← Back to Topics

CONVOLUTIONAL NEURAL NETWORKS

FOR IMAGE CLASSIFICATION

CMPUT 328 - ASSIGNMENT 3 STUDY GUIDE

TABLE OF CONTENTS

1. INTRODUCTION TO CNNs

What is a Convolutional Neural Network?

A Convolutional Neural Network (CNN) is a specialized type of neural network designed for processing grid-like data, particularly images. CNNs are inspired by the visual cortex of animals and are the foundation of modern computer vision.

Key Characteristics

INPUT IMAGE (32×32×3)
        ↓
    ┌───────┐
    │  CONV │ → Detect edges, colors
    └───────┘
        ↓
    ┌───────┐
    │  POOL │ → Reduce size
    └───────┘
        ↓
    ┌───────┐
    │  CONV │ → Detect shapes
    └───────┘
        ↓
    ┌───────┐
    │  POOL │ → Reduce size
    └───────┘
        ↓
    ┌───────┐
    │  CONV │ → Detect objects
    └───────┘
        ↓
    ┌───────┐
    │   FC  │ → Classification
    └───────┘
        ↓
    OUTPUT (10 classes)
            

2. WHY CNNs FOR IMAGES?

The Problem with Fully Connected Networks

When we flatten an image for a fully connected network:

Input: 32×32×3 CIFAR-10 image
Flattened: 3,072 dimensional vector
Hidden layer: 1,024 neurons
Parameters: 3,072 × 1,024 = 3,145,728 parameters!

Problems with FC Networks

Problem Description
Loss of spatial structure Pixels that are spatially close treated same as pixels far apart
Huge parameter count Millions of parameters in first layer alone
No translation invariance Must learn same feature at every possible position
Overfitting Too many parameters lead to poor generalization

How CNNs Solve These Problems

3. CNN ARCHITECTURE COMPONENTS

┌────────────────────────────────────────┐
│          STANDARD CNN ARCHITECTURE      │
├────────────────────────────────────────┤
│  Input Image (32×32×3)                 │
│           ↓                             │
│  ┌──────────────────────┐               │
│  │ Conv → ReLU → Pool   │ ×N            │
│  └──────────────────────┘               │
│           ↓                             │
│  ┌──────────────────────┐               │
│  │ Conv → ReLU → Pool   │ ×M            │
│  └──────────────────────┘               │
│           ↓                             │
│  Global Average Pooling                │
│           ↓                             │
│  Fully Connected → Output (10 classes) │
└────────────────────────────────────────┘
            

Layer Types

  1. Convolutional layers: Extract spatial features
  2. Activation layers: Introduce non-linearity (ReLU)
  3. Pooling layers: Downsample spatial dimensions
  4. Normalization layers: Stabilize training (BatchNorm)
  5. Dropout layers: Regularization
  6. Fully connected layers: Final classification

4. CONVOLUTIONAL LAYERS

What is Convolution?

Convolution is a mathematical operation that slides a small filter (kernel) over an input to produce a feature map.

Output[i,j] = Σ Σ Input[i+m, j+n] × Kernel[m,n]

Key Parameters

Parameter Description Common Values
Kernel size (k) Size of the sliding window 3×3, 5×5, 7×7
Stride (s) How many pixels to slide 1, 2
Padding (p) Add zeros around border 0, 1, 2
Filters (out_channels) Number of output feature maps 32, 64, 128, 256

Output Size Formula

output_size = ⌊(input_size + 2×padding - kernel_size) / stride⌋ + 1
For a 3×3 kernel with padding=1 and stride=1, the output size equals input size!

How Filters Work

Parameter Count

Parameters = (kernel_h × kernel_w × in_channels + 1) × out_channels
Example: Conv2d(3, 64, kernel_size=3)
= (3 × 3 × 3 + 1) × 64 = 1,792 parameters

5. POOLING LAYERS

Purpose of Pooling

  1. Reduce spatial dimensions → decrease computational cost
  2. Increase receptive field → each neuron "sees" more
  3. Add translation invariance → small shifts don't change output
  4. Reduce overfitting → fewer parameters in subsequent layers

Types of Pooling

Type Operation Use Case
MaxPool2d(2) Takes maximum value Most common, preserves strong activations
AvgPool2d(2) Takes average value Smoother downsampling
AdaptiveAvgPool2d(1) Global average pooling Before final classifier, replaces flatten
Size Reduction with MaxPool(2):
32×32 → MaxPool → 16×16
16×16 → MaxPool → 8×8
 8×8  → MaxPool → 4×4
 4×4  → MaxPool → 2×2
            

6. ACTIVATION FUNCTIONS

Why Activation Functions?

Without activation functions, stacking layers is useless:

Linear → Linear → Linear ≡ Single Linear Layer

Activations introduce non-linearity, allowing networks to learn complex patterns.

ReLU (Rectified Linear Unit)

ReLU(x) = max(0, x)

Advantages:

Disadvantages:

Other Activations

Function Formula Use Case
Leaky ReLU max(0.01x, x) Prevents dying neurons
Sigmoid 1 / (1 + e^(-x)) Binary classification output
Tanh (e^x - e^(-x)) / (e^x + e^(-x)) Zero-centered alternative to sigmoid
Softmax e^(x_i) / Σ e^(x_j) Multi-class classification output

7. NORMALIZATION TECHNIQUES

Batch Normalization

Purpose: Normalize layer inputs to have mean=0, std=1

BatchNorm(x) = γ × (x - μ_batch) / √(σ²_batch + ε) + β

Benefits:

Typical Placement

Conv → BatchNorm → ReLU

nn.Conv2d(64, 128, 3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True)
During training: uses batch statistics
During inference: uses running averages computed during training

8. REGULARIZATION IN CNNs

Dropout

Purpose: Prevent overfitting by randomly dropping activations

Typical Usage

# Higher dropout in FC layers
nn.Dropout(0.5)

# Lower dropout after conv layers
nn.Dropout2d(0.25)

Weight Decay (L2 Regularization)

Loss_total = Loss_original + λ × Σ(w²)
optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=5e-4  # λ = 0.0005
)

Label Smoothing

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

Prevents overconfident predictions and improves calibration

9. DATA AUGMENTATION

Why Data Augmentation?

Common Augmentations for CIFAR-10

Augmentation Effect Expected Gain
RandomCrop(32, padding=4) Translation invariance +2-3%
RandomHorizontalFlip() Left-right symmetry +1-2%
ColorJitter(0.2, 0.2, 0.2, 0.1) Lighting robustness +1-2%
RandomErasing(p=0.15) Occlusion robustness +0.5-1%

Complete Pipeline

train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(0.2, 0.2, 0.2, 0.1),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=(0.4914, 0.4822, 0.4465),
        std=(0.2470, 0.2435, 0.2616)
    ),
    transforms.RandomErasing(p=0.15),
])
NEVER augment the test set! Evaluate on clean data only.

10. TRAINING CNNs

Loss Functions

criterion = nn.CrossEntropyLoss(label_smoothing=0.05)

Optimizers

Optimizer Pros Cons
Adam / AdamW Fast convergence, adaptive LR Can generalize slightly worse
SGD + Momentum Better generalization Requires careful tuning

Learning Rate Schedules

# Cosine Annealing (smooth decrease)
scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=epochs, eta_min=1e-6
)

# Step LR (decrease by factor every N epochs)
scheduler = optim.lr_scheduler.StepLR(
    optimizer, step_size=30, gamma=0.1
)

# ReduceLROnPlateau (adaptive)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=5
)

Gradient Clipping

torch.nn.utils.clip_grad_norm_(
    model.parameters(),
    max_norm=1.0
)

Prevents exploding gradients, especially useful with high learning rates

11. CIFAR-10 DATASET

Dataset Overview

Property Value
Total Images 60,000 (50k train, 10k test)
Resolution 32×32 pixels
Channels 3 (RGB)
Classes 10 (balanced)

Class Distribution

  1. Airplane
  2. Automobile
  3. Bird
  4. Cat
  5. Deer
  6. Dog
  7. Frog
  8. Horse
  9. Ship
  10. Truck

Data Normalization

# CIFAR-10 statistics
mean = (0.4914, 0.4822, 0.4465)  # RGB
std = (0.2470, 0.2435, 0.2616)

normalize = transforms.Normalize(mean=mean, std=std)
Why normalize? Zero-centered inputs speed up convergence and prevent activation saturation

Train/Val Split

# Typical split: 45k train / 5k val / 10k test
val_size = 5000
train_indices = list(range(0, 50000 - val_size))
val_indices = list(range(50000 - val_size, 50000))

12. MODEL EVALUATION

Key Metrics

Metric Formula Interpretation
Accuracy correct / total Overall correctness
Loss CrossEntropyLoss Confidence-aware error
Confidence max(softmax(logits)) Model certainty

Evaluation Function

def evaluate(model, dataloader, criterion, device):
    model.eval()  # IMPORTANT!
    running_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, targets in dataloader:
            inputs = inputs.to(device)
            targets = targets.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, targets)

            running_loss += loss.item() * inputs.size(0)
            predictions = outputs.argmax(dim=1)
            correct += (predictions == targets).sum().item()
            total += targets.size(0)

    return running_loss / total, correct / total
Always call model.eval() before evaluation to disable dropout and switch BatchNorm to eval mode!

13. CNN vs FULLY CONNECTED NETWORKS

Parameter Comparison

Network Type First Layer Parameters
Fully Connected 3,072 → 1,024 3,145,728
CNN 3→64, kernel=3×3 1,792
Ratio 1,754:1

Spatial Awareness

Fully Connected:

CNN:

Typical CIFAR-10 Results

Model Parameters Test Accuracy
Random Guessing - 10%
3-layer FC 3.8M 55-60%
Simple CNN 0.6M 75-80%
CNN + Augmentation 1.2M 85-90%
ResNet-18 11M 92-95%

14. IMPLEMENTATION GUIDE

Model Architecture

class CIFAR10CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        self.features = nn.Sequential(
            # Block 1: 32×32 → 16×16
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Dropout(0.25),

            # Block 2: 16×16 → 8×8
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Dropout(0.35),

            # Block 3: 8×8 → 1×1
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d(1),
        )

        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Dropout(0.4),
            nn.Linear(256, 128),
            nn.ReLU(inplace=True),
            nn.Dropout(0.2),
            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

Training Setup

# Model, loss, optimizer
model = CIFAR10CNN().to(device)
criterion = nn.CrossEntropyLoss(label_smoothing=0.05)
optimizer = optim.AdamW(model.parameters(), lr=2e-3, weight_decay=5e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

# Training loop
for epoch in range(num_epochs):
    model.train()
    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

    scheduler.step()

15. COMMON PITFALLS AND SOLUTIONS

Pitfall 1: Flattening at Input

WRONG:

def forward(self, x):
    x = x.view(x.size(0), -1)  # DESTROYS SPATIAL STRUCTURE!
    x = self.conv1(x)  # This will crash

CORRECT:

def forward(self, x):
    # x shape: [batch, 3, 32, 32] - keep spatial structure!
    x = self.features(x)
    x = self.classifier(x)

Pitfall 2: Augmenting Test Data

WRONG:

test_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),  # NO!
    transforms.ToTensor(),
])

CORRECT:

test_transform = transforms.Compose([
    transforms.ToTensor(),  # No augmentation!
    transforms.Normalize(mean, std),
])

Pitfall 3: Not Using .eval()

WRONG:

with torch.no_grad():
    for inputs, targets in test_loader:
        outputs = model(inputs)  # BatchNorm/Dropout still active!

CORRECT:

model.eval()  # IMPORTANT!
with torch.no_grad():
    for inputs, targets in test_loader:
        outputs = model(inputs)

Common Training Issues

Symptom Problem Solution
Loss = NaN Learning rate too high Lower LR, add gradient clipping
Train >> Val accuracy Overfitting More dropout, augmentation, weight decay
Both train/val low Underfitting Increase capacity, train longer
Loss oscillates wildly LR too high or batch size too small Lower LR, increase batch size

SUMMARY

Key Takeaways

  1. CNNs preserve spatial structure - NEVER flatten at input
  2. Convolutions are parameter-efficient - same filter reused across image
  3. Use BatchNorm + ReLU after Conv layers for stable training
  4. Data augmentation is crucial - easily +5-10% accuracy
  5. Always normalize inputs using dataset statistics
  6. Don't augment test set - evaluate on clean data
  7. Use proper train/val/test splits - 45k/5k/10k for CIFAR-10
  8. Monitor both train and val metrics - detect overfitting
  9. Save best model, not last epoch
  10. Set seeds for reproducibility

Expected Results

Configuration Expected Test Accuracy
Simple CNN, no augmentation 70-75%
CNN + basic augmentation 80-85%
CNN + full augmentation pipeline 85-90%

END OF LESSON

CMPUT 328 - VISUAL RECOGNITION

ASSIGNMENT 3: CNNs FOR IMAGE CLASSIFICATION

DOWNLOAD ANKI DECK