← Back to Topics

MACHINE LEARNING & BACKPROPAGATION

COMPLETE TUTORIAL

CMPUT 328 - VISUAL RECOGNITION

TABLE OF CONTENTS

1. INTRODUCTION TO MACHINE LEARNING

What is Machine Learning?

Machine Learning is a method of teaching computers to:

Core Philosophy

According to Rich Sutton (ACM A.M. Turing Award 2024 recipient):

"General methods that leverage computation and learning ultimately prove more effective than approaches relying on human knowledge and domain-specific expertise."

This is known as The Bitter Lesson - we should not assume models (e.g., linear or not), we should just have a general learning method leveraging computation (e.g., Neural Networks + Gradient Descent).

2. WHAT PROBLEMS CAN ML SOLVE?

Machine Learning excels when three conditions are met:

  1. Many labeled examples - Abundant training data
  2. Patterns in the data - Reusable regularities across samples
  3. Complex relationships - Rules too intricate for manual programming

Examples

3. MODELS: SIMPLIFICATION OF REALITY

What is a Model?

Definition: A model is a simplification of a real process or system.

Usually we want to know what is the process that generates the data.

The Gravity Example

Given measurements of mass and force:

x (mass) y (force)
2 19.6
4.5 44.1
10 98

Goal: Find the relationship $y = ax$

Guesses:

We discovered: $F = 9.8m$ (gravitational constant $g \approx 9.8 \text{ m/s}^2$)

Model Notation

Key Insight

It suffices to approximate reality. We don't need Einstein's relativity formula:

$$y = \left(1 + \frac{1}{1 - v^2/c^2} \cdot \frac{v \cdot v^T}{c^2}\right)\left(1 - v^2/c^2\right)^{-1/2} \cdot ax$$

where $c \approx 2.99 \times 10^8$ m/s

For most purposes, $\hat{y} = 9.8x$ is good enough!

4. LOSS FUNCTIONS: MEASURING GOODNESS

What is a Good Approximation?

Question: Is it possible to evaluate "goodness"?

Answer: Yes! Using a loss function.

L2 (Euclidean) Loss

$$L = \sum_{i} \left(y_i - \hat{y}(x_i)\right)^2$$

Interpretation: Distance from the data to the model.

If $L = 0$, then $y_i = \hat{y}(x_i)$ for all $i$ (perfect fit).

L2 Loss Example

Using our gravity data:

Guess Formula Loss
a = 8 $\hat{y} = 8x$ L = 402.57
a = 9 $\hat{y} = 9x$ L = 79.52
a = 9.8 $\hat{y} = 9.8x$ L = 4.97

Lower loss = better approximation!

Vector Form

For vector outputs:

$$L = \frac{1}{2} ||\hat{y} - y||^2$$

Gradient with respect to predictions:

$$\frac{\partial L}{\partial \hat{y}} = \hat{y} - y$$

5. GRADIENT DESCENT: FINDING THE MINIMUM

The Problem

Question: How do we find the value for parameter $a$ that minimizes loss?

Answer: Gradient descent!

How Derivatives Show Influence

The derivative $\frac{\partial L}{\partial a}$ tells us:

The Gradient Descent Update Rule

$$a \leftarrow a - \eta \cdot \frac{\partial L}{\partial a}$$

where:

Example Iterations

Starting with $a = 8$:

Iteration a ∂L/∂a Update
0 8.00 -447 $a \leftarrow 8 - \eta(-447)$
1 9.34 -114 $a \leftarrow 9.34 - \eta(-114)$
2 9.68 -29 $a \leftarrow 9.68 - \eta(-29)$
3 9.77 -7.4 $a \leftarrow 9.77 - \eta(-7.4)$
4 9.79 -1.9 $a \leftarrow 9.79 - \eta(-1.9)$

Converges to $a \approx 9.8$!

Learning Rate Importance

Multiple Parameters

For multiple parameters (e.g., $y = ax + b$):

$$a \leftarrow a - \eta \cdot \frac{\partial L}{\partial a}$$ $$b \leftarrow b - \eta \cdot \frac{\partial L}{\partial b}$$

6. STOCHASTIC GRADIENT DESCENT

The Data Loading Problem

Challenge: size of dataset $\gg$ computer memory

Solution: Batching!

Three Flavors of Gradient Descent

Type Update When to Use
Full Batch $\theta \leftarrow \theta - \eta \nabla_\theta L(\theta; \text{all data})$ Convex problems, tiny datasets
Stochastic (SGD) $\theta \leftarrow \theta - \eta \nabla_\theta L(\theta; x_i)$ Huge datasets, escape sharp minima
Mini-Batch $\theta \leftarrow \theta - \eta \nabla_\theta \left(\frac{1}{B} \sum L(x_b)\right)$ Practical default - balances GPU utilization and noise

Mini-Batch SGD Checklist

  1. Shuffle samples each epoch for unbiased batches
  2. Normalize inputs so magnitudes are comparable
  3. Track both training and validation metrics to detect overfitting

7. CHOOSING MODELS

The Spectrum of Models

For fitting data, we could choose:

The Overfitting Problem

High-degree polynomial:

For Complex Data (Images, Text, etc.)

Question: What model for CIFAR-10 images (cats, dogs, etc.)?

Answer: Neural Networks!

Why?

8. NEURAL NETWORKS

Basic Architecture

Input (n features)
    ↓
Dense Layer: Z₁ = X·W₁ + b₁
    ↓
Activation: Z₂ = σ(Z₁)    [Non-linearity!]
    ↓
Dense Layer: Z₃ = Z₂·W₂ + b₂
    ↓
Output (m classes)
            

Why Non-linearities?

Without activation functions:

Linear → Linear → Linear ≡ Single Linear Layer

With activation functions (ReLU, sigmoid, etc.):

Key Components

  1. Weights (W): Learned parameters
  2. Biases (b): Learned offsets
  3. Activations (σ): Non-linear functions (ReLU, sigmoid, tanh)
  4. Loss function (L): Measures error
  5. Optimizer: Updates parameters (SGD, Adam, etc.)

9. BACKPROPAGATION: THE CHAIN RULE

Computational Graph

For a simple scalar neural network:

X → [·W₁] → Z₁ → [+b₁] → Z₂ → [σ] → Z₃ → [·W₂] → Z₄ → [+b₂] → Yₚ → [Loss] → L
                                                                    ↑
                                                                    Y (true)
            

Forward Pass

Compute outputs from inputs:

  1. $Z_1 = X \cdot W_1$
  2. $Z_2 = Z_1 + b_1$
  3. $Z_3 = \sigma(Z_2)$
  4. $Z_4 = Z_3 \cdot W_2$
  5. $Y_p = Z_4 + b_2$
  6. $L = \frac{1}{2}(Y_p - Y)^2$

Backward Pass (Backpropagation)

Apply chain rule right-to-left:

$$\frac{\partial L}{\partial Y_p} = Y_p - Y$$ $$\frac{\partial L}{\partial Z_4} = \frac{\partial L}{\partial Y_p} \cdot \frac{\partial Y_p}{\partial Z_4} = (Y_p - Y) \cdot 1$$ $$\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial Z_4} \cdot \frac{\partial Z_4}{\partial W_2} = (Y_p - Y) \cdot Z_3$$ $$\frac{\partial L}{\partial Z_3} = \frac{\partial L}{\partial Z_4} \cdot \frac{\partial Z_4}{\partial Z_3} = (Y_p - Y) \cdot W_2$$ $$\frac{\partial L}{\partial Z_2} = \frac{\partial L}{\partial Z_3} \cdot \frac{\partial Z_3}{\partial Z_2} = \frac{\partial L}{\partial Z_3} \cdot \sigma'(Z_2)$$ $$\frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial b_1} = \frac{\partial L}{\partial Z_2} \cdot 1$$ $$\frac{\partial L}{\partial Z_1} = \frac{\partial L}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial Z_1} = \frac{\partial L}{\partial Z_2} \cdot 1$$ $$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial Z_1} \cdot \frac{\partial Z_1}{\partial W_1} = \frac{\partial L}{\partial Z_1} \cdot X$$

Matrix/Vector Extension

For matrices and vectors, use:

Key formulas:

$$\frac{\partial L}{\partial W_1} = X^T \cdot \delta Z_1$$ $$\frac{\partial L}{\partial W_2} = Z_3^T \cdot \delta Z_4$$ $$\frac{\partial L}{\partial b} = \sum_k (\delta Z)_k \quad \text{[sum over batch]}$$

10. LOSS FUNCTION DERIVATIVES

Euclidean (MSE) Loss

Forward:

$$L = \frac{1}{2} ||Y_p - Y||^2 = \frac{1}{2} \sum_i (Y_{pi} - Y_i)^2$$

Backward:

$$\frac{\partial L}{\partial Y_{pi}} = Y_{pi} - Y_i$$

Vector notation:

$$\delta Y_p = Y_p - Y$$

Softmax + Cross-Entropy Loss

Forward:

Logits → Softmax → Probabilities → Cross-Entropy → Loss
            
$$y_{pi} = \frac{\exp(\text{logits}_i)}{\sum_j \exp(\text{logits}_j)}$$ $$\text{Loss} = -\sum_k y_k \cdot \log(y_{pk})$$

Backward (the beautiful simplification!):

$$\frac{\partial \text{Loss}}{\partial \text{logits}_i} = y_{pi} - y_i$$
Softmax + Cross-Entropy has elegant gradient: just the difference between predicted and true probabilities!

Comparison: L2 vs Cross-Entropy for Classification

Property L2 Loss Cross-Entropy
Symmetry Symmetric Asymmetric
Gradient behavior Shrinks when saturated Well-conditioned
Use case Regression Classification
Confidence penalty Weak Strong for wrong predictions

Sanity check: If you used L2 loss after softmax, gradients would shrink dramatically when probabilities saturate, slowing learning.

11. BACKPROPAGATION FOR CONVOLUTIONAL LAYERS

Setup

Assume: $H \geq h$ and $W \geq w$

Two Questions

Given $\frac{\partial L}{\partial J}$ (upstream gradient), find:

  1. $\frac{\partial L}{\partial K}$ - To update kernel by gradient descent
  2. $\frac{\partial L}{\partial I}$ - To backprop to previous layer

Derivation of ∂L/∂K

Forward:

$$J(i,j) = \sum_{l} \sum_{m} I(i+l-1, j+m-1) \cdot K(l,m)$$

Backward:

$$\frac{\partial J(i,j)}{\partial K(p,q)} = I(i+p-1, j+q-1)$$ $$\delta K(p,q) = \sum_i \sum_j I(i+p-1, j+q-1) \cdot \delta J(i,j)$$

Result (correlation):

$$\delta K = I \circledast \delta J$$

where $\circledast$ is correlation (slide input patches under upstream gradient and accumulate).

Derivation of ∂L/∂I

Backward:

$$\frac{\partial J(i,j)}{\partial I(p,q)} = \begin{cases} K(p-i+1, q-j+1), & \text{if } 0 \leq p-i \leq h-1 \text{ and } 0 \leq q-j \leq w-1 \\ 0, & \text{otherwise} \end{cases}$$ $$\delta I(p,q) = \sum_i \sum_j K(p-i+1, q-j+1) \cdot \delta J(i,j)$$

Result (full convolution with flipped kernel):

$$\delta I = \text{pad}(\delta J) * \text{flip}(K)$$

where:

Example of flip:

K = [1 3 5]    flip(K) = [6 4 2]
    [2 4 6]               [5 3 1]
            

Implementation Tips

  1. Cache input patches or use im2col to vectorize
  2. Reuse same stride/padding from forward pass
  3. Mismatched shapes are the #1 error source!

12. BACKPROPAGATION FOR MAX POOLING

Setup

$$\text{Max Pool: } [x_1, x_2, \ldots, x_n] \rightarrow y = \max(x_i)$$

Key Idea

In case of a tie, choose a single index deterministically:

$$i^* = \arg\max(x_k)$$

Backward Pass

$$\frac{\partial y}{\partial x_i} = \begin{cases} 1, & \text{if } i = \arg\max(x_k) \\ 0, & \text{otherwise} \end{cases}$$ $$\delta x_i = \begin{cases} \delta y, & \text{if } i = \arg\max(x_k) \\ 0, & \text{otherwise} \end{cases}$$

Route the gradient only to the index that held the maximum!

Example: 2×2 MaxPool, stride 2

Forward:

x = [2  -3]    MaxPool    y = [5  6]
    [4   5]      →            [0  8]
    [6   5]

    [6  -7]
    [0  -1]

    [-2 -6]
    [3  -4]

    [6   8]
            

Backward:

δy = [δy₁  δy₂]
     [δy₃  δy₄]

δx = [0    0  ]
     [0   δy₁]
     [δy₃  0  ]

     [δy₂  0  ]
     [0    0  ]

     [0    0  ]
     [0    0  ]

     [0   δy₄]

Implementation

Pseudo-code:

# Forward pass: store argmax indices
for (i, j), pooling_region in enumerate(regions):
    argmax_cache[i, j] = argmax(pooling_region)
    output[i, j] = max(pooling_region)

# Backward pass: route gradient
for (i, j), grad in enumerate(dL_dOutput):
    (ii, jj) = argmax_cache[i, j]
    dL_dInput[ii, jj] += grad
Never recompute argmax from mutated tensors - always cache!

13. BACKPROPAGATION FOR ReLU

Setup

$$\text{ReLU: } y = \max(0, x)$$

Backward Pass

$$\frac{\partial y}{\partial x} = \begin{cases} 1, & \text{if } x > 0 \\ 0, & \text{otherwise} \end{cases}$$ $$\delta x = \delta y \odot \mathbb{1}[x > 0]$$

where $\odot$ is element-wise multiplication and $\mathbb{1}[\cdot]$ is indicator function.

Example

Forward:

x = [-5  4]    ReLU    y = [0  4]
    [ 0  2]      →        [0  2]
            

Backward:

δy = [δy₁  δy₂]    →    δx = [0     δy₂]
     [δy₃  δy₄]              [0     δy₄]

Implementation

Keep the boolean mask from forward pass:

# Forward
mask = (x > 0)
y = x * mask

# Backward
dx = dy * mask
Do NOT recompute mask with mutated x!

14. PUTTING IT ALL TOGETHER

Example Network

Conv → ReLU → MaxPool → Loss
 ↓      ↓       ↓        ↓
 y      z       p        L
 ↑      ↑       ↑
 x      y       z
            

Training Algorithm

Initialize: Parameter $K$ (convolution kernel)

Iterate:

Step 1: Forward Pass

Step 2: Compute Loss

$$L = \frac{1}{2} ||p - q||^2$$

Step 3: Backward Pass (Backpropagation)

Step 4: Update Parameters

$$K \leftarrow K - \eta \cdot \delta K$$
We don't compute $\delta x$ because there's no layer before Conv in this example.

15. ADVANCED TOPICS

Example 1: Conv → ReLU → MaxPool with Numbers

Given:

I = [1  3  2  4  6  4]
    [4  8  3  1  0  2]
    [2  1  4  3  9  1]
    [4  7  2  3  9  2]

W = [-1  0  1]    (edge detection kernel)
    [-1  0  1]
    [-1  0  1]

Task: Compute $J$, $K$, $O$ and backpropagation gradients.

See Practice_Problem.ipynb and verify with PyTorch autograd!

Example 2: Shared Parameters

Network:

$$Y = \text{ReLU}(X \cdot W) \cdot W$$

Loss: L2 between Y and X

Note: Same parameter matrix $W$ appears twice - shared weights!

Gradient:

$$\delta W = Z_2^T \cdot \delta Y + X^T \cdot \delta Z_1$$
Add gradients from both computational nodes where W appears.

Residual Connections

Network:

$$Y = X + F(X)$$

Backward:

$$\delta X = \delta Y + \delta F$$

Key insight: Gradient flows through both identity and function paths.

Transformer Backpropagation

Complex architecture with:

Same chain rule, but more complex computational graph!

16. SUMMARY & KEY TAKEAWAYS

The Big Picture

Data → Model → Loss → Optimization
         ↑                 |
         |---- Backprop ----
            

Core Concepts

  1. Models approximate reality
    • Simplification of unknown data-generating process
    • Goal: generalize to unseen data
  2. Loss functions quantify error
    • L2 loss for regression
    • Cross-entropy for classification
  3. Gradient descent minimizes loss
    • Update rule: $\theta \leftarrow \theta - \eta \cdot \nabla L$
    • Mini-batch SGD is practical default
  4. Neural networks are universal approximators
    • Don't assume model form
    • Let network learn representation
    • Need: data, computation, backpropagation
  5. Backpropagation is chain rule
    • Forward: compute outputs
    • Backward: compute gradients right-to-left
    • Update: gradient descent

Layer-Specific Backprop Recipes

Layer Forward Backward
Linear $Z = X \cdot W + b$ $\delta W = X^T \cdot \delta Z$, $\delta X = \delta Z \cdot W^T$
ReLU $Y = \max(0, X)$ $\delta X = \delta Y \odot \mathbb{1}[X > 0]$
Conv $J = I * K$ $\delta K = I \circledast \delta J$, $\delta I = \text{pad}(\delta J) * \text{flip}(K)$
MaxPool $Y = \max(X)$ Route $\delta Y$ to argmax location
Softmax+CE $y_p = \text{softmax}(z)$, $L = -\sum y \cdot \log(y_p)$ $\delta z = y_p - y$

Practical Workflow

  1. Data sanity: Visualize random samples after augmentation
  2. Forward caches: Store intermediates for backward pass
  3. Gradient checks: Compare manual gradients vs autograd
  4. Learning rate sweeps: Try {1e-1, 1e-2, 1e-3}
  5. Monitor metrics: Track train/val loss and accuracy

Common Pitfalls

Symptom Cause Fix
Shape mismatch Forgot flip/pad Follow formulas exactly
Pooling gradients everywhere Lost argmax Cache switches in forward
Accuracy at random chance Wrong loss Use cross-entropy for classification
Train/val diverge Data leakage Separate augmentation for train/val
Exploding gradients Too large LR Gradient clipping, lower LR
Vanishing gradients Too many sigmoid layers Use ReLU, residual connections

Debugging Patterns

Lightweight instrumentation:

Autograd verification:

# Manual calculation
δW_manual = compute_gradient_manually()

# PyTorch autograd
loss.backward()
δW_auto = W.grad

# Compare
assert torch.allclose(δW_manual, δW_auto, atol=1e-5)

Next Steps

  1. Review Anki cards (tagged by topic)
  2. Re-implement Conv → ReLU → Pool example
  3. Verify gradients numerically before scaling to datasets
  4. Apply to Assignment 7 architecture

REFERENCES


END OF LESSON

CMPUT 328 - VISUAL RECOGNITION

MACHINE LEARNING & BACKPROPAGATION

Happy learning and backpropagating!

DOWNLOAD ANKI DECK