← Back to Topics
1. INTRODUCTION TO MACHINE LEARNING
What is Machine Learning?
Machine Learning is a method of teaching computers to:
- Learn from data - Extract patterns from examples
- Generalize to unseen data - Apply learned patterns to new situations
- Without explicit instructions - No need to manually program every rule
Core Philosophy
According to Rich Sutton (ACM A.M. Turing Award 2024 recipient):
"General methods that leverage computation and learning ultimately prove more effective than approaches relying on human knowledge and domain-specific expertise."
This is known as The Bitter Lesson - we should not assume models (e.g., linear or not), we should just have a general learning method leveraging computation (e.g., Neural Networks + Gradient Descent).
2. WHAT PROBLEMS CAN ML SOLVE?
Machine Learning excels when three conditions are met:
- Many labeled examples - Abundant training data
- Patterns in the data - Reusable regularities across samples
- Complex relationships - Rules too intricate for manual programming
Examples
- Image classification (cats vs dogs, CIFAR-10)
- Object detection and segmentation
- Natural language processing
- Recommendation systems
- Game playing (AlphaGo, Chess)
3. MODELS: SIMPLIFICATION OF REALITY
What is a Model?
Definition: A model is a simplification of a real process or system.
Usually we want to know what is the process that generates the data.
The Gravity Example
Given measurements of mass and force:
| x (mass) |
y (force) |
| 2 |
19.6 |
| 4.5 |
44.1 |
| 10 |
98 |
Goal: Find the relationship $y = ax$
Guesses:
- $a = 5$ → Poor fit
- $a = 12$ → Poor fit
- $a = 10$ → Better fit
- $a = 9.8$ → Best fit!
We discovered: $F = 9.8m$ (gravitational constant $g \approx 9.8 \text{ m/s}^2$)
Model Notation
- $y$ = true output (observed data)
- $\hat{y}$ = predicted output (model prediction)
- Goal: Make $\hat{y} \approx y$
Key Insight
It suffices to approximate reality. We don't need Einstein's relativity formula:
$$y = \left(1 + \frac{1}{1 - v^2/c^2} \cdot \frac{v \cdot v^T}{c^2}\right)\left(1 - v^2/c^2\right)^{-1/2} \cdot ax$$
where $c \approx 2.99 \times 10^8$ m/s
For most purposes, $\hat{y} = 9.8x$ is good enough!
4. LOSS FUNCTIONS: MEASURING GOODNESS
What is a Good Approximation?
Question: Is it possible to evaluate "goodness"?
Answer: Yes! Using a loss function.
L2 (Euclidean) Loss
$$L = \sum_{i} \left(y_i - \hat{y}(x_i)\right)^2$$
Interpretation: Distance from the data to the model.
If $L = 0$, then $y_i = \hat{y}(x_i)$ for all $i$ (perfect fit).
L2 Loss Example
Using our gravity data:
| Guess |
Formula |
Loss |
| a = 8 |
$\hat{y} = 8x$ |
L = 402.57 |
| a = 9 |
$\hat{y} = 9x$ |
L = 79.52 |
| a = 9.8 |
$\hat{y} = 9.8x$ |
L = 4.97 |
Lower loss = better approximation!
Vector Form
For vector outputs:
$$L = \frac{1}{2} ||\hat{y} - y||^2$$
Gradient with respect to predictions:
$$\frac{\partial L}{\partial \hat{y}} = \hat{y} - y$$
5. GRADIENT DESCENT: FINDING THE MINIMUM
The Problem
Question: How do we find the value for parameter $a$ that minimizes loss?
Answer: Gradient descent!
How Derivatives Show Influence
The derivative $\frac{\partial L}{\partial a}$ tells us:
- If positive: Increasing $a$ increases loss → decrease $a$
- If negative: Increasing $a$ decreases loss → increase $a$
- If zero: We're at a minimum (or maximum/saddle point)
The Gradient Descent Update Rule
$$a \leftarrow a - \eta \cdot \frac{\partial L}{\partial a}$$
where:
- $\eta$ = learning rate (chosen by us)
- $\frac{\partial L}{\partial a}$ = derivative of loss with respect to parameter
Example Iterations
Starting with $a = 8$:
| Iteration |
a |
∂L/∂a |
Update |
| 0 |
8.00 |
-447 |
$a \leftarrow 8 - \eta(-447)$ |
| 1 |
9.34 |
-114 |
$a \leftarrow 9.34 - \eta(-114)$ |
| 2 |
9.68 |
-29 |
$a \leftarrow 9.68 - \eta(-29)$ |
| 3 |
9.77 |
-7.4 |
$a \leftarrow 9.77 - \eta(-7.4)$ |
| 4 |
9.79 |
-1.9 |
$a \leftarrow 9.79 - \eta(-1.9)$ |
Converges to $a \approx 9.8$!
Learning Rate Importance
- Too large: Steps overshoot minimum, oscillate
- Too small: Training is slow, may get stuck
- Just right: Smooth convergence
Multiple Parameters
For multiple parameters (e.g., $y = ax + b$):
$$a \leftarrow a - \eta \cdot \frac{\partial L}{\partial a}$$
$$b \leftarrow b - \eta \cdot \frac{\partial L}{\partial b}$$
6. STOCHASTIC GRADIENT DESCENT
The Data Loading Problem
Challenge: size of dataset $\gg$ computer memory
Solution: Batching!
Three Flavors of Gradient Descent
| Type |
Update |
When to Use |
| Full Batch |
$\theta \leftarrow \theta - \eta \nabla_\theta L(\theta; \text{all data})$ |
Convex problems, tiny datasets |
| Stochastic (SGD) |
$\theta \leftarrow \theta - \eta \nabla_\theta L(\theta; x_i)$ |
Huge datasets, escape sharp minima |
| Mini-Batch |
$\theta \leftarrow \theta - \eta \nabla_\theta \left(\frac{1}{B} \sum L(x_b)\right)$ |
Practical default - balances GPU utilization and noise |
Mini-Batch SGD Checklist
- Shuffle samples each epoch for unbiased batches
- Normalize inputs so magnitudes are comparable
- Track both training and validation metrics to detect overfitting
7. CHOOSING MODELS
The Spectrum of Models
For fitting data, we could choose:
- Linear model: $y = ax + b$
- Quadratic model: $y = ax^2 + bx + c$
- Cosine model: $y = a \cdot \cos(bx + c) + d$
- Polynomial model: $y = \sum a_i x^i$
The Overfitting Problem
High-degree polynomial:
- Passes through all training points ($L = 0$)
- But fails on new data (poor generalization)
- This is overfitting
For Complex Data (Images, Text, etc.)
Question: What model for CIFAR-10 images (cats, dogs, etc.)?
Answer: Neural Networks!
Why?
- Don't assume linear/quadratic/etc.
- Let the network learn the right representation
- Just need:
- General learning method (backpropagation)
- Computation (GPUs)
- Data
8. NEURAL NETWORKS
Basic Architecture
Input (n features)
↓
Dense Layer: Z₁ = X·W₁ + b₁
↓
Activation: Z₂ = σ(Z₁) [Non-linearity!]
↓
Dense Layer: Z₃ = Z₂·W₂ + b₂
↓
Output (m classes)
Why Non-linearities?
Without activation functions:
Linear → Linear → Linear ≡ Single Linear Layer
With activation functions (ReLU, sigmoid, etc.):
- Can approximate any function (Universal Approximation Theorem)
- Learn complex patterns
- Hierarchical representations
Key Components
- Weights (W): Learned parameters
- Biases (b): Learned offsets
- Activations (σ): Non-linear functions (ReLU, sigmoid, tanh)
- Loss function (L): Measures error
- Optimizer: Updates parameters (SGD, Adam, etc.)
9. BACKPROPAGATION: THE CHAIN RULE
Computational Graph
For a simple scalar neural network:
X → [·W₁] → Z₁ → [+b₁] → Z₂ → [σ] → Z₃ → [·W₂] → Z₄ → [+b₂] → Yₚ → [Loss] → L
↑
Y (true)
Forward Pass
Compute outputs from inputs:
- $Z_1 = X \cdot W_1$
- $Z_2 = Z_1 + b_1$
- $Z_3 = \sigma(Z_2)$
- $Z_4 = Z_3 \cdot W_2$
- $Y_p = Z_4 + b_2$
- $L = \frac{1}{2}(Y_p - Y)^2$
Backward Pass (Backpropagation)
Apply chain rule right-to-left:
$$\frac{\partial L}{\partial Y_p} = Y_p - Y$$
$$\frac{\partial L}{\partial Z_4} = \frac{\partial L}{\partial Y_p} \cdot \frac{\partial Y_p}{\partial Z_4} = (Y_p - Y) \cdot 1$$
$$\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial Z_4} \cdot \frac{\partial Z_4}{\partial W_2} = (Y_p - Y) \cdot Z_3$$
$$\frac{\partial L}{\partial Z_3} = \frac{\partial L}{\partial Z_4} \cdot \frac{\partial Z_4}{\partial Z_3} = (Y_p - Y) \cdot W_2$$
$$\frac{\partial L}{\partial Z_2} = \frac{\partial L}{\partial Z_3} \cdot \frac{\partial Z_3}{\partial Z_2} = \frac{\partial L}{\partial Z_3} \cdot \sigma'(Z_2)$$
$$\frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial b_1} = \frac{\partial L}{\partial Z_2} \cdot 1$$
$$\frac{\partial L}{\partial Z_1} = \frac{\partial L}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial Z_1} = \frac{\partial L}{\partial Z_2} \cdot 1$$
$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial Z_1} \cdot \frac{\partial Z_1}{\partial W_1} = \frac{\partial L}{\partial Z_1} \cdot X$$
Matrix/Vector Extension
For matrices and vectors, use:
- Jacobians instead of derivatives
- Matrix transposes for proper dimensionality
Key formulas:
$$\frac{\partial L}{\partial W_1} = X^T \cdot \delta Z_1$$
$$\frac{\partial L}{\partial W_2} = Z_3^T \cdot \delta Z_4$$
$$\frac{\partial L}{\partial b} = \sum_k (\delta Z)_k \quad \text{[sum over batch]}$$
10. LOSS FUNCTION DERIVATIVES
Euclidean (MSE) Loss
Forward:
$$L = \frac{1}{2} ||Y_p - Y||^2 = \frac{1}{2} \sum_i (Y_{pi} - Y_i)^2$$
Backward:
$$\frac{\partial L}{\partial Y_{pi}} = Y_{pi} - Y_i$$
Vector notation:
$$\delta Y_p = Y_p - Y$$
Softmax + Cross-Entropy Loss
Forward:
Logits → Softmax → Probabilities → Cross-Entropy → Loss
$$y_{pi} = \frac{\exp(\text{logits}_i)}{\sum_j \exp(\text{logits}_j)}$$
$$\text{Loss} = -\sum_k y_k \cdot \log(y_{pk})$$
Backward (the beautiful simplification!):
$$\frac{\partial \text{Loss}}{\partial \text{logits}_i} = y_{pi} - y_i$$
Softmax + Cross-Entropy has elegant gradient: just the difference between predicted and true probabilities!
Comparison: L2 vs Cross-Entropy for Classification
| Property |
L2 Loss |
Cross-Entropy |
| Symmetry |
Symmetric |
Asymmetric |
| Gradient behavior |
Shrinks when saturated |
Well-conditioned |
| Use case |
Regression |
Classification |
| Confidence penalty |
Weak |
Strong for wrong predictions |
Sanity check: If you used L2 loss after softmax, gradients would shrink dramatically when probabilities saturate, slowing learning.
11. BACKPROPAGATION FOR CONVOLUTIONAL LAYERS
Setup
- Input: $I$ ($H \times W$ matrix)
- Kernel: $K$ ($h \times w$ matrix)
- Output: $J = I * K$ (convolution, size $(H-h+1) \times (W-w+1)$)
Assume: $H \geq h$ and $W \geq w$
Two Questions
Given $\frac{\partial L}{\partial J}$ (upstream gradient), find:
- $\frac{\partial L}{\partial K}$ - To update kernel by gradient descent
- $\frac{\partial L}{\partial I}$ - To backprop to previous layer
Derivation of ∂L/∂K
Forward:
$$J(i,j) = \sum_{l} \sum_{m} I(i+l-1, j+m-1) \cdot K(l,m)$$
Backward:
$$\frac{\partial J(i,j)}{\partial K(p,q)} = I(i+p-1, j+q-1)$$
$$\delta K(p,q) = \sum_i \sum_j I(i+p-1, j+q-1) \cdot \delta J(i,j)$$
Result (correlation):
$$\delta K = I \circledast \delta J$$
where $\circledast$ is correlation (slide input patches under upstream gradient and accumulate).
Derivation of ∂L/∂I
Backward:
$$\frac{\partial J(i,j)}{\partial I(p,q)} = \begin{cases}
K(p-i+1, q-j+1), & \text{if } 0 \leq p-i \leq h-1 \text{ and } 0 \leq q-j \leq w-1 \\
0, & \text{otherwise}
\end{cases}$$
$$\delta I(p,q) = \sum_i \sum_j K(p-i+1, q-j+1) \cdot \delta J(i,j)$$
Result (full convolution with flipped kernel):
$$\delta I = \text{pad}(\delta J) * \text{flip}(K)$$
where:
- pad: adds $(h-1)$ zero rows top/bottom and $(w-1)$ zero columns left/right
- flip: rotates kernel 180°
Example of flip:
K = [1 3 5] flip(K) = [6 4 2]
[2 4 6] [5 3 1]
Implementation Tips
- Cache input patches or use
im2col to vectorize
- Reuse same stride/padding from forward pass
- Mismatched shapes are the #1 error source!
12. BACKPROPAGATION FOR MAX POOLING
Setup
$$\text{Max Pool: } [x_1, x_2, \ldots, x_n] \rightarrow y = \max(x_i)$$
Key Idea
In case of a tie, choose a single index deterministically:
$$i^* = \arg\max(x_k)$$
Backward Pass
$$\frac{\partial y}{\partial x_i} = \begin{cases}
1, & \text{if } i = \arg\max(x_k) \\
0, & \text{otherwise}
\end{cases}$$
$$\delta x_i = \begin{cases}
\delta y, & \text{if } i = \arg\max(x_k) \\
0, & \text{otherwise}
\end{cases}$$
Route the gradient only to the index that held the maximum!
Example: 2×2 MaxPool, stride 2
Forward:
x = [2 -3] MaxPool y = [5 6]
[4 5] → [0 8]
[6 5]
[6 -7]
[0 -1]
[-2 -6]
[3 -4]
[6 8]
Backward:
δy = [δy₁ δy₂]
[δy₃ δy₄]
δx = [0 0 ]
[0 δy₁]
[δy₃ 0 ]
[δy₂ 0 ]
[0 0 ]
[0 0 ]
[0 0 ]
[0 δy₄]
Implementation
Pseudo-code:
# Forward pass: store argmax indices
for (i, j), pooling_region in enumerate(regions):
argmax_cache[i, j] = argmax(pooling_region)
output[i, j] = max(pooling_region)
# Backward pass: route gradient
for (i, j), grad in enumerate(dL_dOutput):
(ii, jj) = argmax_cache[i, j]
dL_dInput[ii, jj] += grad
Never recompute argmax from mutated tensors - always cache!
13. BACKPROPAGATION FOR ReLU
Setup
$$\text{ReLU: } y = \max(0, x)$$
Backward Pass
$$\frac{\partial y}{\partial x} = \begin{cases}
1, & \text{if } x > 0 \\
0, & \text{otherwise}
\end{cases}$$
$$\delta x = \delta y \odot \mathbb{1}[x > 0]$$
where $\odot$ is element-wise multiplication and $\mathbb{1}[\cdot]$ is indicator function.
Example
Forward:
x = [-5 4] ReLU y = [0 4]
[ 0 2] → [0 2]
Backward:
δy = [δy₁ δy₂] → δx = [0 δy₂]
[δy₃ δy₄] [0 δy₄]
Implementation
Keep the boolean mask from forward pass:
# Forward
mask = (x > 0)
y = x * mask
# Backward
dx = dy * mask
Do NOT recompute mask with mutated x!
14. PUTTING IT ALL TOGETHER
Example Network
Conv → ReLU → MaxPool → Loss
↓ ↓ ↓ ↓
y z p L
↑ ↑ ↑
x y z
Training Algorithm
Initialize: Parameter $K$ (convolution kernel)
Iterate:
Step 1: Forward Pass
- 1a. Randomly choose training example $x$ and ideal output $q$
- 1b. Pass $x$ through Conv to get $y$
- 1c. Pass $y$ through ReLU to get $z$
- 1d. Pass $z$ through MaxPool to get $p$
Step 2: Compute Loss
$$L = \frac{1}{2} ||p - q||^2$$
Step 3: Backward Pass (Backpropagation)
- 3a. Compute $\delta p = \frac{\partial L}{\partial p} = p - q$
- 3b. Compute $\delta z$ given $\delta p$ (MaxPool backprop)
- 3c. Compute $\delta y$ given $\delta z$ (ReLU backprop)
- 3d. Compute $\delta K$ given $\delta y$ (Conv backprop)
Step 4: Update Parameters
$$K \leftarrow K - \eta \cdot \delta K$$
We don't compute $\delta x$ because there's no layer before Conv in this example.
15. ADVANCED TOPICS
Example 1: Conv → ReLU → MaxPool with Numbers
Given:
I = [1 3 2 4 6 4]
[4 8 3 1 0 2]
[2 1 4 3 9 1]
[4 7 2 3 9 2]
W = [-1 0 1] (edge detection kernel)
[-1 0 1]
[-1 0 1]
Task: Compute $J$, $K$, $O$ and backpropagation gradients.
See Practice_Problem.ipynb and verify with PyTorch autograd!
Example 2: Shared Parameters
Network:
$$Y = \text{ReLU}(X \cdot W) \cdot W$$
Loss: L2 between Y and X
Note: Same parameter matrix $W$ appears twice - shared weights!
Gradient:
$$\delta W = Z_2^T \cdot \delta Y + X^T \cdot \delta Z_1$$
Add gradients from both computational nodes where W appears.
Residual Connections
Network:
$$Y = X + F(X)$$
Backward:
$$\delta X = \delta Y + \delta F$$
Key insight: Gradient flows through both identity and function paths.
Transformer Backpropagation
Complex architecture with:
- Multi-head attention
- Layer normalization
- Residual connections
- Feed-forward networks
Same chain rule, but more complex computational graph!
16. SUMMARY & KEY TAKEAWAYS
The Big Picture
Data → Model → Loss → Optimization
↑ |
|---- Backprop ----
Core Concepts
- Models approximate reality
- Simplification of unknown data-generating process
- Goal: generalize to unseen data
- Loss functions quantify error
- L2 loss for regression
- Cross-entropy for classification
- Gradient descent minimizes loss
- Update rule: $\theta \leftarrow \theta - \eta \cdot \nabla L$
- Mini-batch SGD is practical default
- Neural networks are universal approximators
- Don't assume model form
- Let network learn representation
- Need: data, computation, backpropagation
- Backpropagation is chain rule
- Forward: compute outputs
- Backward: compute gradients right-to-left
- Update: gradient descent
Layer-Specific Backprop Recipes
| Layer |
Forward |
Backward |
| Linear |
$Z = X \cdot W + b$ |
$\delta W = X^T \cdot \delta Z$, $\delta X = \delta Z \cdot W^T$ |
| ReLU |
$Y = \max(0, X)$ |
$\delta X = \delta Y \odot \mathbb{1}[X > 0]$ |
| Conv |
$J = I * K$ |
$\delta K = I \circledast \delta J$, $\delta I = \text{pad}(\delta J) * \text{flip}(K)$ |
| MaxPool |
$Y = \max(X)$ |
Route $\delta Y$ to argmax location |
| Softmax+CE |
$y_p = \text{softmax}(z)$, $L = -\sum y \cdot \log(y_p)$ |
$\delta z = y_p - y$ |
Practical Workflow
- Data sanity: Visualize random samples after augmentation
- Forward caches: Store intermediates for backward pass
- Gradient checks: Compare manual gradients vs autograd
- Learning rate sweeps: Try {1e-1, 1e-2, 1e-3}
- Monitor metrics: Track train/val loss and accuracy
Common Pitfalls
| Symptom |
Cause |
Fix |
| Shape mismatch |
Forgot flip/pad |
Follow formulas exactly |
| Pooling gradients everywhere |
Lost argmax |
Cache switches in forward |
| Accuracy at random chance |
Wrong loss |
Use cross-entropy for classification |
| Train/val diverge |
Data leakage |
Separate augmentation for train/val |
| Exploding gradients |
Too large LR |
Gradient clipping, lower LR |
| Vanishing gradients |
Too many sigmoid layers |
Use ReLU, residual connections |
Debugging Patterns
Lightweight instrumentation:
- Log
loss, ||∇θ||₂, max|∂L/∂θ| per batch
- Plot running averages for train and validation
- Catch silent NaNs early
Autograd verification:
# Manual calculation
δW_manual = compute_gradient_manually()
# PyTorch autograd
loss.backward()
δW_auto = W.grad
# Compare
assert torch.allclose(δW_manual, δW_auto, atol=1e-5)
Next Steps
- Review Anki cards (tagged by topic)
- Re-implement Conv → ReLU → Pool example
- Verify gradients numerically before scaling to datasets
- Apply to Assignment 7 architecture
REFERENCES
- Slides-1-ML.pdf by Paulius Sasnauskas
- BackPropPractices.pdf by Nilanjan Ray
- Example_Solutions-Backprop.pdf (worked autograd checks)
- Andrej Karpathy's micrograd and backpropagation lecture
- Rich Sutton's The Bitter Lesson (2019)
- Google Machine Learning Crash Course
END OF LESSON
CMPUT 328 - VISUAL RECOGNITION
MACHINE LEARNING & BACKPROPAGATION
Happy learning and backpropagating!