Machine Learning & Backpropagation

1. INTRODUCTION TO MACHINE LEARNING

What is Machine Learning?

Machine Learning is a method of teaching computers to:

Learn from data - Extract patterns from examples
Generalize to unseen data - Apply learned patterns to new situations
Without explicit instructions - No need to manually program every rule

Core Philosophy

According to Rich Sutton (ACM A.M. Turing Award 2024 recipient):

"General methods that leverage computation and learning ultimately prove more effective than approaches relying on human knowledge and domain-specific expertise."

This is known as The Bitter Lesson - we should not assume models (e.g., linear or not), we should just have a general learning method leveraging computation (e.g., Neural Networks + Gradient Descent).

2. WHAT PROBLEMS CAN ML SOLVE?

Machine Learning excels when three conditions are met:

Many labeled examples - Abundant training data
Patterns in the data - Reusable regularities across samples
Complex relationships - Rules too intricate for manual programming

Examples

Image classification (cats vs dogs, CIFAR-10)
Object detection and segmentation
Natural language processing
Recommendation systems
Game playing (AlphaGo, Chess)

3. MODELS: SIMPLIFICATION OF REALITY

What is a Model?

Definition: A model is a simplification of a real process or system.

Usually we want to know what is the process that generates the data.

The Gravity Example

Given measurements of mass and force:

x (mass)	y (force)
2	19.6
4.5	44.1
10	98

Goal: Find the relationship $y = ax$

Guesses:

$a = 5$ → Poor fit
$a = 12$ → Poor fit
$a = 10$ → Better fit
$a = 9.8$ → Best fit!

We discovered: $F = 9.8m$ (gravitational constant $g \approx 9.8 \text{ m/s}^2$)

Model Notation

$y$ = true output (observed data)
$\hat{y}$ = predicted output (model prediction)
Goal: Make $\hat{y} \approx y$

Key Insight

It suffices to approximate reality. We don't need Einstein's relativity formula:

$$y = \left(1 + \frac{1}{1 - v^2/c^2} \cdot \frac{v \cdot v^T}{c^2}\right)\left(1 - v^2/c^2\right)^{-1/2} \cdot ax$$

where $c \approx 2.99 \times 10^8$ m/s

For most purposes, $\hat{y} = 9.8x$ is good enough!

4. LOSS FUNCTIONS: MEASURING GOODNESS

What is a Good Approximation?

Question: Is it possible to evaluate "goodness"?

Answer: Yes! Using a loss function.

L2 (Euclidean) Loss

$$L = \sum_{i} \left(y_i - \hat{y}(x_i)\right)^2$$

Interpretation: Distance from the data to the model.

If $L = 0$, then $y_i = \hat{y}(x_i)$ for all $i$ (perfect fit).

L2 Loss Example

Using our gravity data:

Guess	Formula	Loss
a = 8	$\hat{y} = 8x$	L = 402.57
a = 9	$\hat{y} = 9x$	L = 79.52
a = 9.8	$\hat{y} = 9.8x$	L = 4.97

Lower loss = better approximation!

Vector Form

For vector outputs:

$$L = \frac{1}{2} ||\hat{y} - y||^2$$

Gradient with respect to predictions:

$$\frac{\partial L}{\partial \hat{y}} = \hat{y} - y$$

5. GRADIENT DESCENT: FINDING THE MINIMUM

The Problem

Question: How do we find the value for parameter $a$ that minimizes loss?

Answer: Gradient descent!

How Derivatives Show Influence

The derivative $\frac{\partial L}{\partial a}$ tells us:

If positive: Increasing $a$ increases loss → decrease $a$
If negative: Increasing $a$ decreases loss → increase $a$
If zero: We're at a minimum (or maximum/saddle point)

The Gradient Descent Update Rule

$$a \leftarrow a - \eta \cdot \frac{\partial L}{\partial a}$$

where:

$\eta$ = learning rate (chosen by us)
$\frac{\partial L}{\partial a}$ = derivative of loss with respect to parameter

Example Iterations

Starting with $a = 8$:

Iteration	a	∂L/∂a	Update
0	8.00	-447	$a \leftarrow 8 - \eta(-447)$
1	9.34	-114	$a \leftarrow 9.34 - \eta(-114)$
2	9.68	-29	$a \leftarrow 9.68 - \eta(-29)$
3	9.77	-7.4	$a \leftarrow 9.77 - \eta(-7.4)$
4	9.79	-1.9	$a \leftarrow 9.79 - \eta(-1.9)$

Converges to $a \approx 9.8$!

Learning Rate Importance

Too large: Steps overshoot minimum, oscillate
Too small: Training is slow, may get stuck
Just right: Smooth convergence

Multiple Parameters

For multiple parameters (e.g., $y = ax + b$):

$$a \leftarrow a - \eta \cdot \frac{\partial L}{\partial a}$$ $$b \leftarrow b - \eta \cdot \frac{\partial L}{\partial b}$$

6. STOCHASTIC GRADIENT DESCENT

The Data Loading Problem

Challenge: size of dataset $\gg$ computer memory

Solution: Batching!

Three Flavors of Gradient Descent

Type	Update	When to Use
Full Batch	$\theta \leftarrow \theta - \eta \nabla_\theta L(\theta; \text{all data})$	Convex problems, tiny datasets
Stochastic (SGD)	$\theta \leftarrow \theta - \eta \nabla_\theta L(\theta; x_i)$	Huge datasets, escape sharp minima
Mini-Batch	$\theta \leftarrow \theta - \eta \nabla_\theta \left(\frac{1}{B} \sum L(x_b)\right)$	Practical default - balances GPU utilization and noise

Mini-Batch SGD Checklist

Shuffle samples each epoch for unbiased batches
Normalize inputs so magnitudes are comparable
Track both training and validation metrics to detect overfitting

7. CHOOSING MODELS

The Spectrum of Models

For fitting data, we could choose:

Linear model: $y = ax + b$
Quadratic model: $y = ax^2 + bx + c$
Cosine model: $y = a \cdot \cos(bx + c) + d$
Polynomial model: $y = \sum a_i x^i$

The Overfitting Problem

High-degree polynomial:

Passes through all training points ($L = 0$)
But fails on new data (poor generalization)
This is overfitting

For Complex Data (Images, Text, etc.)

Question: What model for CIFAR-10 images (cats, dogs, etc.)?

Answer: Neural Networks!

Why?

Don't assume linear/quadratic/etc.
Let the network learn the right representation
Just need:
- General learning method (backpropagation)
- Computation (GPUs)
- Data

8. NEURAL NETWORKS

Basic Architecture

Input (n features)
    ↓
Dense Layer: Z₁ = X·W₁ + b₁
    ↓
Activation: Z₂ = σ(Z₁)    [Non-linearity!]
    ↓
Dense Layer: Z₃ = Z₂·W₂ + b₂
    ↓
Output (m classes)

Why Non-linearities?

Without activation functions:

Linear → Linear → Linear ≡ Single Linear Layer

With activation functions (ReLU, sigmoid, etc.):

Can approximate any function (Universal Approximation Theorem)
Learn complex patterns
Hierarchical representations

Key Components

Weights (W): Learned parameters
Biases (b): Learned offsets
Activations (σ): Non-linear functions (ReLU, sigmoid, tanh)
Loss function (L): Measures error
Optimizer: Updates parameters (SGD, Adam, etc.)

9. BACKPROPAGATION: THE CHAIN RULE

Computational Graph

For a simple scalar neural network:

X → [·W₁] → Z₁ → [+b₁] → Z₂ → [σ] → Z₃ → [·W₂] → Z₄ → [+b₂] → Yₚ → [Loss] → L
                                                                    ↑
                                                                    Y (true)

Forward Pass

Compute outputs from inputs:

$Z_1 = X \cdot W_1$
$Z_2 = Z_1 + b_1$
$Z_3 = \sigma(Z_2)$
$Z_4 = Z_3 \cdot W_2$
$Y_p = Z_4 + b_2$
$L = \frac{1}{2}(Y_p - Y)^2$

Backward Pass (Backpropagation)

Apply chain rule right-to-left:

$$\frac{\partial L}{\partial Y_p} = Y_p - Y$$ $$\frac{\partial L}{\partial Z_4} = \frac{\partial L}{\partial Y_p} \cdot \frac{\partial Y_p}{\partial Z_4} = (Y_p - Y) \cdot 1$$ $$\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial Z_4} \cdot \frac{\partial Z_4}{\partial W_2} = (Y_p - Y) \cdot Z_3$$ $$\frac{\partial L}{\partial Z_3} = \frac{\partial L}{\partial Z_4} \cdot \frac{\partial Z_4}{\partial Z_3} = (Y_p - Y) \cdot W_2$$ $$\frac{\partial L}{\partial Z_2} = \frac{\partial L}{\partial Z_3} \cdot \frac{\partial Z_3}{\partial Z_2} = \frac{\partial L}{\partial Z_3} \cdot \sigma'(Z_2)$$ $$\frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial b_1} = \frac{\partial L}{\partial Z_2} \cdot 1$$ $$\frac{\partial L}{\partial Z_1} = \frac{\partial L}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial Z_1} = \frac{\partial L}{\partial Z_2} \cdot 1$$ $$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial Z_1} \cdot \frac{\partial Z_1}{\partial W_1} = \frac{\partial L}{\partial Z_1} \cdot X$$

Matrix/Vector Extension

For matrices and vectors, use:

Jacobians instead of derivatives
Matrix transposes for proper dimensionality

Key formulas:

$$\frac{\partial L}{\partial W_1} = X^T \cdot \delta Z_1$$ $$\frac{\partial L}{\partial W_2} = Z_3^T \cdot \delta Z_4$$ $$\frac{\partial L}{\partial b} = \sum_k (\delta Z)_k \quad \text{[sum over batch]}$$

10. LOSS FUNCTION DERIVATIVES

Euclidean (MSE) Loss

Forward:

$$L = \frac{1}{2} ||Y_p - Y||^2 = \frac{1}{2} \sum_i (Y_{pi} - Y_i)^2$$

Backward:

$$\frac{\partial L}{\partial Y_{pi}} = Y_{pi} - Y_i$$

Vector notation:

$$\delta Y_p = Y_p - Y$$

Softmax + Cross-Entropy Loss

Forward:

Logits → Softmax → Probabilities → Cross-Entropy → Loss

$$y_{pi} = \frac{\exp(\text{logits}_i)}{\sum_j \exp(\text{logits}_j)}$$ $$\text{Loss} = -\sum_k y_k \cdot \log(y_{pk})$$

Backward (the beautiful simplification!):

$$\frac{\partial \text{Loss}}{\partial \text{logits}_i} = y_{pi} - y_i$$

Softmax + Cross-Entropy has elegant gradient: just the difference between predicted and true probabilities!

Comparison: L2 vs Cross-Entropy for Classification

Property	L2 Loss	Cross-Entropy
Symmetry	Symmetric	Asymmetric
Gradient behavior	Shrinks when saturated	Well-conditioned
Use case	Regression	Classification
Confidence penalty	Weak	Strong for wrong predictions

Sanity check: If you used L2 loss after softmax, gradients would shrink dramatically when probabilities saturate, slowing learning.

11. BACKPROPAGATION FOR CONVOLUTIONAL LAYERS

Setup

Input: $I$ ($H \times W$ matrix)
Kernel: $K$ ($h \times w$ matrix)
Output: $J = I * K$ (convolution, size $(H-h+1) \times (W-w+1)$)

Assume: $H \geq h$ and $W \geq w$

Two Questions

Given $\frac{\partial L}{\partial J}$ (upstream gradient), find:

$\frac{\partial L}{\partial K}$ - To update kernel by gradient descent
$\frac{\partial L}{\partial I}$ - To backprop to previous layer

Derivation of ∂L/∂K

Forward:

$$J(i,j) = \sum_{l} \sum_{m} I(i+l-1, j+m-1) \cdot K(l,m)$$

Backward:

$$\frac{\partial J(i,j)}{\partial K(p,q)} = I(i+p-1, j+q-1)$$ $$\delta K(p,q) = \sum_i \sum_j I(i+p-1, j+q-1) \cdot \delta J(i,j)$$

Result (correlation):

$$\delta K = I \circledast \delta J$$

where $\circledast$ is correlation (slide input patches under upstream gradient and accumulate).

Derivation of ∂L/∂I

Backward:

$$\frac{\partial J(i,j)}{\partial I(p,q)} = \begin{cases} K(p-i+1, q-j+1), & \text{if } 0 \leq p-i \leq h-1 \text{ and } 0 \leq q-j \leq w-1 \\ 0, & \text{otherwise} \end{cases}$$ $$\delta I(p,q) = \sum_i \sum_j K(p-i+1, q-j+1) \cdot \delta J(i,j)$$

Result (full convolution with flipped kernel):

$$\delta I = \text{pad}(\delta J) * \text{flip}(K)$$

where:

pad: adds $(h-1)$ zero rows top/bottom and $(w-1)$ zero columns left/right
flip: rotates kernel 180°

Example of flip:

K = [1 3 5]    flip(K) = [6 4 2]
    [2 4 6]               [5 3 1]

Implementation Tips

Cache input patches or use im2col to vectorize
Reuse same stride/padding from forward pass
Mismatched shapes are the #1 error source!

12. BACKPROPAGATION FOR MAX POOLING

Setup

$$\text{Max Pool: } [x_1, x_2, \ldots, x_n] \rightarrow y = \max(x_i)$$

Key Idea

In case of a tie, choose a single index deterministically:

$$i^* = \arg\max(x_k)$$

Backward Pass

$$\frac{\partial y}{\partial x_i} = \begin{cases} 1, & \text{if } i = \arg\max(x_k) \\ 0, & \text{otherwise} \end{cases}$$ $$\delta x_i = \begin{cases} \delta y, & \text{if } i = \arg\max(x_k) \\ 0, & \text{otherwise} \end{cases}$$

Route the gradient only to the index that held the maximum!

Example: 2×2 MaxPool, stride 2

Forward:

x = [2  -3]    MaxPool    y = [5  6]
    [4   5]      →            [0  8]
    [6   5]

    [6  -7]
    [0  -1]

    [-2 -6]
    [3  -4]

    [6   8]

Backward:

δy = [δy₁  δy₂]
     [δy₃  δy₄]

δx = [0    0  ]
     [0   δy₁]
     [δy₃  0  ]

     [δy₂  0  ]
     [0    0  ]

     [0    0  ]
     [0    0  ]

     [0   δy₄]

Implementation

Pseudo-code:

# Forward pass: store argmax indices
for (i, j), pooling_region in enumerate(regions):
    argmax_cache[i, j] = argmax(pooling_region)
    output[i, j] = max(pooling_region)

# Backward pass: route gradient
for (i, j), grad in enumerate(dL_dOutput):
    (ii, jj) = argmax_cache[i, j]
    dL_dInput[ii, jj] += grad

Never recompute argmax from mutated tensors - always cache!

13. BACKPROPAGATION FOR ReLU

Setup

$$\text{ReLU: } y = \max(0, x)$$

Backward Pass

$$\frac{\partial y}{\partial x} = \begin{cases} 1, & \text{if } x > 0 \\ 0, & \text{otherwise} \end{cases}$$ $$\delta x = \delta y \odot \mathbb{1}[x > 0]$$

where $\odot$ is element-wise multiplication and $\mathbb{1}[\cdot]$ is indicator function.

Example

Forward:

x = [-5  4]    ReLU    y = [0  4]
    [ 0  2]      →        [0  2]

Backward:

δy = [δy₁  δy₂]    →    δx = [0     δy₂]
     [δy₃  δy₄]              [0     δy₄]

Implementation

Keep the boolean mask from forward pass:

# Forward
mask = (x > 0)
y = x * mask

# Backward
dx = dy * mask

Do NOT recompute mask with mutated x!

14. PUTTING IT ALL TOGETHER

Example Network

Conv → ReLU → MaxPool → Loss
 ↓      ↓       ↓        ↓
 y      z       p        L
 ↑      ↑       ↑
 x      y       z

Training Algorithm

Initialize: Parameter $K$ (convolution kernel)

Iterate:

Step 1: Forward Pass

1a. Randomly choose training example $x$ and ideal output $q$
1b. Pass $x$ through Conv to get $y$
1c. Pass $y$ through ReLU to get $z$
1d. Pass $z$ through MaxPool to get $p$

Step 2: Compute Loss

$$L = \frac{1}{2} ||p - q||^2$$

Step 3: Backward Pass (Backpropagation)

3a. Compute $\delta p = \frac{\partial L}{\partial p} = p - q$
3b. Compute $\delta z$ given $\delta p$ (MaxPool backprop)
3c. Compute $\delta y$ given $\delta z$ (ReLU backprop)
3d. Compute $\delta K$ given $\delta y$ (Conv backprop)

Step 4: Update Parameters

$$K \leftarrow K - \eta \cdot \delta K$$

We don't compute $\delta x$ because there's no layer before Conv in this example.

15. ADVANCED TOPICS

Example 1: Conv → ReLU → MaxPool with Numbers

Given:

I = [1  3  2  4  6  4]
    [4  8  3  1  0  2]
    [2  1  4  3  9  1]
    [4  7  2  3  9  2]

W = [-1  0  1]    (edge detection kernel)
    [-1  0  1]
    [-1  0  1]

Task: Compute $J$, $K$, $O$ and backpropagation gradients.

See Practice_Problem.ipynb and verify with PyTorch autograd!

Example 2: Shared Parameters

Network:

$$Y = \text{ReLU}(X \cdot W) \cdot W$$

Loss: L2 between Y and X

Note: Same parameter matrix $W$ appears twice - shared weights!

Gradient:

$$\delta W = Z_2^T \cdot \delta Y + X^T \cdot \delta Z_1$$

Add gradients from both computational nodes where W appears.

Residual Connections

Network:

$$Y = X + F(X)$$

Backward:

$$\delta X = \delta Y + \delta F$$

Key insight: Gradient flows through both identity and function paths.

Transformer Backpropagation

Complex architecture with:

Multi-head attention
Layer normalization
Residual connections
Feed-forward networks

Same chain rule, but more complex computational graph!

16. SUMMARY & KEY TAKEAWAYS

The Big Picture

Data → Model → Loss → Optimization
         ↑                 |
         |---- Backprop ----

Core Concepts

Models approximate reality
- Simplification of unknown data-generating process
- Goal: generalize to unseen data
Loss functions quantify error
- L2 loss for regression
- Cross-entropy for classification
Gradient descent minimizes loss
- Update rule: $\theta \leftarrow \theta - \eta \cdot \nabla L$
- Mini-batch SGD is practical default
Neural networks are universal approximators
- Don't assume model form
- Let network learn representation
- Need: data, computation, backpropagation
Backpropagation is chain rule
- Forward: compute outputs
- Backward: compute gradients right-to-left
- Update: gradient descent

Layer-Specific Backprop Recipes

Layer	Forward	Backward
Linear	$Z = X \cdot W + b$	$\delta W = X^T \cdot \delta Z$, $\delta X = \delta Z \cdot W^T$
ReLU	$Y = \max(0, X)$	$\delta X = \delta Y \odot \mathbb{1}[X > 0]$
Conv	$J = I * K$	$\delta K = I \circledast \delta J$, $\delta I = \text{pad}(\delta J) * \text{flip}(K)$
MaxPool	$Y = \max(X)$	Route $\delta Y$ to argmax location
Softmax+CE	$y_p = \text{softmax}(z)$, $L = -\sum y \cdot \log(y_p)$	$\delta z = y_p - y$

Practical Workflow

Data sanity: Visualize random samples after augmentation
Forward caches: Store intermediates for backward pass
Gradient checks: Compare manual gradients vs autograd
Learning rate sweeps: Try {1e-1, 1e-2, 1e-3}
Monitor metrics: Track train/val loss and accuracy

Common Pitfalls

Symptom	Cause	Fix
Shape mismatch	Forgot flip/pad	Follow formulas exactly
Pooling gradients everywhere	Lost argmax	Cache switches in forward
Accuracy at random chance	Wrong loss	Use cross-entropy for classification
Train/val diverge	Data leakage	Separate augmentation for train/val
Exploding gradients	Too large LR	Gradient clipping, lower LR
Vanishing gradients	Too many sigmoid layers	Use ReLU, residual connections

Debugging Patterns

Lightweight instrumentation:

Log loss, ||∇θ||₂, max|∂L/∂θ| per batch
Plot running averages for train and validation
Catch silent NaNs early

Autograd verification:

# Manual calculation
δW_manual = compute_gradient_manually()

# PyTorch autograd
loss.backward()
δW_auto = W.grad

# Compare
assert torch.allclose(δW_manual, δW_auto, atol=1e-5)

Next Steps

Review Anki cards (tagged by topic)
Re-implement Conv → ReLU → Pool example
Verify gradients numerically before scaling to datasets
Apply to Assignment 7 architecture

REFERENCES

Slides-1-ML.pdf by Paulius Sasnauskas
BackPropPractices.pdf by Nilanjan Ray
Example_Solutions-Backprop.pdf (worked autograd checks)
Andrej Karpathy's micrograd and backpropagation lecture
Rich Sutton's The Bitter Lesson (2019)
Google Machine Learning Crash Course

MACHINE LEARNING & BACKPROPAGATION

TABLE OF CONTENTS

1. INTRODUCTION TO MACHINE LEARNING

What is Machine Learning?

Core Philosophy

2. WHAT PROBLEMS CAN ML SOLVE?

Examples

3. MODELS: SIMPLIFICATION OF REALITY

What is a Model?

The Gravity Example

Model Notation

Key Insight

4. LOSS FUNCTIONS: MEASURING GOODNESS

What is a Good Approximation?

L2 (Euclidean) Loss

L2 Loss Example

Vector Form

5. GRADIENT DESCENT: FINDING THE MINIMUM

The Problem

How Derivatives Show Influence

The Gradient Descent Update Rule

Example Iterations

Learning Rate Importance

Multiple Parameters

6. STOCHASTIC GRADIENT DESCENT

The Data Loading Problem

Three Flavors of Gradient Descent

Mini-Batch SGD Checklist

7. CHOOSING MODELS

The Spectrum of Models

The Overfitting Problem

For Complex Data (Images, Text, etc.)

8. NEURAL NETWORKS

Basic Architecture

Why Non-linearities?

Key Components

9. BACKPROPAGATION: THE CHAIN RULE

Computational Graph

Forward Pass

Backward Pass (Backpropagation)

Matrix/Vector Extension

10. LOSS FUNCTION DERIVATIVES

Euclidean (MSE) Loss

Softmax + Cross-Entropy Loss

Comparison: L2 vs Cross-Entropy for Classification

11. BACKPROPAGATION FOR CONVOLUTIONAL LAYERS

Setup

Two Questions

Derivation of ∂L/∂K

Derivation of ∂L/∂I

Implementation Tips

12. BACKPROPAGATION FOR MAX POOLING

Setup

Key Idea

Backward Pass

Example: 2×2 MaxPool, stride 2

Implementation

13. BACKPROPAGATION FOR ReLU

Setup

Backward Pass

Example

Implementation

14. PUTTING IT ALL TOGETHER

Example Network

Training Algorithm

Step 1: Forward Pass

Step 2: Compute Loss

Step 3: Backward Pass (Backpropagation)

Step 4: Update Parameters

15. ADVANCED TOPICS

Example 1: Conv → ReLU → MaxPool with Numbers

Example 2: Shared Parameters

Residual Connections

Transformer Backpropagation

16. SUMMARY & KEY TAKEAWAYS

The Big Picture

Core Concepts

Layer-Specific Backprop Recipes

Practical Workflow

Common Pitfalls