FOR CIFAR-10 CLASSIFICATION
CMPUT 328 - ASSIGNMENT 2 STUDY GUIDE
A Feedforward Neural Network (FFNN), also called a Fully Connected Neural Network, is the simplest type of artificial neural network where information flows in one direction: from input to output.
Neural networks can learn hierarchical representations of data:
For a simple FFNN with one hidden layer:
h = f(W₁x + b₁)
y = g(W₂h + b₂)
Where:
x is the input vectorW₁, b₁ are weights and biases for hidden layerf is a non-linear activation functionh is the hidden layer outputW₂, b₂ are weights and biases for output layerg is the output activation functiony is the final predictionCIFAR-10 is a benchmark dataset for image classification consisting of 60,000 32×32 color images in 10 classes.
Total images: 60,000
Training images: 50,000
Test images: 10,000
Image size: 32 × 32 × 3 (RGB)
Number of classes: 10
Images per class: 6,000
The 10 classes are:
0: airplane
1: automobile
2: bird
3: cat
4: deer
5: dog
6: frog
7: horse
8: ship
9: truck
For proper evaluation, split the data:
Training set: 45,000 images (90% of 50,000)
Validation set: 5,000 images (10% of 50,000)
Test set: 10,000 images (held out)
CIFAR-10 normalization values (empirically computed):
mean = (0.4914, 0.4822, 0.4465) # RGB channels
std = (0.2470, 0.2435, 0.2616) # RGB channels
Normalization formula:
This standardizes inputs to have mean ≈ 0 and std ≈ 1, which helps with:
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split
# Define transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2470, 0.2435, 0.2616))
])
# Load datasets
train_val_ds = datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
test_ds = datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform)
# Split train into train/val
train_ds, val_ds = random_split(train_val_ds, [45000, 5000],
generator=torch.Generator().manual_seed(42))
# Create data loaders
train_loader = DataLoader(train_ds, batch_size=256, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=256, shuffle=False)
test_loader = DataLoader(test_ds, batch_size=256, shuffle=False)
A typical FFNN for CIFAR-10:
Input Layer: 3072 neurons (32×32×3 flattened)
Hidden Layer 1: 512 neurons + ReLU
Hidden Layer 2: 256 neurons + ReLU
Output Layer: 10 neurons (one per class)
┌──────────────┐
│ Input (3072) │
└──────┬───────┘
│ W₁ (3072×512)
↓
┌──────────────┐
│ Hidden (512) │ + ReLU
└──────┬───────┘
│ W₂ (512×256)
↓
┌──────────────┐
│ Hidden (256) │ + ReLU
└──────┬───────┘
│ W₃ (256×10)
↓
┌──────────────┐
│ Output (10) │ + Softmax
└──────────────┘
FFNNs require 1D input vectors:
Original shape: (32, 32, 3)
Flattened shape: (3072,)
Calculation: 32 × 32 × 3 = 3072
Note: This destroys spatial structure, which is why CNNs (Assignment 3) work better for images!
For the architecture above:
Layer 1: (3072 × 512) + 512 = 1,573,376
Layer 2: (512 × 256) + 256 = 131,328
Layer 3: (256 × 10) + 10 = 2,570
Total parameters: 1,707,274
Formula: For a layer with n_in inputs and n_out outputs:
For a batch of B images, each with D features:
Input X: B × D matrix
Weights W: D × H matrix
Bias b: H vector (broadcasted)
Output Z: B × H matrix
Forward pass equation:
Where:
# Input: batch of 256 images, each 3072 pixels
X = torch.randn(256, 3072) # 256×3072
# Layer 1 weights
W1 = torch.randn(3072, 512) # 3072×512
b1 = torch.randn(512) # 512
# Forward pass
Z1 = X @ W1 + b1 # (256×3072) @ (3072×512) = 256×512
The full forward pass:
X (256×3072)
↓ × W₁
Z₁ (256×512)
↓ + b₁
Z₂ (256×512)
↓ ReLU
H₁ (256×512)
↓ × W₂
Z₃ (256×256)
↓ + b₂
Z₄ (256×256)
↓ ReLU
H₂ (256×256)
↓ × W₃
Z₅ (256×10)
↓ + b₃
Z₆ (256×10)
↓ Softmax
Ŷ (256×10)
Without non-linearity, stacking layers is pointless:
Layer 1: h₁ = W₁x + b₁
Layer 2: y = W₂h₁ + b₂
= W₂(W₁x + b₁) + b₂
= (W₂W₁)x + (W₂b₁ + b₂)
= W'x + b' ← Still linear!
Non-linear activations allow networks to learn complex patterns.
Most common activation function for hidden layers.
Formula:
Derivative:
PyTorch implementation:
import torch.nn.functional as F
# Option 1: Functional
output = F.relu(input)
# Option 2: Module
relu = nn.ReLU()
output = relu(input)
Rarely used in hidden layers, sometimes for output.
Formula:
Derivative:
Used for multi-class classification.
Formula:
PyTorch implementation:
# Softmax is typically combined with CrossEntropyLoss
# Don't apply softmax manually before nn.CrossEntropyLoss!
logits = model(x) # Raw scores
loss = nn.CrossEntropyLoss()(logits, targets)
# For inference only:
probs = F.softmax(logits, dim=1)
Standard loss for classification.
Formula:
Where:
y is the true label (one-hot encoded)ŷ is the predicted probabilityFor a single correct class c:
Beautiful property: The gradient simplifies!
Where z is the logits (pre-softmax scores).
Not recommended for classification, but useful to understand.
Formula:
Gradient:
Why not for classification?
# For classification (includes softmax internally)
criterion = nn.CrossEntropyLoss()
# Model outputs logits (raw scores), NOT probabilities!
logits = model(images) # Shape: (batch_size, 10)
targets = labels # Shape: (batch_size,) with values 0-9
loss = criterion(logits, targets)
Backpropagation is an algorithm to compute gradients of the loss with respect to all parameters using the chain rule.
Goal: Compute ∂L/∂W and ∂L/∂b for all weights and biases.
For composite functions:
If z = f(g(x)), then:
dz/dx = (dz/dg) × (dg/dx)
For neural networks with many layers:
∂L/∂W₁ = (∂L/∂Z₆) × (∂Z₆/∂Z₅) × ... × (∂Z₂/∂W₁)
Forward pass (left to right):
X → Z₁ → H₁ → Z₂ → H₂ → ... → Ŷ → L
Backward pass (right to left):
∂L/∂L ← ∂L/∂Ŷ ← ... ← ∂L/∂H₂ ← ∂L/∂Z₂ ← ∂L/∂H₁ ← ∂L/∂Z₁ ← ∂L/∂X
│ │ │ │
└─ ∂L/∂W₃ └─ ∂L/∂W₂ └─ ∂L/∂W₁ └─ (not needed)
For a layer: Z = f(XW + b)
Given: ∂L/∂Z (gradient from next layer)
Compute:
∂L/∂X = [f'(XW + b) ⊙ ∂L/∂Z] Wᵀ
∂L/∂W = Xᵀ [f'(XW + b) ⊙ ∂L/∂Z]
∂L/∂b = Σᵢ [f'(XW + b) ⊙ ∂L/∂Z]ᵢ
Where:
⊙ denotes element-wise multiplicationWᵀ is the transpose of WXᵀ is the transpose of XΣᵢ sums over the batch dimensionForward:
Z = ReLU(X) = max(0, X)
Backward:
∂L/∂X = ∂L/∂Z ⊙ ReLU'(X)
= ∂L/∂Z ⊙ (X > 0) # Mask: 1 where X > 0, else 0
Forward:
Z = XW + b
Backward:
∂L/∂X = (∂L/∂Z) Wᵀ
∂L/∂W = Xᵀ (∂L/∂Z)
∂L/∂b = Σᵢ (∂L/∂Z)ᵢ
For the 3-layer FFNN:
# Forward pass
Z1 = X @ W1 + b1
H1 = relu(Z1)
Z2 = H1 @ W2 + b2
H2 = relu(Z2)
Z3 = H2 @ W3 + b3
Y_hat = softmax(Z3)
L = cross_entropy(Y_hat, Y)
# Backward pass
dZ3 = Y_hat - Y # Softmax + CrossEntropy gradient
dW3 = H2.T @ dZ3
db3 = dZ3.sum(dim=0)
dH2 = dZ3 @ W3.T
dZ2 = dH2 * (Z2 > 0) # ReLU gradient
dW2 = H1.T @ dZ2
db2 = dZ2.sum(dim=0)
dH1 = dZ2 @ W2.T
dZ1 = dH1 * (Z1 > 0) # ReLU gradient
dW1 = X.T @ dZ1
db1 = dZ1.sum(dim=0)
Note: PyTorch does this automatically with loss.backward()!
Goal: Minimize loss L(θ) by adjusting parameters θ.
Key idea: Move in the direction opposite to the gradient.
Where:
α is the learning rate∇L(θ) is the gradient of loss with respect to parametersThe gradient ∇L(θ) points in the direction of steepest increase of L.
Therefore, -∇L(θ) points in the direction of steepest decrease.
Compute gradient using entire dataset:
Compute gradient using one random example:
Compute gradient using a small batch:
Best of both worlds:
Typical batch sizes: 32, 64, 128, 256, 512
Loss oscillates or explodes
Takes forever to converge
Steady decrease to minimum
Typical values: 0.001, 0.0001, 0.01
Adaptive Moment Estimation - most popular optimizer.
Formula (simplified):
m_t = β₁ m_{t-1} + (1-β₁) g_t # First moment (momentum)
v_t = β₂ v_{t-1} + (1-β₂) g_t² # Second moment (variance)
θ_t = θ_{t-1} - α m_t / (√v_t + ε) # Update
Default hyperparameters:
lr = 0.001
betas = (0.9, 0.999)
eps = 1e-8
PyTorch implementation:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
| Algorithm | Characteristics |
|---|---|
| SGD | Simple, well-understood. Requires careful LR tuning. Can escape sharp minima. |
| Adam | Adapts LR automatically. Works well with defaults. Faster convergence. More memory. |
for epoch in range(num_epochs):
# Training phase
model.train()
for images, labels in train_loader:
# 1. Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# 2. Backward pass
optimizer.zero_grad()
loss.backward()
# 3. Update weights
optimizer.step()
# Validation phase
model.eval()
with torch.no_grad():
for images, labels in val_loader:
outputs = model(images)
# Compute validation metrics
model.train() # Enable dropout, batch norm training mode
model.eval() # Disable dropout, batch norm inference mode
PyTorch accumulates gradients. Must zero them each iteration!
optimizer.zero_grad() # Clear previous gradients
loss.backward() # Compute new gradients
optimizer.step() # Update weights
Don't compute gradients during validation (saves memory):
with torch.no_grad():
outputs = model(images) # No gradient tracking
Problem: Model may overfit if trained too long.
Solution: Stop when validation performance stops improving.
Implementation:
best_val_acc = 0
patience = 5
epochs_without_improvement = 0
for epoch in range(max_epochs):
train(...)
val_acc = validate(...)
if val_acc > best_val_acc:
best_val_acc = val_acc
epochs_without_improvement = 0
save_model(model) # Save best model
else:
epochs_without_improvement += 1
if epochs_without_improvement >= patience:
print("Early stopping!")
break
model.load(best_model) # Restore best model
Gradually decrease learning rate during training.
Common schedules:
# Step decay: multiply by 0.1 every 30 epochs
scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=30,
gamma=0.1)
# Cosine annealing: smooth decrease
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer,
T_max=100)
# Reduce on plateau: decrease when validation stops improving
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
patience=5)
Usage:
for epoch in range(num_epochs):
train(...)
validate(...)
scheduler.step() # Update learning rate
Track these metrics:
history = {
'train_loss': [],
'train_acc': [],
'val_loss': [],
'val_acc': []
}
# Each epoch
history['train_loss'].append(train_loss)
history['train_acc'].append(train_acc)
history['val_loss'].append(val_loss)
history['val_acc'].append(val_acc)
Most intuitive metric for classification.
Formula:
PyTorch implementation:
def compute_accuracy(model, dataloader):
model.eval()
correct = 0
total = 0
with torch.no_grad():
for images, labels in dataloader:
outputs = model(images)
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
return correct / total
Compute accuracy for each class separately.
def per_class_accuracy(model, dataloader, num_classes=10):
model.eval()
class_correct = [0] * num_classes
class_total = [0] * num_classes
with torch.no_grad():
for images, labels in dataloader:
outputs = model(images)
_, predicted = torch.max(outputs, 1)
for i in range(len(labels)):
label = labels[i]
class_correct[label] += (predicted[i] == label).item()
class_total[label] += 1
return [class_correct[i] / class_total[i] for i in range(num_classes)]
Measures if correct class is in top K predictions.
def top_k_accuracy(model, dataloader, k=5):
model.eval()
correct = 0
total = 0
with torch.no_grad():
for images, labels in dataloader:
outputs = model(images)
_, top_k = torch.topk(outputs, k, dim=1)
for i in range(len(labels)):
if labels[i] in top_k[i]:
correct += 1
total += 1
return correct / total
A confusion matrix shows the performance of a classification model by comparing predicted vs actual labels.
Predicted Class
0 1 2 3 ... 9
┌──────────────────────┐
0 │ TP FP FP FP ... FP│
1 │ FP TP FP FP ... FP│
Actual 2 │ FP FP TP FP ... FP│
Class 3 │ FP FP FP TP ... FP│
... │ .....................│
9 │ FP FP FP FP ... TP│
└──────────────────────┘
def compute_confusion_matrix(model, dataloader, num_classes=10):
model.eval()
confusion_matrix = torch.zeros(num_classes, num_classes, dtype=torch.int64)
with torch.no_grad():
for images, labels in dataloader:
outputs = model(images)
_, predicted = torch.max(outputs, 1)
for true, pred in zip(labels, predicted):
confusion_matrix[true, pred] += 1
return confusion_matrix
What to look for:
Example insights:
If cm[3, 5] is high (cat predicted as dog):
→ Model confuses cats and dogs
→ Maybe add more training data for these classes
→ Or use data augmentation
If cm[2, :].sum() is low (few bird examples classified):
→ Model struggles with birds overall
→ Check if bird images are underrepresented
Show percentages instead of counts:
def normalize_confusion_matrix(cm):
row_sums = cm.sum(axis=1, keepdims=True)
return cm.astype(float) / row_sums
Each row sums to 1.0 (100%).
baseline = {
'hidden_dims': (512, 256),
'dropout': 0.1,
'lr': 0.001,
'batch_size': 256,
'epochs': 20
}
# Experiment 1: Larger network
config1 = baseline.copy()
config1['hidden_dims'] = (1024, 512, 256)
# Experiment 2: Lower learning rate
config2 = baseline.copy()
config2['lr'] = 0.0001
# Experiment 3: More dropout
config3 = baseline.copy()
config3['dropout'] = 0.3
| Config | Val Acc | Time/Epoch |
|---|---|---|
| Baseline | 54.2% | 12s |
| Larger net | 56.0% | 18s |
| Lower LR | 51.8% | 12s |
| More dropout | 53.1% | 12s |
Start with: 0.001 (Adam) or 0.01 (SGD)
# Try: [0.1, 0.01, 0.001, 0.0001, 0.00001]
# Pick the largest that doesn't diverge
Rule of thumb:
Input size: 3072
Hidden 1: 512-2048 (smaller than input)
Hidden 2: 256-512 (smaller than Hidden 1)
Output: 10 (number of classes)
Typical values: 0.1 - 0.5
class FFNN(nn.Module):
def __init__(self, dropout=0.2):
super().__init__()
self.fc1 = nn.Linear(3072, 512)
self.dropout1 = nn.Dropout(dropout)
self.fc2 = nn.Linear(512, 256)
self.dropout2 = nn.Dropout(dropout)
self.fc3 = nn.Linear(256, 10)
def forward(self, x):
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x = self.dropout1(x)
x = F.relu(self.fc2(x))
x = self.dropout2(x)
x = self.fc3(x)
return x
Common values: 32, 64, 128, 256, 512
For CIFAR-10: 256 is a good default
Problem: Model sees data in same order every epoch.
Solution:
train_loader = DataLoader(train_ds, batch_size=256,
shuffle=True) # ← IMPORTANT!
Problem: Raw pixel values [0, 255] cause unstable training.
Solution:
transform = transforms.Compose([
transforms.ToTensor(), # Scales to [0, 1]
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2470, 0.2435, 0.2616)) # ← IMPORTANT!
])
Problem: Test set performance is overly optimistic.
Solution: Use validation set for tuning, test set ONLY for final evaluation.
Training set → Train model
Validation set → Tune hyperparameters
Test set → Report final performance (once!)
Problem: Dropout stays active, giving inconsistent results.
Solution:
model.eval() # ← Disables dropout
with torch.no_grad():
# Validation code
Problem: Gradients accumulate, causing wrong updates.
Solution:
for images, labels in train_loader:
optimizer.zero_grad() # ← MUST come before backward()
loss = criterion(model(images), labels)
loss.backward()
optimizer.step()
Problem: nn.CrossEntropyLoss applies softmax internally!
Wrong:
outputs = F.softmax(model(images), dim=1)
loss = nn.CrossEntropyLoss()(outputs, labels) # ✗ WRONG!
Correct:
logits = model(images) # Raw scores
loss = nn.CrossEntropyLoss()(logits, labels) # ✓ CORRECT!
Problem: FFNN expects flattened images, not 2D.
Wrong:
# images shape: (batch_size, 3, 32, 32)
outputs = model(images) # ✗ WRONG!
Correct:
# Flatten in forward()
def forward(self, x):
x = x.view(x.size(0), -1) # (batch, 3, 32, 32) → (batch, 3072)
...
Symptoms:
Solutions:
# 1. Add dropout
model = FFNN(dropout=0.3)
# 2. Add weight decay
optimizer = torch.optim.Adam(model.parameters(),
lr=0.001,
weight_decay=1e-4)
# 3. Early stopping
if val_acc_not_improving_for_N_epochs:
stop_training()
# 4. Get more training data (if possible)
Symptoms:
Solutions:
# 1. Larger network
model = FFNN(hidden_dims=(1024, 512, 256))
# 2. Train longer
num_epochs = 50
# 3. Lower learning rate (paradoxically can help)
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
# 4. Remove/reduce regularization
model = FFNN(dropout=0.0) # No dropout
Symptoms:
Solutions:
# 1. Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# 2. Lower learning rate
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
# 3. Better weight initialization (PyTorch does this by default)
# 4. Use batch normalization (for deeper networks)
import torch
import torch.nn as nn
import torch.nn.functional as F
class FFNN(nn.Module):
def __init__(self, input_dim=3072, hidden_dims=(512, 256),
dropout=0.1, num_classes=10):
super().__init__()
layers = []
in_dim = input_dim
for hidden_dim in hidden_dims:
layers.append(nn.Linear(in_dim, hidden_dim))
layers.append(nn.ReLU())
layers.append(nn.Dropout(dropout))
in_dim = hidden_dim
layers.append(nn.Linear(in_dim, num_classes))
self.network = nn.Sequential(*layers)
def forward(self, x):
# Flatten image: (batch, 3, 32, 32) → (batch, 3072)
x = x.view(x.size(0), -1)
return self.network(x)
# Create model
model = FFNN(hidden_dims=(512, 256), dropout=0.1)
print(f"Total parameters: {sum(p.numel() for p in model.parameters())}")
def train_one_epoch(model, train_loader, optimizer, criterion, device):
model.train()
total_loss = 0
correct = 0
total = 0
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Metrics
total_loss += loss.item() * images.size(0)
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
avg_loss = total_loss / total
accuracy = correct / total
return avg_loss, accuracy
def validate(model, val_loader, criterion, device):
model.eval()
total_loss = 0
correct = 0
total = 0
with torch.no_grad():
for images, labels in val_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
loss = criterion(outputs, labels)
total_loss += loss.item() * images.size(0)
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
avg_loss = total_loss / total
accuracy = correct / total
return avg_loss, accuracy
# Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = FFNN(hidden_dims=(512, 256), dropout=0.1).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
# Training
num_epochs = 20
history = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}
for epoch in range(num_epochs):
# Train
train_loss, train_acc = train_one_epoch(model, train_loader,
optimizer, criterion, device)
# Validate
val_loss, val_acc = validate(model, val_loader, criterion, device)
# Record
history['train_loss'].append(train_loss)
history['train_acc'].append(train_acc)
history['val_loss'].append(val_loss)
history['val_acc'].append(val_acc)
# Print
print(f"Epoch {epoch+1}/{num_epochs}")
print(f" Train Loss: {train_loss:.4f}, Train Acc: {train_acc*100:.2f}%")
print(f" Val Loss: {val_loss:.4f}, Val Acc: {val_acc*100:.2f}%")
def gather_misclassifications(model, dataloader, device, max_samples=16):
model.eval()
misclassified_images = []
misclassified_preds = []
misclassified_labels = []
with torch.no_grad():
for images, labels in dataloader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs, 1)
# Find misclassified
mask = predicted != labels
misclassified_images.extend(images[mask].cpu())
misclassified_preds.extend(predicted[mask].cpu().tolist())
misclassified_labels.extend(labels[mask].cpu().tolist())
if len(misclassified_images) >= max_samples:
break
return (misclassified_images[:max_samples],
misclassified_preds[:max_samples],
misclassified_labels[:max_samples])
Architecture: (512, 256)
Validation Accuracy: 50-55%
Training Time: ~12s/epoch (GPU)
Parameters: ~1.7M
Architecture: (1024, 512, 256)
Validation Accuracy: 55-58%
Training Time: ~18s/epoch (GPU)
Parameters: ~3.8M
Note: FFNN performance on CIFAR-10 is limited! CNNs (Assignment 3) achieve 80-90%.
END OF LESSON
CMPUT 328 - FULLY CONNECTED NEURAL NETWORKS
ASSIGNMENT 2 STUDY GUIDE