← Back to Topics

FULLY CONNECTED NEURAL NETWORKS

FOR CIFAR-10 CLASSIFICATION

CMPUT 328 - ASSIGNMENT 2 STUDY GUIDE

TABLE OF CONTENTS

1. INTRODUCTION TO FEEDFORWARD NEURAL NETWORKS

What is a Feedforward Neural Network?

A Feedforward Neural Network (FFNN), also called a Fully Connected Neural Network, is the simplest type of artificial neural network where information flows in one direction: from input to output.

Key Characteristics

Why Use Neural Networks for Images?

Neural networks can learn hierarchical representations of data:

Mathematical Representation

For a simple FFNN with one hidden layer:

h = f(W₁x + b₁)
y = g(W₂h + b₂)

Where:

2. CIFAR-10 DATASET

Dataset Overview

CIFAR-10 is a benchmark dataset for image classification consisting of 60,000 32×32 color images in 10 classes.

Dataset Statistics

Total images:     60,000
Training images:  50,000
Test images:      10,000
Image size:       32 × 32 × 3 (RGB)
Number of classes: 10
Images per class: 6,000

Class Labels

The 10 classes are:

0: airplane
1: automobile
2: bird
3: cat
4: deer
5: dog
6: frog
7: horse
8: ship
9: truck

Data Splits

For proper evaluation, split the data:

Training set:   45,000 images (90% of 50,000)
Validation set:  5,000 images (10% of 50,000)
Test set:       10,000 images (held out)
Use the validation set for hyperparameter tuning, NOT the test set!

Normalization

CIFAR-10 normalization values (empirically computed):

mean = (0.4914, 0.4822, 0.4465)  # RGB channels
std  = (0.2470, 0.2435, 0.2616)  # RGB channels

Normalization formula:

normalized_pixel = (pixel - mean) / std

This standardizes inputs to have mean ≈ 0 and std ≈ 1, which helps with:

Loading CIFAR-10 in PyTorch

from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split

# Define transforms
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2470, 0.2435, 0.2616))
])

# Load datasets
train_val_ds = datasets.CIFAR10(root='./data', train=True,
                                 download=True, transform=transform)
test_ds = datasets.CIFAR10(root='./data', train=False,
                           download=True, transform=transform)

# Split train into train/val
train_ds, val_ds = random_split(train_val_ds, [45000, 5000],
                                 generator=torch.Generator().manual_seed(42))

# Create data loaders
train_loader = DataLoader(train_ds, batch_size=256, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=256, shuffle=False)
test_loader = DataLoader(test_ds, batch_size=256, shuffle=False)

3. NEURAL NETWORK ARCHITECTURE

FFNN Architecture for CIFAR-10

A typical FFNN for CIFAR-10:

Input Layer:    3072 neurons (32×32×3 flattened)
Hidden Layer 1:  512 neurons + ReLU
Hidden Layer 2:  256 neurons + ReLU
Output Layer:     10 neurons (one per class)

Architecture Diagram

┌──────────────┐
│ Input (3072) │
└──────┬───────┘
       │ W₁ (3072×512)
       ↓
┌──────────────┐
│ Hidden (512) │ + ReLU
└──────┬───────┘
       │ W₂ (512×256)
       ↓
┌──────────────┐
│ Hidden (256) │ + ReLU
└──────┬───────┘
       │ W₃ (256×10)
       ↓
┌──────────────┐
│ Output (10)  │ + Softmax
└──────────────┘

Why Flatten Images?

FFNNs require 1D input vectors:

Original shape:  (32, 32, 3)
Flattened shape: (3072,)

Calculation: 32 × 32 × 3 = 3072

Note: This destroys spatial structure, which is why CNNs (Assignment 3) work better for images!

Parameter Count

For the architecture above:

Layer 1: (3072 × 512) + 512 = 1,573,376
Layer 2: (512 × 256) + 256  = 131,328
Layer 3: (256 × 10) + 10    = 2,570

Total parameters: 1,707,274

Formula: For a layer with n_in inputs and n_out outputs:

parameters = (n_in × n_out) + n_out = n_out × (n_in + 1)

4. FORWARD PASS: MATRIX OPERATIONS

Matrix-Vector Multiplication

For a batch of B images, each with D features:

Input X:     B × D matrix
Weights W:   D × H matrix
Bias b:      H vector (broadcasted)
Output Z:    B × H matrix

Forward pass equation:

Z = XW + b

Where:

Example Calculation

# Input: batch of 256 images, each 3072 pixels
X = torch.randn(256, 3072)  # 256×3072

# Layer 1 weights
W1 = torch.randn(3072, 512)  # 3072×512
b1 = torch.randn(512)         # 512

# Forward pass
Z1 = X @ W1 + b1  # (256×3072) @ (3072×512) = 256×512

Computational Graph

The full forward pass:

X (256×3072)
    ↓ × W₁
Z₁ (256×512)
    ↓ + b₁
Z₂ (256×512)
    ↓ ReLU
H₁ (256×512)
    ↓ × W₂
Z₃ (256×256)
    ↓ + b₂
Z₄ (256×256)
    ↓ ReLU
H₂ (256×256)
    ↓ × W₃
Z₅ (256×10)
    ↓ + b₃
Z₆ (256×10)
    ↓ Softmax
Ŷ (256×10)

5. ACTIVATION FUNCTIONS

Why Non-Linear Activations?

Without non-linearity, stacking layers is pointless:

Layer 1: h₁ = W₁x + b₁
Layer 2: y = W₂h₁ + b₂
       = W₂(W₁x + b₁) + b₂
       = (W₂W₁)x + (W₂b₁ + b₂)
       = W'x + b'  ← Still linear!

Non-linear activations allow networks to learn complex patterns.

ReLU (Rectified Linear Unit)

Most common activation function for hidden layers.

Formula:

ReLU(x) = max(0, x)

Properties

Derivative:

ReLU'(x) = 1 if x > 0, else 0

PyTorch implementation:

import torch.nn.functional as F

# Option 1: Functional
output = F.relu(input)

# Option 2: Module
relu = nn.ReLU()
output = relu(input)

Sigmoid

Rarely used in hidden layers, sometimes for output.

Formula:

σ(x) = 1 / (1 + e⁻ˣ)

Properties

Derivative:

σ'(x) = σ(x)(1 - σ(x))

Softmax (Output Layer)

Used for multi-class classification.

Formula:

Softmax(zᵢ) = e^zᵢ / Σⱼ e^zⱼ

Properties

PyTorch implementation:

# Softmax is typically combined with CrossEntropyLoss
# Don't apply softmax manually before nn.CrossEntropyLoss!

logits = model(x)  # Raw scores
loss = nn.CrossEntropyLoss()(logits, targets)

# For inference only:
probs = F.softmax(logits, dim=1)

6. LOSS FUNCTIONS

Cross-Entropy Loss

Standard loss for classification.

Formula:

L = -Σᵢ yᵢ log(ŷᵢ)

Where:

For a single correct class c:

L = -log(ŷ_c)

Properties

Cross-Entropy + Softmax Gradient

Beautiful property: The gradient simplifies!

∂L/∂z = ŷ - y

Where z is the logits (pre-softmax scores).

Mean Squared Error (MSE)

Not recommended for classification, but useful to understand.

Formula:

L = (1/2) Σᵢ (ŷᵢ - yᵢ)²

Gradient:

∂L/∂ŷ = ŷ - y

Why not for classification?

PyTorch Implementation

# For classification (includes softmax internally)
criterion = nn.CrossEntropyLoss()

# Model outputs logits (raw scores), NOT probabilities!
logits = model(images)  # Shape: (batch_size, 10)
targets = labels         # Shape: (batch_size,) with values 0-9

loss = criterion(logits, targets)
Don't apply softmax before CrossEntropyLoss!

7. BACKPROPAGATION ALGORITHM

What is Backpropagation?

Backpropagation is an algorithm to compute gradients of the loss with respect to all parameters using the chain rule.

Goal: Compute ∂L/∂W and ∂L/∂b for all weights and biases.

Chain Rule Review

For composite functions:

If z = f(g(x)), then:
dz/dx = (dz/dg) × (dg/dx)

For neural networks with many layers:

∂L/∂W₁ = (∂L/∂Z₆) × (∂Z₆/∂Z₅) × ... × (∂Z₂/∂W₁)

Computational Graph Approach

Forward pass (left to right):

X → Z₁ → H₁ → Z₂ → H₂ → ... → Ŷ → L

Backward pass (right to left):

∂L/∂L ← ∂L/∂Ŷ ← ... ← ∂L/∂H₂ ← ∂L/∂Z₂ ← ∂L/∂H₁ ← ∂L/∂Z₁ ← ∂L/∂X
       │              │              │              │
       └─ ∂L/∂W₃     └─ ∂L/∂W₂     └─ ∂L/∂W₁     └─ (not needed)

Backprop Formulas for One Layer

For a layer: Z = f(XW + b)

Given: ∂L/∂Z (gradient from next layer)

Compute:

∂L/∂X = [f'(XW + b) ⊙ ∂L/∂Z] Wᵀ
∂L/∂W = Xᵀ [f'(XW + b) ⊙ ∂L/∂Z]
∂L/∂b = Σᵢ [f'(XW + b) ⊙ ∂L/∂Z]ᵢ

Where:

Example: Backprop Through ReLU

Forward:

Z = ReLU(X) = max(0, X)

Backward:

∂L/∂X = ∂L/∂Z ⊙ ReLU'(X)
      = ∂L/∂Z ⊙ (X > 0)  # Mask: 1 where X > 0, else 0

Example: Backprop Through Linear Layer

Forward:

Z = XW + b

Backward:

∂L/∂X = (∂L/∂Z) Wᵀ
∂L/∂W = Xᵀ (∂L/∂Z)
∂L/∂b = Σᵢ (∂L/∂Z)ᵢ

Full Network Backpropagation

For the 3-layer FFNN:

# Forward pass
Z1 = X @ W1 + b1
H1 = relu(Z1)
Z2 = H1 @ W2 + b2
H2 = relu(Z2)
Z3 = H2 @ W3 + b3
Y_hat = softmax(Z3)
L = cross_entropy(Y_hat, Y)

# Backward pass
dZ3 = Y_hat - Y  # Softmax + CrossEntropy gradient
dW3 = H2.T @ dZ3
db3 = dZ3.sum(dim=0)

dH2 = dZ3 @ W3.T
dZ2 = dH2 * (Z2 > 0)  # ReLU gradient
dW2 = H1.T @ dZ2
db2 = dZ2.sum(dim=0)

dH1 = dZ2 @ W2.T
dZ1 = dH1 * (Z1 > 0)  # ReLU gradient
dW1 = X.T @ dZ1
db1 = dZ1.sum(dim=0)

Note: PyTorch does this automatically with loss.backward()!

8. GRADIENT DESCENT OPTIMIZATION

Gradient Descent Intuition

Goal: Minimize loss L(θ) by adjusting parameters θ.

Key idea: Move in the direction opposite to the gradient.

θ_new = θ_old - α ∇L(θ_old)

Where:

Gradient as Steepest Ascent

The gradient ∇L(θ) points in the direction of steepest increase of L.

Therefore, -∇L(θ) points in the direction of steepest decrease.

Batch Gradient Descent

Compute gradient using entire dataset:

∇L(θ) = (1/N) Σᵢ ∇L(θ; xᵢ, yᵢ)

Pros

Cons

Stochastic Gradient Descent (SGD)

Compute gradient using one random example:

∇L(θ) ≈ ∇L(θ; xᵢ, yᵢ)

Pros

Cons

Mini-Batch Gradient Descent

Compute gradient using a small batch:

∇L(θ) ≈ (1/B) Σᵢ₌₁ᴮ ∇L(θ; xᵢ, yᵢ)

Best of both worlds:

Typical batch sizes: 32, 64, 128, 256, 512

Learning Rate Selection

Too large: Overshoots, diverges

Loss oscillates or explodes

Too small: Slow convergence

Takes forever to converge

Just right: Smooth, fast convergence

Steady decrease to minimum

Typical values: 0.001, 0.0001, 0.01

Adam Optimizer

Adaptive Moment Estimation - most popular optimizer.

Key ideas

Formula (simplified):

m_t = β₁ m_{t-1} + (1-β₁) g_t        # First moment (momentum)
v_t = β₂ v_{t-1} + (1-β₂) g_t²       # Second moment (variance)
θ_t = θ_{t-1} - α m_t / (√v_t + ε)   # Update

Default hyperparameters:

lr = 0.001
betas = (0.9, 0.999)
eps = 1e-8

PyTorch implementation:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Optimization Algorithm Comparison

Algorithm Characteristics
SGD Simple, well-understood. Requires careful LR tuning. Can escape sharp minima.
Adam Adapts LR automatically. Works well with defaults. Faster convergence. More memory.

9. TRAINING PROCESS

Training Loop Structure

for epoch in range(num_epochs):
    # Training phase
    model.train()
    for images, labels in train_loader:
        # 1. Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # 2. Backward pass
        optimizer.zero_grad()
        loss.backward()

        # 3. Update weights
        optimizer.step()

    # Validation phase
    model.eval()
    with torch.no_grad():
        for images, labels in val_loader:
            outputs = model(images)
            # Compute validation metrics

Important Steps Explained

1. model.train() vs model.eval()

model.train()  # Enable dropout, batch norm training mode
model.eval()   # Disable dropout, batch norm inference mode

2. optimizer.zero_grad()

PyTorch accumulates gradients. Must zero them each iteration!

optimizer.zero_grad()  # Clear previous gradients
loss.backward()        # Compute new gradients
optimizer.step()       # Update weights

3. torch.no_grad()

Don't compute gradients during validation (saves memory):

with torch.no_grad():
    outputs = model(images)  # No gradient tracking

Early Stopping

Problem: Model may overfit if trained too long.

Solution: Stop when validation performance stops improving.

Implementation:

best_val_acc = 0
patience = 5
epochs_without_improvement = 0

for epoch in range(max_epochs):
    train(...)
    val_acc = validate(...)

    if val_acc > best_val_acc:
        best_val_acc = val_acc
        epochs_without_improvement = 0
        save_model(model)  # Save best model
    else:
        epochs_without_improvement += 1

    if epochs_without_improvement >= patience:
        print("Early stopping!")
        break

model.load(best_model)  # Restore best model

Learning Rate Scheduling

Gradually decrease learning rate during training.

Common schedules:

# Step decay: multiply by 0.1 every 30 epochs
scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                             step_size=30,
                                             gamma=0.1)

# Cosine annealing: smooth decrease
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer,
                                                        T_max=100)

# Reduce on plateau: decrease when validation stops improving
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
                                                        patience=5)

Usage:

for epoch in range(num_epochs):
    train(...)
    validate(...)
    scheduler.step()  # Update learning rate

Monitoring Training

Track these metrics:

history = {
    'train_loss': [],
    'train_acc': [],
    'val_loss': [],
    'val_acc': []
}

# Each epoch
history['train_loss'].append(train_loss)
history['train_acc'].append(train_acc)
history['val_loss'].append(val_loss)
history['val_acc'].append(val_acc)

10. EVALUATION METRICS

Accuracy

Most intuitive metric for classification.

Formula:

Accuracy = (Number of correct predictions) / (Total predictions) = (TP + TN) / (TP + TN + FP + FN)

PyTorch implementation:

def compute_accuracy(model, dataloader):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in dataloader:
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return correct / total

Limitations

Per-Class Accuracy

Compute accuracy for each class separately.

def per_class_accuracy(model, dataloader, num_classes=10):
    model.eval()
    class_correct = [0] * num_classes
    class_total = [0] * num_classes

    with torch.no_grad():
        for images, labels in dataloader:
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)

            for i in range(len(labels)):
                label = labels[i]
                class_correct[label] += (predicted[i] == label).item()
                class_total[label] += 1

    return [class_correct[i] / class_total[i] for i in range(num_classes)]

Top-K Accuracy

Measures if correct class is in top K predictions.

def top_k_accuracy(model, dataloader, k=5):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in dataloader:
            outputs = model(images)
            _, top_k = torch.topk(outputs, k, dim=1)

            for i in range(len(labels)):
                if labels[i] in top_k[i]:
                    correct += 1
                total += 1

    return correct / total

11. CONFUSION MATRIX ANALYSIS

What is a Confusion Matrix?

A confusion matrix shows the performance of a classification model by comparing predicted vs actual labels.

Structure (for 10 classes)

              Predicted Class
           0   1   2   3  ...  9
        ┌──────────────────────┐
      0 │ TP  FP  FP  FP ... FP│
      1 │ FP  TP  FP  FP ... FP│
Actual 2 │ FP  FP  TP  FP ... FP│
Class  3 │ FP  FP  FP  TP ... FP│
     ... │ .....................│
      9 │ FP  FP  FP  FP ... TP│
        └──────────────────────┘

Computing Confusion Matrix

def compute_confusion_matrix(model, dataloader, num_classes=10):
    model.eval()
    confusion_matrix = torch.zeros(num_classes, num_classes, dtype=torch.int64)

    with torch.no_grad():
        for images, labels in dataloader:
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)

            for true, pred in zip(labels, predicted):
                confusion_matrix[true, pred] += 1

    return confusion_matrix

Interpreting Confusion Matrix

What to look for:

  1. Strong diagonal: High accuracy
  2. Weak diagonal: Poor overall performance
  3. Hot spots off diagonal: Specific confusion patterns

Example insights:

If cm[3, 5] is high (cat predicted as dog):
→ Model confuses cats and dogs
→ Maybe add more training data for these classes
→ Or use data augmentation

If cm[2, :].sum() is low (few bird examples classified):
→ Model struggles with birds overall
→ Check if bird images are underrepresented

Normalized Confusion Matrix

Show percentages instead of counts:

def normalize_confusion_matrix(cm):
    row_sums = cm.sum(axis=1, keepdims=True)
    return cm.astype(float) / row_sums

Each row sums to 1.0 (100%).

12. HYPERPARAMETER TUNING

Key Hyperparameters

Architecture

Training

Regularization

Systematic Tuning Process

1. Start with baseline

baseline = {
    'hidden_dims': (512, 256),
    'dropout': 0.1,
    'lr': 0.001,
    'batch_size': 256,
    'epochs': 20
}

2. Change ONE variable at a time

# Experiment 1: Larger network
config1 = baseline.copy()
config1['hidden_dims'] = (1024, 512, 256)

# Experiment 2: Lower learning rate
config2 = baseline.copy()
config2['lr'] = 0.0001

# Experiment 3: More dropout
config3 = baseline.copy()
config3['dropout'] = 0.3

3. Compare results

Config Val Acc Time/Epoch
Baseline 54.2% 12s
Larger net 56.0% 18s
Lower LR 51.8% 12s
More dropout 53.1% 12s

Learning Rate Guidelines

Start with: 0.001 (Adam) or 0.01 (SGD)

Too high indicators

Too low indicators

Finding good LR

# Try: [0.1, 0.01, 0.001, 0.0001, 0.00001]
# Pick the largest that doesn't diverge

Hidden Layer Size Guidelines

Rule of thumb:

Input size: 3072
Hidden 1:   512-2048  (smaller than input)
Hidden 2:   256-512   (smaller than Hidden 1)
Output:     10        (number of classes)

More neurons

Fewer neurons

Dropout Guidelines

Typical values: 0.1 - 0.5

class FFNN(nn.Module):
    def __init__(self, dropout=0.2):
        super().__init__()
        self.fc1 = nn.Linear(3072, 512)
        self.dropout1 = nn.Dropout(dropout)
        self.fc2 = nn.Linear(512, 256)
        self.dropout2 = nn.Dropout(dropout)
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = self.dropout1(x)
        x = F.relu(self.fc2(x))
        x = self.dropout2(x)
        x = self.fc3(x)
        return x

Higher dropout (0.4-0.5)

Lower dropout (0.1-0.2)

Batch Size Guidelines

Common values: 32, 64, 128, 256, 512

Larger batches

Smaller batches

For CIFAR-10: 256 is a good default

13. COMMON PITFALLS AND SOLUTIONS

Pitfall 1: Not Shuffling Training Data

Problem: Model sees data in same order every epoch.

Solution:

train_loader = DataLoader(train_ds, batch_size=256,
                          shuffle=True)  # ← IMPORTANT!

Pitfall 2: Forgetting to Normalize

Problem: Raw pixel values [0, 255] cause unstable training.

Solution:

transform = transforms.Compose([
    transforms.ToTensor(),  # Scales to [0, 1]
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2470, 0.2435, 0.2616))  # ← IMPORTANT!
])

Pitfall 3: Using Test Set for Hyperparameter Tuning

Problem: Test set performance is overly optimistic.

Solution: Use validation set for tuning, test set ONLY for final evaluation.

Training set   → Train model
Validation set → Tune hyperparameters
Test set       → Report final performance (once!)

Pitfall 4: Not Using model.eval() During Validation

Problem: Dropout stays active, giving inconsistent results.

Solution:

model.eval()  # ← Disables dropout
with torch.no_grad():
    # Validation code

Pitfall 5: Forgetting optimizer.zero_grad()

Problem: Gradients accumulate, causing wrong updates.

Solution:

for images, labels in train_loader:
    optimizer.zero_grad()  # ← MUST come before backward()
    loss = criterion(model(images), labels)
    loss.backward()
    optimizer.step()

Pitfall 6: Applying Softmax Before CrossEntropyLoss

Problem: nn.CrossEntropyLoss applies softmax internally!

Wrong:

outputs = F.softmax(model(images), dim=1)
loss = nn.CrossEntropyLoss()(outputs, labels)  # ✗ WRONG!

Correct:

logits = model(images)  # Raw scores
loss = nn.CrossEntropyLoss()(logits, labels)  # ✓ CORRECT!

Pitfall 7: Incorrect Input Shape

Problem: FFNN expects flattened images, not 2D.

Wrong:

# images shape: (batch_size, 3, 32, 32)
outputs = model(images)  # ✗ WRONG!

Correct:

# Flatten in forward()
def forward(self, x):
    x = x.view(x.size(0), -1)  # (batch, 3, 32, 32) → (batch, 3072)
    ...

Pitfall 8: Overfitting

Symptoms:

Solutions:

# 1. Add dropout
model = FFNN(dropout=0.3)

# 2. Add weight decay
optimizer = torch.optim.Adam(model.parameters(),
                             lr=0.001,
                             weight_decay=1e-4)

# 3. Early stopping
if val_acc_not_improving_for_N_epochs:
    stop_training()

# 4. Get more training data (if possible)

Pitfall 9: Underfitting

Symptoms:

Solutions:

# 1. Larger network
model = FFNN(hidden_dims=(1024, 512, 256))

# 2. Train longer
num_epochs = 50

# 3. Lower learning rate (paradoxically can help)
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

# 4. Remove/reduce regularization
model = FFNN(dropout=0.0)  # No dropout

Pitfall 10: Exploding/Vanishing Gradients

Symptoms:

Solutions:

# 1. Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# 2. Lower learning rate
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

# 3. Better weight initialization (PyTorch does this by default)
# 4. Use batch normalization (for deeper networks)

14. IMPLEMENTATION GUIDE

Complete FFNN Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class FFNN(nn.Module):
    def __init__(self, input_dim=3072, hidden_dims=(512, 256),
                 dropout=0.1, num_classes=10):
        super().__init__()

        layers = []
        in_dim = input_dim

        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(in_dim, hidden_dim))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout))
            in_dim = hidden_dim

        layers.append(nn.Linear(in_dim, num_classes))
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        # Flatten image: (batch, 3, 32, 32) → (batch, 3072)
        x = x.view(x.size(0), -1)
        return self.network(x)

# Create model
model = FFNN(hidden_dims=(512, 256), dropout=0.1)
print(f"Total parameters: {sum(p.numel() for p in model.parameters())}")

Training Function

def train_one_epoch(model, train_loader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Metrics
        total_loss += loss.item() * images.size(0)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    avg_loss = total_loss / total
    accuracy = correct / total
    return avg_loss, accuracy

Validation Function

def validate(model, val_loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in val_loader:
            images, labels = images.to(device), labels.to(device)

            outputs = model(images)
            loss = criterion(outputs, labels)

            total_loss += loss.item() * images.size(0)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    avg_loss = total_loss / total
    accuracy = correct / total
    return avg_loss, accuracy

Complete Training Loop

# Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = FFNN(hidden_dims=(512, 256), dropout=0.1).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# Training
num_epochs = 20
history = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}

for epoch in range(num_epochs):
    # Train
    train_loss, train_acc = train_one_epoch(model, train_loader,
                                             optimizer, criterion, device)

    # Validate
    val_loss, val_acc = validate(model, val_loader, criterion, device)

    # Record
    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)
    history['val_loss'].append(val_loss)
    history['val_acc'].append(val_acc)

    # Print
    print(f"Epoch {epoch+1}/{num_epochs}")
    print(f"  Train Loss: {train_loss:.4f}, Train Acc: {train_acc*100:.2f}%")
    print(f"  Val Loss:   {val_loss:.4f}, Val Acc:   {val_acc*100:.2f}%")

Gathering Misclassifications

def gather_misclassifications(model, dataloader, device, max_samples=16):
    model.eval()
    misclassified_images = []
    misclassified_preds = []
    misclassified_labels = []

    with torch.no_grad():
        for images, labels in dataloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)

            # Find misclassified
            mask = predicted != labels

            misclassified_images.extend(images[mask].cpu())
            misclassified_preds.extend(predicted[mask].cpu().tolist())
            misclassified_labels.extend(labels[mask].cpu().tolist())

            if len(misclassified_images) >= max_samples:
                break

    return (misclassified_images[:max_samples],
            misclassified_preds[:max_samples],
            misclassified_labels[:max_samples])

15. ASSIGNMENT REQUIREMENTS SUMMARY

Part A: In-Lab (2%)

Requirements

1. Load CIFAR-10 (0.25%)

2. Implement FFNN (1.0%)

3. Plot Training Curves (0.25%)

4. Confusion Matrix (0.25%)

5. Misclassification Grid (0.25%)

Part B: Take-Home (3%)

Requirements

1. Modifications (1.0%)

2. Updated Plots (1.0%)

3. Short Report (1.0%)

Expected Results

Baseline FFNN Performance

Architecture: (512, 256)
Validation Accuracy: 50-55%
Training Time: ~12s/epoch (GPU)
Parameters: ~1.7M

Improved Configuration

Architecture: (1024, 512, 256)
Validation Accuracy: 55-58%
Training Time: ~18s/epoch (GPU)
Parameters: ~3.8M

Note: FFNN performance on CIFAR-10 is limited! CNNs (Assignment 3) achieve 80-90%.

Submission Checklist


END OF LESSON

CMPUT 328 - FULLY CONNECTED NEURAL NETWORKS

ASSIGNMENT 2 STUDY GUIDE

DOWNLOAD ANKI DECK