FULLY CONNECTED NEURAL NETWORKS

1. INTRODUCTION TO FEEDFORWARD NEURAL NETWORKS

What is a Feedforward Neural Network?

A Feedforward Neural Network (FFNN), also called a Fully Connected Neural Network, is the simplest type of artificial neural network where information flows in one direction: from input to output.

Key Characteristics

No cycles: Data flows forward through layers without loops
Fully connected: Every neuron in layer i connects to every neuron in layer i+1
Universal function approximators: Can approximate any continuous function given enough neurons
Directed acyclic graph: Can be represented as a computational graph

Why Use Neural Networks for Images?

Neural networks can learn hierarchical representations of data:

Input layer: Raw pixel values
Hidden layers: Learn features (edges → shapes → objects)
Output layer: Class predictions

Mathematical Representation

For a simple FFNN with one hidden layer:

h = f(W₁x + b₁)
y = g(W₂h + b₂)

Where:

x is the input vector
W₁, b₁ are weights and biases for hidden layer
f is a non-linear activation function
h is the hidden layer output
W₂, b₂ are weights and biases for output layer
g is the output activation function
y is the final prediction

2. CIFAR-10 DATASET

Dataset Overview

CIFAR-10 is a benchmark dataset for image classification consisting of 60,000 32×32 color images in 10 classes.

Dataset Statistics

Total images:     60,000
Training images:  50,000
Test images:      10,000
Image size:       32 × 32 × 3 (RGB)
Number of classes: 10
Images per class: 6,000

Class Labels

The 10 classes are:

0: airplane
1: automobile
2: bird
3: cat
4: deer
5: dog
6: frog
7: horse
8: ship
9: truck

Data Splits

For proper evaluation, split the data:

Training set:   45,000 images (90% of 50,000)
Validation set:  5,000 images (10% of 50,000)
Test set:       10,000 images (held out)

Use the validation set for hyperparameter tuning, NOT the test set!

Normalization

CIFAR-10 normalization values (empirically computed):

mean = (0.4914, 0.4822, 0.4465)  # RGB channels
std  = (0.2470, 0.2435, 0.2616)  # RGB channels

Normalization formula:

normalized_pixel = (pixel - mean) / std

This standardizes inputs to have mean ≈ 0 and std ≈ 1, which helps with:

Faster convergence
More stable gradients
Better generalization

Loading CIFAR-10 in PyTorch

from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split

# Define transforms
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2470, 0.2435, 0.2616))
])

# Load datasets
train_val_ds = datasets.CIFAR10(root='./data', train=True,
                                 download=True, transform=transform)
test_ds = datasets.CIFAR10(root='./data', train=False,
                           download=True, transform=transform)

# Split train into train/val
train_ds, val_ds = random_split(train_val_ds, [45000, 5000],
                                 generator=torch.Generator().manual_seed(42))

# Create data loaders
train_loader = DataLoader(train_ds, batch_size=256, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=256, shuffle=False)
test_loader = DataLoader(test_ds, batch_size=256, shuffle=False)

3. NEURAL NETWORK ARCHITECTURE

FFNN Architecture for CIFAR-10

A typical FFNN for CIFAR-10:

Input Layer:    3072 neurons (32×32×3 flattened)
Hidden Layer 1:  512 neurons + ReLU
Hidden Layer 2:  256 neurons + ReLU
Output Layer:     10 neurons (one per class)

Architecture Diagram

┌──────────────┐
│ Input (3072) │
└──────┬───────┘
       │ W₁ (3072×512)
       ↓
┌──────────────┐
│ Hidden (512) │ + ReLU
└──────┬───────┘
       │ W₂ (512×256)
       ↓
┌──────────────┐
│ Hidden (256) │ + ReLU
└──────┬───────┘
       │ W₃ (256×10)
       ↓
┌──────────────┐
│ Output (10)  │ + Softmax
└──────────────┘

Why Flatten Images?

FFNNs require 1D input vectors:

Original shape:  (32, 32, 3)
Flattened shape: (3072,)

Calculation: 32 × 32 × 3 = 3072

Note: This destroys spatial structure, which is why CNNs (Assignment 3) work better for images!

Parameter Count

For the architecture above:

Layer 1: (3072 × 512) + 512 = 1,573,376
Layer 2: (512 × 256) + 256  = 131,328
Layer 3: (256 × 10) + 10    = 2,570

Total parameters: 1,707,274

Formula: For a layer with n_in inputs and n_out outputs:

parameters = (n_in × n_out) + n_out = n_out × (n_in + 1)

4. FORWARD PASS: MATRIX OPERATIONS

Matrix-Vector Multiplication

For a batch of B images, each with D features:

Input X:     B × D matrix
Weights W:   D × H matrix
Bias b:      H vector (broadcasted)
Output Z:    B × H matrix

Forward pass equation:

Z = XW + b

Where:

Each row of X is one image (flattened)
Matrix multiplication: (B×D) @ (D×H) = (B×H)
Broadcasting adds b to each row

Example Calculation

# Input: batch of 256 images, each 3072 pixels
X = torch.randn(256, 3072)  # 256×3072

# Layer 1 weights
W1 = torch.randn(3072, 512)  # 3072×512
b1 = torch.randn(512)         # 512

# Forward pass
Z1 = X @ W1 + b1  # (256×3072) @ (3072×512) = 256×512

Computational Graph

The full forward pass:

X (256×3072)
    ↓ × W₁
Z₁ (256×512)
    ↓ + b₁
Z₂ (256×512)
    ↓ ReLU
H₁ (256×512)
    ↓ × W₂
Z₃ (256×256)
    ↓ + b₂
Z₄ (256×256)
    ↓ ReLU
H₂ (256×256)
    ↓ × W₃
Z₅ (256×10)
    ↓ + b₃
Z₆ (256×10)
    ↓ Softmax
Ŷ (256×10)

5. ACTIVATION FUNCTIONS

Why Non-Linear Activations?

Without non-linearity, stacking layers is pointless:

Layer 1: h₁ = W₁x + b₁
Layer 2: y = W₂h₁ + b₂
       = W₂(W₁x + b₁) + b₂
       = (W₂W₁)x + (W₂b₁ + b₂)
       = W'x + b'  ← Still linear!

Non-linear activations allow networks to learn complex patterns.

ReLU (Rectified Linear Unit)

Most common activation function for hidden layers.

Formula:

ReLU(x) = max(0, x)

Properties

Simple: Computationally cheap
Non-saturating: No vanishing gradient for x > 0
Sparse activations: About 50% of neurons are 0
Dead neurons: If neuron always outputs 0, it stops learning

Derivative:

ReLU'(x) = 1 if x > 0, else 0

PyTorch implementation:

import torch.nn.functional as F

# Option 1: Functional
output = F.relu(input)

# Option 2: Module
relu = nn.ReLU()
output = relu(input)

Sigmoid

Rarely used in hidden layers, sometimes for output.

Formula:

σ(x) = 1 / (1 + e⁻ˣ)

Properties

Output range: (0, 1)
Saturates: Gradients → 0 for large |x|
Not zero-centered: Causes zig-zagging during optimization

Derivative:

σ'(x) = σ(x)(1 - σ(x))

Softmax (Output Layer)

Used for multi-class classification.

Formula:

Softmax(zᵢ) = e^zᵢ / Σⱼ e^zⱼ

Properties

Outputs: Probabilities that sum to 1
Differentiable: Smooth gradients
Temperature: Can adjust confidence with scaling

PyTorch implementation:

# Softmax is typically combined with CrossEntropyLoss
# Don't apply softmax manually before nn.CrossEntropyLoss!

logits = model(x)  # Raw scores
loss = nn.CrossEntropyLoss()(logits, targets)

# For inference only:
probs = F.softmax(logits, dim=1)

6. LOSS FUNCTIONS

Cross-Entropy Loss

Standard loss for classification.

Formula:

L = -Σᵢ yᵢ log(ŷᵢ)

Where:

y is the true label (one-hot encoded)
ŷ is the predicted probability

For a single correct class c:

L = -log(ŷ_c)

Properties

Penalizes confident wrong predictions more than uncertain ones
Works well with softmax: Smooth gradients
Probabilistic interpretation: Maximizes likelihood

Cross-Entropy + Softmax Gradient

Beautiful property: The gradient simplifies!

∂L/∂z = ŷ - y

Where z is the logits (pre-softmax scores).

Mean Squared Error (MSE)

Not recommended for classification, but useful to understand.

Formula:

L = (1/2) Σᵢ (ŷᵢ - yᵢ)²

Gradient:

∂L/∂ŷ = ŷ - y

Why not for classification?

Cross-entropy has better gradients for probabilities
MSE doesn't penalize confident wrong predictions enough

PyTorch Implementation

# For classification (includes softmax internally)
criterion = nn.CrossEntropyLoss()

# Model outputs logits (raw scores), NOT probabilities!
logits = model(images)  # Shape: (batch_size, 10)
targets = labels         # Shape: (batch_size,) with values 0-9

loss = criterion(logits, targets)

Don't apply softmax before CrossEntropyLoss!

7. BACKPROPAGATION ALGORITHM

What is Backpropagation?

Backpropagation is an algorithm to compute gradients of the loss with respect to all parameters using the chain rule.

Goal: Compute ∂L/∂W and ∂L/∂b for all weights and biases.

Chain Rule Review

For composite functions:

If z = f(g(x)), then:
dz/dx = (dz/dg) × (dg/dx)

For neural networks with many layers:

∂L/∂W₁ = (∂L/∂Z₆) × (∂Z₆/∂Z₅) × ... × (∂Z₂/∂W₁)

Computational Graph Approach

Forward pass (left to right):

X → Z₁ → H₁ → Z₂ → H₂ → ... → Ŷ → L

Backward pass (right to left):

∂L/∂L ← ∂L/∂Ŷ ← ... ← ∂L/∂H₂ ← ∂L/∂Z₂ ← ∂L/∂H₁ ← ∂L/∂Z₁ ← ∂L/∂X
       │              │              │              │
       └─ ∂L/∂W₃     └─ ∂L/∂W₂     └─ ∂L/∂W₁     └─ (not needed)

Backprop Formulas for One Layer

For a layer: Z = f(XW + b)

Given: ∂L/∂Z (gradient from next layer)

Compute:

∂L/∂X = [f'(XW + b) ⊙ ∂L/∂Z] Wᵀ
∂L/∂W = Xᵀ [f'(XW + b) ⊙ ∂L/∂Z]
∂L/∂b = Σᵢ [f'(XW + b) ⊙ ∂L/∂Z]ᵢ

Where:

⊙ denotes element-wise multiplication
Wᵀ is the transpose of W
Xᵀ is the transpose of X
Σᵢ sums over the batch dimension

Example: Backprop Through ReLU

Forward:

Z = ReLU(X) = max(0, X)

Backward:

∂L/∂X = ∂L/∂Z ⊙ ReLU'(X)
      = ∂L/∂Z ⊙ (X > 0)  # Mask: 1 where X > 0, else 0

Example: Backprop Through Linear Layer

Forward:

Z = XW + b

Backward:

∂L/∂X = (∂L/∂Z) Wᵀ
∂L/∂W = Xᵀ (∂L/∂Z)
∂L/∂b = Σᵢ (∂L/∂Z)ᵢ

Full Network Backpropagation

For the 3-layer FFNN:

# Forward pass
Z1 = X @ W1 + b1
H1 = relu(Z1)
Z2 = H1 @ W2 + b2
H2 = relu(Z2)
Z3 = H2 @ W3 + b3
Y_hat = softmax(Z3)
L = cross_entropy(Y_hat, Y)

# Backward pass
dZ3 = Y_hat - Y  # Softmax + CrossEntropy gradient
dW3 = H2.T @ dZ3
db3 = dZ3.sum(dim=0)

dH2 = dZ3 @ W3.T
dZ2 = dH2 * (Z2 > 0)  # ReLU gradient
dW2 = H1.T @ dZ2
db2 = dZ2.sum(dim=0)

dH1 = dZ2 @ W2.T
dZ1 = dH1 * (Z1 > 0)  # ReLU gradient
dW1 = X.T @ dZ1
db1 = dZ1.sum(dim=0)

Note: PyTorch does this automatically with loss.backward()!

8. GRADIENT DESCENT OPTIMIZATION

Gradient Descent Intuition

Goal: Minimize loss L(θ) by adjusting parameters θ.

Key idea: Move in the direction opposite to the gradient.

θ_new = θ_old - α ∇L(θ_old)

Where:

α is the learning rate
∇L(θ) is the gradient of loss with respect to parameters

Gradient as Steepest Ascent

The gradient ∇L(θ) points in the direction of steepest increase of L.

Therefore, -∇L(θ) points in the direction of steepest decrease.

Batch Gradient Descent

Compute gradient using entire dataset:

∇L(θ) = (1/N) Σᵢ ∇L(θ; xᵢ, yᵢ)

Pros

Accurate gradient
Smooth convergence

Cons

Slow for large datasets
Requires all data in memory

Stochastic Gradient Descent (SGD)

Compute gradient using one random example:

∇L(θ) ≈ ∇L(θ; xᵢ, yᵢ)

Pros

Fast updates
Can escape local minima (noise helps!)

Cons

Noisy gradient
Erratic convergence

Mini-Batch Gradient Descent

Compute gradient using a small batch:

∇L(θ) ≈ (1/B) Σᵢ₌₁ᴮ ∇L(θ; xᵢ, yᵢ)

Best of both worlds:

More accurate than SGD
Faster than full batch
Can use GPU parallelization

Typical batch sizes: 32, 64, 128, 256, 512

Learning Rate Selection

Too large: Overshoots, diverges

Loss oscillates or explodes

Too small: Slow convergence

Takes forever to converge

Just right: Smooth, fast convergence

Steady decrease to minimum

Typical values: 0.001, 0.0001, 0.01

Adam Optimizer

Adaptive Moment Estimation - most popular optimizer.

Key ideas

Momentum: Use exponential moving average of gradients
Adaptive learning rates: Different rates for each parameter

Formula (simplified):

m_t = β₁ m_{t-1} + (1-β₁) g_t        # First moment (momentum)
v_t = β₂ v_{t-1} + (1-β₂) g_t²       # Second moment (variance)
θ_t = θ_{t-1} - α m_t / (√v_t + ε)   # Update

Default hyperparameters:

lr = 0.001
betas = (0.9, 0.999)
eps = 1e-8

PyTorch implementation:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Optimization Algorithm Comparison

Algorithm	Characteristics
SGD	Simple, well-understood. Requires careful LR tuning. Can escape sharp minima.
Adam	Adapts LR automatically. Works well with defaults. Faster convergence. More memory.

9. TRAINING PROCESS

Training Loop Structure

for epoch in range(num_epochs):
    # Training phase
    model.train()
    for images, labels in train_loader:
        # 1. Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # 2. Backward pass
        optimizer.zero_grad()
        loss.backward()

        # 3. Update weights
        optimizer.step()

    # Validation phase
    model.eval()
    with torch.no_grad():
        for images, labels in val_loader:
            outputs = model(images)
            # Compute validation metrics

Important Steps Explained

1. model.train() vs model.eval()

model.train()  # Enable dropout, batch norm training mode
model.eval()   # Disable dropout, batch norm inference mode

2. optimizer.zero_grad()

PyTorch accumulates gradients. Must zero them each iteration!

optimizer.zero_grad()  # Clear previous gradients
loss.backward()        # Compute new gradients
optimizer.step()       # Update weights

3. torch.no_grad()

Don't compute gradients during validation (saves memory):

with torch.no_grad():
    outputs = model(images)  # No gradient tracking

Early Stopping

Problem: Model may overfit if trained too long.

Solution: Stop when validation performance stops improving.

Implementation:

best_val_acc = 0
patience = 5
epochs_without_improvement = 0

for epoch in range(max_epochs):
    train(...)
    val_acc = validate(...)

    if val_acc > best_val_acc:
        best_val_acc = val_acc
        epochs_without_improvement = 0
        save_model(model)  # Save best model
    else:
        epochs_without_improvement += 1

    if epochs_without_improvement >= patience:
        print("Early stopping!")
        break

model.load(best_model)  # Restore best model

Learning Rate Scheduling

Gradually decrease learning rate during training.

Common schedules:

# Step decay: multiply by 0.1 every 30 epochs
scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                             step_size=30,
                                             gamma=0.1)

# Cosine annealing: smooth decrease
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer,
                                                        T_max=100)

# Reduce on plateau: decrease when validation stops improving
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
                                                        patience=5)

Usage:

for epoch in range(num_epochs):
    train(...)
    validate(...)
    scheduler.step()  # Update learning rate

Monitoring Training

Track these metrics:

history = {
    'train_loss': [],
    'train_acc': [],
    'val_loss': [],
    'val_acc': []
}

# Each epoch
history['train_loss'].append(train_loss)
history['train_acc'].append(train_acc)
history['val_loss'].append(val_loss)
history['val_acc'].append(val_acc)

10. EVALUATION METRICS

Accuracy

Most intuitive metric for classification.

Formula:

Accuracy = (Number of correct predictions) / (Total predictions) = (TP + TN) / (TP + TN + FP + FN)

PyTorch implementation:

def compute_accuracy(model, dataloader):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in dataloader:
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return correct / total

Limitations

Doesn't show per-class performance
Misleading for imbalanced datasets

Per-Class Accuracy

Compute accuracy for each class separately.

def per_class_accuracy(model, dataloader, num_classes=10):
    model.eval()
    class_correct = [0] * num_classes
    class_total = [0] * num_classes

    with torch.no_grad():
        for images, labels in dataloader:
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)

            for i in range(len(labels)):
                label = labels[i]
                class_correct[label] += (predicted[i] == label).item()
                class_total[label] += 1

    return [class_correct[i] / class_total[i] for i in range(num_classes)]

Top-K Accuracy

Measures if correct class is in top K predictions.

def top_k_accuracy(model, dataloader, k=5):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in dataloader:
            outputs = model(images)
            _, top_k = torch.topk(outputs, k, dim=1)

            for i in range(len(labels)):
                if labels[i] in top_k[i]:
                    correct += 1
                total += 1

    return correct / total

11. CONFUSION MATRIX ANALYSIS

What is a Confusion Matrix?

A confusion matrix shows the performance of a classification model by comparing predicted vs actual labels.

Structure (for 10 classes)

              Predicted Class
           0   1   2   3  ...  9
        ┌──────────────────────┐
      0 │ TP  FP  FP  FP ... FP│
      1 │ FP  TP  FP  FP ... FP│
Actual 2 │ FP  FP  TP  FP ... FP│
Class  3 │ FP  FP  FP  TP ... FP│
     ... │ .....................│
      9 │ FP  FP  FP  FP ... TP│
        └──────────────────────┘

Diagonal: Correct predictions
Off-diagonal: Misclassifications

Computing Confusion Matrix

def compute_confusion_matrix(model, dataloader, num_classes=10):
    model.eval()
    confusion_matrix = torch.zeros(num_classes, num_classes, dtype=torch.int64)

    with torch.no_grad():
        for images, labels in dataloader:
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)

            for true, pred in zip(labels, predicted):
                confusion_matrix[true, pred] += 1

    return confusion_matrix

Interpreting Confusion Matrix

What to look for:

Strong diagonal: High accuracy
Weak diagonal: Poor overall performance
Hot spots off diagonal: Specific confusion patterns

Example insights:

If cm[3, 5] is high (cat predicted as dog):
→ Model confuses cats and dogs
→ Maybe add more training data for these classes
→ Or use data augmentation

If cm[2, :].sum() is low (few bird examples classified):
→ Model struggles with birds overall
→ Check if bird images are underrepresented

Normalized Confusion Matrix

Show percentages instead of counts:

def normalize_confusion_matrix(cm):
    row_sums = cm.sum(axis=1, keepdims=True)
    return cm.astype(float) / row_sums

Each row sums to 1.0 (100%).

12. HYPERPARAMETER TUNING

Key Hyperparameters

Architecture

Number of hidden layers
Hidden layer sizes
Activation functions

Training

Learning rate
Batch size
Number of epochs
Optimizer choice

Regularization

Dropout rate
Weight decay (L2 regularization)

Systematic Tuning Process

1. Start with baseline

baseline = {
    'hidden_dims': (512, 256),
    'dropout': 0.1,
    'lr': 0.001,
    'batch_size': 256,
    'epochs': 20
}

2. Change ONE variable at a time

# Experiment 1: Larger network
config1 = baseline.copy()
config1['hidden_dims'] = (1024, 512, 256)

# Experiment 2: Lower learning rate
config2 = baseline.copy()
config2['lr'] = 0.0001

# Experiment 3: More dropout
config3 = baseline.copy()
config3['dropout'] = 0.3

3. Compare results

Config	Val Acc	Time/Epoch
Baseline	54.2%	12s
Larger net	56.0%	18s
Lower LR	51.8%	12s
More dropout	53.1%	12s

Learning Rate Guidelines

Start with: 0.001 (Adam) or 0.01 (SGD)

Too high indicators

Loss increases or oscillates wildly
NaN values appear

Too low indicators

Very slow decrease in loss
Taking many epochs to converge

Finding good LR

# Try: [0.1, 0.01, 0.001, 0.0001, 0.00001]
# Pick the largest that doesn't diverge

Hidden Layer Size Guidelines

Rule of thumb:

Input size: 3072
Hidden 1:   512-2048  (smaller than input)
Hidden 2:   256-512   (smaller than Hidden 1)
Output:     10        (number of classes)

More neurons

✓ More capacity to learn
✗ More parameters → slower, more memory
✗ More prone to overfitting

Fewer neurons

✓ Faster training
✓ Less overfitting
✗ May underfit (can't learn complex patterns)

Dropout Guidelines

Typical values: 0.1 - 0.5

class FFNN(nn.Module):
    def __init__(self, dropout=0.2):
        super().__init__()
        self.fc1 = nn.Linear(3072, 512)
        self.dropout1 = nn.Dropout(dropout)
        self.fc2 = nn.Linear(512, 256)
        self.dropout2 = nn.Dropout(dropout)
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = self.dropout1(x)
        x = F.relu(self.fc2(x))
        x = self.dropout2(x)
        x = self.fc3(x)
        return x

Higher dropout (0.4-0.5)

Use when overfitting is severe
More regularization

Lower dropout (0.1-0.2)

Use when model is underfitting
Less regularization

Batch Size Guidelines

Common values: 32, 64, 128, 256, 512

Larger batches

✓ More stable gradients
✓ Better GPU utilization
✗ Less noise → may converge to sharp minima
✗ More memory

Smaller batches

✓ More noise → better exploration
✓ Less memory
✗ Noisier gradients
✗ Slower convergence

For CIFAR-10: 256 is a good default

13. COMMON PITFALLS AND SOLUTIONS

Pitfall 1: Not Shuffling Training Data

Problem: Model sees data in same order every epoch.

Solution:

train_loader = DataLoader(train_ds, batch_size=256,
                          shuffle=True)  # ← IMPORTANT!

Pitfall 2: Forgetting to Normalize

Problem: Raw pixel values [0, 255] cause unstable training.

Solution:

transform = transforms.Compose([
    transforms.ToTensor(),  # Scales to [0, 1]
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2470, 0.2435, 0.2616))  # ← IMPORTANT!
])

Pitfall 3: Using Test Set for Hyperparameter Tuning

Problem: Test set performance is overly optimistic.

Solution: Use validation set for tuning, test set ONLY for final evaluation.

Training set   → Train model
Validation set → Tune hyperparameters
Test set       → Report final performance (once!)

Pitfall 4: Not Using model.eval() During Validation

Problem: Dropout stays active, giving inconsistent results.

Solution:

model.eval()  # ← Disables dropout
with torch.no_grad():
    # Validation code

Pitfall 5: Forgetting optimizer.zero_grad()

Problem: Gradients accumulate, causing wrong updates.

Solution:

for images, labels in train_loader:
    optimizer.zero_grad()  # ← MUST come before backward()
    loss = criterion(model(images), labels)
    loss.backward()
    optimizer.step()

Pitfall 6: Applying Softmax Before CrossEntropyLoss

Problem: nn.CrossEntropyLoss applies softmax internally!

Wrong:

outputs = F.softmax(model(images), dim=1)
loss = nn.CrossEntropyLoss()(outputs, labels)  # ✗ WRONG!

Correct:

logits = model(images)  # Raw scores
loss = nn.CrossEntropyLoss()(logits, labels)  # ✓ CORRECT!

Pitfall 7: Incorrect Input Shape

Problem: FFNN expects flattened images, not 2D.

Wrong:

# images shape: (batch_size, 3, 32, 32)
outputs = model(images)  # ✗ WRONG!

Correct:

# Flatten in forward()
def forward(self, x):
    x = x.view(x.size(0), -1)  # (batch, 3, 32, 32) → (batch, 3072)
    ...

Pitfall 8: Overfitting

Symptoms:

Training accuracy high, validation accuracy low
Large gap between training and validation loss

Solutions:

# 1. Add dropout
model = FFNN(dropout=0.3)

# 2. Add weight decay
optimizer = torch.optim.Adam(model.parameters(),
                             lr=0.001,
                             weight_decay=1e-4)

# 3. Early stopping
if val_acc_not_improving_for_N_epochs:
    stop_training()

# 4. Get more training data (if possible)

Pitfall 9: Underfitting

Symptoms:

Both training and validation accuracy low
Loss decreases very slowly

Solutions:

# 1. Larger network
model = FFNN(hidden_dims=(1024, 512, 256))

# 2. Train longer
num_epochs = 50

# 3. Lower learning rate (paradoxically can help)
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

# 4. Remove/reduce regularization
model = FFNN(dropout=0.0)  # No dropout

Pitfall 10: Exploding/Vanishing Gradients

Symptoms:

Loss becomes NaN (exploding)
Loss stops decreasing (vanishing)

Solutions:

# 1. Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# 2. Lower learning rate
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

# 3. Better weight initialization (PyTorch does this by default)
# 4. Use batch normalization (for deeper networks)

14. IMPLEMENTATION GUIDE

Complete FFNN Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class FFNN(nn.Module):
    def __init__(self, input_dim=3072, hidden_dims=(512, 256),
                 dropout=0.1, num_classes=10):
        super().__init__()

        layers = []
        in_dim = input_dim

        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(in_dim, hidden_dim))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout))
            in_dim = hidden_dim

        layers.append(nn.Linear(in_dim, num_classes))
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        # Flatten image: (batch, 3, 32, 32) → (batch, 3072)
        x = x.view(x.size(0), -1)
        return self.network(x)

# Create model
model = FFNN(hidden_dims=(512, 256), dropout=0.1)
print(f"Total parameters: {sum(p.numel() for p in model.parameters())}")

Training Function

def train_one_epoch(model, train_loader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Metrics
        total_loss += loss.item() * images.size(0)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    avg_loss = total_loss / total
    accuracy = correct / total
    return avg_loss, accuracy

Validation Function

def validate(model, val_loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in val_loader:
            images, labels = images.to(device), labels.to(device)

            outputs = model(images)
            loss = criterion(outputs, labels)

            total_loss += loss.item() * images.size(0)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    avg_loss = total_loss / total
    accuracy = correct / total
    return avg_loss, accuracy

Complete Training Loop

# Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = FFNN(hidden_dims=(512, 256), dropout=0.1).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# Training
num_epochs = 20
history = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}

for epoch in range(num_epochs):
    # Train
    train_loss, train_acc = train_one_epoch(model, train_loader,
                                             optimizer, criterion, device)

    # Validate
    val_loss, val_acc = validate(model, val_loader, criterion, device)

    # Record
    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)
    history['val_loss'].append(val_loss)
    history['val_acc'].append(val_acc)

    # Print
    print(f"Epoch {epoch+1}/{num_epochs}")
    print(f"  Train Loss: {train_loss:.4f}, Train Acc: {train_acc*100:.2f}%")
    print(f"  Val Loss:   {val_loss:.4f}, Val Acc:   {val_acc*100:.2f}%")

Gathering Misclassifications

def gather_misclassifications(model, dataloader, device, max_samples=16):
    model.eval()
    misclassified_images = []
    misclassified_preds = []
    misclassified_labels = []

    with torch.no_grad():
        for images, labels in dataloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)

            # Find misclassified
            mask = predicted != labels

            misclassified_images.extend(images[mask].cpu())
            misclassified_preds.extend(predicted[mask].cpu().tolist())
            misclassified_labels.extend(labels[mask].cpu().tolist())

            if len(misclassified_images) >= max_samples:
                break

    return (misclassified_images[:max_samples],
            misclassified_preds[:max_samples],
            misclassified_labels[:max_samples])

15. ASSIGNMENT REQUIREMENTS SUMMARY

Part A: In-Lab (2%)

Requirements

1. Load CIFAR-10 (0.25%)

✓ Create train/val/test splits
✓ Normalize images
✓ Use DataLoader

2. Implement FFNN (1.0%)

✓ Define fully connected network
✓ Train to at least 50% accuracy

3. Plot Training Curves (0.25%)

✓ Training vs validation loss
✓ Training vs validation accuracy

4. Confusion Matrix (0.25%)

✓ Compute on validation set
✓ Show CIFAR-10 class names
✓ Diagonal should be notably larger than off-diagonal

5. Misclassification Grid (0.25%)

✓ Display at least 16 misclassified examples
✓ Show predicted vs true labels

Part B: Take-Home (3%)

Requirements

1. Modifications (1.0%)

Change at least 1 hyperparameter
Justify your choice

2. Updated Plots (1.0%)

Same plots as Part A
Compare baseline vs modified

3. Short Report (1.0%)

1-2 pages
Discuss training dynamics
Analyze confusion patterns
Explain impact of your changes

Expected Results

Baseline FFNN Performance

Architecture: (512, 256)
Validation Accuracy: 50-55%
Training Time: ~12s/epoch (GPU)
Parameters: ~1.7M

Improved Configuration

Architecture: (1024, 512, 256)
Validation Accuracy: 55-58%
Training Time: ~18s/epoch (GPU)
Parameters: ~3.8M

Note: FFNN performance on CIFAR-10 is limited! CNNs (Assignment 3) achieve 80-90%.

Submission Checklist

[ ] Jupyter notebook with all code and outputs
[ ] PDF report (1-2 pages) for Part B
[ ] Name and student ID included
[ ] Submit within 1 week of lab

FULLY CONNECTED NEURAL NETWORKS

TABLE OF CONTENTS

1. INTRODUCTION TO FEEDFORWARD NEURAL NETWORKS

What is a Feedforward Neural Network?

Key Characteristics

Why Use Neural Networks for Images?

Mathematical Representation

2. CIFAR-10 DATASET

Dataset Overview

Dataset Statistics

Class Labels

Data Splits

Normalization

Loading CIFAR-10 in PyTorch

3. NEURAL NETWORK ARCHITECTURE

FFNN Architecture for CIFAR-10

Architecture Diagram

Why Flatten Images?

Parameter Count

4. FORWARD PASS: MATRIX OPERATIONS

Matrix-Vector Multiplication

Example Calculation

Computational Graph

5. ACTIVATION FUNCTIONS

Why Non-Linear Activations?

ReLU (Rectified Linear Unit)

Properties

Sigmoid

Properties

Softmax (Output Layer)

Properties

6. LOSS FUNCTIONS

Cross-Entropy Loss

Properties

Cross-Entropy + Softmax Gradient

Mean Squared Error (MSE)

PyTorch Implementation

7. BACKPROPAGATION ALGORITHM

What is Backpropagation?

Chain Rule Review

Computational Graph Approach

Backprop Formulas for One Layer

Example: Backprop Through ReLU

Example: Backprop Through Linear Layer

Full Network Backpropagation

8. GRADIENT DESCENT OPTIMIZATION

Gradient Descent Intuition

Gradient as Steepest Ascent

Batch Gradient Descent

Pros

Cons

Stochastic Gradient Descent (SGD)

Pros

Cons

Mini-Batch Gradient Descent

Learning Rate Selection

Too large: Overshoots, diverges

Too small: Slow convergence

Just right: Smooth, fast convergence

Adam Optimizer

Key ideas

Optimization Algorithm Comparison

9. TRAINING PROCESS

Training Loop Structure

Important Steps Explained

1. model.train() vs model.eval()

2. optimizer.zero_grad()

3. torch.no_grad()

Early Stopping

Learning Rate Scheduling

Monitoring Training

10. EVALUATION METRICS

Accuracy

Limitations

Per-Class Accuracy

Top-K Accuracy

11. CONFUSION MATRIX ANALYSIS

What is a Confusion Matrix?

Structure (for 10 classes)

Computing Confusion Matrix