CNNs for Image Classification

1. INTRODUCTION TO CNNs

What is a Convolutional Neural Network?

A Convolutional Neural Network (CNN) is a specialized type of neural network designed for processing grid-like data, particularly images. CNNs are inspired by the visual cortex of animals and are the foundation of modern computer vision.

Key Characteristics

Spatial hierarchy: CNNs learn hierarchical patterns from low-level features (edges) to high-level concepts (objects)
Parameter sharing: Same filters are applied across the entire image
Translation invariance: Can detect features regardless of their position in the image
Sparse connectivity: Each neuron connects only to a local region of the input

INPUT IMAGE (32×32×3)
        ↓
    ┌───────┐
    │  CONV │ → Detect edges, colors
    └───────┘
        ↓
    ┌───────┐
    │  POOL │ → Reduce size
    └───────┘
        ↓
    ┌───────┐
    │  CONV │ → Detect shapes
    └───────┘
        ↓
    ┌───────┐
    │  POOL │ → Reduce size
    └───────┘
        ↓
    ┌───────┐
    │  CONV │ → Detect objects
    └───────┘
        ↓
    ┌───────┐
    │   FC  │ → Classification
    └───────┘
        ↓
    OUTPUT (10 classes)

2. WHY CNNs FOR IMAGES?

The Problem with Fully Connected Networks

When we flatten an image for a fully connected network:

Input: 32×32×3 CIFAR-10 image
Flattened: 3,072 dimensional vector
Hidden layer: 1,024 neurons
Parameters: 3,072 × 1,024 = 3,145,728 parameters!

Problems with FC Networks

Problem	Description
Loss of spatial structure	Pixels that are spatially close treated same as pixels far apart
Huge parameter count	Millions of parameters in first layer alone
No translation invariance	Must learn same feature at every possible position
Overfitting	Too many parameters lead to poor generalization

How CNNs Solve These Problems

Preserve spatial structure: Input shape [batch, channels, height, width] - never flattened!
Parameter efficiency: 3×3 conv with 64 filters = only 1,792 parameters
Translation invariance: Same filter detects edges everywhere
Better generalization: Fewer parameters = less overfitting

3. CNN ARCHITECTURE COMPONENTS

┌────────────────────────────────────────┐
│          STANDARD CNN ARCHITECTURE      │
├────────────────────────────────────────┤
│  Input Image (32×32×3)                 │
│           ↓                             │
│  ┌──────────────────────┐               │
│  │ Conv → ReLU → Pool   │ ×N            │
│  └──────────────────────┘               │
│           ↓                             │
│  ┌──────────────────────┐               │
│  │ Conv → ReLU → Pool   │ ×M            │
│  └──────────────────────┘               │
│           ↓                             │
│  Global Average Pooling                │
│           ↓                             │
│  Fully Connected → Output (10 classes) │
└────────────────────────────────────────┘

Layer Types

Convolutional layers: Extract spatial features
Activation layers: Introduce non-linearity (ReLU)
Pooling layers: Downsample spatial dimensions
Normalization layers: Stabilize training (BatchNorm)
Dropout layers: Regularization
Fully connected layers: Final classification

4. CONVOLUTIONAL LAYERS

What is Convolution?

Convolution is a mathematical operation that slides a small filter (kernel) over an input to produce a feature map.

Output[i,j] = Σ Σ Input[i+m, j+n] × Kernel[m,n]

Key Parameters

Parameter	Description	Common Values
Kernel size (k)	Size of the sliding window	3×3, 5×5, 7×7
Stride (s)	How many pixels to slide	1, 2
Padding (p)	Add zeros around border	0, 1, 2
Filters (out_channels)	Number of output feature maps	32, 64, 128, 256

Output Size Formula

output_size = ⌊(input_size + 2×padding - kernel_size) / stride⌋ + 1

For a 3×3 kernel with padding=1 and stride=1, the output size equals input size!

How Filters Work

Early layers (low-level): Edge detectors, color blobs, simple textures
Middle layers (mid-level): Corners, curves, simple shapes
Deep layers (high-level): Object parts, complex patterns, semantic concepts

Parameter Count

Parameters = (kernel_h × kernel_w × in_channels + 1) × out_channels

Example: Conv2d(3, 64, kernel_size=3)
= (3 × 3 × 3 + 1) × 64 = 1,792 parameters

5. POOLING LAYERS

Purpose of Pooling

Reduce spatial dimensions → decrease computational cost
Increase receptive field → each neuron "sees" more
Add translation invariance → small shifts don't change output
Reduce overfitting → fewer parameters in subsequent layers

Types of Pooling

Type	Operation	Use Case
MaxPool2d(2)	Takes maximum value	Most common, preserves strong activations
AvgPool2d(2)	Takes average value	Smoother downsampling
AdaptiveAvgPool2d(1)	Global average pooling	Before final classifier, replaces flatten

Size Reduction with MaxPool(2):
32×32 → MaxPool → 16×16
16×16 → MaxPool → 8×8
 8×8  → MaxPool → 4×4
 4×4  → MaxPool → 2×2

6. ACTIVATION FUNCTIONS

Why Activation Functions?

Without activation functions, stacking layers is useless:

Linear → Linear → Linear ≡ Single Linear Layer

Activations introduce non-linearity, allowing networks to learn complex patterns.

ReLU (Rectified Linear Unit)

ReLU(x) = max(0, x)

Advantages:

Fast to compute
Helps with vanishing gradients
Sparse activations (many zeros)
Works well in practice

Disadvantages:

"Dying ReLU" problem: neurons can get stuck at 0
Not zero-centered

Other Activations

Function	Formula	Use Case
Leaky ReLU	max(0.01x, x)	Prevents dying neurons
Sigmoid	1 / (1 + e^(-x))	Binary classification output
Tanh	(e^x - e^(-x)) / (e^x + e^(-x))	Zero-centered alternative to sigmoid
Softmax	e^(x_i) / Σ e^(x_j)	Multi-class classification output

7. NORMALIZATION TECHNIQUES

Batch Normalization

Purpose: Normalize layer inputs to have mean=0, std=1

BatchNorm(x) = γ × (x - μ_batch) / √(σ²_batch + ε) + β

Benefits:

Faster convergence
Allows higher learning rates
Less sensitive to initialization
Acts as regularization

Typical Placement

Conv → BatchNorm → ReLU

nn.Conv2d(64, 128, 3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True)

During training: uses batch statistics
During inference: uses running averages computed during training

8. REGULARIZATION IN CNNs

Dropout

Purpose: Prevent overfitting by randomly dropping activations

During training: randomly set activations to 0 with probability p
During inference: scale activations by (1-p)

Typical Usage

# Higher dropout in FC layers
nn.Dropout(0.5)

# Lower dropout after conv layers
nn.Dropout2d(0.25)

Weight Decay (L2 Regularization)

Loss_total = Loss_original + λ × Σ(w²)

optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=5e-4  # λ = 0.0005
)

Label Smoothing

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

Prevents overconfident predictions and improves calibration

9. DATA AUGMENTATION

Why Data Augmentation?

Increase effective dataset size
Improve generalization
Reduce overfitting
Better calibration

Common Augmentations for CIFAR-10

Augmentation	Effect	Expected Gain
RandomCrop(32, padding=4)	Translation invariance	+2-3%
RandomHorizontalFlip()	Left-right symmetry	+1-2%
ColorJitter(0.2, 0.2, 0.2, 0.1)	Lighting robustness	+1-2%
RandomErasing(p=0.15)	Occlusion robustness	+0.5-1%

Complete Pipeline

train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(0.2, 0.2, 0.2, 0.1),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=(0.4914, 0.4822, 0.4465),
        std=(0.2470, 0.2435, 0.2616)
    ),
    transforms.RandomErasing(p=0.15),
])

NEVER augment the test set! Evaluate on clean data only.

10. TRAINING CNNs

Loss Functions

criterion = nn.CrossEntropyLoss(label_smoothing=0.05)

Combines LogSoftmax + NLLLoss
Expects raw logits (before softmax)
Formula: -log(softmax(logits)[target_class])

Optimizers

Optimizer	Pros	Cons
Adam / AdamW	Fast convergence, adaptive LR	Can generalize slightly worse
SGD + Momentum	Better generalization	Requires careful tuning

Learning Rate Schedules

# Cosine Annealing (smooth decrease)
scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=epochs, eta_min=1e-6
)

# Step LR (decrease by factor every N epochs)
scheduler = optim.lr_scheduler.StepLR(
    optimizer, step_size=30, gamma=0.1
)

# ReduceLROnPlateau (adaptive)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=5
)

Gradient Clipping

torch.nn.utils.clip_grad_norm_(
    model.parameters(),
    max_norm=1.0
)

Prevents exploding gradients, especially useful with high learning rates

11. CIFAR-10 DATASET

Dataset Overview

Property	Value
Total Images	60,000 (50k train, 10k test)
Resolution	32×32 pixels
Channels	3 (RGB)
Classes	10 (balanced)

Class Distribution

Airplane
Automobile
Bird
Cat
Deer
Dog
Frog
Horse
Ship
Truck

Data Normalization

# CIFAR-10 statistics
mean = (0.4914, 0.4822, 0.4465)  # RGB
std = (0.2470, 0.2435, 0.2616)

normalize = transforms.Normalize(mean=mean, std=std)

Why normalize? Zero-centered inputs speed up convergence and prevent activation saturation

Train/Val Split

# Typical split: 45k train / 5k val / 10k test
val_size = 5000
train_indices = list(range(0, 50000 - val_size))
val_indices = list(range(50000 - val_size, 50000))

12. MODEL EVALUATION

Key Metrics

Metric	Formula	Interpretation
Accuracy	correct / total	Overall correctness
Loss	CrossEntropyLoss	Confidence-aware error
Confidence	max(softmax(logits))	Model certainty

Evaluation Function

def evaluate(model, dataloader, criterion, device):
    model.eval()  # IMPORTANT!
    running_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, targets in dataloader:
            inputs = inputs.to(device)
            targets = targets.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, targets)

            running_loss += loss.item() * inputs.size(0)
            predictions = outputs.argmax(dim=1)
            correct += (predictions == targets).sum().item()
            total += targets.size(0)

    return running_loss / total, correct / total

Always call model.eval() before evaluation to disable dropout and switch BatchNorm to eval mode!

13. CNN vs FULLY CONNECTED NETWORKS

Parameter Comparison

Network Type	First Layer	Parameters
Fully Connected	3,072 → 1,024	3,145,728
CNN	3→64, kernel=3×3	1,792
Ratio		1,754:1

Spatial Awareness

Fully Connected:

Flattens image: [batch, 3, 32, 32] → [batch, 3072]
No notion of "nearby pixels"
Must relearn patterns at different positions

CNN:

Preserves structure: [batch, 3, 32, 32] → [batch, 64, 32, 32]
Adjacent pixels processed together
Translation invariance built-in

Typical CIFAR-10 Results

Model	Parameters	Test Accuracy
Random Guessing	-	10%
3-layer FC	3.8M	55-60%
Simple CNN	0.6M	75-80%
CNN + Augmentation	1.2M	85-90%
ResNet-18	11M	92-95%

14. IMPLEMENTATION GUIDE

Model Architecture

class CIFAR10CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        self.features = nn.Sequential(
            # Block 1: 32×32 → 16×16
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Dropout(0.25),

            # Block 2: 16×16 → 8×8
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Dropout(0.35),

            # Block 3: 8×8 → 1×1
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d(1),
        )

        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Dropout(0.4),
            nn.Linear(256, 128),
            nn.ReLU(inplace=True),
            nn.Dropout(0.2),
            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

Training Setup

# Model, loss, optimizer
model = CIFAR10CNN().to(device)
criterion = nn.CrossEntropyLoss(label_smoothing=0.05)
optimizer = optim.AdamW(model.parameters(), lr=2e-3, weight_decay=5e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

# Training loop
for epoch in range(num_epochs):
    model.train()
    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

    scheduler.step()

15. COMMON PITFALLS AND SOLUTIONS

Pitfall 1: Flattening at Input

WRONG:

def forward(self, x):
    x = x.view(x.size(0), -1)  # DESTROYS SPATIAL STRUCTURE!
    x = self.conv1(x)  # This will crash

CORRECT:

def forward(self, x):
    # x shape: [batch, 3, 32, 32] - keep spatial structure!
    x = self.features(x)
    x = self.classifier(x)

Pitfall 2: Augmenting Test Data

WRONG:

test_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),  # NO!
    transforms.ToTensor(),
])

CORRECT:

test_transform = transforms.Compose([
    transforms.ToTensor(),  # No augmentation!
    transforms.Normalize(mean, std),
])

Pitfall 3: Not Using .eval()

WRONG:

with torch.no_grad():
    for inputs, targets in test_loader:
        outputs = model(inputs)  # BatchNorm/Dropout still active!

CORRECT:

model.eval()  # IMPORTANT!
with torch.no_grad():
    for inputs, targets in test_loader:
        outputs = model(inputs)

Common Training Issues

Symptom	Problem	Solution
Loss = NaN	Learning rate too high	Lower LR, add gradient clipping
Train >> Val accuracy	Overfitting	More dropout, augmentation, weight decay
Both train/val low	Underfitting	Increase capacity, train longer
Loss oscillates wildly	LR too high or batch size too small	Lower LR, increase batch size

SUMMARY

Key Takeaways

CNNs preserve spatial structure - NEVER flatten at input
Convolutions are parameter-efficient - same filter reused across image
Use BatchNorm + ReLU after Conv layers for stable training
Data augmentation is crucial - easily +5-10% accuracy
Always normalize inputs using dataset statistics
Don't augment test set - evaluate on clean data
Use proper train/val/test splits - 45k/5k/10k for CIFAR-10
Monitor both train and val metrics - detect overfitting
Save best model, not last epoch
Set seeds for reproducibility

Expected Results

Configuration	Expected Test Accuracy
Simple CNN, no augmentation	70-75%
CNN + basic augmentation	80-85%
CNN + full augmentation pipeline	85-90%

CONVOLUTIONAL NEURAL NETWORKS

TABLE OF CONTENTS

1. INTRODUCTION TO CNNs

What is a Convolutional Neural Network?

Key Characteristics

2. WHY CNNs FOR IMAGES?

The Problem with Fully Connected Networks

Problems with FC Networks

How CNNs Solve These Problems

3. CNN ARCHITECTURE COMPONENTS

Layer Types

4. CONVOLUTIONAL LAYERS

What is Convolution?

Key Parameters

Output Size Formula

How Filters Work

Parameter Count

5. POOLING LAYERS

Purpose of Pooling

Types of Pooling

6. ACTIVATION FUNCTIONS

Why Activation Functions?

ReLU (Rectified Linear Unit)

Advantages:

Disadvantages:

Other Activations

7. NORMALIZATION TECHNIQUES

Batch Normalization

Benefits:

Typical Placement

8. REGULARIZATION IN CNNs

Dropout

Typical Usage

Weight Decay (L2 Regularization)

Label Smoothing

9. DATA AUGMENTATION

Why Data Augmentation?

Common Augmentations for CIFAR-10

Complete Pipeline

10. TRAINING CNNs

Loss Functions

Optimizers

Learning Rate Schedules

Gradient Clipping

11. CIFAR-10 DATASET

Dataset Overview

Class Distribution

Data Normalization

Train/Val Split

12. MODEL EVALUATION

Key Metrics

Evaluation Function

13. CNN vs FULLY CONNECTED NETWORKS

Parameter Comparison

Spatial Awareness

Typical CIFAR-10 Results

14. IMPLEMENTATION GUIDE

Model Architecture

Training Setup

15. COMMON PITFALLS AND SOLUTIONS

Pitfall 1: Flattening at Input

Pitfall 2: Augmenting Test Data

Pitfall 3: Not Using .eval()

Common Training Issues

SUMMARY

Key Takeaways

Expected Results