← Back to Topics

CMPUT 328 Assignment 1: Logistic Regression

Complete Study Guide for Linear & Logistic Regression on MNIST

Table of Contents

1. Fundamentals of Linear Regression

What is Linear Regression?

Linear regression is a supervised learning algorithm that models the relationship between input features and a continuous output variable using a linear function.

Mathematical Formulation

Prediction: ŷ = w₁x₁ + w₂x₂ + ... + wₙxₙ + b Matrix Form: ŷ = Xw + b Where: - X: input features [batch_size × n_features] - w: weights [n_features × 1] - b: bias (scalar) - ŷ: predictions

Loss Function: Mean Squared Error (MSE)

MSE = (1/N) Σ(yᵢ - ŷᵢ)² Where: - N: number of samples - yᵢ: true value - ŷᵢ: predicted value

Key Properties

2. Logistic Regression

What is Logistic Regression?

Logistic regression extends linear regression for classification by applying a sigmoid (or softmax) function to convert linear outputs into probabilities.

Binary Classification: Sigmoid Function

Sigmoid: σ(z) = 1 / (1 + e⁻ᶻ) Properties: - Output range: (0, 1) - σ(0) = 0.5 - σ(∞) = 1 - σ(-∞) = 0

Multi-class Classification: Softmax Function

Softmax: softmax(zᵢ) = e^zᵢ / Σⱼ e^zⱼ For MNIST (10 classes): P(y = k | x) = e^(wₖᵀx + bₖ) / Σⱼ₌₀⁹ e^(wⱼᵀx + bⱼ) Properties: - Output sums to 1 - Each output is a probability - Used for multi-class classification

Cross-Entropy Loss

Binary Cross-Entropy: L = -[y log(ŷ) + (1-y) log(1-ŷ)] Multi-class Cross-Entropy (Categorical): L = -Σₖ yₖ log(ŷₖ) For PyTorch: CrossEntropyLoss combines LogSoftmax + NLLLoss - Input: raw logits (before softmax) - Target: class indices (not one-hot)

Important: PyTorch CrossEntropyLoss

PyTorch's nn.CrossEntropyLoss() expects raw logits (unnormalized scores), NOT probabilities. It internally applies softmax before computing the loss.

Correct: output = nn.Linear(784, 10) → CrossEntropyLoss

Wrong: output = Softmax(nn.Linear(784, 10)) → CrossEntropyLoss

Logistic Regression Architecture for MNIST

class LogisticRegression(nn.Module): def __init__(self, in_dim=28*28, out_dim=10): super().__init__() # Single linear layer: 784 inputs → 10 outputs self.fc = nn.Linear(in_dim, out_dim) def forward(self, x): # Flatten 28×28 images to 784-dim vectors x = x.view(x.size(0), -1) # [batch, 28, 28] → [batch, 784] logits = self.fc(x) # [batch, 784] → [batch, 10] return logits # Raw scores (no softmax!) # Usage model = LogisticRegression().to(device) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

3. MNIST Dataset

Overview

MNIST (Modified National Institute of Standards and Technology) is a benchmark dataset of handwritten digits (0-9) commonly used for image classification tasks.

Dataset Characteristics

Property Value
Training samples 60,000 images
Test samples 10,000 images
Image size 28×28 pixels (grayscale)
Classes 10 (digits 0-9)
Input features 784 (28×28 flattened)
Pixel range 0-255 (grayscale intensity)

Data Preprocessing

# Normalization transform transform = transforms.Compose([ transforms.ToTensor(), # Converts to [0,1] range transforms.Normalize((0.1307,), (0.3081,)) # Mean & std of MNIST ]) # Load data train_ds = datasets.MNIST(root="./data", train=True, download=True, transform=transform) test_ds = datasets.MNIST(root="./data", train=False, download=True, transform=transform) # Split train into train (50k) + validation (10k) train_ds, val_ds = random_split(train_ds, [50_000, 10_000])

Why Normalize?

MNIST normalization: Mean=0.1307, Std=0.3081 (computed from training set)

4. Implementation Details

Complete Training Loop

def train_epoch(model, loader, criterion, optimizer, device): model.train() # Set to training mode total_loss = 0.0 correct = 0 total = 0 for x, y in loader: # Move to device x, y = x.to(device), y.to(device) # Forward pass optimizer.zero_grad() # Clear previous gradients logits = model(x) # Get predictions loss = criterion(logits, y) # Compute loss # Backward pass loss.backward() # Compute gradients optimizer.step() # Update weights # Track metrics total_loss += loss.item() * x.size(0) preds = logits.argmax(dim=1) # Get class predictions correct += (preds == y).sum().item() total += y.size(0) avg_loss = total_loss / total accuracy = correct / total return avg_loss, accuracy def validate(model, loader, criterion, device): model.eval() # Set to evaluation mode total_loss = 0.0 correct = 0 total = 0 with torch.no_grad(): # Disable gradient computation for x, y in loader: x, y = x.to(device), y.to(device) logits = model(x) loss = criterion(logits, y) total_loss += loss.item() * x.size(0) preds = logits.argmax(dim=1) correct += (preds == y).sum().item() total += y.size(0) avg_loss = total_loss / total accuracy = correct / total return avg_loss, accuracy

Training Script

num_epochs = 10 for epoch in range(1, num_epochs + 1): # Train train_loss, train_acc = train_epoch( model, train_loader, criterion, optimizer, device ) # Validate val_loss, val_acc = validate( model, val_loader, criterion, device ) print(f"Epoch {epoch:02d}: " f"train_loss={train_loss:.4f} train_acc={train_acc:.4f} " f"val_acc={val_acc:.4f}")

Best Practices

5. Regularization (L1 & L2)

Why Regularization?

Regularization prevents overfitting by penalizing large weights, encouraging the model to learn simpler patterns that generalize better.

L2 Regularization (Ridge / Weight Decay)

Loss with L2: L = CrossEntropy + λ × Σ wᵢ² Effect: - Penalizes large weights - Encourages weights to be small but non-zero - Smoother decision boundaries
# L2 in PyTorch: use weight_decay parameter optimizer = torch.optim.SGD( model.parameters(), lr=0.1, weight_decay=1e-4 # L2 regularization strength )

L1 Regularization (Lasso)

Loss with L1: L = CrossEntropy + λ × Σ |wᵢ| Effect: - Penalizes absolute value of weights - Encourages sparse weights (many weights → 0) - Feature selection (removes irrelevant features)
# L1 in PyTorch: manual implementation def train_with_l1(model, loader, criterion, optimizer, l1_lambda=1e-5): model.train() for x, y in loader: x, y = x.to(device), y.to(device) optimizer.zero_grad() logits = model(x) loss = criterion(logits, y) # Add L1 penalty l1_penalty = 0.0 for param in model.parameters(): l1_penalty += param.abs().sum() total_loss = loss + l1_lambda * l1_penalty total_loss.backward() optimizer.step()

Comparison: L1 vs L2

Aspect L2 (Ridge) L1 (Lasso)
Penalty Σ wᵢ² Σ |wᵢ|
Weight behavior Small, non-zero Sparse (many zeros)
Feature selection No Yes
Differentiability Smooth everywhere Not differentiable at 0
Use case General regularization High-dimensional, sparse data
PyTorch implementation weight_decay parameter Manual penalty in loss

Typical Regularization Strengths

6. Optimizers (SGD vs Adam)

Stochastic Gradient Descent (SGD)

Update Rule: w ← w - η × ∇L(w) Where: - w: weights - η: learning rate - ∇L(w): gradient of loss w.r.t. weights
# Basic SGD optimizer = torch.optim.SGD(model.parameters(), lr=0.1) # SGD with momentum optimizer = torch.optim.SGD( model.parameters(), lr=0.1, momentum=0.9 # Accelerates convergence )

SGD Characteristics

Adam (Adaptive Moment Estimation)

Update Rule (simplified): m ← β₁m + (1-β₁)∇L (momentum) v ← β₂v + (1-β₂)(∇L)² (adaptive learning rate) w ← w - η × m / √(v + ε) Default hyperparameters: - β₁ = 0.9 - β₂ = 0.999 - ε = 1e-8
# Adam optimizer optimizer = torch.optim.Adam( model.parameters(), lr=1e-3 # Typical starting point for Adam )

Adam Characteristics

Optimizer Comparison

Aspect SGD Adam
Learning rate Fixed (or scheduled) Adaptive per parameter
Typical LR 0.01 - 0.1 1e-4 - 1e-3
Convergence speed Slower Faster
Tuning difficulty Requires careful LR tuning Works well with defaults
Generalization Often better May overfit easier
Memory Low Higher (2× gradients)
Best for Well-tuned, final models Rapid prototyping

Which to Use?

7. Model Evaluation

Confusion Matrix

A confusion matrix shows the counts of true vs predicted classes, revealing which classes the model confuses.

# Compute confusion matrix num_classes = 10 confusion_matrix = torch.zeros(num_classes, num_classes, dtype=torch.int64) model.eval() with torch.no_grad(): for x, y in test_loader: x, y = x.to(device), y.to(device) logits = model(x) preds = logits.argmax(dim=1) # Update confusion matrix for true_label, pred_label in zip(y, preds): confusion_matrix[true_label, pred_label] += 1 # Visualize plt.imshow(confusion_matrix, cmap='Blues') plt.colorbar() plt.xlabel('Predicted') plt.ylabel('True') plt.title('Confusion Matrix')

Interpreting the Confusion Matrix

Visualizing Learned Weights

# Extract and visualize weights for each class W = model.fc.weight.detach().cpu() # [10, 784] fig, axes = plt.subplots(2, 5, figsize=(10, 4)) for cls in range(10): ax = axes[cls // 5, cls % 5] img = W[cls].view(28, 28) # Reshape to image # Normalize for visualization vmax = img.abs().max().item() ax.imshow(img, cmap='seismic', vmin=-vmax, vmax=vmax) ax.set_title(f'Class {cls}') ax.axis('off')

What the Weights Show

Each class's weights form a template that the model learned:

Effect of Training Data Size

Training with less data typically reduces accuracy:

Key insight: More data helps, but diminishing returns after ~50%

Effect of Noisy Labels

# Simulate 10% label noise class NoisyLabels(Dataset): def __init__(self, base_dataset, noise_frac=0.1, num_classes=10): self.base = base_dataset self.noise_frac = noise_frac self.num_classes = num_classes # Randomly select indices to corrupt n = len(base_dataset) k = int(noise_frac * n) self.noisy_idx = set(random.sample(range(n), k)) def __getitem__(self, idx): x, y = self.base[idx] if idx in self.noisy_idx: # Replace with random incorrect label y = random.randint(0, self.num_classes - 1) while y == self.base[idx][1]: y = random.randint(0, self.num_classes - 1) return x, y

Impact of Label Noise

8. Common Challenges & Solutions

Challenge 1: Poor Convergence

Symptoms:

Solutions:

Challenge 2: Overfitting

Symptoms:

Solutions:

Challenge 3: Slow Training

Solutions:

Challenge 4: Incorrect Loss/Accuracy

Common Mistakes:

Checklist:

Summary: Key Takeaways

Logistic Regression for MNIST

Best Practices

Experimental Findings (Assignment)

DOWNLOAD ANKI DECK