CMPUT 328 Assignment 1: Logistic Regression

1. Fundamentals of Linear Regression

What is Linear Regression?

Linear regression is a supervised learning algorithm that models the relationship between input features and a continuous output variable using a linear function.

Mathematical Formulation

Prediction: ŷ = w₁x₁ + w₂x₂ + ... + wₙxₙ + b Matrix Form: ŷ = Xw + b Where: - X: input features [batch_size × n_features] - w: weights [n_features × 1] - b: bias (scalar) - ŷ: predictions

Loss Function: Mean Squared Error (MSE)

MSE = (1/N) Σ(yᵢ - ŷᵢ)² Where: - N: number of samples - yᵢ: true value - ŷᵢ: predicted value

Key Properties

Linear decision boundary: Separates classes with a straight line/hyperplane
Differentiable: Can use gradient descent for optimization
Fast training: Efficient for large datasets
Interpretable: Weights show feature importance

2. Logistic Regression

What is Logistic Regression?

Logistic regression extends linear regression for classification by applying a sigmoid (or softmax) function to convert linear outputs into probabilities.

Binary Classification: Sigmoid Function

Sigmoid: σ(z) = 1 / (1 + e⁻ᶻ) Properties: - Output range: (0, 1) - σ(0) = 0.5 - σ(∞) = 1 - σ(-∞) = 0

Multi-class Classification: Softmax Function

Softmax: softmax(zᵢ) = e^zᵢ / Σⱼ e^zⱼ For MNIST (10 classes): P(y = k | x) = e^(wₖᵀx + bₖ) / Σⱼ₌₀⁹ e^(wⱼᵀx + bⱼ) Properties: - Output sums to 1 - Each output is a probability - Used for multi-class classification

Cross-Entropy Loss

Binary Cross-Entropy: L = -[y log(ŷ) + (1-y) log(1-ŷ)] Multi-class Cross-Entropy (Categorical): L = -Σₖ yₖ log(ŷₖ) For PyTorch: CrossEntropyLoss combines LogSoftmax + NLLLoss - Input: raw logits (before softmax) - Target: class indices (not one-hot)

Important: PyTorch CrossEntropyLoss

PyTorch's nn.CrossEntropyLoss() expects raw logits (unnormalized scores), NOT probabilities. It internally applies softmax before computing the loss.

Correct: output = nn.Linear(784, 10) → CrossEntropyLoss

Wrong: output = Softmax(nn.Linear(784, 10)) → CrossEntropyLoss

Logistic Regression Architecture for MNIST

class LogisticRegression(nn.Module):
    def __init__(self, in_dim=28*28, out_dim=10):
        super().__init__()
        # Single linear layer: 784 inputs → 10 outputs
        self.fc = nn.Linear(in_dim, out_dim)

    def forward(self, x):
        # Flatten 28×28 images to 784-dim vectors
        x = x.view(x.size(0), -1)  # [batch, 28, 28] → [batch, 784]
        logits = self.fc(x)        # [batch, 784] → [batch, 10]
        return logits              # Raw scores (no softmax!)

# Usage
model = LogisticRegression().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

3. MNIST Dataset

Overview

MNIST (Modified National Institute of Standards and Technology) is a benchmark dataset of handwritten digits (0-9) commonly used for image classification tasks.

Dataset Characteristics

Property	Value
Training samples	60,000 images
Test samples	10,000 images
Image size	28×28 pixels (grayscale)
Classes	10 (digits 0-9)
Input features	784 (28×28 flattened)
Pixel range	0-255 (grayscale intensity)

Data Preprocessing

# Normalization transform
transform = transforms.Compose([
    transforms.ToTensor(),  # Converts to [0,1] range
    transforms.Normalize((0.1307,), (0.3081,))  # Mean & std of MNIST
])

# Load data
train_ds = datasets.MNIST(root="./data", train=True,
                          download=True, transform=transform)
test_ds = datasets.MNIST(root="./data", train=False,
                         download=True, transform=transform)

# Split train into train (50k) + validation (10k)
train_ds, val_ds = random_split(train_ds, [50_000, 10_000])

Why Normalize?

Faster convergence: Keeps gradients in a reasonable range
Numerical stability: Prevents overflow/underflow
Better optimization: Helps gradient descent find minima faster

MNIST normalization: Mean=0.1307, Std=0.3081 (computed from training set)

4. Implementation Details

Complete Training Loop

def train_epoch(model, loader, criterion, optimizer, device):
    model.train()  # Set to training mode
    total_loss = 0.0
    correct = 0
    total = 0

    for x, y in loader:
        # Move to device
        x, y = x.to(device), y.to(device)

        # Forward pass
        optimizer.zero_grad()  # Clear previous gradients
        logits = model(x)      # Get predictions
        loss = criterion(logits, y)  # Compute loss

        # Backward pass
        loss.backward()        # Compute gradients
        optimizer.step()       # Update weights

        # Track metrics
        total_loss += loss.item() * x.size(0)
        preds = logits.argmax(dim=1)  # Get class predictions
        correct += (preds == y).sum().item()
        total += y.size(0)

    avg_loss = total_loss / total
    accuracy = correct / total
    return avg_loss, accuracy

def validate(model, loader, criterion, device):
    model.eval()  # Set to evaluation mode
    total_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():  # Disable gradient computation
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            logits = model(x)
            loss = criterion(logits, y)

            total_loss += loss.item() * x.size(0)
            preds = logits.argmax(dim=1)
            correct += (preds == y).sum().item()
            total += y.size(0)

    avg_loss = total_loss / total
    accuracy = correct / total
    return avg_loss, accuracy

Training Script

num_epochs = 10

for epoch in range(1, num_epochs + 1):
    # Train
    train_loss, train_acc = train_epoch(
        model, train_loader, criterion, optimizer, device
    )

    # Validate
    val_loss, val_acc = validate(
        model, val_loader, criterion, device
    )

    print(f"Epoch {epoch:02d}: "
          f"train_loss={train_loss:.4f} train_acc={train_acc:.4f} "
          f"val_acc={val_acc:.4f}")

Best Practices

Always use validation set: Don't touch test set until final evaluation
Model modes: Use model.train() and model.eval()
No gradients in validation: Use with torch.no_grad():
Track metrics: Log loss and accuracy for both train and validation

5. Regularization (L1 & L2)

Why Regularization?

Regularization prevents overfitting by penalizing large weights, encouraging the model to learn simpler patterns that generalize better.

L2 Regularization (Ridge / Weight Decay)

Loss with L2: L = CrossEntropy + λ × Σ wᵢ² Effect: - Penalizes large weights - Encourages weights to be small but non-zero - Smoother decision boundaries

# L2 in PyTorch: use weight_decay parameter
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    weight_decay=1e-4  # L2 regularization strength
)

L1 Regularization (Lasso)

Loss with L1: L = CrossEntropy + λ × Σ |wᵢ| Effect: - Penalizes absolute value of weights - Encourages sparse weights (many weights → 0) - Feature selection (removes irrelevant features)

# L1 in PyTorch: manual implementation
def train_with_l1(model, loader, criterion, optimizer, l1_lambda=1e-5):
    model.train()
    for x, y in loader:
        x, y = x.to(device), y.to(device)

        optimizer.zero_grad()
        logits = model(x)
        loss = criterion(logits, y)

        # Add L1 penalty
        l1_penalty = 0.0
        for param in model.parameters():
            l1_penalty += param.abs().sum()

        total_loss = loss + l1_lambda * l1_penalty
        total_loss.backward()
        optimizer.step()

Comparison: L1 vs L2

Aspect	L2 (Ridge)	L1 (Lasso)
Penalty	Σ wᵢ²	Σ \|wᵢ\|
Weight behavior	Small, non-zero	Sparse (many zeros)
Feature selection	No	Yes
Differentiability	Smooth everywhere	Not differentiable at 0
Use case	General regularization	High-dimensional, sparse data
PyTorch implementation	weight_decay parameter	Manual penalty in loss

Typical Regularization Strengths

L2 (weight_decay): 1e-5 to 1e-3
L1 (lambda): 1e-6 to 1e-4
Start small and increase if overfitting persists

6. Optimizers (SGD vs Adam)

Stochastic Gradient Descent (SGD)

Update Rule: w ← w - η × ∇L(w) Where: - w: weights - η: learning rate - ∇L(w): gradient of loss w.r.t. weights

# Basic SGD
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# SGD with momentum
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9  # Accelerates convergence
)

SGD Characteristics

Simple: Easy to understand and implement
Sensitive to LR: Requires careful tuning
Slow convergence: May take many epochs
Good generalization: Often generalizes better than adaptive methods

Adam (Adaptive Moment Estimation)

Update Rule (simplified): m ← β₁m + (1-β₁)∇L (momentum) v ← β₂v + (1-β₂)(∇L)² (adaptive learning rate) w ← w - η × m / √(v + ε) Default hyperparameters: - β₁ = 0.9 - β₂ = 0.999 - ε = 1e-8

# Adam optimizer
optimizer = torch.optim.Adam(
    model.parameters(),
    lr=1e-3  # Typical starting point for Adam
)

Adam Characteristics

Adaptive: Adjusts learning rate per parameter
Fast convergence: Usually converges faster than SGD
Less sensitive to LR: Works well with default settings
Memory overhead: Stores running averages (m, v)

Optimizer Comparison

Aspect	SGD	Adam
Learning rate	Fixed (or scheduled)	Adaptive per parameter
Typical LR	0.01 - 0.1	1e-4 - 1e-3
Convergence speed	Slower	Faster
Tuning difficulty	Requires careful LR tuning	Works well with defaults
Generalization	Often better	May overfit easier
Memory	Low	Higher (2× gradients)
Best for	Well-tuned, final models	Rapid prototyping

Which to Use?

Start with Adam: Fast prototyping, easy to use
Fine-tune with SGD: Better final performance with proper tuning
For MNIST logistic regression: Both work well; Adam typically 92-93%, SGD with good LR also 91-92%

7. Model Evaluation

Confusion Matrix

A confusion matrix shows the counts of true vs predicted classes, revealing which classes the model confuses.

# Compute confusion matrix
num_classes = 10
confusion_matrix = torch.zeros(num_classes, num_classes, dtype=torch.int64)

model.eval()
with torch.no_grad():
    for x, y in test_loader:
        x, y = x.to(device), y.to(device)
        logits = model(x)
        preds = logits.argmax(dim=1)

        # Update confusion matrix
        for true_label, pred_label in zip(y, preds):
            confusion_matrix[true_label, pred_label] += 1

# Visualize
plt.imshow(confusion_matrix, cmap='Blues')
plt.colorbar()
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')

Interpreting the Confusion Matrix

Diagonal: Correct predictions
Off-diagonal: Errors (confusions)
Common confusions in MNIST:
- 4 ↔ 9 (similar shape)
- 3 ↔ 5 or 8 (curved digits)
- 7 ↔ 1 (both vertical)

Visualizing Learned Weights

# Extract and visualize weights for each class
W = model.fc.weight.detach().cpu()  # [10, 784]

fig, axes = plt.subplots(2, 5, figsize=(10, 4))
for cls in range(10):
    ax = axes[cls // 5, cls % 5]
    img = W[cls].view(28, 28)  # Reshape to image

    # Normalize for visualization
    vmax = img.abs().max().item()
    ax.imshow(img, cmap='seismic', vmin=-vmax, vmax=vmax)
    ax.set_title(f'Class {cls}')
    ax.axis('off')

What the Weights Show

Each class's weights form a template that the model learned:

Red pixels: Positive contribution (presence increases score)
Blue pixels: Negative contribution (presence decreases score)
White pixels: Neutral (don't affect classification)

Effect of Training Data Size

Training with less data typically reduces accuracy:

10% data: ~80-85% accuracy
25% data: ~87-89% accuracy
50% data: ~90-91% accuracy
100% data: ~92-93% accuracy

Key insight: More data helps, but diminishing returns after ~50%

Effect of Noisy Labels

# Simulate 10% label noise
class NoisyLabels(Dataset):
    def __init__(self, base_dataset, noise_frac=0.1, num_classes=10):
        self.base = base_dataset
        self.noise_frac = noise_frac
        self.num_classes = num_classes

        # Randomly select indices to corrupt
        n = len(base_dataset)
        k = int(noise_frac * n)
        self.noisy_idx = set(random.sample(range(n), k))

    def __getitem__(self, idx):
        x, y = self.base[idx]
        if idx in self.noisy_idx:
            # Replace with random incorrect label
            y = random.randint(0, self.num_classes - 1)
            while y == self.base[idx][1]:
                y = random.randint(0, self.num_classes - 1)
        return x, y

Impact of Label Noise

0% noise: ~92% accuracy (baseline)
10% noise: ~88-89% accuracy (3-4% drop)
Observation: Label noise is more harmful than missing data
Why? Model learns incorrect patterns from wrong labels

8. Common Challenges & Solutions

Challenge 1: Poor Convergence

Symptoms:

Loss not decreasing
Accuracy stuck at ~10% (random guessing)
NaN or Inf in loss

Solutions:

Lower learning rate: Try 0.01 or 0.001 instead of 0.1
Check normalization: Ensure inputs are normalized
Use Adam: More robust to LR choice
Gradient clipping: Prevent exploding gradients

Challenge 2: Overfitting

Symptoms:

High train accuracy, low validation accuracy
Gap increases over epochs

Solutions:

Add regularization: L2 (weight_decay=1e-4)
More training data: Data augmentation
Early stopping: Stop when val accuracy plateaus
Simpler model: Logistic regression is already simple!

Challenge 3: Slow Training

Solutions:

Increase batch size: 128 or 256 for faster processing
Use GPU: Move model and data to CUDA
Reduce epochs: MNIST converges in 5-10 epochs
Use DataLoader workers: num_workers=4

Challenge 4: Incorrect Loss/Accuracy

Common Mistakes:

Applying softmax before CrossEntropyLoss (double softmax)
Not flattening images before linear layer
Computing accuracy on logits instead of predictions
Forgetting to call model.eval() during validation

Checklist:

✓ Use raw logits for CrossEntropyLoss (no softmax)
✓ Flatten: x.view(x.size(0), -1)
✓ Predictions: logits.argmax(dim=1)
✓ Use model.train() and model.eval() appropriately
✓ Use torch.no_grad() during validation

Summary: Key Takeaways

Logistic Regression for MNIST

Architecture: Single linear layer (784→10) + softmax
Loss: CrossEntropyLoss (combines softmax + negative log-likelihood)
Expected accuracy: 91-93% on MNIST
Training time: ~5-10 epochs to converge

Best Practices

Data: Normalize inputs, use train/val/test split
Regularization: L2 (weight_decay) prevents overfitting
Optimizer: Adam for fast prototyping, SGD for final tuning
Evaluation: Use confusion matrix to identify problem classes
Debugging: Visualize weights, track train/val metrics

Experimental Findings (Assignment)

L2 vs L1: L2 more stable, L1 for sparse weights
SGD vs Adam: Adam converges faster and reaches higher accuracy
Confusion: Digit 5 most confused (with 3, 8)
Data size: 50% data gives 90%+ accuracy
Label noise: 10% noise → 3-4% accuracy drop

CMPUT 328 Assignment 1: Logistic Regression

Table of Contents

1. Fundamentals of Linear Regression

What is Linear Regression?

Mathematical Formulation

Loss Function: Mean Squared Error (MSE)

Key Properties

2. Logistic Regression

What is Logistic Regression?

Binary Classification: Sigmoid Function

Multi-class Classification: Softmax Function

Cross-Entropy Loss

Important: PyTorch CrossEntropyLoss

Logistic Regression Architecture for MNIST

3. MNIST Dataset

Overview

Dataset Characteristics

Data Preprocessing

Why Normalize?

4. Implementation Details

Complete Training Loop

Training Script

Best Practices

5. Regularization (L1 & L2)

Why Regularization?

L2 Regularization (Ridge / Weight Decay)

L1 Regularization (Lasso)

Comparison: L1 vs L2

Typical Regularization Strengths

6. Optimizers (SGD vs Adam)

Stochastic Gradient Descent (SGD)

SGD Characteristics

Adam (Adaptive Moment Estimation)

Adam Characteristics

Optimizer Comparison

Which to Use?

7. Model Evaluation

Confusion Matrix

Interpreting the Confusion Matrix

Visualizing Learned Weights

What the Weights Show

Effect of Training Data Size

Effect of Noisy Labels

Impact of Label Noise

8. Common Challenges & Solutions

Challenge 1: Poor Convergence

Symptoms:

Solutions:

Challenge 2: Overfitting

Symptoms:

Solutions:

Challenge 3: Slow Training

Solutions:

Challenge 4: Incorrect Loss/Accuracy

Common Mistakes:

Checklist:

Summary: Key Takeaways

Logistic Regression for MNIST

Best Practices

Experimental Findings (Assignment)