Linear regression is a supervised learning algorithm that models the relationship between input features and a continuous output variable using a linear function.
MSE = (1/N) Σ(yᵢ - ŷᵢ)²
Where:
- N: number of samples
- yᵢ: true value
- ŷᵢ: predicted value
Key Properties
Linear decision boundary: Separates classes with a straight line/hyperplane
Differentiable: Can use gradient descent for optimization
Fast training: Efficient for large datasets
Interpretable: Weights show feature importance
2. Logistic Regression
What is Logistic Regression?
Logistic regression extends linear regression for classification by applying a sigmoid (or softmax) function to convert linear outputs into probabilities.
Softmax:
softmax(zᵢ) = e^zᵢ / Σⱼ e^zⱼ
For MNIST (10 classes):
P(y = k | x) = e^(wₖᵀx + bₖ) / Σⱼ₌₀⁹ e^(wⱼᵀx + bⱼ)
Properties:
- Output sums to 1
- Each output is a probability
- Used for multi-class classification
Cross-Entropy Loss
Binary Cross-Entropy:
L = -[y log(ŷ) + (1-y) log(1-ŷ)]
Multi-class Cross-Entropy (Categorical):
L = -Σₖ yₖ log(ŷₖ)
For PyTorch:
CrossEntropyLoss combines LogSoftmax + NLLLoss
- Input: raw logits (before softmax)
- Target: class indices (not one-hot)
Important: PyTorch CrossEntropyLoss
PyTorch's nn.CrossEntropyLoss() expects raw logits (unnormalized scores), NOT probabilities. It internally applies softmax before computing the loss.
class LogisticRegression(nn.Module):
def __init__(self, in_dim=28*28, out_dim=10):
super().__init__()
# Single linear layer: 784 inputs → 10 outputs
self.fc = nn.Linear(in_dim, out_dim)
def forward(self, x):
# Flatten 28×28 images to 784-dim vectors
x = x.view(x.size(0), -1) # [batch, 28, 28] → [batch, 784]
logits = self.fc(x) # [batch, 784] → [batch, 10]
return logits # Raw scores (no softmax!)
# Usage
model = LogisticRegression().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
3. MNIST Dataset
Overview
MNIST (Modified National Institute of Standards and Technology) is a benchmark dataset of handwritten digits (0-9) commonly used for image classification tasks.
Dataset Characteristics
Property
Value
Training samples
60,000 images
Test samples
10,000 images
Image size
28×28 pixels (grayscale)
Classes
10 (digits 0-9)
Input features
784 (28×28 flattened)
Pixel range
0-255 (grayscale intensity)
Data Preprocessing
# Normalization transform
transform = transforms.Compose([
transforms.ToTensor(), # Converts to [0,1] range
transforms.Normalize((0.1307,), (0.3081,)) # Mean & std of MNIST
])
# Load data
train_ds = datasets.MNIST(root="./data", train=True,
download=True, transform=transform)
test_ds = datasets.MNIST(root="./data", train=False,
download=True, transform=transform)
# Split train into train (50k) + validation (10k)
train_ds, val_ds = random_split(train_ds, [50_000, 10_000])
Why Normalize?
Faster convergence: Keeps gradients in a reasonable range
Numerical stability: Prevents overflow/underflow
Better optimization: Helps gradient descent find minima faster
MNIST normalization: Mean=0.1307, Std=0.3081 (computed from training set)
4. Implementation Details
Complete Training Loop
def train_epoch(model, loader, criterion, optimizer, device):
model.train() # Set to training mode
total_loss = 0.0
correct = 0
total = 0
for x, y in loader:
# Move to device
x, y = x.to(device), y.to(device)
# Forward pass
optimizer.zero_grad() # Clear previous gradients
logits = model(x) # Get predictions
loss = criterion(logits, y) # Compute loss
# Backward pass
loss.backward() # Compute gradients
optimizer.step() # Update weights
# Track metrics
total_loss += loss.item() * x.size(0)
preds = logits.argmax(dim=1) # Get class predictions
correct += (preds == y).sum().item()
total += y.size(0)
avg_loss = total_loss / total
accuracy = correct / total
return avg_loss, accuracy
def validate(model, loader, criterion, device):
model.eval() # Set to evaluation mode
total_loss = 0.0
correct = 0
total = 0
with torch.no_grad(): # Disable gradient computation
for x, y in loader:
x, y = x.to(device), y.to(device)
logits = model(x)
loss = criterion(logits, y)
total_loss += loss.item() * x.size(0)
preds = logits.argmax(dim=1)
correct += (preds == y).sum().item()
total += y.size(0)
avg_loss = total_loss / total
accuracy = correct / total
return avg_loss, accuracy
Always use validation set: Don't touch test set until final evaluation
Model modes: Use model.train() and model.eval()
No gradients in validation: Use with torch.no_grad():
Track metrics: Log loss and accuracy for both train and validation
5. Regularization (L1 & L2)
Why Regularization?
Regularization prevents overfitting by penalizing large weights, encouraging the model to learn simpler patterns that generalize better.
L2 Regularization (Ridge / Weight Decay)
Loss with L2:
L = CrossEntropy + λ × Σ wᵢ²
Effect:
- Penalizes large weights
- Encourages weights to be small but non-zero
- Smoother decision boundaries
# L2 in PyTorch: use weight_decay parameter
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.1,
weight_decay=1e-4 # L2 regularization strength
)
L1 Regularization (Lasso)
Loss with L1:
L = CrossEntropy + λ × Σ |wᵢ|
Effect:
- Penalizes absolute value of weights
- Encourages sparse weights (many weights → 0)
- Feature selection (removes irrelevant features)
# L1 in PyTorch: manual implementation
def train_with_l1(model, loader, criterion, optimizer, l1_lambda=1e-5):
model.train()
for x, y in loader:
x, y = x.to(device), y.to(device)
optimizer.zero_grad()
logits = model(x)
loss = criterion(logits, y)
# Add L1 penalty
l1_penalty = 0.0
for param in model.parameters():
l1_penalty += param.abs().sum()
total_loss = loss + l1_lambda * l1_penalty
total_loss.backward()
optimizer.step()
Comparison: L1 vs L2
Aspect
L2 (Ridge)
L1 (Lasso)
Penalty
Σ wᵢ²
Σ |wᵢ|
Weight behavior
Small, non-zero
Sparse (many zeros)
Feature selection
No
Yes
Differentiability
Smooth everywhere
Not differentiable at 0
Use case
General regularization
High-dimensional, sparse data
PyTorch implementation
weight_decay parameter
Manual penalty in loss
Typical Regularization Strengths
L2 (weight_decay): 1e-5 to 1e-3
L1 (lambda): 1e-6 to 1e-4
Start small and increase if overfitting persists
6. Optimizers (SGD vs Adam)
Stochastic Gradient Descent (SGD)
Update Rule:
w ← w - η × ∇L(w)
Where:
- w: weights
- η: learning rate
- ∇L(w): gradient of loss w.r.t. weights
Good generalization: Often generalizes better than adaptive methods
Adam (Adaptive Moment Estimation)
Update Rule (simplified):
m ← β₁m + (1-β₁)∇L (momentum)
v ← β₂v + (1-β₂)(∇L)² (adaptive learning rate)
w ← w - η × m / √(v + ε)
Default hyperparameters:
- β₁ = 0.9
- β₂ = 0.999
- ε = 1e-8
# Adam optimizer
optimizer = torch.optim.Adam(
model.parameters(),
lr=1e-3 # Typical starting point for Adam
)
Adam Characteristics
Adaptive: Adjusts learning rate per parameter
Fast convergence: Usually converges faster than SGD
Less sensitive to LR: Works well with default settings
Memory overhead: Stores running averages (m, v)
Optimizer Comparison
Aspect
SGD
Adam
Learning rate
Fixed (or scheduled)
Adaptive per parameter
Typical LR
0.01 - 0.1
1e-4 - 1e-3
Convergence speed
Slower
Faster
Tuning difficulty
Requires careful LR tuning
Works well with defaults
Generalization
Often better
May overfit easier
Memory
Low
Higher (2× gradients)
Best for
Well-tuned, final models
Rapid prototyping
Which to Use?
Start with Adam: Fast prototyping, easy to use
Fine-tune with SGD: Better final performance with proper tuning
For MNIST logistic regression: Both work well; Adam typically 92-93%, SGD with good LR also 91-92%
7. Model Evaluation
Confusion Matrix
A confusion matrix shows the counts of true vs predicted classes, revealing which classes the model confuses.
# Compute confusion matrix
num_classes = 10
confusion_matrix = torch.zeros(num_classes, num_classes, dtype=torch.int64)
model.eval()
with torch.no_grad():
for x, y in test_loader:
x, y = x.to(device), y.to(device)
logits = model(x)
preds = logits.argmax(dim=1)
# Update confusion matrix
for true_label, pred_label in zip(y, preds):
confusion_matrix[true_label, pred_label] += 1
# Visualize
plt.imshow(confusion_matrix, cmap='Blues')
plt.colorbar()
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
Interpreting the Confusion Matrix
Diagonal: Correct predictions
Off-diagonal: Errors (confusions)
Common confusions in MNIST:
4 ↔ 9 (similar shape)
3 ↔ 5 or 8 (curved digits)
7 ↔ 1 (both vertical)
Visualizing Learned Weights
# Extract and visualize weights for each class
W = model.fc.weight.detach().cpu() # [10, 784]
fig, axes = plt.subplots(2, 5, figsize=(10, 4))
for cls in range(10):
ax = axes[cls // 5, cls % 5]
img = W[cls].view(28, 28) # Reshape to image
# Normalize for visualization
vmax = img.abs().max().item()
ax.imshow(img, cmap='seismic', vmin=-vmax, vmax=vmax)
ax.set_title(f'Class {cls}')
ax.axis('off')
What the Weights Show
Each class's weights form a template that the model learned:
Red pixels: Positive contribution (presence increases score)
Blue pixels: Negative contribution (presence decreases score)
White pixels: Neutral (don't affect classification)
Effect of Training Data Size
Training with less data typically reduces accuracy:
10% data: ~80-85% accuracy
25% data: ~87-89% accuracy
50% data: ~90-91% accuracy
100% data: ~92-93% accuracy
Key insight: More data helps, but diminishing returns after ~50%
Effect of Noisy Labels
# Simulate 10% label noise
class NoisyLabels(Dataset):
def __init__(self, base_dataset, noise_frac=0.1, num_classes=10):
self.base = base_dataset
self.noise_frac = noise_frac
self.num_classes = num_classes
# Randomly select indices to corrupt
n = len(base_dataset)
k = int(noise_frac * n)
self.noisy_idx = set(random.sample(range(n), k))
def __getitem__(self, idx):
x, y = self.base[idx]
if idx in self.noisy_idx:
# Replace with random incorrect label
y = random.randint(0, self.num_classes - 1)
while y == self.base[idx][1]:
y = random.randint(0, self.num_classes - 1)
return x, y
Impact of Label Noise
0% noise: ~92% accuracy (baseline)
10% noise: ~88-89% accuracy (3-4% drop)
Observation: Label noise is more harmful than missing data
Why? Model learns incorrect patterns from wrong labels
8. Common Challenges & Solutions
Challenge 1: Poor Convergence
Symptoms:
Loss not decreasing
Accuracy stuck at ~10% (random guessing)
NaN or Inf in loss
Solutions:
Lower learning rate: Try 0.01 or 0.001 instead of 0.1
Check normalization: Ensure inputs are normalized
Use Adam: More robust to LR choice
Gradient clipping: Prevent exploding gradients
Challenge 2: Overfitting
Symptoms:
High train accuracy, low validation accuracy
Gap increases over epochs
Solutions:
Add regularization: L2 (weight_decay=1e-4)
More training data: Data augmentation
Early stopping: Stop when val accuracy plateaus
Simpler model: Logistic regression is already simple!
Challenge 3: Slow Training
Solutions:
Increase batch size: 128 or 256 for faster processing
Use GPU: Move model and data to CUDA
Reduce epochs: MNIST converges in 5-10 epochs
Use DataLoader workers: num_workers=4
Challenge 4: Incorrect Loss/Accuracy
Common Mistakes:
Applying softmax before CrossEntropyLoss (double softmax)
Not flattening images before linear layer
Computing accuracy on logits instead of predictions
Forgetting to call model.eval() during validation
Checklist:
✓ Use raw logits for CrossEntropyLoss (no softmax)
✓ Flatten: x.view(x.size(0), -1)
✓ Predictions: logits.argmax(dim=1)
✓ Use model.train() and model.eval() appropriately
✓ Use torch.no_grad() during validation
Summary: Key Takeaways
Logistic Regression for MNIST
Architecture: Single linear layer (784→10) + softmax