Neural networks learn hierarchical features for object detection:
Input layer: Raw pixel values
Early layers: Edges, corners, textures
Middle layers: Parts of objects (wheels, faces)
Deep layers: Whole objects (cars, people)
Output layer: Bounding boxes + class predictions
2. HISTORY OF OBJECT DETECTION (R-CNN → FASTER R-CNN)
Evolution of Two-Stage Detectors
Selective Search (Pre-Deep Learning)
Selective Search: Algorithm to generate ~2000 region proposals per image
How it works:
Over-segment image into many small regions
Hierarchically group regions based on color, texture, size, fill
Generate bounding boxes around grouped regions
Output: ~2000 region proposals that likely contain objects
R-CNN (2014)
R-CNN Pipeline:
1. Input image
2. Selective Search → ~2000 region proposals
3. Warp each region to fixed size (227×227)
4. Pass each warped region through CNN (AlexNet)
5. Extract CNN features for each region
6. SVM classifier per class
7. Bounding box regressor to refine boxes
Problem: Very slow (~47 seconds per image)
- Must run CNN forward pass ~2000 times per image
Fast R-CNN (2015)
Key innovation: Pass image through CNN only once, then extract features for each region
Fast R-CNN Pipeline:
1. Input image → CNN backbone (single forward pass)
2. Get feature map from last conv layer
3. Selective Search → ~2000 region proposals (on original image)
4. Project each proposal onto feature map → RoI (Region of Interest)
5. **ROI Pooling**: Convert each RoI to fixed-size feature vector
6. Fully connected layers → (classification, bbox regression)
Speed up: ~0.3 seconds per image (from 47s)
Bottleneck: Selective search still slow
ROI Pooling Explained
ROI Pooling converts variable-sized regions into fixed-size features
Example:
- Feature map: 512 × 20 × 15 (channels × height × width)
- RoI on feature map: Variable size (e.g., 7×5 region)
- Target output: 512 × 2 × 2 (fixed size)
Process:
1. Divide RoI into 2×2 grid (target output size)
2. Max pool within each grid cell
3. Result: 512 × 2 × 2 fixed-size feature
Benefits:
- Variable input → fixed output (required for FC layers)
- Differentiable (can backprop)
- Fast (just max pooling)
Faster R-CNN (2016)
Key innovation: Replace selective search with Region Proposal Network (RPN)
RPN: Neural network that predicts region proposals
Faster R-CNN Pipeline:
1. Input image → CNN backbone (shared)
2. Feature map → **Region Proposal Network (RPN)**
- RPN outputs ~300 region proposals
- Much faster than selective search
3. ROI Pooling on proposed regions
4. Classification + bbox regression heads
Two-stage training:
Stage 1: Train backbone + RPN
Stage 2: Train backbone + detection heads (classification + bbox)
Speed: ~0.2 seconds per image (5 FPS)
Region Proposal Network (RPN) Details
RPN predicts whether each location contains an object
At each location in feature map:
1. Place K anchor boxes of different scales/aspect ratios
- Example: 3 scales × 3 aspect ratios = 9 anchors
2. For each anchor, predict:
- Objectness score (1 value): does anchor contain object?
- Box offsets (4 values): how to adjust anchor to fit object?
Output per location: K anchors × (1 objectness + 4 offsets) = K × 5 values
Full output for 20×15 feature map:
- Objectness: 20 × 15 × K
- Box transforms: 20 × 15 × K × 4
Post-processing:
1. Apply objectness threshold (e.g., > 0.5)
2. Apply box transforms to anchors
3. Apply NMS to remove duplicates
4. Keep top ~300 proposals
Why Anchors?
Problem: Neural networks need fixed-size outputs, but objects have variable sizes/shapes
Solution: Anchor boxes
Pre-define K box shapes at each location
Network predicts adjustments to these anchors
Anchors designed using k-means on training data
Without anchors (selective search): Can generate arbitrary proposals, but not differentiable
With anchors (RPN): Differentiable, learnable, but constrained to anchor shapes
Selective Search: Can generate proposals of any shape/size
Bottom-up approach based on image segmentation
Not constrained by pre-defined shapes
Downside: Not learnable, slow, hand-crafted heuristics
RPN with Anchors: Must predict from fixed set of shapes
Top-down approach using neural network
Anchors provide starting points
Upside: Learnable, fast, end-to-end trainable
Anchor Box Design
Aspect Ratio
Scale
Purpose
1:1
Small, Medium, Large
Square objects (faces, balls)
1:2
Small, Medium, Large
Tall objects (people, bottles)
2:1
Small, Medium, Large
Wide objects (cars, buses)
Common configuration: 3 scales × 3 aspect ratios = 9 anchors per location
ROI Pooling vs ROI Align
ROI Pooling Problem
Quantization: ROI pooling uses integer coordinates, causing misalignment
Example: RoI at (6.5, 4.7, 18.3, 12.9) → rounded to (6, 4, 18, 12)
Impact: Slight misalignment, especially bad for segmentation
ROI Align Solution (Mask R-CNN)
ROI Align: Use bilinear interpolation instead of rounding
Preserve exact spatial locations
Better for pixel-level tasks (segmentation)
Standard in modern detectors
4. YOLO: SINGLE-STAGE DETECTION
YOLO Philosophy
You Only Look Once: Predict bounding boxes and classes in a single forward pass
Key insight: Frame detection as regression, not classification on proposals
Speed advantage: >10× faster than Faster R-CNN
YOLOv1 (2016)
YOLOv1 Architecture:
1. Divide image into S × S grid (e.g., 7×7)
2. Each grid cell predicts:
- B bounding boxes (x, y, w, h, confidence)
- C class probabilities
Output tensor: S × S × (B×5 + C)
Example (S=7, B=2, C=20): 7 × 7 × 30
Grid cell responsible for object if:
- Object's center falls in that cell
Confidence score:
- Pr(Object) × IoU(pred, truth)
- 0 if no object in cell
YOLO Evolution
Version
Year
Key Improvements
YOLOv1
2016
Single-stage, grid-based, real-time
YOLOv2
2017
Batch norm, anchor boxes, multi-scale training
YOLOv3
2018
FPN (3 scales), better for small objects
YOLOv4
2020
CSPDarknet, Mish activation, mosaic augmentation
YOLOv5
2020
PyTorch, auto-anchor, production-ready
YOLOv8
2023
Anchor-free, C2f blocks, improved neck
YOLO11
2024
Latest state-of-the-art
YOLOv8 Architecture (Used in Assignment)
YOLOv8 Components:
Backbone: CSPDarknet
├─ Extracts features at multiple scales
├─ CSP (Cross Stage Partial) blocks
└─ SPPF (Spatial Pyramid Pooling - Fast)
Neck: PANet (Path Aggregation Network)
├─ Top-down: FPN for multi-scale fusion
└─ Bottom-up: PAN for feature enhancement
Head: Decoupled anchor-free head
├─ Classification head → class probabilities
└─ Regression head → bbox coordinates (direct prediction)
Key differences from earlier YOLO:
- No anchor boxes (anchor-free)
- Separate heads for classification and localization
- C2f modules instead of C3 (faster, better gradient flow)
Why Anchor-Free?
Simpler: No need to tune anchor sizes/ratios
Better generalization: Not constrained to pre-defined shapes
Fewer hyperparameters: Easier to use
Direct prediction: Predict box center and size directly
YOLO Loss Function
YOLOv1 Loss (Multi-part):
L_total = λ_coord × L_box + L_obj + L_noobj + L_class
L_box: Localization loss (coordinates + size)
L_obj: Confidence loss (cells with objects)
L_noobj: Confidence loss (cells without objects)
L_class: Classification loss
YOLOv8 Loss:
L_total = L_cls + L_box + L_dfl
L_cls: Classification loss (BCE)
L_box: Box loss (CIoU - Complete IoU)
L_dfl: Distribution Focal Loss (for bbox refinement)
5. MODERN APPROACHES (CENTERNET, DETR)
Anchor-less Object Detection
Motivation: Anchors add complexity and hyperparameters
Anchor-less approaches:
Keypoint-based: CenterNet - detect object centers as keypoints
Transformer-based: DETR - set prediction with transformers
CenterNet (2019)
CenterNet: Objects as Points
Key idea: Represent each object as a single point (its center)
Architecture:
1. Input image → Backbone CNN → Feature map
2. Three prediction heads:
a) Heatmap: Detect object centers (Gaussian peaks)
b) Size: Predict width and height at center
c) Offset: Sub-pixel offset (for quantization correction)
Training:
- Ground truth: Gaussian heatmap around object centers
- Loss: Focal loss for heatmap + L1 loss for size/offset
Inference:
1. Find local maxima in heatmap (object centers)
2. Read size and offset at each center
3. Reconstruct bounding boxes
Advantages:
- No anchors, no NMS needed (few duplicate detections)
- Simple and fast
DETR (2020) - End-to-End Detection with Transformers
DETR Philosophy
Set prediction: Predict a fixed-size set of objects in parallel
No hand-crafted components: No anchors, no NMS, learned end-to-end
DETR Architecture:
1. Input image → CNN backbone → Feature map
2. Flatten feature map → Sequence of features
3. Add positional encodings
4. Transformer encoder-decoder:
- Encoder: Process image features
- Decoder: N object queries → N predictions
5. FFN heads → (class, bbox) for each query
Object queries:
- N learned embeddings (e.g., N=100)
- Each query predicts one object (or "no object")
- Transformer learns to assign queries to objects
Hungarian Matching:
- Bipartite matching between predictions and ground truth
- Find optimal 1-to-1 assignment
- Loss computed only on matched pairs
Training:
- Classification loss: Cross-entropy
- Box loss: L1 + GIoU loss
- Hungarian matching provides assignment
def xyxy_to_xywh(boxes):
"""XYXY to XYWH"""
x1, y1, x2, y2 = boxes[..., 0], boxes[..., 1], boxes[..., 2], boxes[..., 3]
x = x1
y = y1
w = x2 - x1
h = y2 - y1
return torch.stack([x, y, w, h], dim=-1)
def xywh_to_xyxy(boxes):
"""XYWH to XYXY"""
x, y, w, h = boxes[..., 0], boxes[..., 1], boxes[..., 2], boxes[..., 3]
x1 = x
y1 = y
x2 = x + w
y2 = y + h
return torch.stack([x1, y1, x2, y2], dim=-1)
def xyxy_to_yolo(boxes, image_size):
"""XYXY to YOLO (normalized center format)"""
width, height = image_size
x1, y1, x2, y2 = boxes[..., 0], boxes[..., 1], boxes[..., 2], boxes[..., 3]
cx = (x1 + x2) / (2 * width)
cy = (y1 + y2) / (2 * height)
w = (x2 - x1) / width
h = (y2 - y1) / height
return torch.stack([cx, cy, w, h], dim=-1)
Coordinate System
Origin (0,0): Top-left corner
x-axis: Left to right
y-axis: Top to bottom
Normalized coords: Divide by width/height → [0, 1]
7. IoU AND NON-MAXIMUM SUPPRESSION
Intersection over Union (IoU)
IoU Formula:
IoU = Area of Intersection / Area of Union
Where:
- Intersection = Overlapping region
- Union = Total area covered by both boxes
- Range: [0, 1] (0 = no overlap, 1 = perfect match)
Mathematical:
IoU = (A ∩ B) / (A + B - A ∩ B)
NMS Algorithm:
1. Sort all detections by confidence score (high → low)
2. While detections remain:
a. Take highest confidence detection
b. Add to final output
c. Remove all detections with IoU > threshold (e.g., 0.45) with this detection
3. Return final output
When is NMS applied?
- Only during inference (not training)
- After confidence thresholding
- Per class (separately for each object class)
def nms(boxes, scores, iou_threshold=0.45):
"""
Non-Maximum Suppression
Args:
boxes: (N, 4) in XYXY format
scores: (N,) confidence scores
iou_threshold: IoU threshold for suppression
Returns:
keep: Indices of boxes to keep
"""
# Sort by score
sorted_indices = torch.argsort(scores, descending=True)
keep = []
while len(sorted_indices) > 0:
# Take highest confidence
current = sorted_indices[0]
keep.append(current.item())
if len(sorted_indices) == 1:
break
# Compute IoU with remaining boxes
ious = box_iou_matrix(boxes[current:current+1], boxes[sorted_indices[1:]])
# Keep boxes with IoU < threshold
mask = ious[0] < iou_threshold
sorted_indices = sorted_indices[1:][mask]
return keep
NMS Hyperparameters
Confidence threshold (e.g., 0.25): Filter low-confidence predictions before NMS
IoU threshold (e.g., 0.45): How much overlap allowed before suppression
Lower IoU threshold: More aggressive (fewer boxes, may remove valid detections)
Higher IoU threshold: Less aggressive (more boxes, may keep duplicates)
8. EVALUATION METRICS (mAP, PRECISION, RECALL)
Confusion Matrix for Object Detection
Metric
Definition
Condition
True Positive (TP)
Correct detection
IoU ≥ threshold AND correct class
False Positive (FP)
Incorrect detection
IoU < threshold OR wrong class
False Negative (FN)
Missed object
No prediction matched this ground truth
True Negative (TN)
N/A
Not applicable (infinite background)
Precision and Recall
Precision: Of all detections, how many are correct?
Precision = TP / (TP + FP)
Recall: Of all ground truth objects, how many detected?
Recall = TP / (TP + FN)
Trade-off:
- Lower confidence threshold → higher recall, lower precision
- Higher confidence threshold → lower recall, higher precision
F1 Score: Harmonic mean
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Average Precision (AP)
AP Calculation
Sort all detections by confidence (high to low)
For each detection threshold, compute precision and recall
Plot precision-recall curve
AP = Area under the interpolated precision-recall curve
AP Formula:
AP = Σ (Rₙ - Rₙ₋₁) × Pₙ
Where:
- Rₙ = recall at nth threshold
- Pₙ = precision at nth threshold
11-point interpolation (PASCAL VOC):
AP = (1/11) × Σ P_interp(r) for r ∈ {0, 0.1, ..., 1.0}
All-point interpolation (COCO):
Use all unique recall values (more accurate)
Mean Average Precision (mAP)
mAP Calculation:
1. Compute AP for each class
2. Average over all classes
mAP = (1/N) × Σ APᵢ (for N classes)
COCO Metrics:
- mAP or mAP@[0.5:0.95]: Average over IoU thresholds 0.5, 0.55, ..., 0.95
- [email protected]: mAP at IoU threshold = 0.5 (more lenient)
- [email protected]: mAP at IoU threshold = 0.75 (stricter)
- mAP_small: mAP for small objects (area < 32²)
- mAP_medium: mAP for medium objects (32² < area < 96²)
- mAP_large: mAP for large objects (area > 96²)
Pre-trained models: Trained on COCO (80 classes, 120k images)
Fine-tuning: Adapt to custom dataset
Benefits:
Faster convergence (fewer epochs)
Better performance with less data
Learned features transfer across domains
YOLO Training Configuration
from ultralytics import YOLO
# Load pre-trained model
model = YOLO('yolov8n.pt') # nano (fastest)
# Train
results = model.train(
data='dataset.yaml',
epochs=50,
imgsz=64, # MNISTDD-RGB is 64×64
batch=16,
lr0=0.01, # Initial learning rate
device='cuda:0',
# Data augmentation
hsv_h=0.015, # Hue
hsv_s=0.7, # Saturation
hsv_v=0.4, # Value
degrees=0.0, # Rotation (keep 0 for upright digits)
translate=0.1, # Translation
scale=0.5, # Scale
fliplr=0.5, # Horizontal flip
mosaic=1.0, # Mosaic augmentation
# Loss weights
box=7.5, # Box loss gain
cls=0.5, # Class loss gain
dfl=1.5, # DFL loss gain
)
Dataset Format (YOLO)
# dataset.yaml structure
path: /path/to/dataset
train: images/train
val: images/val
names:
0: digit_0
1: digit_1
# ... up to digit_9
# Label files (one per image)
# images/train/img001.jpg → labels/train/img001.txt
# Format:
# All coordinates normalized to [0, 1]
#
# Example label file content:
# 3 0.500 0.500 0.200 0.300
# 7 0.250 0.750 0.150 0.200
Data Augmentation
Augmentation
Effect
Recommended Value
Mosaic
Combine 4 images
1.0 (always apply)
Horizontal flip
Mirror image
0.5 (50% chance)
HSV jitter
Color variation
h=0.015, s=0.7, v=0.4
Scale
Zoom in/out
0.5 (±50%)
Translation
Shift image
0.1 (10%)
Rotation
Rotate
0.0 (digits are upright)
Fine-Tuning Strategies
Freeze backbone: Only train head (fast, less overfitting)
Lower learning rate: 0.001-0.01 for fine-tuning
Fewer epochs: 50-100 epochs usually enough
Monitor validation mAP: Stop when it plateaus
10. MNISTDD-RGB DATASET & ASSIGNMENT
MNISTDD-RGB Dataset
Purpose: Simple object detection dataset for learning
Images: 64×64 RGB
Objects: MNIST digits (0-9)
Per image: 1-3 digits at random positions
Format: NPZ files with images and bboxes
Dataset Structure
# Load dataset
data = np.load('train.npz')
images = data['images'] # (N, 64, 64, 3)
bboxes = data['bboxes'] # (N, max_objs, 4) in XYXY format
# Example
img = images[0] # 64×64×3 RGB image
boxes = bboxes[0] # Multiple bounding boxes for this image
Assignment 5 Tasks
Part
Task
Classes
Goal
Part A
Fine-tune YOLOv8n
1 (generic "digit")
Detect any digit
Part B
Fine-tune YOLOv8n
10 (digit 0-9)
Detect and classify which digit
Manual IoU Evaluation (Assignment Metric)
Mean IoU (Matched):
For each image:
1. Run model to get predictions
2. Compute IoU matrix between predictions and ground truth
3. Greedy matching:
- For each ground truth, find best prediction (highest IoU)
- Match only if IoU ≥ 0.5
- Each prediction matched at most once
4. Compute mean IoU of matched pairs
5. If no matches, IoU = 0 for this image
Final metric: Average over all images
This measures localization quality of correct detections
def greedy_match_ious(iou_matrix, threshold=0.5):
"""
Greedy matching algorithm
Args:
iou_matrix: (N_pred, N_gt) IoU values
threshold: Minimum IoU to match
Returns:
matches: List of IoU values for matched pairs
"""
matches = []
used_preds = set()
# For each ground truth
for gt_idx in range(iou_matrix.shape[1]):
# Find best prediction
best_pred_idx = torch.argmax(iou_matrix[:, gt_idx]).item()
best_iou = iou_matrix[best_pred_idx, gt_idx].item()
# Match if IoU >= threshold and prediction not used
if best_iou >= threshold and best_pred_idx not in used_preds:
matches.append(best_iou)
used_preds.add(best_pred_idx)
return matches
Complete Training Pipeline
# 1. Convert NPZ to YOLO format
# (create images/labels folders with YOLO format annotations)
# 2. Train model
model = YOLO('yolov8n.pt')
results = model.train(
data='mnistdd.yaml',
epochs=50,
imgsz=64,
batch=16,
device='cuda:0'
)
# 3. Validate (automatic mAP computation)
metrics = model.val()
print(f"[email protected]: {metrics.box.map50:.3f}")
print(f"[email protected]:0.95: {metrics.box.map:.3f}")
# 4. Predict on validation set
preds = model.predict(val_images, conf=0.25, iou=0.45)
# 5. Compute manual IoU
mean_ious = []
for pred, gt_boxes in zip(preds, ground_truth):
iou_matrix = box_iou_matrix(pred.boxes.xyxy, gt_boxes)
matches = greedy_match_ious(iou_matrix, threshold=0.5)
mean_ious.append(np.mean(matches) if matches else 0.0)
mean_iou_matched = np.mean(mean_ious)
print(f"Mean IoU (matched): {mean_iou_matched:.3f}")