← Back to Topics

CMPUT 328 ASSIGNMENT 5: OBJECT DETECTION

Complete Study Guide - From R-CNN to YOLO, Evaluation Metrics, and Modern Architectures

Table of Contents

1. INTRODUCTION TO OBJECT DETECTION

What is Object Detection?

Object detection localizes and classifies multiple objects in an image.

Output: For each object: (bounding box, class label, confidence score)

Object Detection vs Other Vision Tasks

Task Input Output Example
Classification Image Single class label "dog"
Object Detection Image Multiple (bbox, class, conf) [(box, "dog", 0.95), (box, "cat", 0.87)]
Segmentation Image Pixel-wise masks Per-pixel class labels

Key Challenge: Variable Output Length

Problem: The network doesn't know how many objects are in each image.

Solutions over time:

Hierarchical Representations

Neural networks learn hierarchical features for object detection:

2. HISTORY OF OBJECT DETECTION (R-CNN → FASTER R-CNN)

Evolution of Two-Stage Detectors

Selective Search (Pre-Deep Learning)

Selective Search: Algorithm to generate ~2000 region proposals per image

How it works:

R-CNN (2014)

R-CNN Pipeline: 1. Input image 2. Selective Search → ~2000 region proposals 3. Warp each region to fixed size (227×227) 4. Pass each warped region through CNN (AlexNet) 5. Extract CNN features for each region 6. SVM classifier per class 7. Bounding box regressor to refine boxes Problem: Very slow (~47 seconds per image) - Must run CNN forward pass ~2000 times per image

Fast R-CNN (2015)

Key innovation: Pass image through CNN only once, then extract features for each region

Fast R-CNN Pipeline: 1. Input image → CNN backbone (single forward pass) 2. Get feature map from last conv layer 3. Selective Search → ~2000 region proposals (on original image) 4. Project each proposal onto feature map → RoI (Region of Interest) 5. **ROI Pooling**: Convert each RoI to fixed-size feature vector 6. Fully connected layers → (classification, bbox regression) Speed up: ~0.3 seconds per image (from 47s) Bottleneck: Selective search still slow

ROI Pooling Explained

ROI Pooling converts variable-sized regions into fixed-size features Example: - Feature map: 512 × 20 × 15 (channels × height × width) - RoI on feature map: Variable size (e.g., 7×5 region) - Target output: 512 × 2 × 2 (fixed size) Process: 1. Divide RoI into 2×2 grid (target output size) 2. Max pool within each grid cell 3. Result: 512 × 2 × 2 fixed-size feature Benefits: - Variable input → fixed output (required for FC layers) - Differentiable (can backprop) - Fast (just max pooling)

Faster R-CNN (2016)

Key innovation: Replace selective search with Region Proposal Network (RPN)

RPN: Neural network that predicts region proposals

Faster R-CNN Pipeline: 1. Input image → CNN backbone (shared) 2. Feature map → **Region Proposal Network (RPN)** - RPN outputs ~300 region proposals - Much faster than selective search 3. ROI Pooling on proposed regions 4. Classification + bbox regression heads Two-stage training: Stage 1: Train backbone + RPN Stage 2: Train backbone + detection heads (classification + bbox) Speed: ~0.2 seconds per image (5 FPS)

Region Proposal Network (RPN) Details

RPN predicts whether each location contains an object At each location in feature map: 1. Place K anchor boxes of different scales/aspect ratios - Example: 3 scales × 3 aspect ratios = 9 anchors 2. For each anchor, predict: - Objectness score (1 value): does anchor contain object? - Box offsets (4 values): how to adjust anchor to fit object? Output per location: K anchors × (1 objectness + 4 offsets) = K × 5 values Full output for 20×15 feature map: - Objectness: 20 × 15 × K - Box transforms: 20 × 15 × K × 4 Post-processing: 1. Apply objectness threshold (e.g., > 0.5) 2. Apply box transforms to anchors 3. Apply NMS to remove duplicates 4. Keep top ~300 proposals

Why Anchors?

Problem: Neural networks need fixed-size outputs, but objects have variable sizes/shapes

Solution: Anchor boxes

Without anchors (selective search): Can generate arbitrary proposals, but not differentiable

With anchors (RPN): Differentiable, learnable, but constrained to anchor shapes

Bounding Box Regression in R-CNN

Box Parameterization: Anchor/proposal box: p = (pₓ, pᵧ, pᵥ, pₕ) Ground truth box: g = (gₓ, gᵧ, gᵥ, gₕ) Network predicts transformations d(p): ĝₓ = pᵥ·dₓ(p) + pₓ ĝᵧ = pₕ·dᵧ(p) + pᵧ ĝᵥ = pᵥ·exp(dᵥ(p)) ĝₕ = pₕ·exp(dₕ(p)) Target transformations: tₓ = (gₓ - pₓ) / pᵥ tᵧ = (gᵧ - pᵧ) / pₕ tᵥ = log(gᵥ / pᵥ) tₕ = log(gₕ / pₕ) Loss: L_reg = Σ (tᵢ - dᵢ(p))² + λ||w||²

Performance Comparison

Method Year Speed (sec/img) Region Proposals mAP
R-CNN 2014 49 Selective Search ~58%
SPP-Net 2014 4.3 Selective Search ~59%
Fast R-CNN 2015 2.3 Selective Search ~66%
Faster R-CNN 2016 0.2 RPN (neural net) ~73%

3. ROI POOLING, ANCHORS, AND REGION PROPOSALS

Why Selective Search Doesn't Need Anchors

Selective Search: Can generate proposals of any shape/size

RPN with Anchors: Must predict from fixed set of shapes

Anchor Box Design

Aspect Ratio Scale Purpose
1:1 Small, Medium, Large Square objects (faces, balls)
1:2 Small, Medium, Large Tall objects (people, bottles)
2:1 Small, Medium, Large Wide objects (cars, buses)

Common configuration: 3 scales × 3 aspect ratios = 9 anchors per location

ROI Pooling vs ROI Align

ROI Pooling Problem

Quantization: ROI pooling uses integer coordinates, causing misalignment

Example: RoI at (6.5, 4.7, 18.3, 12.9) → rounded to (6, 4, 18, 12)

Impact: Slight misalignment, especially bad for segmentation

ROI Align Solution (Mask R-CNN)

ROI Align: Use bilinear interpolation instead of rounding

4. YOLO: SINGLE-STAGE DETECTION

YOLO Philosophy

You Only Look Once: Predict bounding boxes and classes in a single forward pass

Key insight: Frame detection as regression, not classification on proposals

Speed advantage: >10× faster than Faster R-CNN

YOLOv1 (2016)

YOLOv1 Architecture: 1. Divide image into S × S grid (e.g., 7×7) 2. Each grid cell predicts: - B bounding boxes (x, y, w, h, confidence) - C class probabilities Output tensor: S × S × (B×5 + C) Example (S=7, B=2, C=20): 7 × 7 × 30 Grid cell responsible for object if: - Object's center falls in that cell Confidence score: - Pr(Object) × IoU(pred, truth) - 0 if no object in cell

YOLO Evolution

Version Year Key Improvements
YOLOv1 2016 Single-stage, grid-based, real-time
YOLOv2 2017 Batch norm, anchor boxes, multi-scale training
YOLOv3 2018 FPN (3 scales), better for small objects
YOLOv4 2020 CSPDarknet, Mish activation, mosaic augmentation
YOLOv5 2020 PyTorch, auto-anchor, production-ready
YOLOv8 2023 Anchor-free, C2f blocks, improved neck
YOLO11 2024 Latest state-of-the-art

YOLOv8 Architecture (Used in Assignment)

YOLOv8 Components: Backbone: CSPDarknet ├─ Extracts features at multiple scales ├─ CSP (Cross Stage Partial) blocks └─ SPPF (Spatial Pyramid Pooling - Fast) Neck: PANet (Path Aggregation Network) ├─ Top-down: FPN for multi-scale fusion └─ Bottom-up: PAN for feature enhancement Head: Decoupled anchor-free head ├─ Classification head → class probabilities └─ Regression head → bbox coordinates (direct prediction) Key differences from earlier YOLO: - No anchor boxes (anchor-free) - Separate heads for classification and localization - C2f modules instead of C3 (faster, better gradient flow)

Why Anchor-Free?

YOLO Loss Function

YOLOv1 Loss (Multi-part): L_total = λ_coord × L_box + L_obj + L_noobj + L_class L_box: Localization loss (coordinates + size) L_obj: Confidence loss (cells with objects) L_noobj: Confidence loss (cells without objects) L_class: Classification loss YOLOv8 Loss: L_total = L_cls + L_box + L_dfl L_cls: Classification loss (BCE) L_box: Box loss (CIoU - Complete IoU) L_dfl: Distribution Focal Loss (for bbox refinement)

5. MODERN APPROACHES (CENTERNET, DETR)

Anchor-less Object Detection

Motivation: Anchors add complexity and hyperparameters

Anchor-less approaches:

CenterNet (2019)

CenterNet: Objects as Points Key idea: Represent each object as a single point (its center) Architecture: 1. Input image → Backbone CNN → Feature map 2. Three prediction heads: a) Heatmap: Detect object centers (Gaussian peaks) b) Size: Predict width and height at center c) Offset: Sub-pixel offset (for quantization correction) Training: - Ground truth: Gaussian heatmap around object centers - Loss: Focal loss for heatmap + L1 loss for size/offset Inference: 1. Find local maxima in heatmap (object centers) 2. Read size and offset at each center 3. Reconstruct bounding boxes Advantages: - No anchors, no NMS needed (few duplicate detections) - Simple and fast

DETR (2020) - End-to-End Detection with Transformers

DETR Philosophy

Set prediction: Predict a fixed-size set of objects in parallel

No hand-crafted components: No anchors, no NMS, learned end-to-end

DETR Architecture: 1. Input image → CNN backbone → Feature map 2. Flatten feature map → Sequence of features 3. Add positional encodings 4. Transformer encoder-decoder: - Encoder: Process image features - Decoder: N object queries → N predictions 5. FFN heads → (class, bbox) for each query Object queries: - N learned embeddings (e.g., N=100) - Each query predicts one object (or "no object") - Transformer learns to assign queries to objects Hungarian Matching: - Bipartite matching between predictions and ground truth - Find optimal 1-to-1 assignment - Loss computed only on matched pairs Training: - Classification loss: Cross-entropy - Box loss: L1 + GIoU loss - Hungarian matching provides assignment

Hungarian Matching: Non-Differentiable Step

Problem: Hungarian algorithm uses discrete argmin/argmax (not differentiable)

Training still works:

Issues:

Improvements to DETR

Method Key Improvement
Deformable DETR Multi-scale deformable attention (faster, better convergence)
Conditional DETR Condition queries on spatial priors (reduce matching ambiguity)
DN-DETR Add noised ground-truth queries (stabilize early training)
DINO Contrastive denoising + mixed query selection
Group DETR / Co-DETR Multi-group matching, current SOTA on COCO

Performance on COCO

Historical Progress (mAP on COCO): Fast R-CNN (2015): ~21% Faster R-CNN (2016): ~37% Mask R-CNN (2017): ~42% YOLO v3 (2018): ~47% Cascade R-CNN (2019): ~50% DetectoRS (2020): ~55% DETR v2 (2023): ~66% Co-DETR (2024): ~67% Note: mAP = mAP@[0.5:0.95] (average over IoU 0.5 to 0.95)

6. BOUNDING BOX REPRESENTATIONS

Common Bounding Box Formats

Format Representation Description Use Case
XYXY (x1, y1, x2, y2) Top-left + bottom-right corners PyTorch, easy IoU calculation
XYWH (x, y, w, h) Top-left + width/height Intuitive, COCO dataset
CXCYWH (cx, cy, w, h) Center + width/height DETR, transformers
YOLO Format (cx, cy, w, h) normalized Center + size, all in [0,1] YOLO training labels

Format Conversions

def xyxy_to_xywh(boxes): """XYXY to XYWH""" x1, y1, x2, y2 = boxes[..., 0], boxes[..., 1], boxes[..., 2], boxes[..., 3] x = x1 y = y1 w = x2 - x1 h = y2 - y1 return torch.stack([x, y, w, h], dim=-1) def xywh_to_xyxy(boxes): """XYWH to XYXY""" x, y, w, h = boxes[..., 0], boxes[..., 1], boxes[..., 2], boxes[..., 3] x1 = x y1 = y x2 = x + w y2 = y + h return torch.stack([x1, y1, x2, y2], dim=-1) def xyxy_to_yolo(boxes, image_size): """XYXY to YOLO (normalized center format)""" width, height = image_size x1, y1, x2, y2 = boxes[..., 0], boxes[..., 1], boxes[..., 2], boxes[..., 3] cx = (x1 + x2) / (2 * width) cy = (y1 + y2) / (2 * height) w = (x2 - x1) / width h = (y2 - y1) / height return torch.stack([cx, cy, w, h], dim=-1)

Coordinate System

7. IoU AND NON-MAXIMUM SUPPRESSION

Intersection over Union (IoU)

IoU Formula: IoU = Area of Intersection / Area of Union Where: - Intersection = Overlapping region - Union = Total area covered by both boxes - Range: [0, 1] (0 = no overlap, 1 = perfect match) Mathematical: IoU = (A ∩ B) / (A + B - A ∩ B)
def box_iou(box1, box2): """ Compute IoU between two boxes (XYXY format) Args: box1, box2: (x1, y1, x2, y2) Returns: iou: float in [0, 1] """ # Intersection rectangle inter_x1 = max(box1[0], box2[0]) inter_y1 = max(box1[1], box2[1]) inter_x2 = min(box1[2], box2[2]) inter_y2 = min(box1[3], box2[3]) # Intersection area inter_w = max(0, inter_x2 - inter_x1) inter_h = max(0, inter_y2 - inter_y1) inter_area = inter_w * inter_h # Box areas box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1]) box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1]) # Union area union_area = box1_area + box2_area - inter_area # IoU iou = inter_area / (union_area + 1e-6) return iou

IoU Thresholds in Practice

IoU Range Quality Usage
IoU ≥ 0.5 Good match Standard threshold for TP
IoU ≥ 0.7 Strong match Stricter evaluation
IoU ≥ 0.9 Excellent Very precise localization
IoU < 0.5 Poor match False positive

Non-Maximum Suppression (NMS)

Why NMS?

Problem: Multiple detections for same object

Solution: Keep highest confidence, remove overlapping detections

NMS Algorithm: 1. Sort all detections by confidence score (high → low) 2. While detections remain: a. Take highest confidence detection b. Add to final output c. Remove all detections with IoU > threshold (e.g., 0.45) with this detection 3. Return final output When is NMS applied? - Only during inference (not training) - After confidence thresholding - Per class (separately for each object class)
def nms(boxes, scores, iou_threshold=0.45): """ Non-Maximum Suppression Args: boxes: (N, 4) in XYXY format scores: (N,) confidence scores iou_threshold: IoU threshold for suppression Returns: keep: Indices of boxes to keep """ # Sort by score sorted_indices = torch.argsort(scores, descending=True) keep = [] while len(sorted_indices) > 0: # Take highest confidence current = sorted_indices[0] keep.append(current.item()) if len(sorted_indices) == 1: break # Compute IoU with remaining boxes ious = box_iou_matrix(boxes[current:current+1], boxes[sorted_indices[1:]]) # Keep boxes with IoU < threshold mask = ious[0] < iou_threshold sorted_indices = sorted_indices[1:][mask] return keep

NMS Hyperparameters

8. EVALUATION METRICS (mAP, PRECISION, RECALL)

Confusion Matrix for Object Detection

Metric Definition Condition
True Positive (TP) Correct detection IoU ≥ threshold AND correct class
False Positive (FP) Incorrect detection IoU < threshold OR wrong class
False Negative (FN) Missed object No prediction matched this ground truth
True Negative (TN) N/A Not applicable (infinite background)

Precision and Recall

Precision: Of all detections, how many are correct? Precision = TP / (TP + FP) Recall: Of all ground truth objects, how many detected? Recall = TP / (TP + FN) Trade-off: - Lower confidence threshold → higher recall, lower precision - Higher confidence threshold → lower recall, higher precision F1 Score: Harmonic mean F1 = 2 × (Precision × Recall) / (Precision + Recall)

Average Precision (AP)

AP Calculation

  1. Sort all detections by confidence (high to low)
  2. For each detection threshold, compute precision and recall
  3. Plot precision-recall curve
  4. AP = Area under the interpolated precision-recall curve
AP Formula: AP = Σ (Rₙ - Rₙ₋₁) × Pₙ Where: - Rₙ = recall at nth threshold - Pₙ = precision at nth threshold 11-point interpolation (PASCAL VOC): AP = (1/11) × Σ P_interp(r) for r ∈ {0, 0.1, ..., 1.0} All-point interpolation (COCO): Use all unique recall values (more accurate)

Mean Average Precision (mAP)

mAP Calculation: 1. Compute AP for each class 2. Average over all classes mAP = (1/N) × Σ APᵢ (for N classes) COCO Metrics: - mAP or mAP@[0.5:0.95]: Average over IoU thresholds 0.5, 0.55, ..., 0.95 - [email protected]: mAP at IoU threshold = 0.5 (more lenient) - [email protected]: mAP at IoU threshold = 0.75 (stricter) - mAP_small: mAP for small objects (area < 32²) - mAP_medium: mAP for medium objects (32² < area < 96²) - mAP_large: mAP for large objects (area > 96²)

mAP Interpretation

Example Calculation

Example: 10 detections for "cat" class Detections sorted by confidence: Detection | Conf | IoU | TP/FP | Precision | Recall 1 | 0.95 | 0.88 | TP | 1/1=1.00 | 1/5=0.20 2 | 0.90 | 0.67 | TP | 2/2=1.00 | 2/5=0.40 3 | 0.85 | 0.42 | FP | 2/3=0.67 | 2/5=0.40 4 | 0.80 | 0.73 | TP | 3/4=0.75 | 3/5=0.60 5 | 0.75 | 0.35 | FP | 3/5=0.60 | 3/5=0.60 6 | 0.70 | 0.81 | TP | 4/6=0.67 | 4/5=0.80 7 | 0.60 | 0.92 | TP | 5/7=0.71 | 5/5=1.00 8 | 0.55 | 0.23 | FP | 5/8=0.63 | 5/5=1.00 9 | 0.50 | 0.15 | FP | 5/9=0.56 | 5/5=1.00 10 | 0.45 | 0.08 | FP | 5/10=0.50 | 5/5=1.00 (Assume 5 ground truth cats total) Precision-Recall pairs: (1.00, 0.20), (1.00, 0.40), (0.75, 0.60), (0.71, 1.00) AP ≈ area under this curve ≈ 0.87 (for this class)

9. TRAINING & FINE-TUNING

Transfer Learning

Pre-trained models: Trained on COCO (80 classes, 120k images)

Fine-tuning: Adapt to custom dataset

Benefits:

YOLO Training Configuration

from ultralytics import YOLO # Load pre-trained model model = YOLO('yolov8n.pt') # nano (fastest) # Train results = model.train( data='dataset.yaml', epochs=50, imgsz=64, # MNISTDD-RGB is 64×64 batch=16, lr0=0.01, # Initial learning rate device='cuda:0', # Data augmentation hsv_h=0.015, # Hue hsv_s=0.7, # Saturation hsv_v=0.4, # Value degrees=0.0, # Rotation (keep 0 for upright digits) translate=0.1, # Translation scale=0.5, # Scale fliplr=0.5, # Horizontal flip mosaic=1.0, # Mosaic augmentation # Loss weights box=7.5, # Box loss gain cls=0.5, # Class loss gain dfl=1.5, # DFL loss gain )

Dataset Format (YOLO)

# dataset.yaml structure path: /path/to/dataset train: images/train val: images/val names: 0: digit_0 1: digit_1 # ... up to digit_9 # Label files (one per image) # images/train/img001.jpg → labels/train/img001.txt # Format: # All coordinates normalized to [0, 1] # # Example label file content: # 3 0.500 0.500 0.200 0.300 # 7 0.250 0.750 0.150 0.200

Data Augmentation

Augmentation Effect Recommended Value
Mosaic Combine 4 images 1.0 (always apply)
Horizontal flip Mirror image 0.5 (50% chance)
HSV jitter Color variation h=0.015, s=0.7, v=0.4
Scale Zoom in/out 0.5 (±50%)
Translation Shift image 0.1 (10%)
Rotation Rotate 0.0 (digits are upright)

Fine-Tuning Strategies

10. MNISTDD-RGB DATASET & ASSIGNMENT

MNISTDD-RGB Dataset

Purpose: Simple object detection dataset for learning

Dataset Structure

# Load dataset data = np.load('train.npz') images = data['images'] # (N, 64, 64, 3) bboxes = data['bboxes'] # (N, max_objs, 4) in XYXY format # Example img = images[0] # 64×64×3 RGB image boxes = bboxes[0] # Multiple bounding boxes for this image

Assignment 5 Tasks

Part Task Classes Goal
Part A Fine-tune YOLOv8n 1 (generic "digit") Detect any digit
Part B Fine-tune YOLOv8n 10 (digit 0-9) Detect and classify which digit

Manual IoU Evaluation (Assignment Metric)

Mean IoU (Matched): For each image: 1. Run model to get predictions 2. Compute IoU matrix between predictions and ground truth 3. Greedy matching: - For each ground truth, find best prediction (highest IoU) - Match only if IoU ≥ 0.5 - Each prediction matched at most once 4. Compute mean IoU of matched pairs 5. If no matches, IoU = 0 for this image Final metric: Average over all images This measures localization quality of correct detections
def greedy_match_ious(iou_matrix, threshold=0.5): """ Greedy matching algorithm Args: iou_matrix: (N_pred, N_gt) IoU values threshold: Minimum IoU to match Returns: matches: List of IoU values for matched pairs """ matches = [] used_preds = set() # For each ground truth for gt_idx in range(iou_matrix.shape[1]): # Find best prediction best_pred_idx = torch.argmax(iou_matrix[:, gt_idx]).item() best_iou = iou_matrix[best_pred_idx, gt_idx].item() # Match if IoU >= threshold and prediction not used if best_iou >= threshold and best_pred_idx not in used_preds: matches.append(best_iou) used_preds.add(best_pred_idx) return matches

Complete Training Pipeline

# 1. Convert NPZ to YOLO format # (create images/labels folders with YOLO format annotations) # 2. Train model model = YOLO('yolov8n.pt') results = model.train( data='mnistdd.yaml', epochs=50, imgsz=64, batch=16, device='cuda:0' ) # 3. Validate (automatic mAP computation) metrics = model.val() print(f"[email protected]: {metrics.box.map50:.3f}") print(f"[email protected]:0.95: {metrics.box.map:.3f}") # 4. Predict on validation set preds = model.predict(val_images, conf=0.25, iou=0.45) # 5. Compute manual IoU mean_ious = [] for pred, gt_boxes in zip(preds, ground_truth): iou_matrix = box_iou_matrix(pred.boxes.xyxy, gt_boxes) matches = greedy_match_ious(iou_matrix, threshold=0.5) mean_ious.append(np.mean(matches) if matches else 0.0) mean_iou_matched = np.mean(mean_ious) print(f"Mean IoU (matched): {mean_iou_matched:.3f}")

Expected Performance

Metric Part A (1 class) Part B (10 classes)
[email protected] 0.90-0.95 0.85-0.90
Mean IoU (matched) 0.80-0.85 0.75-0.80

SUMMARY: KEY TAKEAWAYS

Evolution of Object Detection

Core Concepts

YOLO for Assignment 5

Evaluation Metrics

DOWNLOAD ANKI DECK