CMPUT 328 Assignment 5: Object Detection

1. INTRODUCTION TO OBJECT DETECTION

What is Object Detection?

Object detection localizes and classifies multiple objects in an image.

Output: For each object: (bounding box, class label, confidence score)

Object Detection vs Other Vision Tasks

Task	Input	Output	Example
Classification	Image	Single class label	"dog"
Object Detection	Image	Multiple (bbox, class, conf)	[(box, "dog", 0.95), (box, "cat", 0.87)]
Segmentation	Image	Pixel-wise masks	Per-pixel class labels

Key Challenge: Variable Output Length

Problem: The network doesn't know how many objects are in each image.

Solutions over time:

Older methods: Sliding window, selective search (region proposals)
Modern methods: Anchor boxes, objectness score thresholding, NMS
Latest methods: Transformer-based sequential outputs (DETR)

Hierarchical Representations

Neural networks learn hierarchical features for object detection:

Input layer: Raw pixel values
Early layers: Edges, corners, textures
Middle layers: Parts of objects (wheels, faces)
Deep layers: Whole objects (cars, people)
Output layer: Bounding boxes + class predictions

2. HISTORY OF OBJECT DETECTION (R-CNN → FASTER R-CNN)

Evolution of Two-Stage Detectors

Selective Search (Pre-Deep Learning)

Selective Search: Algorithm to generate ~2000 region proposals per image

How it works:

Over-segment image into many small regions
Hierarchically group regions based on color, texture, size, fill
Generate bounding boxes around grouped regions
Output: ~2000 region proposals that likely contain objects

R-CNN (2014)

R-CNN Pipeline: 1. Input image 2. Selective Search → ~2000 region proposals 3. Warp each region to fixed size (227×227) 4. Pass each warped region through CNN (AlexNet) 5. Extract CNN features for each region 6. SVM classifier per class 7. Bounding box regressor to refine boxes Problem: Very slow (~47 seconds per image) - Must run CNN forward pass ~2000 times per image

Fast R-CNN (2015)

Key innovation: Pass image through CNN only once, then extract features for each region

Fast R-CNN Pipeline: 1. Input image → CNN backbone (single forward pass) 2. Get feature map from last conv layer 3. Selective Search → ~2000 region proposals (on original image) 4. Project each proposal onto feature map → RoI (Region of Interest) 5. **ROI Pooling**: Convert each RoI to fixed-size feature vector 6. Fully connected layers → (classification, bbox regression) Speed up: ~0.3 seconds per image (from 47s) Bottleneck: Selective search still slow

ROI Pooling Explained

ROI Pooling converts variable-sized regions into fixed-size features

Example:
- Feature map: 512 × 20 × 15 (channels × height × width)
- RoI on feature map: Variable size (e.g., 7×5 region)
- Target output: 512 × 2 × 2 (fixed size)

Process:
1. Divide RoI into 2×2 grid (target output size)
2. Max pool within each grid cell
3. Result: 512 × 2 × 2 fixed-size feature

Benefits:
- Variable input → fixed output (required for FC layers)
- Differentiable (can backprop)
- Fast (just max pooling)

Faster R-CNN (2016)

Key innovation: Replace selective search with Region Proposal Network (RPN)

RPN: Neural network that predicts region proposals

Faster R-CNN Pipeline: 1. Input image → CNN backbone (shared) 2. Feature map → **Region Proposal Network (RPN)** - RPN outputs ~300 region proposals - Much faster than selective search 3. ROI Pooling on proposed regions 4. Classification + bbox regression heads Two-stage training: Stage 1: Train backbone + RPN Stage 2: Train backbone + detection heads (classification + bbox) Speed: ~0.2 seconds per image (5 FPS)

Region Proposal Network (RPN) Details

RPN predicts whether each location contains an object

At each location in feature map:
1. Place K anchor boxes of different scales/aspect ratios
   - Example: 3 scales × 3 aspect ratios = 9 anchors
2. For each anchor, predict:
   - Objectness score (1 value): does anchor contain object?
   - Box offsets (4 values): how to adjust anchor to fit object?

Output per location: K anchors × (1 objectness + 4 offsets) = K × 5 values

Full output for 20×15 feature map:
- Objectness: 20 × 15 × K
- Box transforms: 20 × 15 × K × 4

Post-processing:
1. Apply objectness threshold (e.g., > 0.5)
2. Apply box transforms to anchors
3. Apply NMS to remove duplicates
4. Keep top ~300 proposals

Why Anchors?

Problem: Neural networks need fixed-size outputs, but objects have variable sizes/shapes

Solution: Anchor boxes

Pre-define K box shapes at each location
Network predicts adjustments to these anchors
Anchors designed using k-means on training data

Without anchors (selective search): Can generate arbitrary proposals, but not differentiable

With anchors (RPN): Differentiable, learnable, but constrained to anchor shapes

Bounding Box Regression in R-CNN

Box Parameterization: Anchor/proposal box: p = (pₓ, pᵧ, pᵥ, pₕ) Ground truth box: g = (gₓ, gᵧ, gᵥ, gₕ) Network predicts transformations d(p): ĝₓ = pᵥ·dₓ(p) + pₓ ĝᵧ = pₕ·dᵧ(p) + pᵧ ĝᵥ = pᵥ·exp(dᵥ(p)) ĝₕ = pₕ·exp(dₕ(p)) Target transformations: tₓ = (gₓ - pₓ) / pᵥ tᵧ = (gᵧ - pᵧ) / pₕ tᵥ = log(gᵥ / pᵥ) tₕ = log(gₕ / pₕ) Loss: L_reg = Σ (tᵢ - dᵢ(p))² + λ||w||²

Performance Comparison

Method	Year	Speed (sec/img)	Region Proposals	mAP
R-CNN	2014	49	Selective Search	~58%
SPP-Net	2014	4.3	Selective Search	~59%
Fast R-CNN	2015	2.3	Selective Search	~66%
Faster R-CNN	2016	0.2	RPN (neural net)	~73%

3. ROI POOLING, ANCHORS, AND REGION PROPOSALS

Why Selective Search Doesn't Need Anchors

Selective Search: Can generate proposals of any shape/size

Bottom-up approach based on image segmentation
Not constrained by pre-defined shapes
Downside: Not learnable, slow, hand-crafted heuristics

RPN with Anchors: Must predict from fixed set of shapes

Top-down approach using neural network
Anchors provide starting points
Upside: Learnable, fast, end-to-end trainable

Anchor Box Design

Aspect Ratio	Scale	Purpose
1:1	Small, Medium, Large	Square objects (faces, balls)
1:2	Small, Medium, Large	Tall objects (people, bottles)
2:1	Small, Medium, Large	Wide objects (cars, buses)

Common configuration: 3 scales × 3 aspect ratios = 9 anchors per location

ROI Pooling vs ROI Align

ROI Pooling Problem

Quantization: ROI pooling uses integer coordinates, causing misalignment

Example: RoI at (6.5, 4.7, 18.3, 12.9) → rounded to (6, 4, 18, 12)

Impact: Slight misalignment, especially bad for segmentation

ROI Align Solution (Mask R-CNN)

ROI Align: Use bilinear interpolation instead of rounding

Preserve exact spatial locations
Better for pixel-level tasks (segmentation)
Standard in modern detectors

4. YOLO: SINGLE-STAGE DETECTION

YOLO Philosophy

You Only Look Once: Predict bounding boxes and classes in a single forward pass

Key insight: Frame detection as regression, not classification on proposals

Speed advantage: >10× faster than Faster R-CNN

YOLOv1 (2016)

YOLOv1 Architecture: 1. Divide image into S × S grid (e.g., 7×7) 2. Each grid cell predicts: - B bounding boxes (x, y, w, h, confidence) - C class probabilities Output tensor: S × S × (B×5 + C) Example (S=7, B=2, C=20): 7 × 7 × 30 Grid cell responsible for object if: - Object's center falls in that cell Confidence score: - Pr(Object) × IoU(pred, truth) - 0 if no object in cell

YOLO Evolution

Version	Year	Key Improvements
YOLOv1	2016	Single-stage, grid-based, real-time
YOLOv2	2017	Batch norm, anchor boxes, multi-scale training
YOLOv3	2018	FPN (3 scales), better for small objects
YOLOv4	2020	CSPDarknet, Mish activation, mosaic augmentation
YOLOv5	2020	PyTorch, auto-anchor, production-ready
YOLOv8	2023	Anchor-free, C2f blocks, improved neck
YOLO11	2024	Latest state-of-the-art

YOLOv8 Architecture (Used in Assignment)

YOLOv8 Components:

Backbone: CSPDarknet
├─ Extracts features at multiple scales
├─ CSP (Cross Stage Partial) blocks
└─ SPPF (Spatial Pyramid Pooling - Fast)

Neck: PANet (Path Aggregation Network)
├─ Top-down: FPN for multi-scale fusion
└─ Bottom-up: PAN for feature enhancement

Head: Decoupled anchor-free head
├─ Classification head → class probabilities
└─ Regression head → bbox coordinates (direct prediction)

Key differences from earlier YOLO:
- No anchor boxes (anchor-free)
- Separate heads for classification and localization
- C2f modules instead of C3 (faster, better gradient flow)

Why Anchor-Free?

Simpler: No need to tune anchor sizes/ratios
Better generalization: Not constrained to pre-defined shapes
Fewer hyperparameters: Easier to use
Direct prediction: Predict box center and size directly

YOLO Loss Function

YOLOv1 Loss (Multi-part): L_total = λ_coord × L_box + L_obj + L_noobj + L_class L_box: Localization loss (coordinates + size) L_obj: Confidence loss (cells with objects) L_noobj: Confidence loss (cells without objects) L_class: Classification loss YOLOv8 Loss: L_total = L_cls + L_box + L_dfl L_cls: Classification loss (BCE) L_box: Box loss (CIoU - Complete IoU) L_dfl: Distribution Focal Loss (for bbox refinement)

5. MODERN APPROACHES (CENTERNET, DETR)

Anchor-less Object Detection

Motivation: Anchors add complexity and hyperparameters

Anchor-less approaches:

Keypoint-based: CenterNet - detect object centers as keypoints
Transformer-based: DETR - set prediction with transformers

CenterNet (2019)

CenterNet: Objects as Points Key idea: Represent each object as a single point (its center) Architecture: 1. Input image → Backbone CNN → Feature map 2. Three prediction heads: a) Heatmap: Detect object centers (Gaussian peaks) b) Size: Predict width and height at center c) Offset: Sub-pixel offset (for quantization correction) Training: - Ground truth: Gaussian heatmap around object centers - Loss: Focal loss for heatmap + L1 loss for size/offset Inference: 1. Find local maxima in heatmap (object centers) 2. Read size and offset at each center 3. Reconstruct bounding boxes Advantages: - No anchors, no NMS needed (few duplicate detections) - Simple and fast

DETR (2020) - End-to-End Detection with Transformers

DETR Philosophy

Set prediction: Predict a fixed-size set of objects in parallel

No hand-crafted components: No anchors, no NMS, learned end-to-end

DETR Architecture: 1. Input image → CNN backbone → Feature map 2. Flatten feature map → Sequence of features 3. Add positional encodings 4. Transformer encoder-decoder: - Encoder: Process image features - Decoder: N object queries → N predictions 5. FFN heads → (class, bbox) for each query Object queries: - N learned embeddings (e.g., N=100) - Each query predicts one object (or "no object") - Transformer learns to assign queries to objects Hungarian Matching: - Bipartite matching between predictions and ground truth - Find optimal 1-to-1 assignment - Loss computed only on matched pairs Training: - Classification loss: Cross-entropy - Box loss: L1 + GIoU loss - Hungarian matching provides assignment

Hungarian Matching: Non-Differentiable Step

Problem: Hungarian algorithm uses discrete argmin/argmax (not differentiable)

Training still works:

Matching done in torch.no_grad() (no gradient through matching)
Matching provides indices: which prediction matches which ground truth
Loss computed using these indices (loss is differentiable)
Gradients flow through predictions, not through matching

Issues:

Noisy early supervision (random matches initially)
Discontinuous loss surface when assignments flip
Slow convergence (hundreds of epochs)

Improvements to DETR

Method	Key Improvement
Deformable DETR	Multi-scale deformable attention (faster, better convergence)
Conditional DETR	Condition queries on spatial priors (reduce matching ambiguity)
DN-DETR	Add noised ground-truth queries (stabilize early training)
DINO	Contrastive denoising + mixed query selection
Group DETR / Co-DETR	Multi-group matching, current SOTA on COCO

Performance on COCO

Historical Progress (mAP on COCO): Fast R-CNN (2015): ~21% Faster R-CNN (2016): ~37% Mask R-CNN (2017): ~42% YOLO v3 (2018): ~47% Cascade R-CNN (2019): ~50% DetectoRS (2020): ~55% DETR v2 (2023): ~66% Co-DETR (2024): ~67% Note: mAP = mAP@[0.5:0.95] (average over IoU 0.5 to 0.95)

6. BOUNDING BOX REPRESENTATIONS

Common Bounding Box Formats

Format	Representation	Description	Use Case
XYXY	(x1, y1, x2, y2)	Top-left + bottom-right corners	PyTorch, easy IoU calculation
XYWH	(x, y, w, h)	Top-left + width/height	Intuitive, COCO dataset
CXCYWH	(cx, cy, w, h)	Center + width/height	DETR, transformers
YOLO Format	(cx, cy, w, h) normalized	Center + size, all in [0,1]	YOLO training labels

Format Conversions

def xyxy_to_xywh(boxes):
    """XYXY to XYWH"""
    x1, y1, x2, y2 = boxes[..., 0], boxes[..., 1], boxes[..., 2], boxes[..., 3]
    x = x1
    y = y1
    w = x2 - x1
    h = y2 - y1
    return torch.stack([x, y, w, h], dim=-1)

def xywh_to_xyxy(boxes):
    """XYWH to XYXY"""
    x, y, w, h = boxes[..., 0], boxes[..., 1], boxes[..., 2], boxes[..., 3]
    x1 = x
    y1 = y
    x2 = x + w
    y2 = y + h
    return torch.stack([x1, y1, x2, y2], dim=-1)

def xyxy_to_yolo(boxes, image_size):
    """XYXY to YOLO (normalized center format)"""
    width, height = image_size
    x1, y1, x2, y2 = boxes[..., 0], boxes[..., 1], boxes[..., 2], boxes[..., 3]

    cx = (x1 + x2) / (2 * width)
    cy = (y1 + y2) / (2 * height)
    w = (x2 - x1) / width
    h = (y2 - y1) / height

    return torch.stack([cx, cy, w, h], dim=-1)

Coordinate System

Origin (0,0): Top-left corner
x-axis: Left to right
y-axis: Top to bottom
Normalized coords: Divide by width/height → [0, 1]

7. IoU AND NON-MAXIMUM SUPPRESSION

Intersection over Union (IoU)

IoU Formula: IoU = Area of Intersection / Area of Union Where: - Intersection = Overlapping region - Union = Total area covered by both boxes - Range: [0, 1] (0 = no overlap, 1 = perfect match) Mathematical: IoU = (A ∩ B) / (A + B - A ∩ B)

def box_iou(box1, box2):
    """
    Compute IoU between two boxes (XYXY format)

    Args:
        box1, box2: (x1, y1, x2, y2)

    Returns:
        iou: float in [0, 1]
    """
    # Intersection rectangle
    inter_x1 = max(box1[0], box2[0])
    inter_y1 = max(box1[1], box2[1])
    inter_x2 = min(box1[2], box2[2])
    inter_y2 = min(box1[3], box2[3])

    # Intersection area
    inter_w = max(0, inter_x2 - inter_x1)
    inter_h = max(0, inter_y2 - inter_y1)
    inter_area = inter_w * inter_h

    # Box areas
    box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
    box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])

    # Union area
    union_area = box1_area + box2_area - inter_area

    # IoU
    iou = inter_area / (union_area + 1e-6)
    return iou

IoU Thresholds in Practice

IoU Range	Quality	Usage
IoU ≥ 0.5	Good match	Standard threshold for TP
IoU ≥ 0.7	Strong match	Stricter evaluation
IoU ≥ 0.9	Excellent	Very precise localization
IoU < 0.5	Poor match	False positive

Non-Maximum Suppression (NMS)

Why NMS?

Problem: Multiple detections for same object

Solution: Keep highest confidence, remove overlapping detections

NMS Algorithm: 1. Sort all detections by confidence score (high → low) 2. While detections remain: a. Take highest confidence detection b. Add to final output c. Remove all detections with IoU > threshold (e.g., 0.45) with this detection 3. Return final output When is NMS applied? - Only during inference (not training) - After confidence thresholding - Per class (separately for each object class)

def nms(boxes, scores, iou_threshold=0.45):
    """
    Non-Maximum Suppression

    Args:
        boxes: (N, 4) in XYXY format
        scores: (N,) confidence scores
        iou_threshold: IoU threshold for suppression

    Returns:
        keep: Indices of boxes to keep
    """
    # Sort by score
    sorted_indices = torch.argsort(scores, descending=True)

    keep = []
    while len(sorted_indices) > 0:
        # Take highest confidence
        current = sorted_indices[0]
        keep.append(current.item())

        if len(sorted_indices) == 1:
            break

        # Compute IoU with remaining boxes
        ious = box_iou_matrix(boxes[current:current+1], boxes[sorted_indices[1:]])

        # Keep boxes with IoU < threshold
        mask = ious[0] < iou_threshold
        sorted_indices = sorted_indices[1:][mask]

    return keep

NMS Hyperparameters

Confidence threshold (e.g., 0.25): Filter low-confidence predictions before NMS
IoU threshold (e.g., 0.45): How much overlap allowed before suppression
Lower IoU threshold: More aggressive (fewer boxes, may remove valid detections)
Higher IoU threshold: Less aggressive (more boxes, may keep duplicates)

8. EVALUATION METRICS (mAP, PRECISION, RECALL)

Confusion Matrix for Object Detection

Metric	Definition	Condition
True Positive (TP)	Correct detection	IoU ≥ threshold AND correct class
False Positive (FP)	Incorrect detection	IoU < threshold OR wrong class
False Negative (FN)	Missed object	No prediction matched this ground truth
True Negative (TN)	N/A	Not applicable (infinite background)

Precision and Recall

Precision: Of all detections, how many are correct? Precision = TP / (TP + FP) Recall: Of all ground truth objects, how many detected? Recall = TP / (TP + FN) Trade-off: - Lower confidence threshold → higher recall, lower precision - Higher confidence threshold → lower recall, higher precision F1 Score: Harmonic mean F1 = 2 × (Precision × Recall) / (Precision + Recall)

Average Precision (AP)

AP Calculation

Sort all detections by confidence (high to low)
For each detection threshold, compute precision and recall
Plot precision-recall curve
AP = Area under the interpolated precision-recall curve

AP Formula: AP = Σ (Rₙ - Rₙ₋₁) × Pₙ Where: - Rₙ = recall at nth threshold - Pₙ = precision at nth threshold 11-point interpolation (PASCAL VOC): AP = (1/11) × Σ P_interp(r) for r ∈ {0, 0.1, ..., 1.0} All-point interpolation (COCO): Use all unique recall values (more accurate)

Mean Average Precision (mAP)

mAP Calculation: 1. Compute AP for each class 2. Average over all classes mAP = (1/N) × Σ APᵢ (for N classes) COCO Metrics: - mAP or mAP@[0.5:0.95]: Average over IoU thresholds 0.5, 0.55, ..., 0.95 - [email protected]: mAP at IoU threshold = 0.5 (more lenient) - [email protected]: mAP at IoU threshold = 0.75 (stricter) - mAP_small: mAP for small objects (area < 32²) - mAP_medium: mAP for medium objects (32² < area < 96²) - mAP_large: mAP for large objects (area > 96²)

mAP Interpretation

[email protected] = 0.5: Decent detector
[email protected] = 0.7: Good detector
[email protected] = 0.9+: Excellent (rare on complex datasets)
High [email protected], low [email protected]: Finds objects but localizes poorly
High [email protected]: Accurate localization

Example Calculation

Example: 10 detections for "cat" class

Detections sorted by confidence:
Detection | Conf  | IoU  | TP/FP | Precision | Recall
    1     | 0.95  | 0.88 | TP    | 1/1=1.00  | 1/5=0.20
    2     | 0.90  | 0.67 | TP    | 2/2=1.00  | 2/5=0.40
    3     | 0.85  | 0.42 | FP    | 2/3=0.67  | 2/5=0.40
    4     | 0.80  | 0.73 | TP    | 3/4=0.75  | 3/5=0.60
    5     | 0.75  | 0.35 | FP    | 3/5=0.60  | 3/5=0.60
    6     | 0.70  | 0.81 | TP    | 4/6=0.67  | 4/5=0.80
    7     | 0.60  | 0.92 | TP    | 5/7=0.71  | 5/5=1.00
    8     | 0.55  | 0.23 | FP    | 5/8=0.63  | 5/5=1.00
    9     | 0.50  | 0.15 | FP    | 5/9=0.56  | 5/5=1.00
   10     | 0.45  | 0.08 | FP    | 5/10=0.50 | 5/5=1.00

(Assume 5 ground truth cats total)

Precision-Recall pairs: (1.00, 0.20), (1.00, 0.40), (0.75, 0.60), (0.71, 1.00)
AP ≈ area under this curve ≈ 0.87 (for this class)

9. TRAINING & FINE-TUNING

Transfer Learning

Pre-trained models: Trained on COCO (80 classes, 120k images)

Fine-tuning: Adapt to custom dataset

Benefits:

Faster convergence (fewer epochs)
Better performance with less data
Learned features transfer across domains

YOLO Training Configuration

from ultralytics import YOLO

# Load pre-trained model
model = YOLO('yolov8n.pt')  # nano (fastest)

# Train
results = model.train(
    data='dataset.yaml',
    epochs=50,
    imgsz=64,              # MNISTDD-RGB is 64×64
    batch=16,
    lr0=0.01,              # Initial learning rate
    device='cuda:0',

    # Data augmentation
    hsv_h=0.015,           # Hue
    hsv_s=0.7,             # Saturation
    hsv_v=0.4,             # Value
    degrees=0.0,           # Rotation (keep 0 for upright digits)
    translate=0.1,         # Translation
    scale=0.5,             # Scale
    fliplr=0.5,            # Horizontal flip
    mosaic=1.0,            # Mosaic augmentation

    # Loss weights
    box=7.5,               # Box loss gain
    cls=0.5,               # Class loss gain
    dfl=1.5,               # DFL loss gain
)

Dataset Format (YOLO)

# dataset.yaml structure
path: /path/to/dataset
train: images/train
val: images/val

names:
  0: digit_0
  1: digit_1
  # ... up to digit_9

# Label files (one per image)
# images/train/img001.jpg → labels/train/img001.txt
# Format:     
# All coordinates normalized to [0, 1]
#
# Example label file content:
# 3 0.500 0.500 0.200 0.300
# 7 0.250 0.750 0.150 0.200

Data Augmentation

Augmentation	Effect	Recommended Value
Mosaic	Combine 4 images	1.0 (always apply)
Horizontal flip	Mirror image	0.5 (50% chance)
HSV jitter	Color variation	h=0.015, s=0.7, v=0.4
Scale	Zoom in/out	0.5 (±50%)
Translation	Shift image	0.1 (10%)
Rotation	Rotate	0.0 (digits are upright)

Fine-Tuning Strategies

Freeze backbone: Only train head (fast, less overfitting)
Lower learning rate: 0.001-0.01 for fine-tuning
Fewer epochs: 50-100 epochs usually enough
Monitor validation mAP: Stop when it plateaus

10. MNISTDD-RGB DATASET & ASSIGNMENT

MNISTDD-RGB Dataset

Purpose: Simple object detection dataset for learning

Images: 64×64 RGB
Objects: MNIST digits (0-9)
Per image: 1-3 digits at random positions
Format: NPZ files with images and bboxes

Dataset Structure

# Load dataset
data = np.load('train.npz')
images = data['images']    # (N, 64, 64, 3)
bboxes = data['bboxes']    # (N, max_objs, 4) in XYXY format

# Example
img = images[0]            # 64×64×3 RGB image
boxes = bboxes[0]          # Multiple bounding boxes for this image

Assignment 5 Tasks

Part	Task	Classes	Goal
Part A	Fine-tune YOLOv8n	1 (generic "digit")	Detect any digit
Part B	Fine-tune YOLOv8n	10 (digit 0-9)	Detect and classify which digit

Manual IoU Evaluation (Assignment Metric)

Mean IoU (Matched): For each image: 1. Run model to get predictions 2. Compute IoU matrix between predictions and ground truth 3. Greedy matching: - For each ground truth, find best prediction (highest IoU) - Match only if IoU ≥ 0.5 - Each prediction matched at most once 4. Compute mean IoU of matched pairs 5. If no matches, IoU = 0 for this image Final metric: Average over all images This measures localization quality of correct detections

def greedy_match_ious(iou_matrix, threshold=0.5):
    """
    Greedy matching algorithm

    Args:
        iou_matrix: (N_pred, N_gt) IoU values
        threshold: Minimum IoU to match

    Returns:
        matches: List of IoU values for matched pairs
    """
    matches = []
    used_preds = set()

    # For each ground truth
    for gt_idx in range(iou_matrix.shape[1]):
        # Find best prediction
        best_pred_idx = torch.argmax(iou_matrix[:, gt_idx]).item()
        best_iou = iou_matrix[best_pred_idx, gt_idx].item()

        # Match if IoU >= threshold and prediction not used
        if best_iou >= threshold and best_pred_idx not in used_preds:
            matches.append(best_iou)
            used_preds.add(best_pred_idx)

    return matches

Complete Training Pipeline

# 1. Convert NPZ to YOLO format # (create images/labels folders with YOLO format annotations) # 2. Train model model = YOLO('yolov8n.pt') results = model.train( data='mnistdd.yaml', epochs=50, imgsz=64, batch=16, device='cuda:0' ) # 3. Validate (automatic mAP computation) metrics = model.val() print(f"[email protected]: {metrics.box.map50:.3f}") print(f"[email protected]:0.95: {metrics.box.map:.3f}") # 4. Predict on validation set preds = model.predict(val_images, conf=0.25, iou=0.45) # 5. Compute manual IoU mean_ious = [] for pred, gt_boxes in zip(preds, ground_truth): iou_matrix = box_iou_matrix(pred.boxes.xyxy, gt_boxes) matches = greedy_match_ious(iou_matrix, threshold=0.5) mean_ious.append(np.mean(matches) if matches else 0.0) mean_iou_matched = np.mean(mean_ious) print(f"Mean IoU (matched): {mean_iou_matched:.3f}")

Expected Performance

Metric	Part A (1 class)	Part B (10 classes)
[email protected]	0.90-0.95	0.85-0.90
Mean IoU (matched)	0.80-0.85	0.75-0.80

SUMMARY: KEY TAKEAWAYS

Evolution of Object Detection

R-CNN (2014): Selective search + CNN features (slow)
Fast R-CNN (2015): ROI pooling (faster)
Faster R-CNN (2016): RPN with anchors (end-to-end)
YOLO (2016): Single-stage, real-time (10× faster)
CenterNet (2019): Anchor-free keypoint detection
DETR (2020): Transformers, set prediction, no NMS

Core Concepts

ROI Pooling: Fixed-size features from variable regions
Anchors: Pre-defined boxes for neural network predictions
IoU: Measures overlap (0=none, 1=perfect)
NMS: Remove duplicate detections (inference only)
mAP: Average AP across classes (primary metric)

YOLO for Assignment 5

YOLOv8: Anchor-free, fast, accurate
Transfer learning: Pre-train on COCO, fine-tune on MNISTDD
Data augmentation: Mosaic, flip, color jitter
Evaluation: [email protected], mean IoU (matched)

Evaluation Metrics

Precision: TP / (TP + FP) - how many detections correct?
Recall: TP / (TP + FN) - how many objects found?
AP: Area under precision-recall curve (per class)
mAP: Average AP over all classes
mAP@[0.5:0.95]: COCO standard (stricter)