38. Object Detection

Previous: Modern Deep Learning Architectures | Next: Practical Image Classification Project


38. Object Detection

Learning Objectives

  • Understand the difference between Two-stage vs One-stage detectors
  • Learn YOLO architecture and its evolution
  • Understand Faster R-CNN structure and RPN
  • Grasp DETR (Detection Transformer) concepts
  • Practice with PyTorch/Ultralytics

1. Object Detection Overview

1.1 Problem Definition

┌─────────────────────────────────────────────────────────────────┐
│                  Computer Vision Task Comparison                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Image Classification                                        │
│     └─ Entire image  Single class                              │
│     └─ Output: "dog"                                            │
│                                                                 │
│  2. Object Detection                                            │
│     └─ Image  Multiple object locations + classes             │
│     └─ Output: [(x1,y1,x2,y2, "dog", 0.95), ...]              │
│                                                                 │
│  3. Semantic Segmentation                                       │
│     └─ Assign class to each pixel                              │
│     └─ Objects of the same class are not distinguished         │
│                                                                 │
│  4. Instance Segmentation                                       │
│     └─ Object detection + pixel mask for each object           │
│     └─ Distinguish individual objects even of the same class   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

1.2 Detector Classification

┌─────────────────────────────────────────────────────────────────┐
│                    Detector Classification                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│             Two-Stage Detectors                               │
│    Stage 1: Region Proposal (generate candidates)            │
│    Stage 2: Classification + Regression                       │
│                                                                │
│    Examples: R-CNN  Fast R-CNN  Faster R-CNN               │
│        Pros: High accuracy                                    │
│        Cons: Slow speed                                       │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│             One-Stage Detectors                               │
│    Single network predicting location + class together       │
│                                                                │
│    Examples: YOLO, SSD, RetinaNet, CenterNet                 │
│        Pros: Fast speed, real-time processing possible       │
│        Cons: Difficulty detecting small objects (improved)   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│             Transformer-based Detectors                       │
│    DETR, Deformable DETR, RT-DETR                             │
│    End-to-end training, no NMS required                      │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

1.3 Evaluation Metrics

"""
Object detection evaluation metrics
"""

def calculate_iou(box1, box2):
    """
    IoU (Intersection over Union) calculation

    Args:
        box1, box2: [x1, y1, x2, y2] format

    Returns:
        IoU value (0~1)
    """
    # Intersection coordinates
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    # Intersection area
    inter_area = max(0, x2 - x1) * max(0, y2 - y1)

    # Union area
    box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
    box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union_area = box1_area + box2_area - inter_area

    return inter_area / union_area if union_area > 0 else 0

# Example
pred_box = [100, 100, 200, 200]
gt_box = [120, 110, 210, 210]
print(f"IoU: {calculate_iou(pred_box, gt_box):.3f}")  # approximately 0.68


"""
mAP (mean Average Precision) calculation process:

1. For each class:
   - Sort predictions by confidence
   - TP if IoU > threshold, otherwise FP
   - Calculate Precision-Recall curve
   - AP = area under the curve

2. mAP = average of all class APs

COCO dataset metrics:
- mAP@0.5: IoU=0.5 threshold
- mAP@0.75: IoU=0.75 threshold (strict)
- mAP@[.5:.95]: average of IoU from 0.5 to 0.95
"""

2. R-CNN Family

2.1 R-CNN Evolution

┌─────────────────────────────────────────────────────────────────┐
                    R-CNN Family Evolution                        
├─────────────────────────────────────────────────────────────────┤
                                                                 
  R-CNN (2014):                                                  
  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌─────────┐   
   Image       Selective   CNN         SVM        
                 Search        (AlexNet)     Classifier 
  └──────────┘     (~2000)      └──────────┘    └─────────┘   
                  └──────────┘                                   
  Problem: ~2000 CNN passes  very slow                          
                                                                 
  Fast R-CNN (2015):                                             
  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌─────────┐   
   Image       CNN         RoI         FC +       
                 Feature       Pooling       Softmax    
  └──────────┘     Map          └──────────┘    └─────────┘   
  Improvement: Single CNN pass, RoI Pooling for region extraction
  Problem: Selective Search still slow                           
                                                                 
  Faster R-CNN (2015):                                           
  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌─────────┐   
   Image       Backbone    RPN         Head       
                 (ResNet)                               
  └──────────┘    └──────────┘    └──────────┘    └─────────┘   
  Innovation: RPN makes Region Proposal also learnable           
                                                                 
└─────────────────────────────────────────────────────────────────┘

2.2 Faster R-CNN Structure

import torch
import torch.nn as nn
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

class CustomFasterRCNN:
    """Custom Faster R-CNN model"""

    def __init__(self, num_classes: int, pretrained: bool = True):
        """
        Args:
            num_classes: Number of classes including background (e.g., 10 objects → 11)
            pretrained: Use COCO pretrained weights
        """
        # Load pretrained model
        self.model = fasterrcnn_resnet50_fpn_v2(
            weights="DEFAULT" if pretrained else None
        )

        # Replace box predictor (adjust for number of classes)
        in_features = self.model.roi_heads.box_predictor.cls_score.in_features
        self.model.roi_heads.box_predictor = FastRCNNPredictor(
            in_features, num_classes
        )

    def get_model(self):
        return self.model


def train_faster_rcnn():
    """Faster R-CNN training example"""

    # Create model (background + 10 classes)
    model = CustomFasterRCNN(num_classes=11).get_model()
    model.train()

    # Dummy data
    images = [torch.rand(3, 600, 800) for _ in range(2)]

    targets = [
        {
            "boxes": torch.tensor([[100, 100, 200, 200], [300, 300, 400, 400]]),
            "labels": torch.tensor([1, 2]),  # class ID
        },
        {
            "boxes": torch.tensor([[50, 50, 150, 150]]),
            "labels": torch.tensor([3]),
        }
    ]

    # Forward pass (returns loss in training mode)
    loss_dict = model(images, targets)

    # Loss types:
    # - loss_classifier: class classification loss
    # - loss_box_reg: box regression loss
    # - loss_objectness: RPN object/non-object classification
    # - loss_rpn_box_reg: RPN box regression

    total_loss = sum(loss for loss in loss_dict.values())
    print(f"Total loss: {total_loss.item():.4f}")

    return loss_dict


def inference_faster_rcnn(model, image, threshold=0.5):
    """Faster R-CNN inference"""

    model.eval()

    with torch.no_grad():
        predictions = model([image])

    pred = predictions[0]

    # Filter predictions above threshold
    keep = pred["scores"] > threshold

    result = {
        "boxes": pred["boxes"][keep],
        "labels": pred["labels"][keep],
        "scores": pred["scores"][keep],
    }

    return result

2.3 RPN (Region Proposal Network)

"""
RPN key concepts:

1. Anchor Boxes:
   - Pre-defined boxes of multiple sizes/ratios at each location
   - Example: 3 scales × 3 ratios = 9 anchors

2. Outputs:
   - objectness score: object presence probability (2-class)
   - box regression: anchor → actual box transformation

3. Training:
   - Positive: anchors with IoU > 0.7
   - Negative: anchors with IoU < 0.3
   - Ignored: between 0.3~0.7
"""

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleRPN(nn.Module):
    """Simplified RPN implementation"""

    def __init__(
        self,
        in_channels: int = 256,
        num_anchors: int = 9,  # 3 scales × 3 ratios
    ):
        super().__init__()

        # 3×3 conv for feature processing
        self.conv = nn.Conv2d(in_channels, in_channels, 3, padding=1)

        # objectness prediction (object/background)
        self.objectness = nn.Conv2d(in_channels, num_anchors * 2, 1)

        # bbox regression (dx, dy, dw, dh)
        self.bbox_reg = nn.Conv2d(in_channels, num_anchors * 4, 1)

    def forward(self, feature_map):
        """
        Args:
            feature_map: (B, C, H, W)

        Returns:
            objectness: (B, num_anchors*2, H, W)
            bbox_deltas: (B, num_anchors*4, H, W)
        """
        x = F.relu(self.conv(feature_map))

        objectness = self.objectness(x)
        bbox_deltas = self.bbox_reg(x)

        return objectness, bbox_deltas


def generate_anchors(
    feature_size: tuple,
    anchor_scales: list = [128, 256, 512],
    anchor_ratios: list = [0.5, 1.0, 2.0],
    stride: int = 16
):
    """
    Generate anchor boxes

    Args:
        feature_size: (H, W) feature map size
        anchor_scales: square root of anchor area
        anchor_ratios: width/height ratios
        stride: downsampling ratio from original image

    Returns:
        anchors: (H*W*num_anchors, 4) anchor coordinates
    """
    H, W = feature_size
    anchors = []

    for h in range(H):
        for w in range(W):
            # feature map position → original image coordinates
            cx = (w + 0.5) * stride
            cy = (h + 0.5) * stride

            for scale in anchor_scales:
                for ratio in anchor_ratios:
                    # width/height based on ratio
                    anchor_w = scale * (ratio ** 0.5)
                    anchor_h = scale / (ratio ** 0.5)

                    # (x1, y1, x2, y2) format
                    anchors.append([
                        cx - anchor_w / 2,
                        cy - anchor_h / 2,
                        cx + anchor_w / 2,
                        cy + anchor_h / 2
                    ])

    return torch.tensor(anchors)

# Example
anchors = generate_anchors((38, 50))  # 600×800 image, stride=16
print(f"Generated {len(anchors)} anchors")  # 38*50*9 = 17,100

3. YOLO (You Only Look Once)

3.1 YOLO Evolution

┌─────────────────────────────────────────────────────────────────┐
                    YOLO Version Comparison                       
├─────────────────────────────────────────────────────────────────┤
                                                                 
  YOLOv1 (2016): Grid-based detection with single CNN            
  YOLOv2 (2017): Batch Norm, Anchor Boxes introduction           
  YOLOv3 (2018): Darknet-53, FPN, 3-scale predictions            
  YOLOv4 (2020): CSPDarknet, SPP, PANet                          
  YOLOv5 (2020): PyTorch implementation, Ultralytics             
  YOLOv6 (2022): Speed optimization, EfficientRep                
  YOLOv7 (2022): E-ELAN, Auxiliary Head                          
  YOLOv8 (2023): Unified Framework, Anchor-free                  
  YOLOv9 (2024): GELAN, PGI                                      
  YOLOv10 (2024): NMS-free, Dual Assignments                     
  YOLO11 (2024): Faster and more accurate version                
                                                                 
  ┌─────────────────────────────────────────────────────────┐   
    Performance (COCO val2017)    mAP50-95   Speed (ms)      
    ─────────────────────────────────────────────────────     
    YOLOv8n                         37.3        1.2          
    YOLOv8s                         44.9        1.9          
    YOLOv8m                         50.2        4.3          
    YOLOv8l                         52.9        6.7          
    YOLOv8x                         53.9        9.8          
    YOLO11x                         54.7        11.3         
  └─────────────────────────────────────────────────────────┘   
                                                                 
└─────────────────────────────────────────────────────────────────┘

3.2 Ultralytics YOLOv8 Practice

from ultralytics import YOLO
import torch

# ===============================
# 1. Model Loading
# ===============================

# Load pretrained model
model = YOLO("yolov8n.pt")  # nano version (fastest)
# model = YOLO("yolov8s.pt")  # small
# model = YOLO("yolov8m.pt")  # medium
# model = YOLO("yolov8l.pt")  # large
# model = YOLO("yolov8x.pt")  # extra-large

# Create empty model for training
# model = YOLO("yolov8n.yaml")


# ===============================
# 2. Inference
# ===============================

def detect_objects(image_path: str, conf_threshold: float = 0.25):
    """Object detection on image"""

    results = model(image_path, conf=conf_threshold)

    for result in results:
        boxes = result.boxes

        print(f"Number of detected objects: {len(boxes)}")

        for box in boxes:
            # Coordinates
            x1, y1, x2, y2 = box.xyxy[0].tolist()

            # Class and confidence
            cls = int(box.cls[0])
            conf = float(box.conf[0])
            class_name = model.names[cls]

            print(f"  {class_name}: {conf:.2f} at ({x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f})")

    return results

# Usage example
# results = detect_objects("image.jpg")

# Visualize results
# results[0].show()  # Display image
# results[0].save("result.jpg")  # Save


# ===============================
# 3. Video/Webcam Detection
# ===============================

def detect_video(source: str = 0):
    """
    Real-time detection on video or webcam

    Args:
        source: 0=webcam, or video file path
    """
    results = model(source, stream=True)  # return as generator

    for result in results:
        # Per-frame processing
        annotated_frame = result.plot()  # frame with boxes drawn

        # Can display with cv2.imshow() etc.
        yield annotated_frame


# ===============================
# 4. Custom Dataset Training
# ===============================

def train_custom_model():
    """Train YOLO on custom dataset"""

    # Dataset yaml file example (data.yaml):
    """
    path: /path/to/dataset
    train: images/train
    val: images/val

    names:
      0: cat
      1: dog
      2: bird
    """

    # Model training
    model = YOLO("yolov8n.pt")

    results = model.train(
        data="data.yaml",
        epochs=100,
        imgsz=640,
        batch=16,
        device=0,  # GPU 0, or "cpu"
        patience=50,  # Early stopping
        save=True,
        project="runs/detect",
        name="custom_model",
    )

    return results


# ===============================
# 5. Model Export
# ===============================

def export_model():
    """Export model to various formats"""

    model = YOLO("yolov8n.pt")

    # Export to ONNX
    model.export(format="onnx")

    # Export to TensorRT (GPU inference optimization)
    # model.export(format="engine")

    # Export to CoreML (Apple)
    # model.export(format="coreml")

    # Export to TFLite (mobile)
    # model.export(format="tflite")

3.3 YOLOv8 Loss Function

"""
YOLOv8 loss function components:

1. Box Loss (CIoU Loss):
   - Accuracy of box location and size
   - CIoU = IoU - (distance penalty + aspect ratio penalty)

2. Classification Loss (BCE):
   - Binary cross entropy for each class
   - Focal Loss variant can be used

3. DFL Loss (Distribution Focal Loss):
   - Distribution prediction for box boundaries
   - New regression method in YOLOv8

Total Loss = λ_box * L_box + λ_cls * L_cls + λ_dfl * L_dfl
"""

import torch
import torch.nn as nn
import torch.nn.functional as F

def ciou_loss(pred_boxes, target_boxes, eps=1e-7):
    """
    Complete IoU Loss

    Args:
        pred_boxes: (N, 4) predicted boxes [x1, y1, x2, y2]
        target_boxes: (N, 4) ground truth boxes

    Returns:
        CIoU loss
    """
    # IoU calculation
    inter_x1 = torch.max(pred_boxes[:, 0], target_boxes[:, 0])
    inter_y1 = torch.max(pred_boxes[:, 1], target_boxes[:, 1])
    inter_x2 = torch.min(pred_boxes[:, 2], target_boxes[:, 2])
    inter_y2 = torch.min(pred_boxes[:, 3], target_boxes[:, 3])

    inter_area = torch.clamp(inter_x2 - inter_x1, min=0) * \
                 torch.clamp(inter_y2 - inter_y1, min=0)

    pred_area = (pred_boxes[:, 2] - pred_boxes[:, 0]) * \
                (pred_boxes[:, 3] - pred_boxes[:, 1])
    target_area = (target_boxes[:, 2] - target_boxes[:, 0]) * \
                  (target_boxes[:, 3] - target_boxes[:, 1])

    union_area = pred_area + target_area - inter_area
    iou = inter_area / (union_area + eps)

    # Center point distance
    pred_cx = (pred_boxes[:, 0] + pred_boxes[:, 2]) / 2
    pred_cy = (pred_boxes[:, 1] + pred_boxes[:, 3]) / 2
    target_cx = (target_boxes[:, 0] + target_boxes[:, 2]) / 2
    target_cy = (target_boxes[:, 1] + target_boxes[:, 3]) / 2

    center_dist_sq = (pred_cx - target_cx) ** 2 + (pred_cy - target_cy) ** 2

    # Diagonal distance (enclosing box)
    enclose_x1 = torch.min(pred_boxes[:, 0], target_boxes[:, 0])
    enclose_y1 = torch.min(pred_boxes[:, 1], target_boxes[:, 1])
    enclose_x2 = torch.max(pred_boxes[:, 2], target_boxes[:, 2])
    enclose_y2 = torch.max(pred_boxes[:, 3], target_boxes[:, 3])

    enclose_diag_sq = (enclose_x2 - enclose_x1) ** 2 + \
                      (enclose_y2 - enclose_y1) ** 2

    # Aspect ratio consistency
    pred_w = pred_boxes[:, 2] - pred_boxes[:, 0]
    pred_h = pred_boxes[:, 3] - pred_boxes[:, 1]
    target_w = target_boxes[:, 2] - target_boxes[:, 0]
    target_h = target_boxes[:, 3] - target_boxes[:, 1]

    v = (4 / (torch.pi ** 2)) * \
        (torch.atan(target_w / (target_h + eps)) -
         torch.atan(pred_w / (pred_h + eps))) ** 2

    alpha = v / (1 - iou + v + eps)

    # CIoU
    ciou = iou - (center_dist_sq / (enclose_diag_sq + eps)) - alpha * v

    return 1 - ciou

4. DETR (Detection Transformer)

4.1 DETR Concept

┌─────────────────────────────────────────────────────────────────┐
                    DETR Architecture                             
├─────────────────────────────────────────────────────────────────┤
                                                                 
  Traditional detector issues:                                   
  - Anchor design required                                       
  - NMS (Non-Maximum Suppression) post-processing required       
  - Complex pipeline                                             
                                                                 
  DETR innovations:                                              
  - End-to-end training                                          
  - Direct object prediction with Object Queries                
  - Training with Hungarian Matching                             
  - No NMS required                                              
                                                                 
  ┌────────────┐    ┌────────────┐    ┌────────────────────────┐ 
   Backbone      Transformer   FFN Heads              
   (ResNet)        Encoder/        (class + box)          
                   Decoder                                 
  └────────────┘    └────────────┘    └────────────────────────┘ 
                                                              
  Feature Map    Object Queries (100)      100 prediction outputs
  + Positional                                                  
    Encoding     Self-attention + Cross-attention                
                                                                 
└─────────────────────────────────────────────────────────────────┘

4.2 DETR Implementation

import torch
import torch.nn as nn
from torchvision.models import resnet50
import torch.nn.functional as F

class DETR(nn.Module):
    """
    Simplified DETR implementation
    """

    def __init__(
        self,
        num_classes: int,
        num_queries: int = 100,
        hidden_dim: int = 256,
        nheads: int = 8,
        num_encoder_layers: int = 6,
        num_decoder_layers: int = 6,
    ):
        super().__init__()

        # Backbone
        backbone = resnet50(weights="DEFAULT")
        self.backbone = nn.Sequential(*list(backbone.children())[:-2])

        # Feature map → hidden_dim
        self.conv = nn.Conv2d(2048, hidden_dim, 1)

        # Transformer
        self.transformer = nn.Transformer(
            d_model=hidden_dim,
            nhead=nheads,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            batch_first=True,
        )

        # Object Queries (learned embeddings)
        self.query_embed = nn.Embedding(num_queries, hidden_dim)

        # Output heads
        self.class_head = nn.Linear(hidden_dim, num_classes + 1)  # +1 for no-object
        self.bbox_head = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 4),  # cx, cy, w, h
            nn.Sigmoid(),
        )

        # Positional Encoding
        self.row_embed = nn.Embedding(50, hidden_dim // 2)
        self.col_embed = nn.Embedding(50, hidden_dim // 2)

    def forward(self, x):
        """
        Args:
            x: (B, 3, H, W) input image

        Returns:
            class_logits: (B, num_queries, num_classes+1)
            bbox_pred: (B, num_queries, 4)
        """
        B = x.shape[0]

        # Backbone feature extraction
        features = self.backbone(x)  # (B, 2048, H/32, W/32)
        features = self.conv(features)  # (B, 256, H/32, W/32)

        _, _, H, W = features.shape

        # Generate Positional Encoding
        pos_embed = self._get_positional_encoding(H, W, features.device)

        # Flatten for Transformer
        src = features.flatten(2).permute(0, 2, 1)  # (B, H*W, 256)
        src = src + pos_embed.flatten(0, 1).unsqueeze(0).expand(B, -1, -1)

        # Object Queries
        query_embed = self.query_embed.weight.unsqueeze(0).expand(B, -1, -1)

        # Transformer
        tgt = torch.zeros_like(query_embed)
        hs = self.transformer(src, tgt + query_embed)  # (B, num_queries, 256)

        # Predictions
        class_logits = self.class_head(hs)
        bbox_pred = self.bbox_head(hs)

        return class_logits, bbox_pred

    def _get_positional_encoding(self, H, W, device):
        """Generate 2D Positional Encoding"""
        i = torch.arange(W, device=device)
        j = torch.arange(H, device=device)

        x_embed = self.col_embed(i)  # (W, 128)
        y_embed = self.row_embed(j)  # (H, 128)

        pos = torch.cat([
            x_embed.unsqueeze(0).expand(H, -1, -1),
            y_embed.unsqueeze(1).expand(-1, W, -1),
        ], dim=-1)  # (H, W, 256)

        return pos


# Hungarian Matching Loss (simplified)
class HungarianMatcher:
    """
    Optimal matching between predictions and GT

    Cost = λ_cls * L_cls + λ_box * L_box + λ_giou * L_giou
    """

    def __init__(self, cost_class=1, cost_bbox=5, cost_giou=2):
        self.cost_class = cost_class
        self.cost_bbox = cost_bbox
        self.cost_giou = cost_giou

    def __call__(self, outputs, targets):
        """
        Perform bipartite matching using
        scipy.optimize.linear_sum_assignment
        """
        # Implementation omitted (uses scipy)
        pass

4.3 RT-DETR (Real-Time DETR)

from ultralytics import RTDETR

# RT-DETR usage (Ultralytics)
model = RTDETR("rtdetr-l.pt")

# Inference
results = model("image.jpg")

# Training
model.train(data="coco.yaml", epochs=100)

"""
RT-DETR features:
- Maintains DETR's end-to-end advantages
- Real-time inference possible (YOLO-level speed)
- Efficient Hybrid Encoder
- IoU-aware Query Selection
"""

5. Instance Segmentation

5.1 Mask R-CNN

import torch
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor

def create_mask_rcnn(num_classes: int):
    """
    Create custom Mask R-CNN

    Mask R-CNN = Faster R-CNN + Mask Head
    """
    model = maskrcnn_resnet50_fpn_v2(weights="DEFAULT")

    # Replace box predictor
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    # Replace mask predictor
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    model.roi_heads.mask_predictor = MaskRCNNPredictor(
        in_features_mask, hidden_layer, num_classes
    )

    return model


def inference_mask_rcnn(model, image, threshold=0.5):
    """Mask R-CNN inference"""

    model.eval()

    with torch.no_grad():
        predictions = model([image])

    pred = predictions[0]
    keep = pred["scores"] > threshold

    result = {
        "boxes": pred["boxes"][keep],
        "labels": pred["labels"][keep],
        "scores": pred["scores"][keep],
        "masks": pred["masks"][keep],  # (N, 1, H, W) soft masks
    }

    # Convert to hard masks
    result["masks"] = (result["masks"] > 0.5).squeeze(1)  # (N, H, W)

    return result


# YOLOv8-seg usage
from ultralytics import YOLO

seg_model = YOLO("yolov8n-seg.pt")
results = seg_model("image.jpg")

# Extract masks from results
for result in results:
    if result.masks is not None:
        masks = result.masks.data  # (N, H, W)

5.2 SAM (Segment Anything Model)

from segment_anything import sam_model_registry, SamPredictor
import numpy as np

# Load SAM model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)

# Set image
predictor.set_image(image)  # (H, W, 3) numpy array

# Segmentation with point prompt
input_point = np.array([[500, 375]])  # click coordinates
input_label = np.array([1])  # 1 = foreground

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True,  # return 3 mask candidates
)

# Segmentation with box prompt
input_box = np.array([100, 100, 400, 400])  # x1, y1, x2, y2

masks, scores, logits = predictor.predict(
    box=input_box,
    multimask_output=False,
)

"""
SAM features:
- Promptable segmentation (point, box, text)
- Zero-shot generalization
- Very high segmentation quality
- Large model with slow speed → lightweight versions like MobileSAM, FastSAM exist
"""

6. Practical Tips

6.1 Dataset Formats

"""
Major dataset formats:

1. COCO Format:
   - All annotations stored in annotations.json
   - Images and annotations separated

2. YOLO Format:
   - .txt file for each image
   - class x_center y_center width height (normalized)

3. Pascal VOC Format:
   - XML file for each image annotation
"""

# YOLO format example (labels/train/image001.txt)
"""
0 0.5 0.5 0.2 0.3
1 0.3 0.7 0.1 0.15
"""

# COCO to YOLO conversion
def coco_to_yolo(coco_box, img_width, img_height):
    """
    COCO: [x_min, y_min, width, height]
    YOLO: [x_center, y_center, width, height] (normalized)
    """
    x, y, w, h = coco_box

    x_center = (x + w / 2) / img_width
    y_center = (y + h / 2) / img_height
    w_norm = w / img_width
    h_norm = h / img_height

    return [x_center, y_center, w_norm, h_norm]

6.2 Training Tips

"""
Object detection training checklist:

1. Data Quality
   - Verify label accuracy
   - Handle class imbalance (Focal Loss, oversampling)
   - Use appropriate augmentation

2. Hyperparameters
   - Learning rate: 1e-4 ~ 1e-3 (starting from pretrained)
   - Batch size: as large as possible within GPU memory
   - Image size: use model default (YOLO: 640)

3. Augmentation Strategy
   - Mosaic: compose 4 images (YOLO)
   - MixUp: image blending
   - Basic: Flip, Scale, Color Jitter

4. Model Selection
   - Real-time: YOLO (YOLOv8n, YOLOv8s)
   - Accuracy: Faster R-CNN, DETR
   - Segmentation: YOLOv8-seg, Mask R-CNN
"""

# Ultralytics training example
from ultralytics import YOLO

model = YOLO("yolov8n.pt")

model.train(
    data="data.yaml",
    epochs=100,
    imgsz=640,
    batch=16,

    # Augmentation
    mosaic=1.0,      # Mosaic probability
    mixup=0.0,       # MixUp probability
    hsv_h=0.015,     # Hue augmentation
    hsv_s=0.7,       # Saturation augmentation
    hsv_v=0.4,       # Value augmentation
    degrees=0.0,     # Rotation
    translate=0.1,   # Translation
    scale=0.5,       # Scale
    fliplr=0.5,      # Horizontal flip

    # Regularization
    weight_decay=0.0005,

    # Training schedule
    warmup_epochs=3,
    warmup_momentum=0.8,
    warmup_bias_lr=0.1,
    lr0=0.01,        # Initial learning rate
    lrf=0.01,        # Final learning rate ratio
)

Summary

Detector Selection Guide

Requirements Recommended Model
Real-time (30+ FPS) YOLOv8n/s
High accuracy YOLOv8x, Faster R-CNN
Small objects YOLO + SAHI, RetinaNet
Instance segmentation YOLOv8-seg, Mask R-CNN
End-to-end DETR, RT-DETR
Zero-shot Grounding DINO, SAM

Key Concepts Summary

Concept Description
IoU Box overlap degree (0~1)
mAP mean Average Precision (accuracy metric)
NMS Non-Maximum Suppression (duplicate box removal)
Anchor Pre-defined reference boxes
FPN Multi-scale feature extraction
GIoU/CIoU Improved IoU loss functions

Next Steps


References

Papers

  • "Faster R-CNN" (Ren et al., 2015)
  • "YOLO: You Only Look Once" (Redmon et al., 2016)
  • "DETR: End-to-End Object Detection with Transformers" (Carion et al., 2020)
  • "Segment Anything" (Kirillov et al., 2023)

Code & Resources

to navigate between lessons