Deep Neural Network Module (DNN Module)

Deep Neural Network Module (DNN Module)

Overview

OpenCV's DNN module provides functionality to load and run inference with pre-trained deep learning models. It supports models from various frameworks including TensorFlow, Caffe, Darknet, and ONNX, and can run efficiently on both CPU and GPU.

Difficulty: ⭐⭐⭐⭐

Prerequisites: Deep learning fundamentals, object detection, image classification


Table of Contents

  1. cv2.dnn Module Overview
  2. readNet(): Model Loading
  3. blobFromImage(): Preprocessing
  4. YOLO Object Detection
  5. SSD (Single Shot Detector)
  6. DNN Face Detection
  7. Modern Object Detection with ONNX
  8. Exercises

1. cv2.dnn Module Overview

DNN Module Features

OpenCV DNN Module:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        cv2.dnn                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Supported Frameworks:                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚   Caffe     β”‚ TensorFlow  β”‚   Darknet   β”‚    ONNX     β”‚     β”‚
β”‚  β”‚  (.caffemodel,β”‚ (.pb)     β”‚  (.weights, β”‚  (.onnx)    β”‚     β”‚
β”‚  β”‚   .prototxt)β”‚             β”‚   .cfg)     β”‚             β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                                 β”‚
β”‚  Execution Backends:                                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚    CPU      β”‚   OpenCL    β”‚    CUDA     β”‚   Vulkan    β”‚     β”‚
β”‚  β”‚  (default)  β”‚   (GPU)     β”‚   (NVIDIA)  β”‚  (multi-GPU)β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                                 β”‚
β”‚  Features:                                                      β”‚
β”‚  - Inference only (no training)                                 β”‚
β”‚  - Optimized operations                                         β”‚
β”‚  - Multiple hardware support                                    β”‚
β”‚  - Simple API                                                   β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Basic Workflow

DNN Inference Workflow:

1. Model Loading
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  readNet()       β”‚ β†’ Load model file
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
2. Backend/Target Setup
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ setPreferableBackend()β”‚ β†’ CPU/CUDA/OpenCL
   β”‚ setPreferableTarget() β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
3. Input Preprocessing
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ blobFromImage()  β”‚ β†’ Image β†’ Blob
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
4. Run Inference
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ net.setInput()   β”‚
   β”‚ net.forward()    β”‚ β†’ Inference result
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
5. Post-processing
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ NMS, visualization, etc. β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2. readNet(): Model Loading

Loading Various Model Formats

import cv2

# Loading Caffe model
net_caffe = cv2.dnn.readNetFromCaffe(
    'deploy.prototxt',      # Network structure
    'model.caffemodel'      # Weights
)

# Loading TensorFlow model
net_tf = cv2.dnn.readNetFromTensorflow(
    'frozen_inference_graph.pb',  # Frozen graph
    'graph.pbtxt'                 # Text graph (optional)
)

# Loading Darknet (YOLO) model
net_darknet = cv2.dnn.readNetFromDarknet(
    'yolov3.cfg',           # Config file
    'yolov3.weights'        # Weights
)

# Loading ONNX model
net_onnx = cv2.dnn.readNetFromONNX('model.onnx')

# Generic function (auto-detect)
net = cv2.dnn.readNet('model.weights', 'model.cfg')

Backend and Target Configuration

import cv2

net = cv2.dnn.readNet('model.weights', 'model.cfg')

# Backend options
# - cv2.dnn.DNN_BACKEND_OPENCV: OpenCV built-in (default)
# - cv2.dnn.DNN_BACKEND_CUDA: NVIDIA CUDA
# - cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE: Intel OpenVINO

# Target options
# - cv2.dnn.DNN_TARGET_CPU: CPU (default)
# - cv2.dnn.DNN_TARGET_OPENCL: OpenCL GPU
# - cv2.dnn.DNN_TARGET_CUDA: NVIDIA GPU
# - cv2.dnn.DNN_TARGET_CUDA_FP16: NVIDIA GPU (half precision)

# CPU execution (default)
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)

# CUDA GPU execution (if OpenCV is built with CUDA support)
# net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
# net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

# Check available backends
print("Available backends:", cv2.dnn.getAvailableBackends())

Checking Layer Information

import cv2

net = cv2.dnn.readNet('yolov3.cfg', 'yolov3.weights')

# Layer names list
layer_names = net.getLayerNames()
print(f"Total layers: {len(layer_names)}")
print("Layer list (partial):", layer_names[:10])

# Output layers (unconnected layers)
output_layers = net.getUnconnectedOutLayers()
output_layer_names = [layer_names[i - 1] for i in output_layers]
print("Output layers:", output_layer_names)

# Specific layer info
layer = net.getLayer(0)
print(f"Layer type: {layer.type}")

3. blobFromImage(): Preprocessing

Blob Concept

Blob (Binary Large Object):
4-dimensional tensor for DNN model input

Dimension structure:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Blob Shape: (N, C, H, W)                                   β”‚
β”‚                                                             β”‚
β”‚  N: Batch Size                                              β”‚
β”‚     - Number of images to process at once                   β”‚
β”‚                                                             β”‚
β”‚  C: Number of Channels                                      β”‚
β”‚     - RGB: 3, Grayscale: 1                                  β”‚
β”‚                                                             β”‚
β”‚  H: Height                                                  β”‚
β”‚     - Input height required by model                        β”‚
β”‚                                                             β”‚
β”‚  W: Width                                                   β”‚
β”‚     - Input width required by model                         β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Example: YOLO (416x416)
blob.shape = (1, 3, 416, 416)
             β”‚  β”‚   β”‚    β”‚
             β”‚  β”‚   β”‚    └── Width 416
             β”‚  β”‚   └── Height 416
             β”‚  └── RGB 3 channels
             └── 1 image

Using blobFromImage

import cv2
import numpy as np

img = cv2.imread('image.jpg')

# Basic Blob creation
blob = cv2.dnn.blobFromImage(
    img,                    # Input image
    scalefactor=1/255.0,    # Scale factor (normalization)
    size=(416, 416),        # Target size
    mean=(0, 0, 0),         # Mean subtraction
    swapRB=True,            # BGR β†’ RGB
    crop=False              # Crop flag
)

print(f"Blob shape: {blob.shape}")  # (1, 3, 416, 416)

# Various preprocessing options

# 1. ImageNet style (mean subtraction)
blob_imagenet = cv2.dnn.blobFromImage(
    img,
    scalefactor=1.0,
    size=(224, 224),
    mean=(104.0, 117.0, 123.0),  # ImageNet mean
    swapRB=True
)

# 2. YOLO style (0-1 normalization)
blob_yolo = cv2.dnn.blobFromImage(
    img,
    scalefactor=1/255.0,
    size=(416, 416),
    mean=(0, 0, 0),
    swapRB=True
)

# 3. SSD style
blob_ssd = cv2.dnn.blobFromImage(
    img,
    scalefactor=1.0,
    size=(300, 300),
    mean=(104.0, 177.0, 123.0),
    swapRB=True
)

Processing Multiple Images

import cv2
import numpy as np

def prepare_batch(images, size=(416, 416)):
    """Process multiple images as a batch"""

    # Method 1: Using blobFromImages
    blob = cv2.dnn.blobFromImages(
        images,
        scalefactor=1/255.0,
        size=size,
        mean=(0, 0, 0),
        swapRB=True,
        crop=False
    )

    return blob

# Usage example
images = [cv2.imread(f'image{i}.jpg') for i in range(4)]
batch_blob = prepare_batch(images)
print(f"Batch blob shape: {batch_blob.shape}")  # (4, 3, 416, 416)

# Input to network
net.setInput(batch_blob)
outputs = net.forward()

Visualizing Preprocessing

import cv2
import numpy as np
import matplotlib.pyplot as plt

def visualize_blob(blob):
    """Visualize blob contents"""

    # blob shape: (N, C, H, W)
    n, c, h, w = blob.shape

    for i in range(n):
        fig, axes = plt.subplots(1, c+1, figsize=(15, 4))

        # Display each channel
        for j in range(c):
            channel = blob[i, j, :, :]
            axes[j].imshow(channel, cmap='gray')
            axes[j].set_title(f'Channel {j}')
            axes[j].axis('off')

        # Combined image (recombine as RGB)
        if c == 3:
            combined = np.transpose(blob[i], (1, 2, 0))  # CHW β†’ HWC
            combined = (combined * 255).astype(np.uint8)
            axes[c].imshow(combined)
            axes[c].set_title('Combined (RGB)')
            axes[c].axis('off')

        plt.suptitle(f'Image {i}')
        plt.tight_layout()
        plt.show()

# Usage example
img = cv2.imread('image.jpg')
blob = cv2.dnn.blobFromImage(img, 1/255.0, (416, 416), swapRB=True)
visualize_blob(blob)

4. YOLO Object Detection

YOLO Overview

YOLO (You Only Look Once):
Real-time object detection algorithm

Features:
- Single pass detection (End-to-End)
- Fast speed (real-time capable)
- Uses full image context

Output structure:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                             β”‚
β”‚  For each detection:                                        β”‚
β”‚  [center_x, center_y, width, height, confidence, class_scores...]β”‚
β”‚                                                             β”‚
β”‚  - center_x, center_y: Bounding box center (0-1 normalized) β”‚
β”‚  - width, height: Box size (0-1 normalized)                 β”‚
β”‚  - confidence: Object presence probability                  β”‚
β”‚  - class_scores: Probability for each class (80 classes)    β”‚
β”‚                                                             β”‚
β”‚  Example: COCO dataset (80 classes)                         β”‚
β”‚  Output vector length = 4 + 1 + 80 = 85                     β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

YOLO version comparison:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Version β”‚ Input    β”‚  mAP     β”‚ Speed (FPS)       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ YOLOv3  β”‚ 416x416  β”‚ 33.0     β”‚ ~35 (GPU)         β”‚
β”‚ YOLOv3-tinyβ”‚ 416x416β”‚ 15.0    β”‚ ~220 (GPU)        β”‚
β”‚ YOLOv4  β”‚ 416x416  β”‚ 43.5     β”‚ ~38 (GPU)         β”‚
β”‚ YOLOv4-tinyβ”‚ 416x416β”‚ 21.7    β”‚ ~371 (GPU)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

YOLOv3 Implementation

import cv2
import numpy as np

class YOLODetector:
    """YOLOv3 Object Detector"""

    def __init__(self, config_path, weights_path, names_path,
                 conf_threshold=0.5, nms_threshold=0.4):
        # Load model
        self.net = cv2.dnn.readNetFromDarknet(config_path, weights_path)
        self.net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
        self.net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)

        # Load class names
        with open(names_path, 'r') as f:
            self.classes = [line.strip() for line in f.readlines()]

        # Output layer names
        layer_names = self.net.getLayerNames()
        self.output_layers = [layer_names[i - 1]
                              for i in self.net.getUnconnectedOutLayers()]

        # Thresholds
        self.conf_threshold = conf_threshold
        self.nms_threshold = nms_threshold

        # Colors (per class)
        np.random.seed(42)
        self.colors = np.random.randint(0, 255, size=(len(self.classes), 3),
                                        dtype=np.uint8)

    def detect(self, img):
        """Object detection"""
        height, width = img.shape[:2]

        # Create blob
        blob = cv2.dnn.blobFromImage(img, 1/255.0, (416, 416),
                                      swapRB=True, crop=False)

        # Inference
        self.net.setInput(blob)
        outputs = self.net.forward(self.output_layers)

        # Parse results
        boxes = []
        confidences = []
        class_ids = []

        for output in outputs:
            for detection in output:
                scores = detection[5:]
                class_id = np.argmax(scores)
                confidence = scores[class_id]

                if confidence > self.conf_threshold:
                    # Bounding box coordinates (normalized)
                    center_x = int(detection[0] * width)
                    center_y = int(detection[1] * height)
                    w = int(detection[2] * width)
                    h = int(detection[3] * height)

                    # Top-left coordinates
                    x = int(center_x - w / 2)
                    y = int(center_y - h / 2)

                    boxes.append([x, y, w, h])
                    confidences.append(float(confidence))
                    class_ids.append(class_id)

        # Non-Maximum Suppression
        indices = cv2.dnn.NMSBoxes(boxes, confidences,
                                    self.conf_threshold, self.nms_threshold)

        results = []
        for i in indices:
            box = boxes[i]
            results.append({
                'box': box,
                'confidence': confidences[i],
                'class_id': class_ids[i],
                'class_name': self.classes[class_ids[i]]
            })

        return results

    def draw(self, img, results):
        """Visualize results"""
        for det in results:
            x, y, w, h = det['box']
            color = [int(c) for c in self.colors[det['class_id']]]

            cv2.rectangle(img, (x, y), (x + w, y + h), color, 2)

            label = f"{det['class_name']}: {det['confidence']:.2f}"
            cv2.putText(img, label, (x, y - 10),
                       cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

        return img

# Usage example
# Model files download required:
# - yolov3.cfg: https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg
# - yolov3.weights: https://pjreddie.com/media/files/yolov3.weights
# - coco.names: https://github.com/pjreddie/darknet/blob/master/data/coco.names

detector = YOLODetector(
    'yolov3.cfg',
    'yolov3.weights',
    'coco.names'
)

img = cv2.imread('street.jpg')
results = detector.detect(img)
output = detector.draw(img, results)

print(f"Detected objects: {len(results)}")
for r in results:
    print(f"  - {r['class_name']}: {r['confidence']:.2%}")

cv2.imshow('YOLO Detection', output)
cv2.waitKey(0)

YOLOv3-tiny (Lightweight Version)

import cv2
import numpy as np

def yolo_tiny_detect(img, conf_threshold=0.5):
    """Fast detection using YOLOv3-tiny"""

    # Load YOLOv3-tiny
    net = cv2.dnn.readNetFromDarknet('yolov3-tiny.cfg', 'yolov3-tiny.weights')

    # Class names
    with open('coco.names', 'r') as f:
        classes = [line.strip() for line in f.readlines()]

    # Output layers
    layer_names = net.getLayerNames()
    output_layers = [layer_names[i - 1] for i in net.getUnconnectedOutLayers()]

    height, width = img.shape[:2]

    # Blob and inference
    blob = cv2.dnn.blobFromImage(img, 1/255.0, (416, 416), swapRB=True)
    net.setInput(blob)
    outputs = net.forward(output_layers)

    # Parse results
    boxes, confidences, class_ids = [], [], []

    for output in outputs:
        for detection in output:
            scores = detection[5:]
            class_id = np.argmax(scores)
            confidence = scores[class_id]

            if confidence > conf_threshold:
                center_x = int(detection[0] * width)
                center_y = int(detection[1] * height)
                w = int(detection[2] * width)
                h = int(detection[3] * height)

                x = int(center_x - w / 2)
                y = int(center_y - h / 2)

                boxes.append([x, y, w, h])
                confidences.append(float(confidence))
                class_ids.append(class_id)

    # NMS
    indices = cv2.dnn.NMSBoxes(boxes, confidences, conf_threshold, 0.4)

    # Draw results
    for i in indices:
        x, y, w, h = boxes[i]
        label = f"{classes[class_ids[i]]}: {confidences[i]:.2f}"
        cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
        cv2.putText(img, label, (x, y - 10),
                   cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

    return img

5. SSD (Single Shot Detector)

SSD Overview

SSD (Single Shot MultiBox Detector):
Object detection using multi-scale feature maps

Features:
- Strong at detecting objects of various sizes
- Better at detecting small objects than YOLO
- Various backbone networks available (VGG, MobileNet, etc.)

Architecture:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                            β”‚
β”‚  Input Image (300x300)                                     β”‚
β”‚       β”‚                                                    β”‚
β”‚       β–Ό                                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                              β”‚
β”‚  β”‚ Backbone β”‚ VGG16, MobileNet, etc.                      β”‚
β”‚  β”‚ Network  β”‚                                              β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                                              β”‚
β”‚       β”‚                                                    β”‚
β”‚       β–Ό                                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ 38x38    β”‚  β”‚ 19x19    β”‚  β”‚ 10x10    β”‚  β”‚ 5x5 ...  β”‚   β”‚
β”‚  β”‚ Feature  β”‚  β”‚ Feature  β”‚  β”‚ Feature  β”‚  β”‚ Feature  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜   β”‚
β”‚       β”‚             β”‚             β”‚             β”‚          β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                         β”‚                                  β”‚
β”‚                         β–Ό                                  β”‚
β”‚                    NMS β†’ Final Detection                   β”‚
β”‚                                                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

MobileNet SSD Implementation

import cv2
import numpy as np

class SSDDetector:
    """MobileNet SSD Object Detector"""

    # COCO classes (MobileNet SSD v2)
    CLASSES = ["background", "person", "bicycle", "car", "motorcycle",
               "airplane", "bus", "train", "truck", "boat", "traffic light",
               "fire hydrant", "stop sign", "parking meter", "bench", "bird",
               "cat", "dog", "horse", "sheep", "cow", "elephant", "bear",
               "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
               "suitcase", "frisbee", "skis", "snowboard", "sports ball",
               "kite", "baseball bat", "baseball glove", "skateboard",
               "surfboard", "tennis racket", "bottle", "wine glass", "cup",
               "fork", "knife", "spoon", "bowl", "banana", "apple",
               "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza",
               "donut", "cake", "chair", "couch", "potted plant", "bed",
               "dining table", "toilet", "tv", "laptop", "mouse", "remote",
               "keyboard", "cell phone", "microwave", "oven", "toaster",
               "sink", "refrigerator", "book", "clock", "vase", "scissors",
               "teddy bear", "hair drier", "toothbrush"]

    def __init__(self, config_path, weights_path, conf_threshold=0.5):
        self.net = cv2.dnn.readNetFromTensorflow(weights_path, config_path)
        self.conf_threshold = conf_threshold

        np.random.seed(42)
        self.colors = np.random.randint(0, 255,
                                        size=(len(self.CLASSES), 3),
                                        dtype=np.uint8)

    def detect(self, img):
        """Object detection"""
        height, width = img.shape[:2]

        # Create blob (SSD uses 300x300 or 512x512 input)
        blob = cv2.dnn.blobFromImage(img, size=(300, 300),
                                      mean=(127.5, 127.5, 127.5),
                                      scalefactor=1/127.5,
                                      swapRB=True)

        self.net.setInput(blob)
        detections = self.net.forward()

        results = []

        # Output shape: (1, 1, N, 7)
        # Each detection: [batch_id, class_id, confidence, x1, y1, x2, y2]
        for i in range(detections.shape[2]):
            confidence = detections[0, 0, i, 2]

            if confidence > self.conf_threshold:
                class_id = int(detections[0, 0, i, 1])

                # Coordinates (normalized β†’ pixels)
                x1 = int(detections[0, 0, i, 3] * width)
                y1 = int(detections[0, 0, i, 4] * height)
                x2 = int(detections[0, 0, i, 5] * width)
                y2 = int(detections[0, 0, i, 6] * height)

                results.append({
                    'box': [x1, y1, x2 - x1, y2 - y1],
                    'confidence': confidence,
                    'class_id': class_id,
                    'class_name': self.CLASSES[class_id]
                })

        return results

    def draw(self, img, results):
        """Visualize results"""
        for det in results:
            x, y, w, h = det['box']
            color = [int(c) for c in self.colors[det['class_id']]]

            cv2.rectangle(img, (x, y), (x + w, y + h), color, 2)

            label = f"{det['class_name']}: {det['confidence']:.2f}"
            cv2.putText(img, label, (x, y - 10),
                       cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

        return img

# Usage example
# Model download:
# ssd_mobilenet_v2_coco_2018_03_29.pbtxt
# frozen_inference_graph.pb

# detector = SSDDetector(
#     'ssd_mobilenet_v2_coco.pbtxt',
#     'frozen_inference_graph.pb'
# )

Caffe SSD (Lightweight)

import cv2
import numpy as np

def ssd_caffe_detect(img, prototxt, caffemodel, conf_threshold=0.5):
    """Caffe SSD detection (MobileNet backbone)"""

    CLASSES = ["background", "aeroplane", "bicycle", "bird", "boat",
               "bottle", "bus", "car", "cat", "chair", "cow", "diningtable",
               "dog", "horse", "motorbike", "person", "pottedplant",
               "sheep", "sofa", "train", "tvmonitor"]

    net = cv2.dnn.readNetFromCaffe(prototxt, caffemodel)

    height, width = img.shape[:2]

    # Create blob
    blob = cv2.dnn.blobFromImage(
        cv2.resize(img, (300, 300)),
        0.007843,  # 1/127.5
        (300, 300),
        127.5
    )

    net.setInput(blob)
    detections = net.forward()

    results = []

    for i in range(detections.shape[2]):
        confidence = detections[0, 0, i, 2]

        if confidence > conf_threshold:
            class_id = int(detections[0, 0, i, 1])

            box = detections[0, 0, i, 3:7] * np.array([width, height,
                                                       width, height])
            x1, y1, x2, y2 = box.astype(int)

            results.append({
                'box': [x1, y1, x2 - x1, y2 - y1],
                'confidence': float(confidence),
                'class_id': class_id,
                'class_name': CLASSES[class_id]
            })

    return results

6. DNN Face Detection

OpenCV DNN Face Detector

import cv2
import numpy as np

class DNNFaceDetector:
    """DNN-based Face Detector (res10_300x300)"""

    def __init__(self, model_path, config_path=None, conf_threshold=0.5):
        """
        Model download:
        - deploy.prototxt: https://github.com/opencv/opencv/blob/master/samples/dnn/face_detector/deploy.prototxt
        - res10_300x300_ssd_iter_140000.caffemodel
        """

        if config_path:
            # Caffe model
            self.net = cv2.dnn.readNetFromCaffe(config_path, model_path)
        else:
            # TensorFlow model
            self.net = cv2.dnn.readNetFromTensorflow(model_path)

        self.conf_threshold = conf_threshold

    def detect(self, img):
        """Face detection"""
        height, width = img.shape[:2]

        # Create blob
        blob = cv2.dnn.blobFromImage(
            img, 1.0, (300, 300),
            (104.0, 177.0, 123.0),  # Mean values
            swapRB=False,
            crop=False
        )

        self.net.setInput(blob)
        detections = self.net.forward()

        faces = []

        for i in range(detections.shape[2]):
            confidence = detections[0, 0, i, 2]

            if confidence > self.conf_threshold:
                box = detections[0, 0, i, 3:7] * np.array(
                    [width, height, width, height]
                )
                x1, y1, x2, y2 = box.astype(int)

                # Boundary check
                x1 = max(0, x1)
                y1 = max(0, y1)
                x2 = min(width, x2)
                y2 = min(height, y2)

                faces.append({
                    'box': (x1, y1, x2 - x1, y2 - y1),
                    'confidence': float(confidence)
                })

        return faces

    def draw(self, img, faces):
        """Visualize results"""
        for face in faces:
            x, y, w, h = face['box']
            conf = face['confidence']

            cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
            label = f"Face: {conf:.2f}"
            cv2.putText(img, label, (x, y - 10),
                       cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

        return img

# Usage example
detector = DNNFaceDetector(
    'res10_300x300_ssd_iter_140000.caffemodel',
    'deploy.prototxt'
)

img = cv2.imread('group_photo.jpg')
faces = detector.detect(img)
output = detector.draw(img, faces)

print(f"Detected faces: {len(faces)}")
cv2.imshow('DNN Face Detection', output)
cv2.waitKey(0)

Real-time DNN Face Detection

import cv2
import time

def realtime_dnn_face_detection():
    """Real-time DNN face detection"""

    # Load model
    net = cv2.dnn.readNetFromCaffe(
        'deploy.prototxt',
        'res10_300x300_ssd_iter_140000.caffemodel'
    )

    cap = cv2.VideoCapture(0)

    fps_time = time.time()
    fps = 0
    frame_count = 0

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        h, w = frame.shape[:2]

        # Create blob
        blob = cv2.dnn.blobFromImage(
            frame, 1.0, (300, 300),
            (104.0, 177.0, 123.0), False, False
        )

        net.setInput(blob)
        detections = net.forward()

        # Process detection results
        for i in range(detections.shape[2]):
            confidence = detections[0, 0, i, 2]

            if confidence > 0.5:
                box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
                x1, y1, x2, y2 = box.astype(int)

                cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
                label = f"{confidence:.2%}"
                cv2.putText(frame, label, (x1, y1 - 10),
                           cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

        # Calculate FPS
        frame_count += 1
        if time.time() - fps_time >= 1.0:
            fps = frame_count
            frame_count = 0
            fps_time = time.time()

        cv2.putText(frame, f"FPS: {fps}", (10, 30),
                   cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)

        cv2.imshow('DNN Face Detection', frame)

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()

# Run
realtime_dnn_face_detection()

Haar vs DNN Face Detection Comparison

import cv2
import time
import numpy as np

def compare_face_detectors(img):
    """Compare Haar and DNN face detection"""

    # Haar Cascade
    haar = cv2.CascadeClassifier(
        cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
    )

    # DNN
    dnn_net = cv2.dnn.readNetFromCaffe(
        'deploy.prototxt',
        'res10_300x300_ssd_iter_140000.caffemodel'
    )

    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    h, w = img.shape[:2]

    # Haar detection
    start = time.time()
    haar_faces = haar.detectMultiScale(gray, 1.1, 5, minSize=(30, 30))
    haar_time = time.time() - start

    # DNN detection
    start = time.time()
    blob = cv2.dnn.blobFromImage(img, 1.0, (300, 300),
                                  (104.0, 177.0, 123.0), False, False)
    dnn_net.setInput(blob)
    dnn_detections = dnn_net.forward()
    dnn_time = time.time() - start

    # Visualize results
    img_haar = img.copy()
    img_dnn = img.copy()

    for (x, y, w, h) in haar_faces:
        cv2.rectangle(img_haar, (x, y), (x+w, y+h), (0, 255, 0), 2)

    for i in range(dnn_detections.shape[2]):
        conf = dnn_detections[0, 0, i, 2]
        if conf > 0.5:
            box = dnn_detections[0, 0, i, 3:7] * np.array([w, h, w, h])
            x1, y1, x2, y2 = box.astype(int)
            cv2.rectangle(img_dnn, (x1, y1), (x2, y2), (0, 255, 0), 2)

    cv2.putText(img_haar, f"Haar: {len(haar_faces)} faces, {haar_time*1000:.1f}ms",
               (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)

    dnn_count = sum(1 for i in range(dnn_detections.shape[2])
                    if dnn_detections[0, 0, i, 2] > 0.5)
    cv2.putText(img_dnn, f"DNN: {dnn_count} faces, {dnn_time*1000:.1f}ms",
               (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)

    # Display results
    combined = np.hstack([img_haar, img_dnn])
    cv2.imshow('Haar vs DNN', combined)
    cv2.waitKey(0)

    return {
        'haar': {'count': len(haar_faces), 'time': haar_time},
        'dnn': {'count': dnn_count, 'time': dnn_time}
    }

# Run comparison
# result = compare_face_detectors(cv2.imread('group.jpg'))

7. Modern Object Detection with ONNX

7.1 Running YOLOv8 with OpenCV DNN

YOLOv8, released by Ultralytics in 2023, represents a significant advancement in the YOLO family. It can be exported to ONNX format and run efficiently with OpenCV's DNN module.

import cv2
import numpy as np

class YOLOv8Detector:
    """YOLOv8 ONNX Object Detector"""

    def __init__(self, onnx_model_path, conf_threshold=0.5, iou_threshold=0.4):
        """
        Initialize YOLOv8 detector

        To export YOLOv8 to ONNX:
        pip install ultralytics
        yolo export model=yolov8n.pt format=onnx
        """
        self.net = cv2.dnn.readNetFromONNX(onnx_model_path)
        self.conf_threshold = conf_threshold
        self.iou_threshold = iou_threshold

        # COCO class names (80 classes)
        self.classes = [
            "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train",
            "truck", "boat", "traffic light", "fire hydrant", "stop sign",
            "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow",
            "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag",
            "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite",
            "baseball bat", "baseball glove", "skateboard", "surfboard",
            "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
            "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot",
            "hot dog", "pizza", "donut", "cake", "chair", "couch", "potted plant",
            "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote",
            "keyboard", "cell phone", "microwave", "oven", "toaster", "sink",
            "refrigerator", "book", "clock", "vase", "scissors", "teddy bear",
            "hair drier", "toothbrush"
        ]

        # Generate random colors for each class
        np.random.seed(42)
        self.colors = np.random.randint(0, 255, size=(len(self.classes), 3),
                                        dtype=np.uint8)

    def detect(self, img):
        """
        YOLOv8 object detection

        YOLOv8 output format (different from YOLOv5):
        - Shape: (1, 84, 8400) for 640x640 input
        - First 4 rows: [x_center, y_center, width, height]
        - Rows 4-83: class probabilities (80 classes)
        - No objectness score (unlike YOLOv5)
        """
        height, width = img.shape[:2]

        # Preprocessing: letterbox resize
        input_size = 640
        blob = cv2.dnn.blobFromImage(
            img,
            scalefactor=1/255.0,
            size=(input_size, input_size),
            mean=(0, 0, 0),
            swapRB=True,
            crop=False
        )

        # Inference
        self.net.setInput(blob)
        outputs = self.net.forward()

        # YOLOv8 output shape: (1, 84, 8400)
        # Transpose to (8400, 84) for easier processing
        outputs = outputs[0].transpose()  # (8400, 84)

        boxes = []
        confidences = []
        class_ids = []

        # Scale factors for coordinate conversion
        x_scale = width / input_size
        y_scale = height / input_size

        for detection in outputs:
            # Extract class scores (rows 4-83)
            class_scores = detection[4:]
            class_id = np.argmax(class_scores)
            confidence = class_scores[class_id]

            if confidence > self.conf_threshold:
                # Extract bounding box (rows 0-3: cx, cy, w, h)
                cx = detection[0] * x_scale
                cy = detection[1] * y_scale
                w = detection[2] * x_scale
                h = detection[3] * y_scale

                # Convert to top-left corner format
                x = int(cx - w / 2)
                y = int(cy - h / 2)

                boxes.append([x, y, int(w), int(h)])
                confidences.append(float(confidence))
                class_ids.append(class_id)

        # Non-Maximum Suppression
        indices = cv2.dnn.NMSBoxes(
            boxes, confidences,
            self.conf_threshold, self.iou_threshold
        )

        results = []
        for i in indices:
            results.append({
                'box': boxes[i],
                'confidence': confidences[i],
                'class_id': class_ids[i],
                'class_name': self.classes[class_ids[i]]
            })

        return results

    def draw(self, img, results):
        """Visualize detection results"""
        for det in results:
            x, y, w, h = det['box']
            class_id = det['class_id']
            confidence = det['confidence']

            color = [int(c) for c in self.colors[class_id]]

            # Draw bounding box
            cv2.rectangle(img, (x, y), (x + w, y + h), color, 2)

            # Draw label
            label = f"{det['class_name']}: {confidence:.2f}"
            (label_w, label_h), _ = cv2.getTextSize(
                label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1
            )
            cv2.rectangle(img, (x, y - label_h - 10), (x + label_w, y), color, -1)
            cv2.putText(
                img, label, (x, y - 5),
                cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 1
            )

        return img

# Usage example
# First, export YOLOv8 model:
# pip install ultralytics
# yolo export model=yolov8n.pt format=onnx
# This creates yolov8n.onnx

detector = YOLOv8Detector('yolov8n.onnx', conf_threshold=0.5)
img = cv2.imread('street.jpg')
results = detector.detect(img)
output = detector.draw(img, results)

print(f"Detected {len(results)} objects:")
for r in results:
    print(f"  - {r['class_name']}: {r['confidence']:.2%}")

cv2.imshow('YOLOv8 Detection', output)
cv2.waitKey(0)
cv2.destroyAllWindows()

7.2 SAM (Segment Anything Model) with ONNX

Meta's Segment Anything Model (SAM) enables powerful image segmentation. While SAM is typically used with PyTorch, the encoder and decoder can be exported to ONNX and run with OpenCV DNN for inference.

import cv2
import numpy as np

class SAMONNXDetector:
    """
    Simplified SAM ONNX inference with OpenCV

    SAM consists of two components:
    1. Image Encoder: Encodes input image to embeddings
    2. Mask Decoder: Generates masks from embeddings and prompts

    To export SAM to ONNX, see:
    https://github.com/facebookresearch/segment-anything
    """

    def __init__(self, encoder_path, decoder_path):
        """
        Initialize SAM with ONNX models

        Args:
            encoder_path: Path to encoder ONNX model
            decoder_path: Path to decoder ONNX model
        """
        self.encoder = cv2.dnn.readNetFromONNX(encoder_path)
        self.decoder = cv2.dnn.readNetFromONNX(decoder_path)
        self.image_size = 1024  # SAM default size

    def preprocess(self, img):
        """Preprocess image for SAM encoder"""
        # Resize to 1024x1024
        h, w = img.shape[:2]
        scale = self.image_size / max(h, w)
        new_h, new_w = int(h * scale), int(w * scale)

        resized = cv2.resize(img, (new_w, new_h))

        # Pad to square
        padded = np.zeros((self.image_size, self.image_size, 3), dtype=np.uint8)
        padded[:new_h, :new_w] = resized

        # Normalize (ImageNet style)
        normalized = padded.astype(np.float32) / 255.0
        mean = np.array([0.485, 0.456, 0.406])
        std = np.array([0.229, 0.224, 0.225])
        normalized = (normalized - mean) / std

        # Convert to blob (1, 3, 1024, 1024)
        blob = cv2.dnn.blobFromImage(normalized, 1.0, swapRB=True)

        return blob, scale

    def encode_image(self, img):
        """Generate image embeddings using encoder"""
        blob, scale = self.preprocess(img)
        self.encoder.setInput(blob)
        embeddings = self.encoder.forward()
        return embeddings, scale

    def segment_with_point(self, img, point_coords, point_labels):
        """
        Segment image with point prompts

        Args:
            img: Input image
            point_coords: List of (x, y) coordinates
            point_labels: List of labels (1=foreground, 0=background)

        Returns:
            Segmentation mask
        """
        # Get image embeddings
        embeddings, scale = self.encode_image(img)

        # Scale point coordinates
        scaled_coords = np.array(point_coords) * scale

        # Prepare decoder inputs
        point_coords_input = scaled_coords.reshape(1, -1, 2).astype(np.float32)
        point_labels_input = np.array(point_labels).reshape(1, -1).astype(np.float32)

        # Run decoder
        # Note: Actual SAM ONNX decoder has specific input format
        # This is simplified - refer to official SAM ONNX export
        self.decoder.setInput(embeddings, 'image_embeddings')
        # Additional inputs for decoder would be set here

        mask = self.decoder.forward()

        return mask

# Conceptual usage (requires actual SAM ONNX models)
# sam = SAMONNXDetector('sam_vit_h_encoder.onnx', 'sam_vit_h_decoder.onnx')
# img = cv2.imread('image.jpg')
#
# # Segment with point prompt
# point_coords = [(100, 150)]  # Click location
# point_labels = [1]  # Foreground point
# mask = sam.segment_with_point(img, point_coords, point_labels)

Note: Running SAM with OpenCV DNN is complex due to its architecture. For production use, consider using the official SAM implementation or pre-built inference servers.

7.3 Model Zoo and Current Landscape (2025)

The object detection and segmentation landscape has evolved significantly. Here's an overview of popular models and their ONNX availability:

Model Family Latest Version ONNX Support Use Case Performance
YOLOv8 v8.1 (2024) βœ… Native Real-time detection mAP 53.9 (YOLOv8x)
YOLOv9 v9.0 (2024) βœ… Export Improved accuracy mAP 55.6
YOLOv10 v10.0 (2024) βœ… Export NMS-free YOLO Faster inference
YOLOv11 v11.0 (2024) βœ… Export Latest Ultralytics State-of-the-art
RT-DETR v2 (2024) βœ… Export Transformer detector mAP 53.1, no NMS
SAM v1.0 (2023) βœ… Complex Universal segmentation Zero-shot capable
SAM 2 v2.0 (2024) βœ… Complex Video segmentation Temporal tracking
Depth Anything v2 (2024) βœ… Export Monocular depth Fast, accurate
GroundingDINO v1.5 (2024) ⚠️ Limited Text-prompted detection Open-vocabulary
DINO v2 v2.0 (2024) βœ… Export Self-supervised features Strong backbone

7.3.1 Quick Start: YOLOv11 with ONNX

# Install Ultralytics
# pip install ultralytics

# Export YOLOv11 to ONNX (Python)
from ultralytics import YOLO

model = YOLO('yolov11n.pt')  # n, s, m, l, x variants
model.export(format='onnx', dynamic=False)  # Creates yolov11n.onnx

# Then use with OpenCV (same as YOLOv8 example above)
# detector = YOLOv8Detector('yolov11n.onnx')  # API compatible

7.3.2 RT-DETR: Transformer-based Detection

RT-DETR (Real-Time DEtection TRansformer) eliminates the need for NMS by using a transformer architecture:

import cv2
import numpy as np

class RTDETRDetector:
    """RT-DETR ONNX Detector (NMS-free)"""

    def __init__(self, onnx_path, conf_threshold=0.5):
        self.net = cv2.dnn.readNetFromONNX(onnx_path)
        self.conf_threshold = conf_threshold

        # COCO classes (same as YOLO)
        self.classes = ["person", "bicycle", "car", ...]  # 80 classes

    def detect(self, img):
        """
        RT-DETR detection (no NMS required)

        Output format: Direct bounding boxes and scores
        Shape: (1, 300, 6) - top 300 detections
        Each detection: [x1, y1, x2, y2, confidence, class_id]
        """
        height, width = img.shape[:2]

        # Preprocessing (RT-DETR uses 640x640)
        blob = cv2.dnn.blobFromImage(
            img, 1/255.0, (640, 640),
            mean=(0, 0, 0), swapRB=True, crop=False
        )

        self.net.setInput(blob)
        outputs = self.net.forward()

        # Parse outputs (already NMS-filtered by model)
        results = []
        for detection in outputs[0]:  # (300, 6)
            confidence = detection[4]
            if confidence > self.conf_threshold:
                class_id = int(detection[5])

                # Scale coordinates
                x1 = int(detection[0] * width)
                y1 = int(detection[1] * height)
                x2 = int(detection[2] * width)
                y2 = int(detection[3] * height)

                results.append({
                    'box': [x1, y1, x2-x1, y2-y1],
                    'confidence': float(confidence),
                    'class_id': class_id,
                    'class_name': self.classes[class_id]
                })

        return results

# Export RT-DETR to ONNX:
# pip install ultralytics
# yolo export model=rtdetr-l.pt format=onnx

7.3.3 Model Selection Guide

For Real-time Applications (Edge Devices, Mobile): - YOLOv8n/s: Fastest, good for embedded systems - YOLOv10n: No NMS overhead, even faster - RT-DETR-s: Best accuracy/speed trade-off

For High Accuracy (Server, Cloud): - YOLOv11x: State-of-the-art detection - RT-DETR-x: Transformer advantages - Ensemble multiple models

For Segmentation: - YOLOv8-seg: Instance segmentation (ONNX native) - SAM/SAM2: Universal segmentation (prompt-based) - Depth Anything: Depth estimation

OpenCV DNN Compatibility: - βœ… Full support: YOLOv8-11, RT-DETR, MobileNet-SSD - ⚠️ Partial: SAM (complex multi-stage pipeline) - ❌ Limited: Models requiring custom ops (GroundingDINO)

Performance Benchmarks (on RTX 4090, 2024):

Model Input Size FPS (CUDA) mAP OpenCV DNN
YOLOv8n 640 450 37.3 βœ… Good
YOLOv11m 640 200 51.5 βœ… Good
RT-DETR-l 640 110 53.1 βœ… Excellent
YOLOv8x 640 80 53.9 βœ… Good

8. Exercises

Exercise 1: Object Detection Performance Comparison

Write a program that compares the performance of YOLO and SSD.

Requirements: - Use the same test image set - Measure detection speed - Compare detection accuracy (IoU-based) - Visualize results

Hint
def compare_detectors(img, yolo_detector, ssd_detector):
    # YOLO detection
    yolo_start = time.time()
    yolo_results = yolo_detector.detect(img)
    yolo_time = time.time() - yolo_start

    # SSD detection
    ssd_start = time.time()
    ssd_results = ssd_detector.detect(img)
    ssd_time = time.time() - ssd_start

    return {
        'yolo': {'results': yolo_results, 'time': yolo_time},
        'ssd': {'results': ssd_results, 'time': ssd_time}
    }

Exercise 2: Custom Class Filtering

Implement filtering functionality to detect only specific classes.

Requirements: - Specify list of classes to detect - Ignore other classes - Set class-specific colors - Set class-specific confidence thresholds

Hint
class FilteredDetector:
    def __init__(self, base_detector, target_classes):
        self.detector = base_detector
        self.target_classes = target_classes
        self.class_thresholds = {cls: 0.5 for cls in target_classes}

    def detect(self, img):
        all_results = self.detector.detect(img)
        filtered = [r for r in all_results
                    if r['class_name'] in self.target_classes and
                    r['confidence'] > self.class_thresholds[r['class_name']]]
        return filtered

Exercise 3: Video Object Tracking + Detection

Implement stable video object recognition by combining detection and tracking.

Requirements: - Perform detection every N frames - Use tracking for intermediate frames - Assign and maintain IDs - Re-detect on tracking failure

Hint
class DetectionTracker:
    def __init__(self, detector, detect_every_n=5):
        self.detector = detector
        self.detect_every_n = detect_every_n
        self.trackers = {}
        self.frame_count = 0

    def process(self, frame):
        if self.frame_count % self.detect_every_n == 0:
            # New detection
            detections = self.detector.detect(frame)
            self.update_trackers(frame, detections)
        else:
            # Update existing trackers
            self.update_existing_trackers(frame)

        self.frame_count += 1
        return self.get_current_positions()

Exercise 4: Model Ensemble

Improve accuracy by combining results from multiple models.

Requirements: - Use 2 or more models - Merge detection results (Weighted NMS) - Calibrate confidence scores - Output final results

Hint
def ensemble_detection(img, detectors, weights=None):
    all_boxes = []
    all_scores = []
    all_classes = []

    for i, detector in enumerate(detectors):
        results = detector.detect(img)
        weight = weights[i] if weights else 1.0

        for r in results:
            all_boxes.append(r['box'])
            all_scores.append(r['confidence'] * weight)
            all_classes.append(r['class_id'])

    # Soft-NMS or Weighted Box Fusion
    final_results = weighted_nms(all_boxes, all_scores, all_classes)
    return final_results

Exercise 5: Real-time Object Counting System

Implement a system that counts specific objects (e.g., people, vehicles) in video.

Requirements: - Real-time detection - Set counting line/region - Distinguish entry/exit - Display statistics (hourly, cumulative)

Hint
class ObjectCounter:
    def __init__(self, detector, count_line_y):
        self.detector = detector
        self.count_line_y = count_line_y
        self.tracked_objects = {}  # {id: previous_y}
        self.count_in = 0
        self.count_out = 0

    def process(self, frame):
        results = self.detector.detect(frame)

        for obj in results:
            # Object center y coordinate
            _, y, _, h = obj['box']
            center_y = y + h // 2

            # Compare with previous position to check line crossing
            if obj['id'] in self.tracked_objects:
                prev_y = self.tracked_objects[obj['id']]
                if prev_y < self.count_line_y <= center_y:
                    self.count_out += 1
                elif prev_y > self.count_line_y >= center_y:
                    self.count_in += 1

            self.tracked_objects[obj['id']] = center_y

Next Steps


References

to navigate between lessons