Monocular Depth Estimation

Monocular Depth Estimation

Overview

Monocular depth estimation is the technology for estimating per-pixel depth information from a single 2D image. This covers deep learning models like MiDaS and DPT, as well as geometric approaches through Structure from Motion (SfM).

Difficulty: ⭐⭐⭐⭐

Prerequisites: DNN module, feature detection/matching, camera calibration


Table of Contents

  1. Monocular Depth Estimation Overview
  2. MiDaS Model
  3. DPT (Dense Prediction Transformer)
  4. Structure from Motion (SfM)
  5. Depth Map Applications
  6. Exercises

1. Monocular Depth Estimation Overview

Why Monocular Depth Estimation?

Stereo vs Monocular Depth Estimation:

┌─────────────────────────────────────────────────────────────────┐
                                                                 
  Stereo Vision                                                  
  ┌───────────┐    ┌───────────┐                                 
     📷               📷                                     
     Left    │◄──►│   Right     Two cameras required           
  └───────────┘    └───────────┘                                 
                                                                 
  Pros: Geometrically accurate, absolute depth measurement       
  Cons: Two cameras required, calibration mandatory              
                                                                 
├─────────────────────────────────────────────────────────────────┤
                                                                 
  Monocular Depth Estimation                                     
  ┌───────────┐                                                  
      📷       Single camera sufficient                        
    Single     Suitable for smartphones, drones, robots        
  └───────────┘                                                  
                                                                 
  Pros: Single camera, simple setup, suitable for mobile devices 
  Cons: Relative depth, scale ambiguity, depends on training data
                                                                 
└─────────────────────────────────────────────────────────────────┘

Challenges in Depth Estimation

Inherent Ambiguity in Monocular Depth Estimation:

Infinitely many 3D scenes can produce the same 2D image

                        
                        
                                🎾  Small ball, close
        /\             
                       
                        
                                 🏀  Large ball, far
    ───────────────────[📷]───────────────────

Appears the same size!

Solutions:
1. Learned Prior Knowledge (Deep Learning)
   - Typical object sizes
   - Perspective rules
   - Texture gradients

2. Multiple Images (SfM)
   - Using viewpoint changes
   - Geometric constraints

3. Additional Sensors
   - LiDAR assistance
   - Structured light assistance

Depth Estimation Methodologies

Depth Estimation Approaches:

┌─────────────────────────────────────────────────────────────────┐
 1. Supervised Learning                                          
    - Train with RGB-D datasets                                  
    - Requires ground truth depth                                
    - Datasets: NYU Depth V2, KITTI, ScanNet                    
                                                                 
 2. Self-supervised Learning                                     
    - Train with stereo pairs or consecutive frames              
    - No ground truth required                                   
    - Monodepth2, PackNet-SfM                                   
                                                                 
 3. Zero-shot Learning (Cross-domain)                            
    - Pre-trained on diverse datasets                            
    - Generalize to new domains                                  
    - MiDaS, DPT, ZoeDepth                                      
                                                                 
 4. Geometric Methods                                            
    - Structure from Motion                                      
    - Multi-View Stereo                                          
    - Use explicit geometric constraints                         
└─────────────────────────────────────────────────────────────────┘

2. MiDaS Model

MiDaS Overview

MiDaS (Mixing Datasets for Monocular Depth Estimation):

┌─────────────────────────────────────────────────────────────────┐
                                                                 
  Key Idea: Improve generalization by mixing diverse datasets    
                                                                 
  Training Data:                                                 
  - ReDWeb (internet images)                                     
  - DIML (indoor)                                                
  - Movies (movie scenes)                                        
  - MegaDepth (outdoor)                                          
  - WSVD (video)                                                 
                                                                 
  Features:                                                      
  - Scale-invariant loss function                                
  - Relative depth prediction                                    
  - Various backbones (EfficientNet, ResNeXt, ViT)              
                                                                 
  Model Versions:                                                
  ┌──────────────────┬───────────┬─────────────────────────┐     
   Model             Input Size Features                     
  ├──────────────────┼───────────┼─────────────────────────┤     
   MiDaS v2.1 Large  384x384    High quality, slow           
   MiDaS v2.1 Small  256x256    Lightweight, fast            
   MiDaS v3 (DPT)    384x384    Transformer-based            
   MiDaS v3.1 (DPT)  Various    Latest, various backbones    
  └──────────────────┴───────────┴─────────────────────────┘     
                                                                 
└─────────────────────────────────────────────────────────────────┘

Using MiDaS

import cv2
import numpy as np
import torch

def load_midas_model(model_type='DPT_Large'):
    """Load MiDaS model (PyTorch Hub)"""

    # Model types:
    # - 'DPT_Large': Most accurate
    # - 'DPT_Hybrid': Balanced
    # - 'MiDaS_small': Fastest

    model = torch.hub.load('intel-isl/MiDaS', model_type)

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.eval()

    # Load preprocessing transforms
    midas_transforms = torch.hub.load('intel-isl/MiDaS', 'transforms')

    if model_type in ['DPT_Large', 'DPT_Hybrid']:
        transform = midas_transforms.dpt_transform
    else:
        transform = midas_transforms.small_transform

    return model, transform, device

def estimate_depth_midas(img, model, transform, device):
    """Estimate depth with MiDaS"""

    # BGR → RGB
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # Preprocessing
    input_batch = transform(img_rgb).to(device)

    # Inference
    with torch.no_grad():
        prediction = model(input_batch)

        # Resize to original size
        prediction = torch.nn.functional.interpolate(
            prediction.unsqueeze(1),
            size=img.shape[:2],
            mode='bicubic',
            align_corners=False
        ).squeeze()

    depth_map = prediction.cpu().numpy()

    return depth_map

def normalize_depth(depth_map):
    """Normalize depth map (for visualization)"""

    depth_min = depth_map.min()
    depth_max = depth_map.max()

    depth_normalized = (depth_map - depth_min) / (depth_max - depth_min)
    depth_normalized = (depth_normalized * 255).astype(np.uint8)

    return depth_normalized

def colorize_depth(depth_map, colormap=cv2.COLORMAP_INFERNO):
    """Apply colormap to depth map"""

    depth_norm = normalize_depth(depth_map)
    depth_colored = cv2.applyColorMap(depth_norm, colormap)

    return depth_colored

# Usage example
def main():
    # Load model
    print("Loading model...")
    model, transform, device = load_midas_model('DPT_Large')

    # Load image
    img = cv2.imread('sample.jpg')

    # Estimate depth
    print("Estimating depth...")
    depth = estimate_depth_midas(img, model, transform, device)

    # Visualization
    depth_colored = colorize_depth(depth)

    cv2.imshow('Original', img)
    cv2.imshow('Depth', depth_colored)
    cv2.waitKey(0)

Running MiDaS with OpenCV DNN

import cv2
import numpy as np

class MiDaSDepthEstimator:
    """Run MiDaS with OpenCV DNN"""

    def __init__(self, model_path):
        """
        model_path: ONNX model path
        Download: https://github.com/isl-org/MiDaS/releases
        """
        self.net = cv2.dnn.readNetFromONNX(model_path)

        # Use GPU (if available)
        self.net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
        self.net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)

        # Input size (depends on model)
        self.input_size = (384, 384)  # DPT_Large
        # self.input_size = (256, 256)  # MiDaS_small

    def estimate(self, img):
        """Estimate depth"""

        h, w = img.shape[:2]

        # Preprocessing
        blob = cv2.dnn.blobFromImage(
            img,
            scalefactor=1/255.0,
            size=self.input_size,
            mean=(0.485, 0.456, 0.406),  # ImageNet mean
            swapRB=True,
            crop=False
        )

        # Standard deviation normalization (manual)
        std = np.array([0.229, 0.224, 0.225]).reshape(1, 3, 1, 1)
        blob = blob / std

        # Inference
        self.net.setInput(blob)
        output = self.net.forward()

        # Post-processing
        depth = output[0, 0]

        # Resize to original size
        depth = cv2.resize(depth, (w, h), interpolation=cv2.INTER_CUBIC)

        return depth

    def visualize(self, depth, colormap=cv2.COLORMAP_MAGMA):
        """Visualize depth map"""

        # Normalization
        depth_norm = cv2.normalize(depth, None, 0, 255, cv2.NORM_MINMAX)
        depth_norm = depth_norm.astype(np.uint8)

        # Apply colormap
        depth_colored = cv2.applyColorMap(depth_norm, colormap)

        return depth_colored

# Usage example
estimator = MiDaSDepthEstimator('midas_v21_384.onnx')

img = cv2.imread('sample.jpg')
depth = estimator.estimate(img)
depth_vis = estimator.visualize(depth)

cv2.imshow('Depth', depth_vis)
cv2.waitKey(0)

3. DPT (Dense Prediction Transformer)

DPT Architecture

DPT (Dense Prediction Transformer):

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  Vision Transformer (ViT)-based dense prediction model         │
│                                                                 │
│  Input: Image (H × W × 3)                                       │
│         │                                                       │
│         ▼                                                       │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Patch Embedding                                        │    │
│  │  Split image into patches and embed                     │    │
│  │  Patch size: 16×16                                      │    │
│  └────────────────────────┬────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Transformer Encoder                                    │    │
│  │  ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐               │    │
│  │  │ Block │→│ Block │→│ Block │→│ Block │               │    │
│  │  └───────┘ └───────┘ └───────┘ └───────┘               │    │
│  │     │          │          │          │                  │    │
│  │     └──────────┼──────────┼──────────┘                  │    │
│  │                ▼          ▼          ▼                  │    │
│  │         Multi-scale feature extraction                  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Reassemble + Fusion                                    │    │
│  │  Multi-scale feature fusion                             │    │
│  └────────────────────────┬────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Head (Conv Layers)                                     │    │
│  │  Final depth map output                                 │    │
│  └────────────────────────┬────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  Output: Depth Map (H × W)                                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

DPT Implementation

import cv2
import numpy as np
import torch
from torchvision import transforms

class DPTDepthEstimator:
    """DPT Depth Estimator"""

    def __init__(self, model_type='DPT_Large'):
        """
        model_type: 'DPT_Large', 'DPT_Hybrid', 'DPT_SwinV2_L_384'
        """
        self.device = torch.device(
            'cuda' if torch.cuda.is_available() else 'cpu'
        )

        # Load model from PyTorch Hub
        self.model = torch.hub.load('intel-isl/MiDaS', model_type)
        self.model.to(self.device)
        self.model.eval()

        # Load preprocessing transforms
        midas_transforms = torch.hub.load('intel-isl/MiDaS', 'transforms')
        self.transform = midas_transforms.dpt_transform

    def estimate(self, img):
        """Estimate depth"""

        h, w = img.shape[:2]

        # BGR → RGB
        img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

        # Preprocessing and inference
        input_batch = self.transform(img_rgb).to(self.device)

        with torch.no_grad():
            prediction = self.model(input_batch)

            # Interpolate to original size
            prediction = torch.nn.functional.interpolate(
                prediction.unsqueeze(1),
                size=(h, w),
                mode='bicubic',
                align_corners=False
            ).squeeze()

        depth = prediction.cpu().numpy()

        return depth

    def get_metric_depth(self, depth, scale=10.0):
        """Relative depth → Metric depth conversion (approximation)"""

        # MiDaS/DPT outputs relative depth
        # Scale estimation needed for absolute depth conversion

        depth_metric = scale / (depth + 1e-6)

        return depth_metric

def estimate_depth_with_confidence(estimator, img, num_samples=5):
    """Estimate depth uncertainty with Monte Carlo dropout"""

    # Note: Actually requires a model with dropout
    # Here we substitute with data augmentation

    depths = []

    for _ in range(num_samples):
        # Slight image variation
        augmented = img.copy()

        # Brightness change
        factor = np.random.uniform(0.9, 1.1)
        augmented = np.clip(augmented * factor, 0, 255).astype(np.uint8)

        depth = estimator.estimate(augmented)
        depths.append(depth)

    depths = np.stack(depths, axis=0)

    # Mean and standard deviation
    mean_depth = np.mean(depths, axis=0)
    std_depth = np.std(depths, axis=0)

    return mean_depth, std_depth

Depth Anything Model

# Depth Anything: More recent SOTA model

class DepthAnythingEstimator:
    """Depth Anything Model (2024)"""

    def __init__(self, model_size='small'):
        """
        model_size: 'small', 'base', 'large'
        """
        from transformers import pipeline

        model_name = f"LiheYoung/depth-anything-{model_size}-hf"
        self.pipe = pipeline(
            task='depth-estimation',
            model=model_name
        )

    def estimate(self, img):
        """Estimate depth"""

        # BGR → RGB, PIL conversion
        from PIL import Image
        img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img_pil = Image.fromarray(img_rgb)

        # Inference
        result = self.pipe(img_pil)

        # Extract depth map
        depth = np.array(result['depth'])

        # Resize to original size
        if depth.shape[:2] != img.shape[:2]:
            depth = cv2.resize(depth, (img.shape[1], img.shape[0]))

        return depth

4. Structure from Motion (SfM)

SfM Overview

Structure from Motion (SfM):
Recover 3D structure using camera motion

┌─────────────────────────────────────────────────────────────────┐
                                                                 
  Input: Consecutive images (video or multi-view images)        
                                                                 
  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐                   
   t=1    t=2    t=3    t=4    t=5                    
  └─────┘  └─────┘  └─────┘  └─────┘  └─────┘                   
                                                            
      └───────┴───────┴───────┴───────┘                          
                                                                
                                                                
          ┌───────────────────────────┐                          
            1. Feature Detection                               
               and Matching                                    
               SIFT, ORB, SuperPoint                           
          └───────────────────────────┘                          
                                                                
                                                                
          ┌───────────────────────────┐                          
            2. Camera Pose Estimation                          
               Essential Matrix                                
               PnP                                             
          └───────────────────────────┘                          
                                                                
                                                                
          ┌───────────────────────────┐                          
            3. Triangulation                                   
               3D Point Recovery                               
          └───────────────────────────┘                          
                                                                
                                                                
          ┌───────────────────────────┐                          
            4. Bundle Adjustment                               
               Global Optimization                             
          └───────────────────────────┘                          
                                                                
                                                                
  Output: 3D Point Cloud + Camera Trajectory                     
                                                                 
└─────────────────────────────────────────────────────────────────┘

SfM Implementation (Simple Version)

import cv2
import numpy as np

class SimpleSfM:
    """Simple 2-view SfM implementation"""

    def __init__(self, K):
        """
        K: Camera intrinsic parameter matrix
        """
        self.K = K
        self.sift = cv2.SIFT_create()
        self.bf = cv2.BFMatcher()

    def detect_and_match(self, img1, img2):
        """Feature detection and matching"""

        # Feature detection
        kp1, desc1 = self.sift.detectAndCompute(img1, None)
        kp2, desc2 = self.sift.detectAndCompute(img2, None)

        # Matching
        matches = self.bf.knnMatch(desc1, desc2, k=2)

        # Ratio test
        good_matches = []
        for m, n in matches:
            if m.distance < 0.75 * n.distance:
                good_matches.append(m)

        # Match point coordinates
        pts1 = np.float32([kp1[m.queryIdx].pt for m in good_matches])
        pts2 = np.float32([kp2[m.trainIdx].pt for m in good_matches])

        return pts1, pts2, good_matches, kp1, kp2

    def estimate_pose(self, pts1, pts2):
        """Estimate pose from Essential Matrix"""

        E, mask = cv2.findEssentialMat(
            pts1, pts2, self.K,
            method=cv2.RANSAC,
            prob=0.999,
            threshold=1.0
        )

        # Recover R, t
        _, R, t, mask = cv2.recoverPose(E, pts1, pts2, self.K)

        return R, t, mask.ravel().astype(bool)

    def triangulate(self, pts1, pts2, R, t):
        """Triangulate to recover 3D points"""

        # Projection matrices
        P1 = self.K @ np.hstack([np.eye(3), np.zeros((3, 1))])
        P2 = self.K @ np.hstack([R, t])

        # Triangulation
        pts1_h = pts1.T  # (2, N)
        pts2_h = pts2.T

        points_4d = cv2.triangulatePoints(P1, P2, pts1_h, pts2_h)

        # Homogeneous → Euclidean coordinates
        points_3d = points_4d[:3] / points_4d[3]

        return points_3d.T  # (N, 3)

    def filter_points(self, pts1, pts2, points_3d, R, t):
        """Filter valid 3D points"""

        # Calculate reprojection error
        P2 = self.K @ np.hstack([R, t])

        projected = P2 @ np.hstack([points_3d, np.ones((len(points_3d), 1))]).T
        projected = projected[:2] / projected[2]
        projected = projected.T

        errors = np.linalg.norm(pts2 - projected, axis=1)

        # Check if in front of camera
        # First camera reference
        valid_depth1 = points_3d[:, 2] > 0

        # Second camera reference
        points_cam2 = (R @ points_3d.T + t).T
        valid_depth2 = points_cam2[:, 2] > 0

        # Reprojection error threshold
        valid_reproj = errors < 2.0

        valid = valid_depth1 & valid_depth2 & valid_reproj

        return points_3d[valid], valid

    def run(self, img1, img2):
        """Run complete SfM pipeline"""

        # 1. Feature matching
        pts1, pts2, matches, kp1, kp2 = self.detect_and_match(img1, img2)
        print(f"Match points: {len(pts1)}")

        # 2. Pose estimation
        R, t, inlier_mask = self.estimate_pose(pts1, pts2)
        pts1 = pts1[inlier_mask]
        pts2 = pts2[inlier_mask]
        print(f"Inliers: {len(pts1)}")

        # 3. Triangulation
        points_3d = self.triangulate(pts1, pts2, R, t)

        # 4. Filtering
        points_3d, valid = self.filter_points(pts1, pts2, points_3d, R, t)
        print(f"Valid 3D points: {len(points_3d)}")

        return points_3d, R, t

# Usage example
K = np.array([
    [800, 0, 320],
    [0, 800, 240],
    [0, 0, 1]
], dtype=np.float32)

sfm = SimpleSfM(K)
img1 = cv2.imread('image1.jpg')
img2 = cv2.imread('image2.jpg')
points_3d, R, t = sfm.run(img1, img2)

Multi-View SfM

class IncrementalSfM:
    """Incremental SfM"""

    def __init__(self, K):
        self.K = K
        self.sift = cv2.SIFT_create(nfeatures=8000)
        self.bf = cv2.BFMatcher()

        # Global data
        self.points_3d = None
        self.point_colors = None
        self.camera_poses = []
        self.keypoints_all = []
        self.descriptors_all = []

    def add_image(self, img):
        """Add new image"""

        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        kp, desc = self.sift.detectAndCompute(gray, None)

        self.keypoints_all.append(kp)
        self.descriptors_all.append(desc)

        return len(self.keypoints_all) - 1

    def initialize(self, idx1, idx2):
        """Initialize with first two images"""

        # Matching
        matches = self.bf.knnMatch(
            self.descriptors_all[idx1],
            self.descriptors_all[idx2],
            k=2
        )

        good = [m for m, n in matches if m.distance < 0.7 * n.distance]

        pts1 = np.float32([self.keypoints_all[idx1][m.queryIdx].pt for m in good])
        pts2 = np.float32([self.keypoints_all[idx2][m.trainIdx].pt for m in good])

        # Essential Matrix
        E, mask = cv2.findEssentialMat(pts1, pts2, self.K)
        _, R, t, mask = cv2.recoverPose(E, pts1, pts2, self.K)

        mask = mask.ravel().astype(bool)
        pts1 = pts1[mask]
        pts2 = pts2[mask]

        # Triangulation
        P1 = self.K @ np.hstack([np.eye(3), np.zeros((3, 1))])
        P2 = self.K @ np.hstack([R, t])

        points_4d = cv2.triangulatePoints(P1, P2, pts1.T, pts2.T)
        self.points_3d = (points_4d[:3] / points_4d[3]).T

        # Store camera poses
        self.camera_poses = [
            {'R': np.eye(3), 't': np.zeros((3, 1))},
            {'R': R, 't': t}
        ]

        print(f"Initialization complete: {len(self.points_3d)} 3D points")

    def register_image(self, idx):
        """Register new image (PnP)"""

        if self.points_3d is None or len(self.points_3d) == 0:
            print("Initialization required first.")
            return False

        # Match with last added image
        last_idx = len(self.camera_poses) - 1

        matches = self.bf.knnMatch(
            self.descriptors_all[last_idx],
            self.descriptors_all[idx],
            k=2
        )

        good = [m for m, n in matches if m.distance < 0.7 * n.distance]

        if len(good) < 8:
            print("Insufficient matches")
            return False

        # 3D-2D correspondences (simplified: use previous image match indices)
        # In practice, track management is needed
        obj_points = []
        img_points = []

        for m in good[:len(self.points_3d)]:
            if m.queryIdx < len(self.points_3d):
                obj_points.append(self.points_3d[m.queryIdx])
                img_points.append(
                    self.keypoints_all[idx][m.trainIdx].pt
                )

        if len(obj_points) < 6:
            print("Insufficient correspondences")
            return False

        obj_points = np.array(obj_points, dtype=np.float32)
        img_points = np.array(img_points, dtype=np.float32)

        # PnP
        success, rvec, tvec, inliers = cv2.solvePnPRansac(
            obj_points, img_points, self.K, None
        )

        if not success:
            print("PnP failed")
            return False

        R, _ = cv2.Rodrigues(rvec)
        self.camera_poses.append({'R': R, 't': tvec})

        print(f"Image {idx} registered")
        return True

    def bundle_adjust(self):
        """Bundle adjustment (using scipy)"""

        from scipy.optimize import least_squares

        # Simple bundle adjustment implementation
        # In practice, recommend using g2o, Ceres, etc.

        print("Bundle adjustment: recommend specialized libraries (g2o, Ceres)")

    def get_point_cloud(self):
        """Return point cloud"""
        return self.points_3d

    def get_camera_trajectory(self):
        """Return camera trajectory"""
        positions = []
        for pose in self.camera_poses:
            R = pose['R']
            t = pose['t']
            # Camera position = -R^T * t
            pos = -R.T @ t
            positions.append(pos.ravel())

        return np.array(positions)

5. Depth Map Applications

Depth-based Image Effects

import cv2
import numpy as np

def apply_bokeh_effect(img, depth, focus_depth=0.5, aperture=0.1):
    """Depth-based bokeh effect (depth of field simulation)"""

    # Normalize depth (0-1)
    depth_norm = (depth - depth.min()) / (depth.max() - depth.min())

    # Calculate deviation from focus distance
    depth_diff = np.abs(depth_norm - focus_depth)

    # Blur strength (stronger farther from focus)
    blur_strength = (depth_diff / aperture * 30).astype(int)
    blur_strength = np.clip(blur_strength, 0, 31)

    # Apply blur (different strength per pixel)
    result = np.zeros_like(img, dtype=np.float32)

    for blur_level in range(0, 32, 2):
        mask = (blur_strength >= blur_level) & (blur_strength < blur_level + 2)

        if blur_level == 0:
            blurred = img.astype(np.float32)
        else:
            ksize = blur_level * 2 + 1
            blurred = cv2.GaussianBlur(img, (ksize, ksize), 0).astype(np.float32)

        result += blurred * mask[:, :, np.newaxis]

    return result.astype(np.uint8)

def create_depth_fog(img, depth, fog_color=(200, 200, 200), max_fog=0.8):
    """Depth-based fog effect"""

    # Normalize depth
    depth_norm = (depth - depth.min()) / (depth.max() - depth.min())

    # Fog strength (stronger farther away)
    fog_factor = depth_norm * max_fog

    # Apply fog
    fog = np.full_like(img, fog_color, dtype=np.float32)
    result = img.astype(np.float32) * (1 - fog_factor[:, :, np.newaxis])
    result += fog * fog_factor[:, :, np.newaxis]

    return result.astype(np.uint8)

def depth_based_segmentation(img, depth, num_layers=5):
    """Depth-based layer segmentation"""

    # Normalize depth
    depth_norm = (depth - depth.min()) / (depth.max() - depth.min())

    # Segment by depth intervals
    layers = []
    for i in range(num_layers):
        lower = i / num_layers
        upper = (i + 1) / num_layers
        mask = (depth_norm >= lower) & (depth_norm < upper)

        layer = np.zeros_like(img)
        layer[mask] = img[mask]
        layers.append(layer)

    return layers

def remove_background_with_depth(img, depth, threshold=0.5):
    """Depth-based background removal"""

    # Normalize depth
    depth_norm = (depth - depth.min()) / (depth.max() - depth.min())

    # Foreground mask (parts closer than threshold)
    foreground_mask = depth_norm < threshold

    # Refine mask
    kernel = np.ones((5, 5), np.uint8)
    foreground_mask = cv2.morphologyEx(
        foreground_mask.astype(np.uint8),
        cv2.MORPH_CLOSE, kernel
    )
    foreground_mask = cv2.morphologyEx(
        foreground_mask,
        cv2.MORPH_OPEN, kernel
    )

    # Remove background
    result = np.zeros_like(img)
    result[foreground_mask == 1] = img[foreground_mask == 1]

    return result, foreground_mask

3D Effect Generation

def create_3d_ken_burns(img, depth, num_frames=60, zoom=0.1):
    """Ken Burns effect (3D camera movement)"""

    h, w = img.shape[:2]
    frames = []

    for i in range(num_frames):
        t = i / (num_frames - 1)

        # Zoom factor
        scale = 1 + zoom * t

        # Parallax by depth
        parallax = (depth - depth.mean()) * 0.001 * t

        # Calculate new coordinates
        y_coords, x_coords = np.meshgrid(range(h), range(w), indexing='ij')

        # Center-based scaling
        new_x = (x_coords - w/2) / scale + w/2 + parallax
        new_y = (y_coords - h/2) / scale + h/2

        # Remapping
        map_x = new_x.astype(np.float32)
        map_y = new_y.astype(np.float32)

        frame = cv2.remap(img, map_x, map_y, cv2.INTER_LINEAR)
        frames.append(frame)

    return frames

def depth_aware_zoom(img, depth, zoom_center, zoom_factor=2.0):
    """Depth-aware zoom"""

    h, w = img.shape[:2]
    cx, cy = zoom_center

    # Normalize depth
    depth_norm = (depth - depth.min()) / (depth.max() - depth.min())

    # Apply different zoom by depth (closer objects zoom more)
    depth_factor = 1 - depth_norm * 0.5  # 0.5 ~ 1.0

    # Coordinate grid
    y_coords, x_coords = np.meshgrid(range(h), range(w), indexing='ij')

    # Zoom transform (different scale per depth)
    effective_zoom = zoom_factor * depth_factor

    new_x = (x_coords - cx) / effective_zoom + cx
    new_y = (y_coords - cy) / effective_zoom + cy

    # Remapping
    map_x = new_x.astype(np.float32)
    map_y = new_y.astype(np.float32)

    result = cv2.remap(img, map_x, map_y, cv2.INTER_LINEAR)

    return result

6. Exercises

Exercise 1: MiDaS Depth Estimation

Estimate depth of an image using MiDaS.

Requirements: - Load model and run inference - Visualize depth map (colormap) - Test on multiple images

Hint
import torch

model = torch.hub.load('intel-isl/MiDaS', 'DPT_Large')
midas_transforms = torch.hub.load('intel-isl/MiDaS', 'transforms')
transform = midas_transforms.dpt_transform

Exercise 2: Depth-based Background Blur

Blur only the background in a portrait photo.

Requirements: - Depth estimation - Foreground/background separation - Apply blur only to background - Natural boundary handling

Hint
# Depth-based mask generation
threshold = np.percentile(depth, 30)  # Treat closest 30% as foreground
foreground_mask = depth < threshold

# Blur mask (smooth boundaries)
mask_blur = cv2.GaussianBlur(
    foreground_mask.astype(np.float32), (21, 21), 0
)

# Background blur
background_blur = cv2.GaussianBlur(img, (25, 25), 0)

# Composite
result = img * mask_blur[..., None] + background_blur * (1 - mask_blur[..., None])

Exercise 3: 3D Reconstruction with SfM

Reconstruct a 3D point cloud from two images.

Requirements: - Feature matching - Essential Matrix calculation - Triangulation - Point cloud visualization

Hint
# Essential Matrix
E, mask = cv2.findEssentialMat(pts1, pts2, K)
_, R, t, _ = cv2.recoverPose(E, pts1, pts2, K)

# Projection matrices
P1 = K @ np.hstack([np.eye(3), np.zeros((3, 1))])
P2 = K @ np.hstack([R, t])

# Triangulation
points_4d = cv2.triangulatePoints(P1, P2, pts1.T, pts2.T)
points_3d = points_4d[:3] / points_4d[3]

Exercise 4: Real-time Depth Estimation

Implement real-time depth estimation using webcam.

Requirements: - Use lightweight model (MiDaS small) - Measure and display FPS - Depth visualization

Hint
# Lightweight model
model = torch.hub.load('intel-isl/MiDaS', 'MiDaS_small')

while True:
    ret, frame = cap.read()

    start = time.time()
    depth = estimate_depth(frame, model, transform)
    fps = 1.0 / (time.time() - start)

    cv2.putText(depth_vis, f"FPS: {fps:.1f}", ...)

Exercise 5: Depth-based 3D Viewer

Create a simple 3D viewer using depth map.

Requirements: - Depth map → Point cloud conversion - Visualization with Open3D - Mouse rotation/zoom

Hint
import open3d as o3d

# Create point cloud
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(points_3d)
pcd.colors = o3d.utility.Vector3dVector(colors / 255.0)

# Visualization
o3d.visualization.draw_geometries([pcd])

Next Steps


References

to navigate between lessons