38. 객체 탐지 (Object Detection)

이전: ν˜„λŒ€ λ”₯λŸ¬λ‹ μ•„ν‚€ν…μ²˜ | λ‹€μŒ: μ‹€μ „ 이미지 λΆ„λ₯˜ ν”„λ‘œμ νŠΈ


38. 객체 탐지 (Object Detection)

ν•™μŠ΅ λͺ©ν‘œ

  • Two-stage vs One-stage 탐지기 차이 이해
  • YOLO μ•„ν‚€ν…μ²˜μ™€ λ°œμ „ κ³Όμ • ν•™μŠ΅
  • Faster R-CNN의 ꡬ쑰와 RPN 이해
  • DETR (Detection Transformer) κ°œλ… νŒŒμ•…
  • PyTorch/Ultralytics둜 μ‹€μŠ΅

1. 객체 탐지 κ°œμš”

1.1 문제 μ •μ˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Computer Vision νƒœμŠ€ν¬ 비ꡐ                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  1. 이미지 λΆ„λ₯˜ (Image Classification)                           β”‚
β”‚     └─ 이미지 전체 β†’ ν•˜λ‚˜μ˜ 클래슀                                β”‚
β”‚     └─ 좜λ ₯: "κ°•μ•„μ§€"                                            β”‚
β”‚                                                                 β”‚
β”‚  2. 객체 탐지 (Object Detection)                                 β”‚
β”‚     └─ 이미지 β†’ μ—¬λŸ¬ 객체의 μœ„μΉ˜ + 클래슀                         β”‚
β”‚     └─ 좜λ ₯: [(x1,y1,x2,y2, "κ°•μ•„μ§€", 0.95), ...]              β”‚
β”‚                                                                 β”‚
β”‚  3. μ‹œλ§¨ν‹± λΆ„ν•  (Semantic Segmentation)                          β”‚
β”‚     └─ ν”½μ…€λ§ˆλ‹€ 클래슀 ν• λ‹Ή                                       β”‚
β”‚     └─ 같은 클래슀의 객체듀은 ꡬ뢄 μ•ˆ 됨                           β”‚
β”‚                                                                 β”‚
β”‚  4. μΈμŠ€ν„΄μŠ€ λΆ„ν•  (Instance Segmentation)                        β”‚
β”‚     └─ 객체 탐지 + 각 객체의 ν”½μ…€ 마슀크                          β”‚
β”‚     └─ 같은 ν΄λž˜μŠ€λΌλ„ κ°œλ³„ 객체 ꡬ뢄                              β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1.2 탐지기 λΆ„λ₯˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    탐지기 λΆ„λ₯˜ 체계                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚           Two-Stage Detectors                            β”‚   β”‚
β”‚  β”‚  1단계: Region Proposal (후보 μ˜μ—­ 생성)                   β”‚   β”‚
β”‚  β”‚  2단계: Classification + Regression                       β”‚   β”‚
β”‚  β”‚                                                           β”‚   β”‚
β”‚  β”‚  예: R-CNN β†’ Fast R-CNN β†’ Faster R-CNN                   β”‚   β”‚
β”‚  β”‚      μž₯점: 높은 정확도                                     β”‚   β”‚
β”‚  β”‚      단점: 느린 속도                                       β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚           One-Stage Detectors                            β”‚   β”‚
β”‚  β”‚  단일 λ„€νŠΈμ›Œν¬λ‘œ μœ„μΉ˜ + 클래슀 λ™μ‹œ 예츑                    β”‚   β”‚
β”‚  β”‚                                                           β”‚   β”‚
β”‚  β”‚  예: YOLO, SSD, RetinaNet, CenterNet                     β”‚   β”‚
β”‚  β”‚      μž₯점: λΉ λ₯Έ 속도, μ‹€μ‹œκ°„ 처리 κ°€λŠ₯                     β”‚   β”‚
β”‚  β”‚      단점: μž‘μ€ 객체 탐지 어렀움 (κ°œμ„ λ¨)                   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚           Transformer-based Detectors                    β”‚   β”‚
β”‚  β”‚  DETR, Deformable DETR, RT-DETR                          β”‚   β”‚
β”‚  β”‚  End-to-end ν•™μŠ΅, NMS λΆˆν•„μš”                              β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1.3 평가 μ§€ν‘œ

"""
객체 탐지 평가 μ§€ν‘œ
"""

def calculate_iou(box1, box2):
    """
    IoU (Intersection over Union) 계산

    Args:
        box1, box2: [x1, y1, x2, y2] ν˜•μ‹

    Returns:
        IoU κ°’ (0~1)
    """
    # ꡐ집합 μ’Œν‘œ
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    # ꡐ집합 면적
    inter_area = max(0, x2 - x1) * max(0, y2 - y1)

    # ν•©μ§‘ν•© 면적
    box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
    box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union_area = box1_area + box2_area - inter_area

    return inter_area / union_area if union_area > 0 else 0

# μ˜ˆμ‹œ
pred_box = [100, 100, 200, 200]
gt_box = [120, 110, 210, 210]
print(f"IoU: {calculate_iou(pred_box, gt_box):.3f}")  # μ•½ 0.68


"""
mAP (mean Average Precision) 계산 κ³Όμ •:

1. 각 ν΄λž˜μŠ€λ³„λ‘œ:
   - μ˜ˆμΈ‘μ„ confidence 순으둜 μ •λ ¬
   - IoU > threshold인 경우 TP, μ•„λ‹ˆλ©΄ FP
   - Precision-Recall 곑선 계산
   - AP = 곑선 μ•„λž˜ 면적

2. mAP = λͺ¨λ“  클래슀 AP의 평균

COCO 데이터셋 κΈ°μ€€:
- mAP@0.5: IoU=0.5 κΈ°μ€€
- mAP@0.75: IoU=0.75 κΈ°μ€€ (엄격)
- mAP@[.5:.95]: 0.5~0.95 IoU의 평균
"""

2. R-CNN 계열

2.1 R-CNN의 λ°œμ „

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    R-CNN Family λ°œμ „                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  R-CNN (2014):                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ 이미지    β”‚ β†’ β”‚ Selectiveβ”‚ β†’ β”‚ CNN      β”‚ β†’ β”‚ SVM     β”‚   β”‚
β”‚  β”‚          β”‚    β”‚ Search   β”‚    β”‚ (AlexNet)β”‚    β”‚ λΆ„λ₯˜κΈ°   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ (~2000개)β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                   β”‚
β”‚  문제점: ~2000번 CNN 톡과 β†’ 맀우 느림                            β”‚
β”‚                                                                 β”‚
β”‚  Fast R-CNN (2015):                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ 이미지    β”‚ β†’ β”‚ CNN      β”‚ β†’ β”‚ RoI      β”‚ β†’ β”‚ FC +    β”‚   β”‚
β”‚  β”‚          β”‚    β”‚ Feature  β”‚    β”‚ Pooling  β”‚    β”‚ Softmax β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ Map      β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚  κ°œμ„ : CNN 1번만 톡과, RoI Pooling으둜 후보 μ˜μ—­ μΆ”μΆœ              β”‚
β”‚  문제점: Selective Search μ—¬μ „νžˆ 느림                            β”‚
β”‚                                                                 β”‚
β”‚  Faster R-CNN (2015):                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ 이미지    β”‚ β†’ β”‚ Backbone β”‚ β†’ β”‚ RPN      β”‚ β†’ β”‚ Head    β”‚   β”‚
β”‚  β”‚          β”‚    β”‚ (ResNet) β”‚    β”‚          β”‚    β”‚         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚  ν˜μ‹ : RPN으둜 Region Proposal도 ν•™μŠ΅                            β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.2 Faster R-CNN ꡬ쑰

import torch
import torch.nn as nn
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

class CustomFasterRCNN:
    """Faster R-CNN μ»€μŠ€ν…€ λͺ¨λΈ"""

    def __init__(self, num_classes: int, pretrained: bool = True):
        """
        Args:
            num_classes: λ°°κ²½ 포함 클래슀 수 (예: 10개 객체 β†’ 11)
            pretrained: COCO μ‚¬μ „ν•™μŠ΅ κ°€μ€‘μΉ˜ μ‚¬μš©
        """
        # μ‚¬μ „ν•™μŠ΅λœ λͺ¨λΈ λ‘œλ“œ
        self.model = fasterrcnn_resnet50_fpn_v2(
            weights="DEFAULT" if pretrained else None
        )

        # Box predictor ꡐ체 (클래슀 수 맞좀)
        in_features = self.model.roi_heads.box_predictor.cls_score.in_features
        self.model.roi_heads.box_predictor = FastRCNNPredictor(
            in_features, num_classes
        )

    def get_model(self):
        return self.model


def train_faster_rcnn():
    """Faster R-CNN ν•™μŠ΅ 예제"""

    # λͺ¨λΈ 생성 (λ°°κ²½ + 10개 클래슀)
    model = CustomFasterRCNN(num_classes=11).get_model()
    model.train()

    # 가상 데이터
    images = [torch.rand(3, 600, 800) for _ in range(2)]

    targets = [
        {
            "boxes": torch.tensor([[100, 100, 200, 200], [300, 300, 400, 400]]),
            "labels": torch.tensor([1, 2]),  # 클래슀 ID
        },
        {
            "boxes": torch.tensor([[50, 50, 150, 150]]),
            "labels": torch.tensor([3]),
        }
    ]

    # μˆœμ „νŒŒ (ν•™μŠ΅ λͺ¨λ“œμ—μ„œλŠ” loss λ°˜ν™˜)
    loss_dict = model(images, targets)

    # 손싀 μ’…λ₯˜:
    # - loss_classifier: 클래슀 λΆ„λ₯˜ 손싀
    # - loss_box_reg: λ°•μŠ€ νšŒκ·€ 손싀
    # - loss_objectness: RPN의 객체/비객체 λΆ„λ₯˜
    # - loss_rpn_box_reg: RPN λ°•μŠ€ νšŒκ·€

    total_loss = sum(loss for loss in loss_dict.values())
    print(f"Total loss: {total_loss.item():.4f}")

    return loss_dict


def inference_faster_rcnn(model, image, threshold=0.5):
    """Faster R-CNN μΆ”λ‘ """

    model.eval()

    with torch.no_grad():
        predictions = model([image])

    pred = predictions[0]

    # threshold 이상인 예츑만 필터링
    keep = pred["scores"] > threshold

    result = {
        "boxes": pred["boxes"][keep],
        "labels": pred["labels"][keep],
        "scores": pred["scores"][keep],
    }

    return result

2.3 RPN (Region Proposal Network)

"""
RPN 핡심 κ°œλ…:

1. Anchor Boxes:
   - 각 μœ„μΉ˜μ—μ„œ μ—¬λŸ¬ 크기/λΉ„μœ¨μ˜ λ°•μŠ€ 미리 μ •μ˜
   - 예: 3개 크기 Γ— 3개 λΉ„μœ¨ = 9개 anchor

2. 좜λ ₯:
   - objectness score: 객체 쑴재 ν™•λ₯  (2-class)
   - box regression: anchor β†’ μ‹€μ œ λ°•μŠ€ λ³€ν™˜

3. ν•™μŠ΅:
   - Positive: IoU > 0.7인 anchor
   - Negative: IoU < 0.3인 anchor
   - λ¬΄μ‹œ: 0.3~0.7 사이
"""

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleRPN(nn.Module):
    """κ°„λž΅ν™”λœ RPN κ΅¬ν˜„"""

    def __init__(
        self,
        in_channels: int = 256,
        num_anchors: int = 9,  # 3 scales Γ— 3 ratios
    ):
        super().__init__()

        # 3Γ—3 conv둜 feature 처리
        self.conv = nn.Conv2d(in_channels, in_channels, 3, padding=1)

        # objectness 예츑 (객체/배경)
        self.objectness = nn.Conv2d(in_channels, num_anchors * 2, 1)

        # bbox regression (dx, dy, dw, dh)
        self.bbox_reg = nn.Conv2d(in_channels, num_anchors * 4, 1)

    def forward(self, feature_map):
        """
        Args:
            feature_map: (B, C, H, W)

        Returns:
            objectness: (B, num_anchors*2, H, W)
            bbox_deltas: (B, num_anchors*4, H, W)
        """
        x = F.relu(self.conv(feature_map))

        objectness = self.objectness(x)
        bbox_deltas = self.bbox_reg(x)

        return objectness, bbox_deltas


def generate_anchors(
    feature_size: tuple,
    anchor_scales: list = [128, 256, 512],
    anchor_ratios: list = [0.5, 1.0, 2.0],
    stride: int = 16
):
    """
    Anchor λ°•μŠ€ 생성

    Args:
        feature_size: (H, W) feature map 크기
        anchor_scales: anchor 면적의 제곱근
        anchor_ratios: κ°€λ‘œ/μ„Έλ‘œ λΉ„μœ¨
        stride: 원본 이미지 λŒ€λΉ„ μΆ•μ†Œ λΉ„μœ¨

    Returns:
        anchors: (H*W*num_anchors, 4) ν˜•νƒœμ˜ anchor μ’Œν‘œ
    """
    H, W = feature_size
    anchors = []

    for h in range(H):
        for w in range(W):
            # feature map μœ„μΉ˜ β†’ 원본 이미지 μ’Œν‘œ
            cx = (w + 0.5) * stride
            cy = (h + 0.5) * stride

            for scale in anchor_scales:
                for ratio in anchor_ratios:
                    # λΉ„μœ¨μ— λ”°λ₯Έ λ„ˆλΉ„/높이
                    anchor_w = scale * (ratio ** 0.5)
                    anchor_h = scale / (ratio ** 0.5)

                    # (x1, y1, x2, y2) ν˜•μ‹
                    anchors.append([
                        cx - anchor_w / 2,
                        cy - anchor_h / 2,
                        cx + anchor_w / 2,
                        cy + anchor_h / 2
                    ])

    return torch.tensor(anchors)

# μ˜ˆμ‹œ
anchors = generate_anchors((38, 50))  # 600Γ—800 이미지, stride=16
print(f"Generated {len(anchors)} anchors")  # 38*50*9 = 17,100개

3. YOLO (You Only Look Once)

3.1 YOLO λ°œμ „μ‚¬

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    YOLO 버전 비ꡐ                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  YOLOv1 (2016): 단일 CNN으둜 κ·Έλ¦¬λ“œ 기반 탐지                     β”‚
β”‚  YOLOv2 (2017): Batch Norm, Anchor Boxes λ„μž…                   β”‚
β”‚  YOLOv3 (2018): Darknet-53, FPN, 3개 μŠ€μΌ€μΌ 예츑                 β”‚
β”‚  YOLOv4 (2020): CSPDarknet, SPP, PANet                          β”‚
β”‚  YOLOv5 (2020): PyTorch κ΅¬ν˜„, Ultralytics                       β”‚
β”‚  YOLOv6 (2022): 속도 μ΅œμ ν™”, EfficientRep                       β”‚
β”‚  YOLOv7 (2022): E-ELAN, Auxiliary Head                          β”‚
β”‚  YOLOv8 (2023): Unified Framework, Anchor-free                  β”‚
β”‚  YOLOv9 (2024): GELAN, PGI                                      β”‚
β”‚  YOLOv10 (2024): NMS-free, Dual Assignments                     β”‚
β”‚  YOLO11 (2024): 더 λΉ λ₯΄κ³  μ •ν™•ν•œ 버전                             β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  μ„±λŠ₯ (COCO val2017)           mAP50-95   Speed (ms)   β”‚   β”‚
β”‚  β”‚  ─────────────────────────────────────────────────────  β”‚   β”‚
β”‚  β”‚  YOLOv8n                         37.3        1.2       β”‚   β”‚
β”‚  β”‚  YOLOv8s                         44.9        1.9       β”‚   β”‚
β”‚  β”‚  YOLOv8m                         50.2        4.3       β”‚   β”‚
β”‚  β”‚  YOLOv8l                         52.9        6.7       β”‚   β”‚
β”‚  β”‚  YOLOv8x                         53.9        9.8       β”‚   β”‚
β”‚  β”‚  YOLO11x                         54.7        11.3      β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3.2 Ultralytics YOLOv8 μ‹€μŠ΅

from ultralytics import YOLO
import torch

# ===============================
# 1. λͺ¨λΈ λ‘œλ“œ
# ===============================

# μ‚¬μ „ν•™μŠ΅ λͺ¨λΈ λ‘œλ“œ
model = YOLO("yolov8n.pt")  # nano 버전 (κ°€μž₯ 빠름)
# model = YOLO("yolov8s.pt")  # small
# model = YOLO("yolov8m.pt")  # medium
# model = YOLO("yolov8l.pt")  # large
# model = YOLO("yolov8x.pt")  # extra-large

# 빈 λͺ¨λΈ 생성 ν›„ ν•™μŠ΅
# model = YOLO("yolov8n.yaml")


# ===============================
# 2. μΆ”λ‘  (Inference)
# ===============================

def detect_objects(image_path: str, conf_threshold: float = 0.25):
    """μ΄λ―Έμ§€μ—μ„œ 객체 탐지"""

    results = model(image_path, conf=conf_threshold)

    for result in results:
        boxes = result.boxes

        print(f"νƒμ§€λœ 객체 수: {len(boxes)}")

        for box in boxes:
            # μ’Œν‘œ
            x1, y1, x2, y2 = box.xyxy[0].tolist()

            # ν΄λž˜μŠ€μ™€ confidence
            cls = int(box.cls[0])
            conf = float(box.conf[0])
            class_name = model.names[cls]

            print(f"  {class_name}: {conf:.2f} at ({x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f})")

    return results

# μ‚¬μš© μ˜ˆμ‹œ
# results = detect_objects("image.jpg")

# κ²°κ³Ό μ‹œκ°ν™”
# results[0].show()  # 이미지 ν‘œμ‹œ
# results[0].save("result.jpg")  # μ €μž₯


# ===============================
# 3. λΉ„λ””μ˜€/μ›ΉμΊ  탐지
# ===============================

def detect_video(source: str = 0):
    """
    λΉ„λ””μ˜€ λ˜λŠ” μ›ΉμΊ μ—μ„œ μ‹€μ‹œκ°„ 탐지

    Args:
        source: 0=μ›ΉμΊ , λ˜λŠ” λΉ„λ””μ˜€ 파일 경둜
    """
    results = model(source, stream=True)  # generator둜 λ°˜ν™˜

    for result in results:
        # ν”„λ ˆμž„λ³„ 처리
        annotated_frame = result.plot()  # λ°•μŠ€ κ·Έλ €μ§„ ν”„λ ˆμž„

        # μ—¬κΈ°μ„œ cv2.imshow() λ“±μœΌλ‘œ ν‘œμ‹œ κ°€λŠ₯
        yield annotated_frame


# ===============================
# 4. μ»€μŠ€ν…€ 데이터셋 ν•™μŠ΅
# ===============================

def train_custom_model():
    """μ»€μŠ€ν…€ λ°μ΄ν„°μ…‹μœΌλ‘œ YOLO ν•™μŠ΅"""

    # 데이터셋 yaml 파일 μ˜ˆμ‹œ (data.yaml):
    """
    path: /path/to/dataset
    train: images/train
    val: images/val

    names:
      0: cat
      1: dog
      2: bird
    """

    # λͺ¨λΈ ν•™μŠ΅
    model = YOLO("yolov8n.pt")

    results = model.train(
        data="data.yaml",
        epochs=100,
        imgsz=640,
        batch=16,
        device=0,  # GPU 0, λ˜λŠ” "cpu"
        patience=50,  # Early stopping
        save=True,
        project="runs/detect",
        name="custom_model",
    )

    return results


# ===============================
# 5. λͺ¨λΈ 내보내기
# ===============================

def export_model():
    """λ‹€μ–‘ν•œ ν˜•μ‹μœΌλ‘œ λͺ¨λΈ 내보내기"""

    model = YOLO("yolov8n.pt")

    # ONNX둜 내보내기
    model.export(format="onnx")

    # TensorRT둜 내보내기 (GPU μΆ”λ‘  μ΅œμ ν™”)
    # model.export(format="engine")

    # CoreML둜 내보내기 (Apple)
    # model.export(format="coreml")

    # TFLite둜 내보내기 (λͺ¨λ°”일)
    # model.export(format="tflite")

3.3 YOLOv8 손싀 ν•¨μˆ˜

"""
YOLOv8 손싀 ν•¨μˆ˜ ꡬ성:

1. Box Loss (CIoU Loss):
   - λ°•μŠ€ μœ„μΉ˜μ™€ 크기의 정확도
   - CIoU = IoU - (거리 νŽ˜λ„ν‹° + μ’…νš‘λΉ„ νŽ˜λ„ν‹°)

2. Classification Loss (BCE):
   - 각 ν΄λž˜μŠ€μ— λŒ€ν•œ 이진 ꡐ차 μ—”νŠΈλ‘œν”Ό
   - Focal Loss λ³€ν˜• μ‚¬μš© κ°€λŠ₯

3. DFL Loss (Distribution Focal Loss):
   - λ°•μŠ€ κ²½κ³„μ˜ 뢄포 예츑
   - YOLOv8의 μƒˆλ‘œμš΄ νšŒκ·€ 방식

Total Loss = Ξ»_box * L_box + Ξ»_cls * L_cls + Ξ»_dfl * L_dfl
"""

import torch
import torch.nn as nn
import torch.nn.functional as F

def ciou_loss(pred_boxes, target_boxes, eps=1e-7):
    """
    Complete IoU Loss

    Args:
        pred_boxes: (N, 4) 예츑 λ°•μŠ€ [x1, y1, x2, y2]
        target_boxes: (N, 4) μ •λ‹΅ λ°•μŠ€

    Returns:
        CIoU loss
    """
    # IoU 계산
    inter_x1 = torch.max(pred_boxes[:, 0], target_boxes[:, 0])
    inter_y1 = torch.max(pred_boxes[:, 1], target_boxes[:, 1])
    inter_x2 = torch.min(pred_boxes[:, 2], target_boxes[:, 2])
    inter_y2 = torch.min(pred_boxes[:, 3], target_boxes[:, 3])

    inter_area = torch.clamp(inter_x2 - inter_x1, min=0) * \
                 torch.clamp(inter_y2 - inter_y1, min=0)

    pred_area = (pred_boxes[:, 2] - pred_boxes[:, 0]) * \
                (pred_boxes[:, 3] - pred_boxes[:, 1])
    target_area = (target_boxes[:, 2] - target_boxes[:, 0]) * \
                  (target_boxes[:, 3] - target_boxes[:, 1])

    union_area = pred_area + target_area - inter_area
    iou = inter_area / (union_area + eps)

    # 쀑심점 거리
    pred_cx = (pred_boxes[:, 0] + pred_boxes[:, 2]) / 2
    pred_cy = (pred_boxes[:, 1] + pred_boxes[:, 3]) / 2
    target_cx = (target_boxes[:, 0] + target_boxes[:, 2]) / 2
    target_cy = (target_boxes[:, 1] + target_boxes[:, 3]) / 2

    center_dist_sq = (pred_cx - target_cx) ** 2 + (pred_cy - target_cy) ** 2

    # λŒ€κ°μ„  거리 (enclosing box)
    enclose_x1 = torch.min(pred_boxes[:, 0], target_boxes[:, 0])
    enclose_y1 = torch.min(pred_boxes[:, 1], target_boxes[:, 1])
    enclose_x2 = torch.max(pred_boxes[:, 2], target_boxes[:, 2])
    enclose_y2 = torch.max(pred_boxes[:, 3], target_boxes[:, 3])

    enclose_diag_sq = (enclose_x2 - enclose_x1) ** 2 + \
                      (enclose_y2 - enclose_y1) ** 2

    # μ’…νš‘λΉ„ 일관성
    pred_w = pred_boxes[:, 2] - pred_boxes[:, 0]
    pred_h = pred_boxes[:, 3] - pred_boxes[:, 1]
    target_w = target_boxes[:, 2] - target_boxes[:, 0]
    target_h = target_boxes[:, 3] - target_boxes[:, 1]

    v = (4 / (torch.pi ** 2)) * \
        (torch.atan(target_w / (target_h + eps)) -
         torch.atan(pred_w / (pred_h + eps))) ** 2

    alpha = v / (1 - iou + v + eps)

    # CIoU
    ciou = iou - (center_dist_sq / (enclose_diag_sq + eps)) - alpha * v

    return 1 - ciou

4. DETR (Detection Transformer)

4.1 DETR κ°œλ…

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DETR μ•„ν‚€ν…μ²˜                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  κΈ°μ‘΄ 탐지기 문제점:                                              β”‚
β”‚  - Anchor 섀계 ν•„μš”                                              β”‚
β”‚  - NMS (Non-Maximum Suppression) ν›„μ²˜λ¦¬ ν•„μš”                     β”‚
β”‚  - λ³΅μž‘ν•œ νŒŒμ΄ν”„λΌμΈ                                             β”‚
β”‚                                                                 β”‚
β”‚  DETR ν˜μ‹ :                                                      β”‚
β”‚  - End-to-end ν•™μŠ΅                                               β”‚
β”‚  - Object Query둜 직접 객체 예츑                                  β”‚
β”‚  - Hungarian Matching으둜 ν•™μŠ΅                                   β”‚
β”‚  - NMS λΆˆν•„μš”                                                    β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚ Backbone   β”‚ β†’ β”‚ Transformerβ”‚ β†’ β”‚ FFN Heads             β”‚ β”‚
β”‚  β”‚ (ResNet)   β”‚    β”‚ Encoder/   β”‚    β”‚ (class + box)         β”‚ β”‚
β”‚  β”‚            β”‚    β”‚ Decoder    β”‚    β”‚                        β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚       ↓                 ↓                      ↓                β”‚
β”‚  Feature Map    Object Queries (100개)    100개 예츑 좜λ ₯       β”‚
β”‚  + Positional   ↓                                               β”‚
β”‚    Encoding     Self-attention + Cross-attention                β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4.2 DETR κ΅¬ν˜„

import torch
import torch.nn as nn
from torchvision.models import resnet50
import torch.nn.functional as F

class DETR(nn.Module):
    """
    κ°„λž΅ν™”λœ DETR κ΅¬ν˜„
    """

    def __init__(
        self,
        num_classes: int,
        num_queries: int = 100,
        hidden_dim: int = 256,
        nheads: int = 8,
        num_encoder_layers: int = 6,
        num_decoder_layers: int = 6,
    ):
        super().__init__()

        # Backbone
        backbone = resnet50(weights="DEFAULT")
        self.backbone = nn.Sequential(*list(backbone.children())[:-2])

        # Feature map β†’ hidden_dim
        self.conv = nn.Conv2d(2048, hidden_dim, 1)

        # Transformer
        self.transformer = nn.Transformer(
            d_model=hidden_dim,
            nhead=nheads,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            batch_first=True,
        )

        # Object Queries (ν•™μŠ΅λ˜λŠ” μž„λ² λ”©)
        self.query_embed = nn.Embedding(num_queries, hidden_dim)

        # 좜λ ₯ ν—€λ“œ
        self.class_head = nn.Linear(hidden_dim, num_classes + 1)  # +1 for no-object
        self.bbox_head = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 4),  # cx, cy, w, h
            nn.Sigmoid(),
        )

        # Positional Encoding
        self.row_embed = nn.Embedding(50, hidden_dim // 2)
        self.col_embed = nn.Embedding(50, hidden_dim // 2)

    def forward(self, x):
        """
        Args:
            x: (B, 3, H, W) μž…λ ₯ 이미지

        Returns:
            class_logits: (B, num_queries, num_classes+1)
            bbox_pred: (B, num_queries, 4)
        """
        B = x.shape[0]

        # Backbone feature μΆ”μΆœ
        features = self.backbone(x)  # (B, 2048, H/32, W/32)
        features = self.conv(features)  # (B, 256, H/32, W/32)

        _, _, H, W = features.shape

        # Positional Encoding 생성
        pos_embed = self._get_positional_encoding(H, W, features.device)

        # Flatten for Transformer
        src = features.flatten(2).permute(0, 2, 1)  # (B, H*W, 256)
        src = src + pos_embed.flatten(0, 1).unsqueeze(0).expand(B, -1, -1)

        # Object Queries
        query_embed = self.query_embed.weight.unsqueeze(0).expand(B, -1, -1)

        # Transformer
        tgt = torch.zeros_like(query_embed)
        hs = self.transformer(src, tgt + query_embed)  # (B, num_queries, 256)

        # 예츑
        class_logits = self.class_head(hs)
        bbox_pred = self.bbox_head(hs)

        return class_logits, bbox_pred

    def _get_positional_encoding(self, H, W, device):
        """2D Positional Encoding 생성"""
        i = torch.arange(W, device=device)
        j = torch.arange(H, device=device)

        x_embed = self.col_embed(i)  # (W, 128)
        y_embed = self.row_embed(j)  # (H, 128)

        pos = torch.cat([
            x_embed.unsqueeze(0).expand(H, -1, -1),
            y_embed.unsqueeze(1).expand(-1, W, -1),
        ], dim=-1)  # (H, W, 256)

        return pos


# Hungarian Matching Loss (κ°„λž΅ν™”)
class HungarianMatcher:
    """
    예츑과 GTλ₯Ό 졜적으둜 λ§€μΉ­

    Cost = Ξ»_cls * L_cls + Ξ»_box * L_box + Ξ»_giou * L_giou
    """

    def __init__(self, cost_class=1, cost_bbox=5, cost_giou=2):
        self.cost_class = cost_class
        self.cost_bbox = cost_bbox
        self.cost_giou = cost_giou

    def __call__(self, outputs, targets):
        """
        scipy.optimize.linear_sum_assignment μ‚¬μš©ν•˜μ—¬
        이뢄 λ§€μΉ­ μˆ˜ν–‰
        """
        # κ΅¬ν˜„ μƒλž΅ (scipy μ‚¬μš©)
        pass

4.3 RT-DETR (Real-Time DETR)

from ultralytics import RTDETR

# RT-DETR μ‚¬μš© (Ultralytics)
model = RTDETR("rtdetr-l.pt")

# μΆ”λ‘ 
results = model("image.jpg")

# ν•™μŠ΅
model.train(data="coco.yaml", epochs=100)

"""
RT-DETR νŠΉμ§•:
- DETR의 end-to-end μž₯점 μœ μ§€
- μ‹€μ‹œκ°„ μΆ”λ‘  κ°€λŠ₯ (YOLO μˆ˜μ€€ 속도)
- Efficient Hybrid Encoder
- IoU-aware Query Selection
"""

5. μΈμŠ€ν„΄μŠ€ λΆ„ν•  (Instance Segmentation)

5.1 Mask R-CNN

import torch
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor

def create_mask_rcnn(num_classes: int):
    """
    μ»€μŠ€ν…€ Mask R-CNN 생성

    Mask R-CNN = Faster R-CNN + Mask Head
    """
    model = maskrcnn_resnet50_fpn_v2(weights="DEFAULT")

    # Box predictor ꡐ체
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    # Mask predictor ꡐ체
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    model.roi_heads.mask_predictor = MaskRCNNPredictor(
        in_features_mask, hidden_layer, num_classes
    )

    return model


def inference_mask_rcnn(model, image, threshold=0.5):
    """Mask R-CNN μΆ”λ‘ """

    model.eval()

    with torch.no_grad():
        predictions = model([image])

    pred = predictions[0]
    keep = pred["scores"] > threshold

    result = {
        "boxes": pred["boxes"][keep],
        "labels": pred["labels"][keep],
        "scores": pred["scores"][keep],
        "masks": pred["masks"][keep],  # (N, 1, H, W) soft masks
    }

    # Hard mask둜 λ³€ν™˜
    result["masks"] = (result["masks"] > 0.5).squeeze(1)  # (N, H, W)

    return result


# YOLOv8-seg μ‚¬μš©
from ultralytics import YOLO

seg_model = YOLO("yolov8n-seg.pt")
results = seg_model("image.jpg")

# κ²°κ³Όμ—μ„œ 마슀크 μΆ”μΆœ
for result in results:
    if result.masks is not None:
        masks = result.masks.data  # (N, H, W)

5.2 SAM (Segment Anything Model)

from segment_anything import sam_model_registry, SamPredictor
import numpy as np

# SAM λͺ¨λΈ λ‘œλ“œ
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)

# 이미지 μ„€μ •
predictor.set_image(image)  # (H, W, 3) numpy array

# Point prompt둜 λΆ„ν• 
input_point = np.array([[500, 375]])  # 클릭 μ’Œν‘œ
input_label = np.array([1])  # 1 = foreground

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True,  # 3개의 마슀크 후보 λ°˜ν™˜
)

# Box prompt둜 λΆ„ν• 
input_box = np.array([100, 100, 400, 400])  # x1, y1, x2, y2

masks, scores, logits = predictor.predict(
    box=input_box,
    multimask_output=False,
)

"""
SAM νŠΉμ§•:
- Promptable segmentation (point, box, text)
- Zero-shot generalization
- 맀우 높은 λΆ„ν•  ν’ˆμ§ˆ
- λŒ€ν˜• λͺ¨λΈλ‘œ 느린 속도 β†’ MobileSAM, FastSAM λ“± κ²½λŸ‰ν™” 버전 쑴재
"""

6. μ‹€μ „ 팁

6.1 데이터셋 ν˜•μ‹

"""
μ£Όμš” 데이터셋 ν˜•μ‹:

1. COCO Format:
   - annotations.json에 λͺ¨λ“  μ–΄λ…Έν…Œμ΄μ…˜ μ €μž₯
   - 이미지와 μ–΄λ…Έν…Œμ΄μ…˜ 뢄리

2. YOLO Format:
   - 각 μ΄λ―Έμ§€λ§ˆλ‹€ .txt 파일
   - class x_center y_center width height (μ •κ·œν™”)

3. Pascal VOC Format:
   - XML 파일둜 각 이미지 μ–΄λ…Έν…Œμ΄μ…˜
"""

# YOLO format μ˜ˆμ‹œ (labels/train/image001.txt)
"""
0 0.5 0.5 0.2 0.3
1 0.3 0.7 0.1 0.15
"""

# COCO to YOLO λ³€ν™˜
def coco_to_yolo(coco_box, img_width, img_height):
    """
    COCO: [x_min, y_min, width, height]
    YOLO: [x_center, y_center, width, height] (μ •κ·œν™”)
    """
    x, y, w, h = coco_box

    x_center = (x + w / 2) / img_width
    y_center = (y + h / 2) / img_height
    w_norm = w / img_width
    h_norm = h / img_height

    return [x_center, y_center, w_norm, h_norm]

6.2 ν•™μŠ΅ 팁

"""
객체 탐지 ν•™μŠ΅ 체크리슀트:

1. 데이터 ν’ˆμ§ˆ
   - 라벨 정확도 확인
   - 클래슀 λΆˆκ· ν˜• 처리 (Focal Loss, μ˜€λ²„μƒ˜ν”Œλ§)
   - μ μ ˆν•œ 증강 μ‚¬μš©

2. ν•˜μ΄νΌνŒŒλΌλ―Έν„°
   - ν•™μŠ΅λ₯ : 1e-4 ~ 1e-3 (μ‚¬μ „ν•™μŠ΅ μ‹œμž‘)
   - 배치 크기: GPU λ©”λͺ¨λ¦¬μ— 맞좰 μ΅œλŒ€ν•œ 크게
   - 이미지 크기: λͺ¨λΈ κΈ°λ³Έκ°’ μ‚¬μš© (YOLO: 640)

3. 증강 μ „λž΅
   - Mosaic: 4개 이미지 ν•©μ„± (YOLO)
   - MixUp: 이미지 λΈ”λ Œλ”©
   - κΈ°λ³Έ: Flip, Scale, Color Jitter

4. λͺ¨λΈ 선택
   - μ‹€μ‹œκ°„: YOLO (YOLOv8n, YOLOv8s)
   - 정확도: Faster R-CNN, DETR
   - λΆ„ν• : YOLOv8-seg, Mask R-CNN
"""

# Ultralytics ν•™μŠ΅ μ˜ˆμ‹œ
from ultralytics import YOLO

model = YOLO("yolov8n.pt")

model.train(
    data="data.yaml",
    epochs=100,
    imgsz=640,
    batch=16,

    # 증강
    mosaic=1.0,      # Mosaic ν™•λ₯ 
    mixup=0.0,       # MixUp ν™•λ₯ 
    hsv_h=0.015,     # Hue 증강
    hsv_s=0.7,       # Saturation 증강
    hsv_v=0.4,       # Value 증강
    degrees=0.0,     # νšŒμ „
    translate=0.1,   # 이동
    scale=0.5,       # μŠ€μΌ€μΌ
    fliplr=0.5,      # 쒌우 λ°˜μ „

    # μ •κ·œν™”
    weight_decay=0.0005,

    # ν•™μŠ΅ μŠ€μΌ€μ€„
    warmup_epochs=3,
    warmup_momentum=0.8,
    warmup_bias_lr=0.1,
    lr0=0.01,        # 초기 ν•™μŠ΅λ₯ 
    lrf=0.01,        # μ΅œμ’… ν•™μŠ΅λ₯  λΉ„μœ¨
)

정리

탐지기 선택 κ°€μ΄λ“œ

μš”κ΅¬μ‚¬ν•­ μΆ”μ²œ λͺ¨λΈ
μ‹€μ‹œκ°„ (30+ FPS) YOLOv8n/s
높은 정확도 YOLOv8x, Faster R-CNN
μž‘μ€ 객체 YOLO + SAHI, RetinaNet
μΈμŠ€ν„΄μŠ€ λΆ„ν•  YOLOv8-seg, Mask R-CNN
End-to-end DETR, RT-DETR
Zero-shot Grounding DINO, SAM

핡심 κ°œλ… μš”μ•½

κ°œλ… μ„€λͺ…
IoU λ°•μŠ€ κ²ΉμΉ¨ 정도 (0~1)
mAP 평균 정밀도 (정확도 μ§€ν‘œ)
NMS 쀑볡 λ°•μŠ€ 제거 ν›„μ²˜λ¦¬
Anchor 사전 μ •μ˜λœ κΈ°μ€€ λ°•μŠ€
FPN 닀쀑 μŠ€μΌ€μΌ νŠΉμ§• μΆ”μΆœ
GIoU/CIoU κ°œμ„ λœ IoU 손싀 ν•¨μˆ˜

λ‹€μŒ 단계

  • Computer Vision ν† ν”½μ—μ„œ μ‹œλ§¨ν‹± λΆ„ν• (Semantic Segmentation)을 ν•™μŠ΅ν•©λ‹ˆλ‹€.
  • Computer_Vision/19_DNN_Module.md: OpenCV DNN

참고 자료

λ…Όλ¬Έ

  • "Faster R-CNN" (Ren et al., 2015)
  • "YOLO: You Only Look Once" (Redmon et al., 2016)
  • "DETR: End-to-End Object Detection with Transformers" (Carion et al., 2020)
  • "Segment Anything" (Kirillov et al., 2023)

μ½”λ“œ & 자료

to navigate between lessons