μ΄μ : νλ λ₯λ¬λ μν€ν
μ² | λ€μ: μ€μ μ΄λ―Έμ§ λΆλ₯ νλ‘μ νΈ
38. κ°μ²΄ νμ§ (Object Detection)
νμ΅ λͺ©ν
- Two-stage vs One-stage νμ§κΈ° μ°¨μ΄ μ΄ν΄
- YOLO μν€ν
μ²μ λ°μ κ³Όμ νμ΅
- Faster R-CNNμ ꡬ쑰μ RPN μ΄ν΄
- DETR (Detection Transformer) κ°λ
νμ
- PyTorch/Ultralyticsλ‘ μ€μ΅
1. κ°μ²΄ νμ§ κ°μ
1.1 λ¬Έμ μ μ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Computer Vision νμ€ν¬ λΉκ΅ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. μ΄λ―Έμ§ λΆλ₯ (Image Classification) β
β ββ μ΄λ―Έμ§ μ 체 β νλμ ν΄λμ€ β
β ββ μΆλ ₯: "κ°μμ§" β
β β
β 2. κ°μ²΄ νμ§ (Object Detection) β
β ββ μ΄λ―Έμ§ β μ¬λ¬ κ°μ²΄μ μμΉ + ν΄λμ€ β
β ββ μΆλ ₯: [(x1,y1,x2,y2, "κ°μμ§", 0.95), ...] β
β β
β 3. μλ§¨ν± λΆν (Semantic Segmentation) β
β ββ ν½μ
λ§λ€ ν΄λμ€ ν λΉ β
β ββ κ°μ ν΄λμ€μ κ°μ²΄λ€μ κ΅¬λΆ μ λ¨ β
β β
β 4. μΈμ€ν΄μ€ λΆν (Instance Segmentation) β
β ββ κ°μ²΄ νμ§ + κ° κ°μ²΄μ ν½μ
λ§μ€ν¬ β
β ββ κ°μ ν΄λμ€λΌλ κ°λ³ κ°μ²΄ κ΅¬λΆ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1.2 νμ§κΈ° λΆλ₯
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β νμ§κΈ° λΆλ₯ μ²΄κ³ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Two-Stage Detectors β β
β β 1λ¨κ³: Region Proposal (ν보 μμ μμ±) β β
β β 2λ¨κ³: Classification + Regression β β
β β β β
β β μ: R-CNN β Fast R-CNN β Faster R-CNN β β
β β μ₯μ : λμ μ νλ β β
β β λ¨μ : λλ¦° μλ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β One-Stage Detectors β β
β β λ¨μΌ λ€νΈμν¬λ‘ μμΉ + ν΄λμ€ λμ μμΈ‘ β β
β β β β
β β μ: YOLO, SSD, RetinaNet, CenterNet β β
β β μ₯μ : λΉ λ₯Έ μλ, μ€μκ° μ²λ¦¬ κ°λ₯ β β
β β λ¨μ : μμ κ°μ²΄ νμ§ μ΄λ €μ (κ°μ λ¨) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Transformer-based Detectors β β
β β DETR, Deformable DETR, RT-DETR β β
β β End-to-end νμ΅, NMS λΆνμ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1.3 νκ° μ§ν
"""
κ°μ²΄ νμ§ νκ° μ§ν
"""
def calculate_iou(box1, box2):
"""
IoU (Intersection over Union) κ³μ°
Args:
box1, box2: [x1, y1, x2, y2] νμ
Returns:
IoU κ° (0~1)
"""
# κ΅μ§ν© μ’ν
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
# κ΅μ§ν© λ©΄μ
inter_area = max(0, x2 - x1) * max(0, y2 - y1)
# ν©μ§ν© λ©΄μ
box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
union_area = box1_area + box2_area - inter_area
return inter_area / union_area if union_area > 0 else 0
# μμ
pred_box = [100, 100, 200, 200]
gt_box = [120, 110, 210, 210]
print(f"IoU: {calculate_iou(pred_box, gt_box):.3f}") # μ½ 0.68
"""
mAP (mean Average Precision) κ³μ° κ³Όμ :
1. κ° ν΄λμ€λ³λ‘:
- μμΈ‘μ confidence μμΌλ‘ μ λ ¬
- IoU > thresholdμΈ κ²½μ° TP, μλλ©΄ FP
- Precision-Recall 곑μ κ³μ°
- AP = 곑μ μλ λ©΄μ
2. mAP = λͺ¨λ ν΄λμ€ APμ νκ·
COCO λ°μ΄ν°μ
κΈ°μ€:
- mAP@0.5: IoU=0.5 κΈ°μ€
- mAP@0.75: IoU=0.75 κΈ°μ€ (μ격)
- mAP@[.5:.95]: 0.5~0.95 IoUμ νκ·
"""
2. R-CNN κ³μ΄
2.1 R-CNNμ λ°μ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β R-CNN Family λ°μ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β R-CNN (2014): β
β ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββ β
β β μ΄λ―Έμ§ β β β Selectiveβ β β CNN β β β SVM β β
β β β β Search β β (AlexNet)β β λΆλ₯κΈ° β β
β ββββββββββββ β (~2000κ°)β ββββββββββββ βββββββββββ β
β ββββββββββββ β
β λ¬Έμ μ : ~2000λ² CNN ν΅κ³Ό β λ§€μ° λλ¦Ό β
β β
β Fast R-CNN (2015): β
β ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββ β
β β μ΄λ―Έμ§ β β β CNN β β β RoI β β β FC + β β
β β β β Feature β β Pooling β β Softmax β β
β ββββββββββββ β Map β ββββββββββββ βββββββββββ β
β κ°μ : CNN 1λ²λ§ ν΅κ³Ό, RoI PoolingμΌλ‘ ν보 μμ μΆμΆ β
β λ¬Έμ μ : Selective Search μ¬μ ν λλ¦Ό β
β β
β Faster R-CNN (2015): β
β ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββ β
β β μ΄λ―Έμ§ β β β Backbone β β β RPN β β β Head β β
β β β β (ResNet) β β β β β β
β ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββ β
β νμ : RPNμΌλ‘ Region Proposalλ νμ΅ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.2 Faster R-CNN ꡬ쑰
import torch
import torch.nn as nn
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
class CustomFasterRCNN:
"""Faster R-CNN 컀μ€ν
λͺ¨λΈ"""
def __init__(self, num_classes: int, pretrained: bool = True):
"""
Args:
num_classes: λ°°κ²½ ν¬ν¨ ν΄λμ€ μ (μ: 10κ° κ°μ²΄ β 11)
pretrained: COCO μ¬μ νμ΅ κ°μ€μΉ μ¬μ©
"""
# μ¬μ νμ΅λ λͺ¨λΈ λ‘λ
self.model = fasterrcnn_resnet50_fpn_v2(
weights="DEFAULT" if pretrained else None
)
# Box predictor κ΅μ²΄ (ν΄λμ€ μ λ§μΆ€)
in_features = self.model.roi_heads.box_predictor.cls_score.in_features
self.model.roi_heads.box_predictor = FastRCNNPredictor(
in_features, num_classes
)
def get_model(self):
return self.model
def train_faster_rcnn():
"""Faster R-CNN νμ΅ μμ """
# λͺ¨λΈ μμ± (λ°°κ²½ + 10κ° ν΄λμ€)
model = CustomFasterRCNN(num_classes=11).get_model()
model.train()
# κ°μ λ°μ΄ν°
images = [torch.rand(3, 600, 800) for _ in range(2)]
targets = [
{
"boxes": torch.tensor([[100, 100, 200, 200], [300, 300, 400, 400]]),
"labels": torch.tensor([1, 2]), # ν΄λμ€ ID
},
{
"boxes": torch.tensor([[50, 50, 150, 150]]),
"labels": torch.tensor([3]),
}
]
# μμ ν (νμ΅ λͺ¨λμμλ loss λ°ν)
loss_dict = model(images, targets)
# μμ€ μ’
λ₯:
# - loss_classifier: ν΄λμ€ λΆλ₯ μμ€
# - loss_box_reg: λ°μ€ νκ· μμ€
# - loss_objectness: RPNμ κ°μ²΄/λΉκ°μ²΄ λΆλ₯
# - loss_rpn_box_reg: RPN λ°μ€ νκ·
total_loss = sum(loss for loss in loss_dict.values())
print(f"Total loss: {total_loss.item():.4f}")
return loss_dict
def inference_faster_rcnn(model, image, threshold=0.5):
"""Faster R-CNN μΆλ‘ """
model.eval()
with torch.no_grad():
predictions = model([image])
pred = predictions[0]
# threshold μ΄μμΈ μμΈ‘λ§ νν°λ§
keep = pred["scores"] > threshold
result = {
"boxes": pred["boxes"][keep],
"labels": pred["labels"][keep],
"scores": pred["scores"][keep],
}
return result
2.3 RPN (Region Proposal Network)
"""
RPN ν΅μ¬ κ°λ
:
1. Anchor Boxes:
- κ° μμΉμμ μ¬λ¬ ν¬κΈ°/λΉμ¨μ λ°μ€ 미리 μ μ
- μ: 3κ° ν¬κΈ° Γ 3κ° λΉμ¨ = 9κ° anchor
2. μΆλ ₯:
- objectness score: κ°μ²΄ μ‘΄μ¬ νλ₯ (2-class)
- box regression: anchor β μ€μ λ°μ€ λ³ν
3. νμ΅:
- Positive: IoU > 0.7μΈ anchor
- Negative: IoU < 0.3μΈ anchor
- 무μ: 0.3~0.7 μ¬μ΄
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleRPN(nn.Module):
"""κ°λ΅νλ RPN ꡬν"""
def __init__(
self,
in_channels: int = 256,
num_anchors: int = 9, # 3 scales Γ 3 ratios
):
super().__init__()
# 3Γ3 convλ‘ feature μ²λ¦¬
self.conv = nn.Conv2d(in_channels, in_channels, 3, padding=1)
# objectness μμΈ‘ (κ°μ²΄/λ°°κ²½)
self.objectness = nn.Conv2d(in_channels, num_anchors * 2, 1)
# bbox regression (dx, dy, dw, dh)
self.bbox_reg = nn.Conv2d(in_channels, num_anchors * 4, 1)
def forward(self, feature_map):
"""
Args:
feature_map: (B, C, H, W)
Returns:
objectness: (B, num_anchors*2, H, W)
bbox_deltas: (B, num_anchors*4, H, W)
"""
x = F.relu(self.conv(feature_map))
objectness = self.objectness(x)
bbox_deltas = self.bbox_reg(x)
return objectness, bbox_deltas
def generate_anchors(
feature_size: tuple,
anchor_scales: list = [128, 256, 512],
anchor_ratios: list = [0.5, 1.0, 2.0],
stride: int = 16
):
"""
Anchor λ°μ€ μμ±
Args:
feature_size: (H, W) feature map ν¬κΈ°
anchor_scales: anchor λ©΄μ μ μ κ³±κ·Ό
anchor_ratios: κ°λ‘/μΈλ‘ λΉμ¨
stride: μλ³Έ μ΄λ―Έμ§ λλΉ μΆμ λΉμ¨
Returns:
anchors: (H*W*num_anchors, 4) ννμ anchor μ’ν
"""
H, W = feature_size
anchors = []
for h in range(H):
for w in range(W):
# feature map μμΉ β μλ³Έ μ΄λ―Έμ§ μ’ν
cx = (w + 0.5) * stride
cy = (h + 0.5) * stride
for scale in anchor_scales:
for ratio in anchor_ratios:
# λΉμ¨μ λ°λ₯Έ λλΉ/λμ΄
anchor_w = scale * (ratio ** 0.5)
anchor_h = scale / (ratio ** 0.5)
# (x1, y1, x2, y2) νμ
anchors.append([
cx - anchor_w / 2,
cy - anchor_h / 2,
cx + anchor_w / 2,
cy + anchor_h / 2
])
return torch.tensor(anchors)
# μμ
anchors = generate_anchors((38, 50)) # 600Γ800 μ΄λ―Έμ§, stride=16
print(f"Generated {len(anchors)} anchors") # 38*50*9 = 17,100κ°
3. YOLO (You Only Look Once)
3.1 YOLO λ°μ μ¬
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β YOLO λ²μ λΉκ΅ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β YOLOv1 (2016): λ¨μΌ CNNμΌλ‘ 그리λ κΈ°λ° νμ§ β
β YOLOv2 (2017): Batch Norm, Anchor Boxes λμ
β
β YOLOv3 (2018): Darknet-53, FPN, 3κ° μ€μΌμΌ μμΈ‘ β
β YOLOv4 (2020): CSPDarknet, SPP, PANet β
β YOLOv5 (2020): PyTorch ꡬν, Ultralytics β
β YOLOv6 (2022): μλ μ΅μ ν, EfficientRep β
β YOLOv7 (2022): E-ELAN, Auxiliary Head β
β YOLOv8 (2023): Unified Framework, Anchor-free β
β YOLOv9 (2024): GELAN, PGI β
β YOLOv10 (2024): NMS-free, Dual Assignments β
β YOLO11 (2024): λ λΉ λ₯΄κ³ μ νν λ²μ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β μ±λ₯ (COCO val2017) mAP50-95 Speed (ms) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β YOLOv8n 37.3 1.2 β β
β β YOLOv8s 44.9 1.9 β β
β β YOLOv8m 50.2 4.3 β β
β β YOLOv8l 52.9 6.7 β β
β β YOLOv8x 53.9 9.8 β β
β β YOLO11x 54.7 11.3 β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.2 Ultralytics YOLOv8 μ€μ΅
from ultralytics import YOLO
import torch
# ===============================
# 1. λͺ¨λΈ λ‘λ
# ===============================
# μ¬μ νμ΅ λͺ¨λΈ λ‘λ
model = YOLO("yolov8n.pt") # nano λ²μ (κ°μ₯ λΉ λ¦)
# model = YOLO("yolov8s.pt") # small
# model = YOLO("yolov8m.pt") # medium
# model = YOLO("yolov8l.pt") # large
# model = YOLO("yolov8x.pt") # extra-large
# λΉ λͺ¨λΈ μμ± ν νμ΅
# model = YOLO("yolov8n.yaml")
# ===============================
# 2. μΆλ‘ (Inference)
# ===============================
def detect_objects(image_path: str, conf_threshold: float = 0.25):
"""μ΄λ―Έμ§μμ κ°μ²΄ νμ§"""
results = model(image_path, conf=conf_threshold)
for result in results:
boxes = result.boxes
print(f"νμ§λ κ°μ²΄ μ: {len(boxes)}")
for box in boxes:
# μ’ν
x1, y1, x2, y2 = box.xyxy[0].tolist()
# ν΄λμ€μ confidence
cls = int(box.cls[0])
conf = float(box.conf[0])
class_name = model.names[cls]
print(f" {class_name}: {conf:.2f} at ({x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f})")
return results
# μ¬μ© μμ
# results = detect_objects("image.jpg")
# κ²°κ³Ό μκ°ν
# results[0].show() # μ΄λ―Έμ§ νμ
# results[0].save("result.jpg") # μ μ₯
# ===============================
# 3. λΉλμ€/μΉμΊ νμ§
# ===============================
def detect_video(source: str = 0):
"""
λΉλμ€ λλ μΉμΊ μμ μ€μκ° νμ§
Args:
source: 0=μΉμΊ , λλ λΉλμ€ νμΌ κ²½λ‘
"""
results = model(source, stream=True) # generatorλ‘ λ°ν
for result in results:
# νλ μλ³ μ²λ¦¬
annotated_frame = result.plot() # λ°μ€ κ·Έλ €μ§ νλ μ
# μ¬κΈ°μ cv2.imshow() λ±μΌλ‘ νμ κ°λ₯
yield annotated_frame
# ===============================
# 4. 컀μ€ν
λ°μ΄ν°μ
νμ΅
# ===============================
def train_custom_model():
"""컀μ€ν
λ°μ΄ν°μ
μΌλ‘ YOLO νμ΅"""
# λ°μ΄ν°μ
yaml νμΌ μμ (data.yaml):
"""
path: /path/to/dataset
train: images/train
val: images/val
names:
0: cat
1: dog
2: bird
"""
# λͺ¨λΈ νμ΅
model = YOLO("yolov8n.pt")
results = model.train(
data="data.yaml",
epochs=100,
imgsz=640,
batch=16,
device=0, # GPU 0, λλ "cpu"
patience=50, # Early stopping
save=True,
project="runs/detect",
name="custom_model",
)
return results
# ===============================
# 5. λͺ¨λΈ λ΄λ³΄λ΄κΈ°
# ===============================
def export_model():
"""λ€μν νμμΌλ‘ λͺ¨λΈ λ΄λ³΄λ΄κΈ°"""
model = YOLO("yolov8n.pt")
# ONNXλ‘ λ΄λ³΄λ΄κΈ°
model.export(format="onnx")
# TensorRTλ‘ λ΄λ³΄λ΄κΈ° (GPU μΆλ‘ μ΅μ ν)
# model.export(format="engine")
# CoreMLλ‘ λ΄λ³΄λ΄κΈ° (Apple)
# model.export(format="coreml")
# TFLiteλ‘ λ΄λ³΄λ΄κΈ° (λͺ¨λ°μΌ)
# model.export(format="tflite")
3.3 YOLOv8 μμ€ ν¨μ
"""
YOLOv8 μμ€ ν¨μ ꡬμ±:
1. Box Loss (CIoU Loss):
- λ°μ€ μμΉμ ν¬κΈ°μ μ νλ
- CIoU = IoU - (거리 νλν° + μ’
ν‘λΉ νλν°)
2. Classification Loss (BCE):
- κ° ν΄λμ€μ λν μ΄μ§ κ΅μ°¨ μνΈλ‘νΌ
- Focal Loss λ³ν μ¬μ© κ°λ₯
3. DFL Loss (Distribution Focal Loss):
- λ°μ€ κ²½κ³μ λΆν¬ μμΈ‘
- YOLOv8μ μλ‘μ΄ νκ· λ°©μ
Total Loss = Ξ»_box * L_box + Ξ»_cls * L_cls + Ξ»_dfl * L_dfl
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
def ciou_loss(pred_boxes, target_boxes, eps=1e-7):
"""
Complete IoU Loss
Args:
pred_boxes: (N, 4) μμΈ‘ λ°μ€ [x1, y1, x2, y2]
target_boxes: (N, 4) μ λ΅ λ°μ€
Returns:
CIoU loss
"""
# IoU κ³μ°
inter_x1 = torch.max(pred_boxes[:, 0], target_boxes[:, 0])
inter_y1 = torch.max(pred_boxes[:, 1], target_boxes[:, 1])
inter_x2 = torch.min(pred_boxes[:, 2], target_boxes[:, 2])
inter_y2 = torch.min(pred_boxes[:, 3], target_boxes[:, 3])
inter_area = torch.clamp(inter_x2 - inter_x1, min=0) * \
torch.clamp(inter_y2 - inter_y1, min=0)
pred_area = (pred_boxes[:, 2] - pred_boxes[:, 0]) * \
(pred_boxes[:, 3] - pred_boxes[:, 1])
target_area = (target_boxes[:, 2] - target_boxes[:, 0]) * \
(target_boxes[:, 3] - target_boxes[:, 1])
union_area = pred_area + target_area - inter_area
iou = inter_area / (union_area + eps)
# μ€μ¬μ 거리
pred_cx = (pred_boxes[:, 0] + pred_boxes[:, 2]) / 2
pred_cy = (pred_boxes[:, 1] + pred_boxes[:, 3]) / 2
target_cx = (target_boxes[:, 0] + target_boxes[:, 2]) / 2
target_cy = (target_boxes[:, 1] + target_boxes[:, 3]) / 2
center_dist_sq = (pred_cx - target_cx) ** 2 + (pred_cy - target_cy) ** 2
# λκ°μ 거리 (enclosing box)
enclose_x1 = torch.min(pred_boxes[:, 0], target_boxes[:, 0])
enclose_y1 = torch.min(pred_boxes[:, 1], target_boxes[:, 1])
enclose_x2 = torch.max(pred_boxes[:, 2], target_boxes[:, 2])
enclose_y2 = torch.max(pred_boxes[:, 3], target_boxes[:, 3])
enclose_diag_sq = (enclose_x2 - enclose_x1) ** 2 + \
(enclose_y2 - enclose_y1) ** 2
# μ’
ν‘λΉ μΌκ΄μ±
pred_w = pred_boxes[:, 2] - pred_boxes[:, 0]
pred_h = pred_boxes[:, 3] - pred_boxes[:, 1]
target_w = target_boxes[:, 2] - target_boxes[:, 0]
target_h = target_boxes[:, 3] - target_boxes[:, 1]
v = (4 / (torch.pi ** 2)) * \
(torch.atan(target_w / (target_h + eps)) -
torch.atan(pred_w / (pred_h + eps))) ** 2
alpha = v / (1 - iou + v + eps)
# CIoU
ciou = iou - (center_dist_sq / (enclose_diag_sq + eps)) - alpha * v
return 1 - ciou
4.1 DETR κ°λ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DETR μν€ν
μ² β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β κΈ°μ‘΄ νμ§κΈ° λ¬Έμ μ : β
β - Anchor μ€κ³ νμ β
β - NMS (Non-Maximum Suppression) νμ²λ¦¬ νμ β
β - 볡μ‘ν νμ΄νλΌμΈ β
β β
β DETR νμ : β
β - End-to-end νμ΅ β
β - Object Queryλ‘ μ§μ κ°μ²΄ μμΈ‘ β
β - Hungarian MatchingμΌλ‘ νμ΅ β
β - NMS λΆνμ β
β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββββββββββββββ β
β β Backbone β β β Transformerβ β β FFN Heads β β
β β (ResNet) β β Encoder/ β β (class + box) β β
β β β β Decoder β β β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββββββββββββββ β
β β β β β
β Feature Map Object Queries (100κ°) 100κ° μμΈ‘ μΆλ ₯ β
β + Positional β β
β Encoding Self-attention + Cross-attention β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4.2 DETR ꡬν
import torch
import torch.nn as nn
from torchvision.models import resnet50
import torch.nn.functional as F
class DETR(nn.Module):
"""
κ°λ΅νλ DETR ꡬν
"""
def __init__(
self,
num_classes: int,
num_queries: int = 100,
hidden_dim: int = 256,
nheads: int = 8,
num_encoder_layers: int = 6,
num_decoder_layers: int = 6,
):
super().__init__()
# Backbone
backbone = resnet50(weights="DEFAULT")
self.backbone = nn.Sequential(*list(backbone.children())[:-2])
# Feature map β hidden_dim
self.conv = nn.Conv2d(2048, hidden_dim, 1)
# Transformer
self.transformer = nn.Transformer(
d_model=hidden_dim,
nhead=nheads,
num_encoder_layers=num_encoder_layers,
num_decoder_layers=num_decoder_layers,
batch_first=True,
)
# Object Queries (νμ΅λλ μλ² λ©)
self.query_embed = nn.Embedding(num_queries, hidden_dim)
# μΆλ ₯ ν€λ
self.class_head = nn.Linear(hidden_dim, num_classes + 1) # +1 for no-object
self.bbox_head = nn.Sequential(
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 4), # cx, cy, w, h
nn.Sigmoid(),
)
# Positional Encoding
self.row_embed = nn.Embedding(50, hidden_dim // 2)
self.col_embed = nn.Embedding(50, hidden_dim // 2)
def forward(self, x):
"""
Args:
x: (B, 3, H, W) μ
λ ₯ μ΄λ―Έμ§
Returns:
class_logits: (B, num_queries, num_classes+1)
bbox_pred: (B, num_queries, 4)
"""
B = x.shape[0]
# Backbone feature μΆμΆ
features = self.backbone(x) # (B, 2048, H/32, W/32)
features = self.conv(features) # (B, 256, H/32, W/32)
_, _, H, W = features.shape
# Positional Encoding μμ±
pos_embed = self._get_positional_encoding(H, W, features.device)
# Flatten for Transformer
src = features.flatten(2).permute(0, 2, 1) # (B, H*W, 256)
src = src + pos_embed.flatten(0, 1).unsqueeze(0).expand(B, -1, -1)
# Object Queries
query_embed = self.query_embed.weight.unsqueeze(0).expand(B, -1, -1)
# Transformer
tgt = torch.zeros_like(query_embed)
hs = self.transformer(src, tgt + query_embed) # (B, num_queries, 256)
# μμΈ‘
class_logits = self.class_head(hs)
bbox_pred = self.bbox_head(hs)
return class_logits, bbox_pred
def _get_positional_encoding(self, H, W, device):
"""2D Positional Encoding μμ±"""
i = torch.arange(W, device=device)
j = torch.arange(H, device=device)
x_embed = self.col_embed(i) # (W, 128)
y_embed = self.row_embed(j) # (H, 128)
pos = torch.cat([
x_embed.unsqueeze(0).expand(H, -1, -1),
y_embed.unsqueeze(1).expand(-1, W, -1),
], dim=-1) # (H, W, 256)
return pos
# Hungarian Matching Loss (κ°λ΅ν)
class HungarianMatcher:
"""
μμΈ‘κ³Ό GTλ₯Ό μ΅μ μΌλ‘ λ§€μΉ
Cost = Ξ»_cls * L_cls + Ξ»_box * L_box + Ξ»_giou * L_giou
"""
def __init__(self, cost_class=1, cost_bbox=5, cost_giou=2):
self.cost_class = cost_class
self.cost_bbox = cost_bbox
self.cost_giou = cost_giou
def __call__(self, outputs, targets):
"""
scipy.optimize.linear_sum_assignment μ¬μ©νμ¬
μ΄λΆ λ§€μΉ μν
"""
# ꡬν μλ΅ (scipy μ¬μ©)
pass
4.3 RT-DETR (Real-Time DETR)
from ultralytics import RTDETR
# RT-DETR μ¬μ© (Ultralytics)
model = RTDETR("rtdetr-l.pt")
# μΆλ‘
results = model("image.jpg")
# νμ΅
model.train(data="coco.yaml", epochs=100)
"""
RT-DETR νΉμ§:
- DETRμ end-to-end μ₯μ μ μ§
- μ€μκ° μΆλ‘ κ°λ₯ (YOLO μμ€ μλ)
- Efficient Hybrid Encoder
- IoU-aware Query Selection
"""
5. μΈμ€ν΄μ€ λΆν (Instance Segmentation)
5.1 Mask R-CNN
import torch
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
def create_mask_rcnn(num_classes: int):
"""
컀μ€ν
Mask R-CNN μμ±
Mask R-CNN = Faster R-CNN + Mask Head
"""
model = maskrcnn_resnet50_fpn_v2(weights="DEFAULT")
# Box predictor κ΅μ²΄
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
# Mask predictor κ΅μ²΄
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
model.roi_heads.mask_predictor = MaskRCNNPredictor(
in_features_mask, hidden_layer, num_classes
)
return model
def inference_mask_rcnn(model, image, threshold=0.5):
"""Mask R-CNN μΆλ‘ """
model.eval()
with torch.no_grad():
predictions = model([image])
pred = predictions[0]
keep = pred["scores"] > threshold
result = {
"boxes": pred["boxes"][keep],
"labels": pred["labels"][keep],
"scores": pred["scores"][keep],
"masks": pred["masks"][keep], # (N, 1, H, W) soft masks
}
# Hard maskλ‘ λ³ν
result["masks"] = (result["masks"] > 0.5).squeeze(1) # (N, H, W)
return result
# YOLOv8-seg μ¬μ©
from ultralytics import YOLO
seg_model = YOLO("yolov8n-seg.pt")
results = seg_model("image.jpg")
# κ²°κ³Όμμ λ§μ€ν¬ μΆμΆ
for result in results:
if result.masks is not None:
masks = result.masks.data # (N, H, W)
5.2 SAM (Segment Anything Model)
from segment_anything import sam_model_registry, SamPredictor
import numpy as np
# SAM λͺ¨λΈ λ‘λ
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)
# μ΄λ―Έμ§ μ€μ
predictor.set_image(image) # (H, W, 3) numpy array
# Point promptλ‘ λΆν
input_point = np.array([[500, 375]]) # ν΄λ¦ μ’ν
input_label = np.array([1]) # 1 = foreground
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True, # 3κ°μ λ§μ€ν¬ ν보 λ°ν
)
# Box promptλ‘ λΆν
input_box = np.array([100, 100, 400, 400]) # x1, y1, x2, y2
masks, scores, logits = predictor.predict(
box=input_box,
multimask_output=False,
)
"""
SAM νΉμ§:
- Promptable segmentation (point, box, text)
- Zero-shot generalization
- λ§€μ° λμ λΆν νμ§
- λν λͺ¨λΈλ‘ λλ¦° μλ β MobileSAM, FastSAM λ± κ²½λν λ²μ μ‘΄μ¬
"""
6. μ€μ ν
6.1 λ°μ΄ν°μ
νμ
"""
μ£Όμ λ°μ΄ν°μ
νμ:
1. COCO Format:
- annotations.jsonμ λͺ¨λ μ΄λ
Έν
μ΄μ
μ μ₯
- μ΄λ―Έμ§μ μ΄λ
Έν
μ΄μ
λΆλ¦¬
2. YOLO Format:
- κ° μ΄λ―Έμ§λ§λ€ .txt νμΌ
- class x_center y_center width height (μ κ·ν)
3. Pascal VOC Format:
- XML νμΌλ‘ κ° μ΄λ―Έμ§ μ΄λ
Έν
μ΄μ
"""
# YOLO format μμ (labels/train/image001.txt)
"""
0 0.5 0.5 0.2 0.3
1 0.3 0.7 0.1 0.15
"""
# COCO to YOLO λ³ν
def coco_to_yolo(coco_box, img_width, img_height):
"""
COCO: [x_min, y_min, width, height]
YOLO: [x_center, y_center, width, height] (μ κ·ν)
"""
x, y, w, h = coco_box
x_center = (x + w / 2) / img_width
y_center = (y + h / 2) / img_height
w_norm = w / img_width
h_norm = h / img_height
return [x_center, y_center, w_norm, h_norm]
6.2 νμ΅ ν
"""
κ°μ²΄ νμ§ νμ΅ μ²΄ν¬λ¦¬μ€νΈ:
1. λ°μ΄ν° νμ§
- λΌλ²¨ μ νλ νμΈ
- ν΄λμ€ λΆκ· ν μ²λ¦¬ (Focal Loss, μ€λ²μνλ§)
- μ μ ν μ¦κ° μ¬μ©
2. νμ΄νΌνλΌλ―Έν°
- νμ΅λ₯ : 1e-4 ~ 1e-3 (μ¬μ νμ΅ μμ)
- λ°°μΉ ν¬κΈ°: GPU λ©λͺ¨λ¦¬μ λ§μΆ° μ΅λν ν¬κ²
- μ΄λ―Έμ§ ν¬κΈ°: λͺ¨λΈ κΈ°λ³Έκ° μ¬μ© (YOLO: 640)
3. μ¦κ° μ λ΅
- Mosaic: 4κ° μ΄λ―Έμ§ ν©μ± (YOLO)
- MixUp: μ΄λ―Έμ§ λΈλ λ©
- κΈ°λ³Έ: Flip, Scale, Color Jitter
4. λͺ¨λΈ μ ν
- μ€μκ°: YOLO (YOLOv8n, YOLOv8s)
- μ νλ: Faster R-CNN, DETR
- λΆν : YOLOv8-seg, Mask R-CNN
"""
# Ultralytics νμ΅ μμ
from ultralytics import YOLO
model = YOLO("yolov8n.pt")
model.train(
data="data.yaml",
epochs=100,
imgsz=640,
batch=16,
# μ¦κ°
mosaic=1.0, # Mosaic νλ₯
mixup=0.0, # MixUp νλ₯
hsv_h=0.015, # Hue μ¦κ°
hsv_s=0.7, # Saturation μ¦κ°
hsv_v=0.4, # Value μ¦κ°
degrees=0.0, # νμ
translate=0.1, # μ΄λ
scale=0.5, # μ€μΌμΌ
fliplr=0.5, # μ’μ° λ°μ
# μ κ·ν
weight_decay=0.0005,
# νμ΅ μ€μΌμ€
warmup_epochs=3,
warmup_momentum=0.8,
warmup_bias_lr=0.1,
lr0=0.01, # μ΄κΈ° νμ΅λ₯
lrf=0.01, # μ΅μ’
νμ΅λ₯ λΉμ¨
)
μ 리
νμ§κΈ° μ ν κ°μ΄λ
| μꡬμ¬ν |
μΆμ² λͺ¨λΈ |
| μ€μκ° (30+ FPS) |
YOLOv8n/s |
| λμ μ νλ |
YOLOv8x, Faster R-CNN |
| μμ κ°μ²΄ |
YOLO + SAHI, RetinaNet |
| μΈμ€ν΄μ€ λΆν |
YOLOv8-seg, Mask R-CNN |
| End-to-end |
DETR, RT-DETR |
| Zero-shot |
Grounding DINO, SAM |
ν΅μ¬ κ°λ
μμ½
| κ°λ
|
μ€λͺ
|
| IoU |
λ°μ€ κ²ΉμΉ¨ μ λ (0~1) |
| mAP |
νκ· μ λ°λ (μ νλ μ§ν) |
| NMS |
μ€λ³΅ λ°μ€ μ κ±° νμ²λ¦¬ |
| Anchor |
μ¬μ μ μλ κΈ°μ€ λ°μ€ |
| FPN |
λ€μ€ μ€μΌμΌ νΉμ§ μΆμΆ |
| GIoU/CIoU |
κ°μ λ IoU μμ€ ν¨μ |
λ€μ λ¨κ³
μ°Έκ³ μλ£
λ
Όλ¬Έ
- "Faster R-CNN" (Ren et al., 2015)
- "YOLO: You Only Look Once" (Redmon et al., 2016)
- "DETR: End-to-End Object Detection with Transformers" (Carion et al., 2020)
- "Segment Anything" (Kirillov et al., 2023)
μ½λ & μλ£