DINOv2 & Self-Supervised Vision

DINOv2 & Self-Supervised Vision

ν•™μŠ΅ λͺ©ν‘œ

  • DINO/DINOv2의 Self-distillation λ©”μ»€λ‹ˆμ¦˜ 이해
  • Teacher-Student ν•™μŠ΅ νŒ¨λŸ¬λ‹€μž„ νŒŒμ•…
  • Dense Visual Features ν™œμš©λ²• μŠ΅λ“
  • Vision Foundation Modelλ‘œμ„œμ˜ DINOv2 ν™œμš©

1. Self-Supervised Learning in Vision 볡슡

1.1 μ™œ Self-Supervised인가?

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Visionμ—μ„œ Self-Supervised Learning                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Supervised Learning의 ν•œκ³„:                                     β”‚
β”‚  β€’ ImageNet: 1.4M 이미지, 1000 클래슀                            β”‚
β”‚  β€’ λ ˆμ΄λΈ”λ§ λΉ„μš© λ†’μŒ                                             β”‚
β”‚  β€’ 클래슀 λ ˆμ΄λΈ” = μ œν•œλœ 정보                                     β”‚
β”‚                                                                 β”‚
β”‚  Self-Supervised Learning:                                       β”‚
β”‚  β€’ λ ˆμ΄λΈ” 없이 ν•™μŠ΅ (pretext task ν™œμš©)                           β”‚
β”‚  β€’ μˆ˜μ‹­μ–΅ 이미지 ν™œμš© κ°€λŠ₯                                         β”‚
β”‚  β€’ 더 ν’λΆ€ν•œ ν‘œν˜„ ν•™μŠ΅                                            β”‚
β”‚                                                                 β”‚
β”‚  μ£Όμš” 방법둠:                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚ Contrastive   β”‚ SimCLR, MoCo  β”‚ μœ μ‚¬/λΉ„μœ μ‚¬ 쌍 ν•™μŠ΅ β”‚         β”‚
β”‚  β”‚ Distillation  β”‚ DINO, BYOL    β”‚ Teacher-Student    β”‚         β”‚
β”‚  β”‚ Masked        β”‚ MAE, BEiT     β”‚ λ§ˆμŠ€ν‚Ή ν›„ 볡원      β”‚         β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1.2 Deep_Learning 폴더 볡슡

μ„ μˆ˜ 지식: Deep_Learning/21_Self_Supervised_Learning.md - SimCLR: Contrastive Learning 기초 - MoCo: Momentum Contrast - BYOL: Bootstrap Your Own Latent - MAE: Masked Autoencoders


2. DINO (2021)

2.1 핡심 아이디어

DINO (Self-Distillation with No labels)λŠ” Knowledge Distillation을 Self-supervised둜 μ μš©ν•©λ‹ˆλ‹€.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DINO Architecture                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚                        Input Image                              β”‚
β”‚                            β”‚                                    β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”‚
β”‚              β–Ό                           β–Ό                      β”‚
β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚     β”‚  Global Crops   β”‚         β”‚  Local Crops    β”‚            β”‚
β”‚     β”‚   (224Γ—224)     β”‚         β”‚   (96Γ—96)       β”‚            β”‚
β”‚     β”‚    Γ— 2          β”‚         β”‚    Γ— 6+         β”‚            β”‚
β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚              β”‚                           β”‚                      β”‚
β”‚              β–Ό                           β–Ό                      β”‚
β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚     β”‚ Teacher Network β”‚         β”‚ Student Network β”‚            β”‚
β”‚     β”‚   (EMA update)  β”‚         β”‚   (Gradient)    β”‚            β”‚
β”‚     β”‚   [stop-grad]   β”‚         β”‚                 β”‚            β”‚
β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚              β”‚                           β”‚                      β”‚
β”‚              β–Ό                           β–Ό                      β”‚
β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚     β”‚  Teacher Head   β”‚         β”‚  Student Head   β”‚            β”‚
β”‚     β”‚  (Projection)   β”‚         β”‚  (Projection)   β”‚            β”‚
β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚              β”‚                           β”‚                      β”‚
β”‚              β–Ό                           β–Ό                      β”‚
β”‚          P_teacher                   P_student                  β”‚
β”‚              β”‚                           β”‚                      β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β”‚                          β–Ό                                      β”‚
β”‚                  Cross-Entropy Loss                             β”‚
β”‚                  H(P_t, P_s) = -Ξ£ P_t log(P_s)                 β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.2 μ£Όμš” ꡬ성 μš”μ†Œ

import torch
import torch.nn as nn
import torch.nn.functional as F

class DINOHead(nn.Module):
    """
    DINO Projection Head

    ꡬ쑰: Linear β†’ GELU β†’ Linear β†’ L2 Norm
    좜λ ₯: K 차원 (예: 65536)
    """
    def __init__(self, in_dim, out_dim=65536, hidden_dim=2048):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(in_dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, out_dim),
        )
        # L2 μ •κ·œν™”
        self.last_layer = nn.utils.weight_norm(
            nn.Linear(out_dim, out_dim, bias=False)
        )
        self.last_layer.weight_g.data.fill_(1)

    def forward(self, x):
        x = self.mlp(x)
        x = F.normalize(x, dim=-1, p=2)
        x = self.last_layer(x)
        return x

class DINOLoss(nn.Module):
    """
    DINO Loss: Cross-entropy between teacher and student

    νŠΉμ§•:
    - Teacher: Centering + Sharpening (temperature Ο„_t < Ο„_s)
    - Student: 일반 softmax
    - Center: λͺ¨λ“  teacher 좜λ ₯의 moving average (collapse λ°©μ§€)
    """
    def __init__(self, out_dim, teacher_temp=0.04, student_temp=0.1, center_momentum=0.9):
        super().__init__()
        self.teacher_temp = teacher_temp
        self.student_temp = student_temp
        self.center_momentum = center_momentum
        self.register_buffer("center", torch.zeros(1, out_dim))

    def forward(self, student_output, teacher_output):
        """
        Args:
            student_output: (batch, n_crops, out_dim)
            teacher_output: (batch, n_global_crops, out_dim)
        """
        # Teacher: centering + sharpening
        teacher_out = F.softmax(
            (teacher_output - self.center) / self.teacher_temp, dim=-1
        )
        teacher_out = teacher_out.detach()  # stop gradient

        # Student: softmax with higher temperature
        student_out = F.log_softmax(student_output / self.student_temp, dim=-1)

        # Cross-entropy loss
        loss = torch.sum(-teacher_out * student_out, dim=-1).mean()

        # Update center (EMA)
        self.update_center(teacher_output)

        return loss

    @torch.no_grad()
    def update_center(self, teacher_output):
        batch_center = teacher_output.mean(dim=0, keepdim=True)
        self.center = self.center * self.center_momentum + batch_center * (1 - self.center_momentum)

2.3 Multi-crop μ „λž΅

"""
Multi-crop Strategy:

Global crops (2개):
- 크기: 224Γ—224 (μ›λ³Έμ˜ 50-100%)
- Teacher와 Student λͺ¨λ‘μ— μž…λ ₯
- 전체 이미지 λ§₯락 ν•™μŠ΅

Local crops (μ—¬λŸ¬ 개, 보톡 6-8개):
- 크기: 96Γ—96 (μ›λ³Έμ˜ 5-50%)
- Studentμ—λ§Œ μž…λ ₯
- μ§€μ—­ νŒ¨ν„΄ ν•™μŠ΅

λͺ©μ :
- "Local-to-Global" λŒ€μ‘ ν•™μŠ΅
- μž‘μ€ μ˜μ—­μ΄ 전체 μ΄λ―Έμ§€μ˜ μ–΄λ–€ 뢀뢄인지 ν•™μŠ΅
- Semantic segmentation λŠ₯λ ₯ μžμ—°μŠ€λŸ½κ²Œ μŠ΅λ“
"""

from torchvision import transforms

class DINODataAugmentation:
    def __init__(self, global_crops_scale=(0.4, 1.0), local_crops_scale=(0.05, 0.4),
                 n_local_crops=8):
        # Global crops (224Γ—224)
        self.global_transform = transforms.Compose([
            transforms.RandomResizedCrop(224, scale=global_crops_scale),
            transforms.RandomHorizontalFlip(),
            transforms.ColorJitter(0.4, 0.4, 0.2, 0.1),
            transforms.RandomGrayscale(p=0.2),
            transforms.GaussianBlur(kernel_size=23, sigma=(0.1, 2.0)),
            transforms.ToTensor(),
            transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
        ])

        # Local crops (96Γ—96)
        self.local_transform = transforms.Compose([
            transforms.RandomResizedCrop(96, scale=local_crops_scale),
            transforms.RandomHorizontalFlip(),
            transforms.ColorJitter(0.4, 0.4, 0.2, 0.1),
            transforms.RandomGrayscale(p=0.2),
            transforms.GaussianBlur(kernel_size=23, sigma=(0.1, 2.0)),
            transforms.ToTensor(),
            transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
        ])

        self.n_local_crops = n_local_crops

    def __call__(self, image):
        crops = []
        # 2 global crops
        crops.append(self.global_transform(image))
        crops.append(self.global_transform(image))
        # n local crops
        for _ in range(self.n_local_crops):
            crops.append(self.local_transform(image))
        return crops

2.4 Teacher-Student μ—…λ°μ΄νŠΈ

class DINOTrainer:
    """
    DINO ν•™μŠ΅ 루프

    핡심:
    - Student: gradient둜 μ—…λ°μ΄νŠΈ
    - Teacher: Student의 EMA (Exponential Moving Average)
    """
    def __init__(self, student, teacher, optimizer, loss_fn, momentum=0.996):
        self.student = student
        self.teacher = teacher
        self.optimizer = optimizer
        self.loss_fn = loss_fn
        self.momentum = momentum

        # TeacherλŠ” Student둜 μ΄ˆκΈ°ν™”
        self.teacher.load_state_dict(self.student.state_dict())
        # TeacherλŠ” gradient 계산 μ•ˆ 함
        for p in self.teacher.parameters():
            p.requires_grad = False

    @torch.no_grad()
    def update_teacher(self):
        """EMA update: ΞΈ_t = m * ΞΈ_t + (1-m) * ΞΈ_s"""
        for param_s, param_t in zip(self.student.parameters(), self.teacher.parameters()):
            param_t.data.mul_(self.momentum).add_((1 - self.momentum) * param_s.data)

    def train_step(self, images):
        """
        images: list of crops [global1, global2, local1, ..., localN]
        """
        # Global crops만 Teacher에 μž…λ ₯
        teacher_output = self.teacher(torch.cat(images[:2]))

        # λͺ¨λ“  cropsλ₯Ό Student에 μž…λ ₯
        student_output = self.student(torch.cat(images))

        # Loss 계산 (각 student crop vs 각 teacher crop)
        loss = self.loss_fn(student_output, teacher_output)

        # Student μ—…λ°μ΄νŠΈ
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Teacher EMA μ—…λ°μ΄νŠΈ
        self.update_teacher()

        return loss.item()

3. DINOv2 (2023)

3.1 DINOv2의 κ°œμ„ μ 

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 DINO vs DINOv2 비ꡐ                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  ν•­λͺ©              β”‚ DINO (2021)      β”‚ DINOv2 (2023)          β”‚
β”‚  ─────────────────│──────────────────│───────────────────────  β”‚
β”‚  데이터            β”‚ ImageNet (1.3M)  β”‚ LVD-142M (142M)        β”‚
β”‚  데이터 νλ ˆμ΄μ…˜    β”‚ μ—†μŒ             β”‚ μžλ™ νλ ˆμ΄μ…˜ νŒŒμ΄ν”„λΌμΈ  β”‚
β”‚  λͺ¨λΈ 크기         β”‚ ViT-S/B          β”‚ ViT-S/B/L/g            β”‚
β”‚  ν•™μŠ΅ λͺ©ν‘œ         β”‚ DINO만           β”‚ DINO + iBOT (masked)   β”‚
β”‚  Regularization   β”‚ κΈ°λ³Έ             β”‚ KoLeo + μ •κ·œν™” κ°•ν™”     β”‚
β”‚  Resolution       β”‚ 224              β”‚ 518 (고해상도)          β”‚
β”‚  μ„±λŠ₯ (k-NN)      β”‚ ~74% (IN-1K)    β”‚ ~86% (IN-1K)           β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3.2 LVD-142M 데이터셋

"""
LVD-142M (Learning with large Visual Datasets)

μžλ™ νλ ˆμ΄μ…˜ νŒŒμ΄ν”„λΌμΈ:
1. μ›Ήμ—μ„œ 이미지 μˆ˜μ§‘ (billions)
2. 쀑볡 제거 (copy detection)
3. ν’ˆμ§ˆ 필터링
4. ImageNetκ³Ό μœ μ‚¬λ„ 기반 μƒ˜ν”Œλ§
5. μ΅œμ’… 142M 이미지

핡심 기술:
- Self-supervised copy detection
- Embedding 기반 ν΄λŸ¬μŠ€ν„°λ§
- Retrieval 기반 데이터 선택

μ™œ μ€‘μš”ν•œκ°€:
- 데이터 ν’ˆμ§ˆμ΄ λͺ¨λΈ μ„±λŠ₯의 핡심
- Scaling은 데이터 νλ ˆμ΄μ…˜μ΄ ν•„μˆ˜
- μžλ™ν™”λœ νŒŒμ΄ν”„λΌμΈμœΌλ‘œ ν™•μž₯ κ°€λŠ₯
"""

3.3 iBOT 톡합

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 DINOv2 = DINO + iBOT                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  DINO Loss (이미지 레벨):                                        β”‚
β”‚  β€’ Global/Local crop κ°„ consistency                             β”‚
β”‚  β€’ CLS token 기반                                               β”‚
β”‚                                                                 β”‚
β”‚  iBOT Loss (패치 레벨):                                          β”‚
β”‚  β€’ Masked patches 예츑                                          β”‚
β”‚  β€’ MAE와 μœ μ‚¬ν•˜μ§€λ§Œ Teacher μ‚¬μš©                                  β”‚
β”‚                                                                 β”‚
β”‚                    Input Image                                  β”‚
β”‚                         β”‚                                       β”‚
β”‚          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚
β”‚          β–Ό                           β–Ό                          β”‚
β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”‚
β”‚     β”‚ Teacher β”‚                β”‚ Student β”‚                      β”‚
β”‚     β”‚ (full)  β”‚                β”‚ (masked)β”‚ ← 일뢀 패치 λ§ˆμŠ€ν‚Ή     β”‚
β”‚     β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜                β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜                      β”‚
β”‚          β”‚                          β”‚                           β”‚
β”‚     β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”                β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”                      β”‚
β”‚     β”‚CLSβ”‚Patchβ”‚                β”‚CLSβ”‚Patchβ”‚                      β”‚
β”‚     β””β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”˜                β””β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”˜                      β”‚
β”‚       β”‚   β”‚                      β”‚   β”‚                          β”‚
β”‚       β”‚   └──────────────────────│────                          β”‚
β”‚       β”‚          iBOT Loss       β”‚   β”‚                          β”‚
β”‚       β”‚     (masked patches)     β”‚   β”‚                          β”‚
β”‚       β”‚                          β”‚   β”‚                          β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚                          β”‚
β”‚              DINO Loss               β”‚                          β”‚
β”‚           (CLS tokens)               β”‚                          β”‚
β”‚                                                                 β”‚
β”‚  Total Loss = L_DINO + Ξ» Γ— L_iBOT                               β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3.4 λͺ¨λΈ ꡬ쑰

"""
DINOv2 λͺ¨λΈ 사양

Model      β”‚ Layers β”‚ Hidden β”‚ Heads β”‚ Params β”‚ Patch
──────────│────────│────────│───────│────────│───────
ViT-S/14  β”‚ 12     β”‚ 384    β”‚ 6     β”‚ 21M    β”‚ 14Γ—14
ViT-B/14  β”‚ 12     β”‚ 768    β”‚ 12    β”‚ 86M    β”‚ 14Γ—14
ViT-L/14  β”‚ 24     β”‚ 1024   β”‚ 16    β”‚ 300M   β”‚ 14Γ—14
ViT-g/14  β”‚ 40     β”‚ 1536   β”‚ 24    β”‚ 1.1B   β”‚ 14Γ—14

νŠΉμ§•:
- Patch size 14 (κΈ°μ‘΄ ViTλŠ” 16)
- 더 높은 해상도 지원
- Register tokens (attention artifact ν•΄κ²°)
"""

4. DINOv2 μ‚¬μš©ν•˜κΈ°

4.1 HuggingFace둜 λ‘œλ“œ

import torch
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import requests

# λͺ¨λΈ λ‘œλ“œ
model_name = "facebook/dinov2-base"
processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# 이미지 λ‘œλ“œ
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# μ „μ²˜λ¦¬ 및 μΆ”λ‘ 
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# 좜λ ₯ ꡬ쑰
print(f"Last hidden state: {outputs.last_hidden_state.shape}")
# (1, 257, 768) = (batch, 1 CLS + 256 patches, hidden_dim)

# CLS token (전체 이미지 ν‘œν˜„)
cls_token = outputs.last_hidden_state[:, 0]
print(f"CLS token: {cls_token.shape}")  # (1, 768)

# Patch tokens (μ§€μ—­ ν‘œν˜„)
patch_tokens = outputs.last_hidden_state[:, 1:]
print(f"Patch tokens: {patch_tokens.shape}")  # (1, 256, 768)

4.2 νŠΉμ§• μΆ”μΆœ 및 ν™œμš©

import torch
import torch.nn.functional as F
from transformers import AutoImageProcessor, AutoModel
import numpy as np
from sklearn.neighbors import NearestNeighbors

class DINOv2FeatureExtractor:
    """DINOv2λ₯Ό μ΄μš©ν•œ 이미지 νŠΉμ§• μΆ”μΆœκΈ°"""

    def __init__(self, model_name="facebook/dinov2-base"):
        self.processor = AutoImageProcessor.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        self.model.eval()

    @torch.no_grad()
    def extract_features(self, images, return_patches=False):
        """
        μ΄λ―Έμ§€μ—μ„œ νŠΉμ§• μΆ”μΆœ

        Args:
            images: PIL Image λ˜λŠ” 리슀트
            return_patches: νŒ¨μΉ˜λ³„ νŠΉμ§•λ„ λ°˜ν™˜ν• μ§€

        Returns:
            cls_features: (n_images, hidden_dim)
            patch_features: (n_images, n_patches, hidden_dim) - optional
        """
        if not isinstance(images, list):
            images = [images]

        inputs = self.processor(images=images, return_tensors="pt")
        outputs = self.model(**inputs)

        cls_features = outputs.last_hidden_state[:, 0]

        if return_patches:
            patch_features = outputs.last_hidden_state[:, 1:]
            return cls_features, patch_features

        return cls_features

    def compute_similarity(self, image1, image2):
        """두 이미지 κ°„ μœ μ‚¬λ„ (코사인)"""
        feat1 = self.extract_features(image1)
        feat2 = self.extract_features(image2)
        similarity = F.cosine_similarity(feat1, feat2)
        return similarity.item()

# μ‚¬μš© μ˜ˆμ‹œ
extractor = DINOv2FeatureExtractor()

# 이미지 검색
def build_image_index(images):
    """이미지 인덱슀 ꡬ좕"""
    features = []
    for img in images:
        feat = extractor.extract_features(img)
        features.append(feat.numpy())
    features = np.vstack(features)

    # k-NN 인덱슀
    index = NearestNeighbors(n_neighbors=5, metric='cosine')
    index.fit(features)
    return index, features

def search_similar(query_image, index, features, k=5):
    """μœ μ‚¬ 이미지 검색"""
    query_feat = extractor.extract_features(query_image).numpy()
    distances, indices = index.kneighbors(query_feat, n_neighbors=k)
    return indices[0], distances[0]

4.3 Dense Prediction (Semantic Segmentation)

import torch
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

def visualize_attention_maps(model, processor, image):
    """DINOv2의 attention map μ‹œκ°ν™”"""

    inputs = processor(images=image, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs, output_attentions=True)

    # λ§ˆμ§€λ§‰ λ ˆμ΄μ–΄μ˜ attention
    attentions = outputs.attentions[-1]  # (1, n_heads, n_tokens, n_tokens)

    # CLS token이 각 νŒ¨μΉ˜μ— μ£ΌλŠ” attention
    cls_attn = attentions[0, :, 0, 1:]  # (n_heads, n_patches)

    # 평균
    cls_attn_mean = cls_attn.mean(dim=0)  # (n_patches,)

    # Reshape to 2D
    n_patches = int(np.sqrt(cls_attn_mean.shape[0]))
    attn_map = cls_attn_mean.reshape(n_patches, n_patches)

    return attn_map.numpy()

def visualize_patch_pca(model, processor, image, n_components=3):
    """패치 νŠΉμ§•μ˜ PCA μ‹œκ°ν™” (의미둠적 μ˜μ—­ 확인)"""

    inputs = processor(images=image, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)

    # 패치 토큰
    patch_tokens = outputs.last_hidden_state[0, 1:].numpy()  # (n_patches, hidden)

    # PCA
    pca = PCA(n_components=n_components)
    patch_pca = pca.fit_transform(patch_tokens)

    # Normalize to [0, 1] for visualization
    patch_pca = (patch_pca - patch_pca.min()) / (patch_pca.max() - patch_pca.min())

    # Reshape
    n_patches = int(np.sqrt(patch_tokens.shape[0]))
    pca_image = patch_pca.reshape(n_patches, n_patches, n_components)

    return pca_image

# μ‹œκ°ν™”
# fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# axes[0].imshow(image)
# axes[0].set_title('Original')
# axes[1].imshow(visualize_attention_maps(model, processor, image), cmap='hot')
# axes[1].set_title('Attention Map')
# axes[2].imshow(visualize_patch_pca(model, processor, image))
# axes[2].set_title('PCA of Patches')

5. DINOv2 μ‘μš©

5.1 Zero-shot Semantic Segmentation

"""
DINOv2의 패치 νŠΉμ§•μ„ μ΄μš©ν•œ μ„Έκ·Έλ©˜ν…Œμ΄μ…˜

방법:
1. μ΄λ―Έμ§€μ—μ„œ DINOv2 패치 νŠΉμ§• μΆ”μΆœ
2. μ˜ˆμ‹œ μ΄λ―Έμ§€μ—μ„œ 관심 μ˜μ—­μ˜ νŠΉμ§• μΆ”μΆœ
3. 코사인 μœ μ‚¬λ„λ‘œ ν•΄λ‹Ή μ˜μ—­ μ°ΎκΈ°

μž₯점:
- ν•™μŠ΅ 없이 μ„Έκ·Έλ©˜ν…Œμ΄μ…˜ κ°€λŠ₯
- μƒˆλ‘œμš΄ 객체 ν΄λž˜μŠ€λ„ 처리 κ°€λŠ₯
"""

def segment_with_reference(model, processor, target_image, reference_image, reference_mask):
    """
    μ°Έμ‘° μ΄λ―Έμ§€μ˜ 마슀크λ₯Ό μ΄μš©ν•΄ νƒ€κ²Ÿ 이미지 μ„Έκ·Έλ©˜ν…Œμ΄μ…˜

    Args:
        target_image: μ„Έκ·Έλ©˜ν…Œμ΄μ…˜ν•  이미지
        reference_image: μ°Έμ‘° 이미지
        reference_mask: μ°Έμ‘° μ΄λ―Έμ§€μ˜ 관심 μ˜μ—­ 마슀크 (binary)
    """
    # νŠΉμ§• μΆ”μΆœ
    with torch.no_grad():
        target_inputs = processor(images=target_image, return_tensors="pt")
        target_outputs = model(**target_inputs)
        target_patches = target_outputs.last_hidden_state[0, 1:]  # (n_patches, hidden)

        ref_inputs = processor(images=reference_image, return_tensors="pt")
        ref_outputs = model(**ref_inputs)
        ref_patches = ref_outputs.last_hidden_state[0, 1:]  # (n_patches, hidden)

    # μ°Έμ‘° λ§ˆμŠ€ν¬μ—μ„œ 관심 μ˜μ—­μ˜ νŠΉμ§• 평균
    n_patches = int(np.sqrt(ref_patches.shape[0]))
    mask_resized = F.interpolate(
        reference_mask.unsqueeze(0).unsqueeze(0).float(),
        size=(n_patches, n_patches),
        mode='nearest'
    ).squeeze().bool()

    foreground_features = ref_patches[mask_resized.flatten()].mean(dim=0)

    # νƒ€κ²Ÿ μ΄λ―Έμ§€μ˜ 각 νŒ¨μΉ˜μ™€ μœ μ‚¬λ„ 계산
    similarities = F.cosine_similarity(
        target_patches,
        foreground_features.unsqueeze(0),
        dim=1
    )

    # Reshape to 2D
    similarity_map = similarities.reshape(n_patches, n_patches)

    return similarity_map.numpy()

5.2 Depth Estimation

"""
DINOv2 + Linear Probe둜 Depth Estimation

방법:
1. DINOv2둜 패치 νŠΉμ§• μΆ”μΆœ
2. κ°„λ‹¨ν•œ Linear layer둜 depth 예츑
3. 적은 λ°μ΄ν„°λ‘œλ„ 쒋은 μ„±λŠ₯

이유:
- DINOv2κ°€ 이미 3D ꡬ쑰 정보λ₯Ό ν•™μŠ΅
- 패치 νŠΉμ§•μ— depth cueκ°€ 인코딩됨
"""

class DepthEstimator(nn.Module):
    def __init__(self, dinov2_model, hidden_dim=768):
        super().__init__()
        self.backbone = dinov2_model
        self.backbone.eval()
        for p in self.backbone.parameters():
            p.requires_grad = False

        self.head = nn.Sequential(
            nn.Linear(hidden_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        with torch.no_grad():
            features = self.backbone(x).last_hidden_state[:, 1:]  # patch tokens

        depth = self.head(features)  # (batch, n_patches, 1)

        # Reshape to image
        batch, n_patches, _ = depth.shape
        h = w = int(np.sqrt(n_patches))
        depth = depth.reshape(batch, h, w)

        return depth

정리

DINO/DINOv2 핡심

κ°œλ… μ„€λͺ…
Self-distillation Teacher-Student ꡬ쑰, λ ˆμ΄λΈ” 없이 ν•™μŠ΅
Multi-crop Global + Local crops둜 λ‹€μ–‘ν•œ μŠ€μΌ€μΌ ν•™μŠ΅
Centering Teacher 좜λ ₯ centering으둜 collapse λ°©μ§€
EMA Teacher Momentum으둜 μ•ˆμ •μ μΈ νƒ€κ²Ÿ 제곡
iBOT Masked patch prediction μΆ”κ°€ (DINOv2)

ν™œμš©

  • Image Retrieval: CLS token으둜 μœ μ‚¬ 이미지 검색
  • Semantic Segmentation: 패치 νŠΉμ§•μœΌλ‘œ zero-shot μ„Έκ·Έλ©˜ν…Œμ΄μ…˜
  • Depth Estimation: Linear probe둜 depth 예츑
  • Fine-tuning: λ‹€μš΄μŠ€νŠΈλ¦Ό νƒœμŠ€ν¬ ν•™μŠ΅

λ‹€μŒ 단계


참고 자료

λ…Όλ¬Έ

  • Caron et al. (2021). "Emerging Properties in Self-Supervised Vision Transformers" (DINO)
  • Oquab et al. (2023). "DINOv2: Learning Robust Visual Features without Supervision"
  • Zhou et al. (2021). "iBOT: Image BERT Pre-Training with Online Tokenizer"

μ½”λ“œ

to navigate between lessons