DINOv2 & Self-Supervised Vision
DINOv2 & Self-Supervised Vision¶
νμ΅ λͺ©ν¶
- DINO/DINOv2μ Self-distillation λ©μ»€λμ¦ μ΄ν΄
- Teacher-Student νμ΅ ν¨λ¬λ€μ νμ
- Dense Visual Features νμ©λ² μ΅λ
- Vision Foundation Modelλ‘μμ DINOv2 νμ©
1. Self-Supervised Learning in Vision 볡졶
1.1 μ Self-SupervisedμΈκ°?¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Visionμμ Self-Supervised Learning β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Supervised Learningμ νκ³: β
β β’ ImageNet: 1.4M μ΄λ―Έμ§, 1000 ν΄λμ€ β
β β’ λ μ΄λΈλ§ λΉμ© λμ β
β β’ ν΄λμ€ λ μ΄λΈ = μ νλ μ 보 β
β β
β Self-Supervised Learning: β
β β’ λ μ΄λΈ μμ΄ νμ΅ (pretext task νμ©) β
β β’ μμμ΅ μ΄λ―Έμ§ νμ© κ°λ₯ β
β β’ λ νλΆν νν νμ΅ β
β β
β μ£Όμ λ°©λ²λ‘ : β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Contrastive β SimCLR, MoCo β μ μ¬/λΉμ μ¬ μ νμ΅ β β
β β Distillation β DINO, BYOL β Teacher-Student β β
β β Masked β MAE, BEiT β λ§μ€νΉ ν 볡μ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1.2 Deep_Learning ν΄λ 볡졶
μ μ μ§μ: Deep_Learning/21_Self_Supervised_Learning.md - SimCLR: Contrastive Learning κΈ°μ΄ - MoCo: Momentum Contrast - BYOL: Bootstrap Your Own Latent - MAE: Masked Autoencoders
2. DINO (2021)¶
2.1 ν΅μ¬ μμ΄λμ΄¶
DINO (Self-Distillation with No labels)λ Knowledge Distillationμ Self-supervisedλ‘ μ μ©ν©λλ€.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DINO Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Input Image β
β β β
β βββββββββββββββ΄ββββββββββββββ β
β βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββ β
β β Global Crops β β Local Crops β β
β β (224Γ224) β β (96Γ96) β β
β β Γ 2 β β Γ 6+ β β
β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββ β
β β Teacher Network β β Student Network β β
β β (EMA update) β β (Gradient) β β
β β [stop-grad] β β β β
β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββ β
β β Teacher Head β β Student Head β β
β β (Projection) β β (Projection) β β
β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ β
β β β β
β βΌ βΌ β
β P_teacher P_student β
β β β β
β βββββββββββββ¬ββββββββββββββββ β
β βΌ β
β Cross-Entropy Loss β
β H(P_t, P_s) = -Ξ£ P_t log(P_s) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.2 μ£Όμ κ΅¬μ± μμ¶
import torch
import torch.nn as nn
import torch.nn.functional as F
class DINOHead(nn.Module):
"""
DINO Projection Head
ꡬ쑰: Linear β GELU β Linear β L2 Norm
μΆλ ₯: K μ°¨μ (μ: 65536)
"""
def __init__(self, in_dim, out_dim=65536, hidden_dim=2048):
super().__init__()
self.mlp = nn.Sequential(
nn.Linear(in_dim, hidden_dim),
nn.GELU(),
nn.Linear(hidden_dim, hidden_dim),
nn.GELU(),
nn.Linear(hidden_dim, out_dim),
)
# L2 μ κ·ν
self.last_layer = nn.utils.weight_norm(
nn.Linear(out_dim, out_dim, bias=False)
)
self.last_layer.weight_g.data.fill_(1)
def forward(self, x):
x = self.mlp(x)
x = F.normalize(x, dim=-1, p=2)
x = self.last_layer(x)
return x
class DINOLoss(nn.Module):
"""
DINO Loss: Cross-entropy between teacher and student
νΉμ§:
- Teacher: Centering + Sharpening (temperature Ο_t < Ο_s)
- Student: μΌλ° softmax
- Center: λͺ¨λ teacher μΆλ ₯μ moving average (collapse λ°©μ§)
"""
def __init__(self, out_dim, teacher_temp=0.04, student_temp=0.1, center_momentum=0.9):
super().__init__()
self.teacher_temp = teacher_temp
self.student_temp = student_temp
self.center_momentum = center_momentum
self.register_buffer("center", torch.zeros(1, out_dim))
def forward(self, student_output, teacher_output):
"""
Args:
student_output: (batch, n_crops, out_dim)
teacher_output: (batch, n_global_crops, out_dim)
"""
# Teacher: centering + sharpening
teacher_out = F.softmax(
(teacher_output - self.center) / self.teacher_temp, dim=-1
)
teacher_out = teacher_out.detach() # stop gradient
# Student: softmax with higher temperature
student_out = F.log_softmax(student_output / self.student_temp, dim=-1)
# Cross-entropy loss
loss = torch.sum(-teacher_out * student_out, dim=-1).mean()
# Update center (EMA)
self.update_center(teacher_output)
return loss
@torch.no_grad()
def update_center(self, teacher_output):
batch_center = teacher_output.mean(dim=0, keepdim=True)
self.center = self.center * self.center_momentum + batch_center * (1 - self.center_momentum)
2.3 Multi-crop μ λ΅¶
"""
Multi-crop Strategy:
Global crops (2κ°):
- ν¬κΈ°: 224Γ224 (μλ³Έμ 50-100%)
- Teacherμ Student λͺ¨λμ μ
λ ₯
- μ 체 μ΄λ―Έμ§ λ§₯λ½ νμ΅
Local crops (μ¬λ¬ κ°, λ³΄ν΅ 6-8κ°):
- ν¬κΈ°: 96Γ96 (μλ³Έμ 5-50%)
- Studentμλ§ μ
λ ₯
- μ§μ ν¨ν΄ νμ΅
λͺ©μ :
- "Local-to-Global" λμ νμ΅
- μμ μμμ΄ μ 체 μ΄λ―Έμ§μ μ΄λ€ λΆλΆμΈμ§ νμ΅
- Semantic segmentation λ₯λ ₯ μμ°μ€λ½κ² μ΅λ
"""
from torchvision import transforms
class DINODataAugmentation:
def __init__(self, global_crops_scale=(0.4, 1.0), local_crops_scale=(0.05, 0.4),
n_local_crops=8):
# Global crops (224Γ224)
self.global_transform = transforms.Compose([
transforms.RandomResizedCrop(224, scale=global_crops_scale),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(0.4, 0.4, 0.2, 0.1),
transforms.RandomGrayscale(p=0.2),
transforms.GaussianBlur(kernel_size=23, sigma=(0.1, 2.0)),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
])
# Local crops (96Γ96)
self.local_transform = transforms.Compose([
transforms.RandomResizedCrop(96, scale=local_crops_scale),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(0.4, 0.4, 0.2, 0.1),
transforms.RandomGrayscale(p=0.2),
transforms.GaussianBlur(kernel_size=23, sigma=(0.1, 2.0)),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
])
self.n_local_crops = n_local_crops
def __call__(self, image):
crops = []
# 2 global crops
crops.append(self.global_transform(image))
crops.append(self.global_transform(image))
# n local crops
for _ in range(self.n_local_crops):
crops.append(self.local_transform(image))
return crops
2.4 Teacher-Student μ λ°μ΄νΈ¶
class DINOTrainer:
"""
DINO νμ΅ λ£¨ν
ν΅μ¬:
- Student: gradientλ‘ μ
λ°μ΄νΈ
- Teacher: Studentμ EMA (Exponential Moving Average)
"""
def __init__(self, student, teacher, optimizer, loss_fn, momentum=0.996):
self.student = student
self.teacher = teacher
self.optimizer = optimizer
self.loss_fn = loss_fn
self.momentum = momentum
# Teacherλ Studentλ‘ μ΄κΈ°ν
self.teacher.load_state_dict(self.student.state_dict())
# Teacherλ gradient κ³μ° μ ν¨
for p in self.teacher.parameters():
p.requires_grad = False
@torch.no_grad()
def update_teacher(self):
"""EMA update: ΞΈ_t = m * ΞΈ_t + (1-m) * ΞΈ_s"""
for param_s, param_t in zip(self.student.parameters(), self.teacher.parameters()):
param_t.data.mul_(self.momentum).add_((1 - self.momentum) * param_s.data)
def train_step(self, images):
"""
images: list of crops [global1, global2, local1, ..., localN]
"""
# Global cropsλ§ Teacherμ μ
λ ₯
teacher_output = self.teacher(torch.cat(images[:2]))
# λͺ¨λ cropsλ₯Ό Studentμ μ
λ ₯
student_output = self.student(torch.cat(images))
# Loss κ³μ° (κ° student crop vs κ° teacher crop)
loss = self.loss_fn(student_output, teacher_output)
# Student μ
λ°μ΄νΈ
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Teacher EMA μ
λ°μ΄νΈ
self.update_teacher()
return loss.item()
3. DINOv2 (2023)¶
3.1 DINOv2μ κ°μ μ ¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DINO vs DINOv2 λΉκ΅ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β νλͺ© β DINO (2021) β DINOv2 (2023) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β λ°μ΄ν° β ImageNet (1.3M) β LVD-142M (142M) β
β λ°μ΄ν° νλ μ΄μ
β μμ β μλ νλ μ΄μ
νμ΄νλΌμΈ β
β λͺ¨λΈ ν¬κΈ° β ViT-S/B β ViT-S/B/L/g β
β νμ΅ λͺ©ν β DINOλ§ β DINO + iBOT (masked) β
β Regularization β κΈ°λ³Έ β KoLeo + μ κ·ν κ°ν β
β Resolution β 224 β 518 (κ³ ν΄μλ) β
β μ±λ₯ (k-NN) β ~74% (IN-1K) β ~86% (IN-1K) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.2 LVD-142M λ°μ΄ν°μ ¶
"""
LVD-142M (Learning with large Visual Datasets)
μλ νλ μ΄μ
νμ΄νλΌμΈ:
1. μΉμμ μ΄λ―Έμ§ μμ§ (billions)
2. μ€λ³΅ μ κ±° (copy detection)
3. νμ§ νν°λ§
4. ImageNetκ³Ό μ μ¬λ κΈ°λ° μνλ§
5. μ΅μ’
142M μ΄λ―Έμ§
ν΅μ¬ κΈ°μ :
- Self-supervised copy detection
- Embedding κΈ°λ° ν΄λ¬μ€ν°λ§
- Retrieval κΈ°λ° λ°μ΄ν° μ ν
μ μ€μνκ°:
- λ°μ΄ν° νμ§μ΄ λͺ¨λΈ μ±λ₯μ ν΅μ¬
- Scalingμ λ°μ΄ν° νλ μ΄μ
μ΄ νμ
- μλνλ νμ΄νλΌμΈμΌλ‘ νμ₯ κ°λ₯
"""
3.3 iBOT ν΅ν©¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DINOv2 = DINO + iBOT β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β DINO Loss (μ΄λ―Έμ§ λ 벨): β
β β’ Global/Local crop κ° consistency β
β β’ CLS token κΈ°λ° β
β β
β iBOT Loss (ν¨μΉ λ 벨): β
β β’ Masked patches μμΈ‘ β
β β’ MAEμ μ μ¬νμ§λ§ Teacher μ¬μ© β
β β
β Input Image β
β β β
β βββββββββββββββ΄ββββββββββββββ β
β βΌ βΌ β
β βββββββββββ βββββββββββ β
β β Teacher β β Student β β
β β (full) β β (masked)β β μΌλΆ ν¨μΉ λ§μ€νΉ β
β ββββββ¬βββββ ββββββ¬βββββ β
β β β β
β ββββββ΄βββββ ββββββ΄βββββ β
β βCLSβPatchβ βCLSβPatchβ β
β βββ¬ββββ¬ββββ βββ¬ββββ¬ββββ β
β β β β β β
β β ββββββββββββββββββββββββββββ€ β
β β iBOT Loss β β β
β β (masked patches) β β β
β β β β β
β ββββββββββββββββββββββββββββ β β
β DINO Loss β β
β (CLS tokens) β β
β β
β Total Loss = L_DINO + Ξ» Γ L_iBOT β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.4 λͺ¨λΈ ꡬ쑰¶
"""
DINOv2 λͺ¨λΈ μ¬μ
Model β Layers β Hidden β Heads β Params β Patch
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
ViT-S/14 β 12 β 384 β 6 β 21M β 14Γ14
ViT-B/14 β 12 β 768 β 12 β 86M β 14Γ14
ViT-L/14 β 24 β 1024 β 16 β 300M β 14Γ14
ViT-g/14 β 40 β 1536 β 24 β 1.1B β 14Γ14
νΉμ§:
- Patch size 14 (κΈ°μ‘΄ ViTλ 16)
- λ λμ ν΄μλ μ§μ
- Register tokens (attention artifact ν΄κ²°)
"""
4. DINOv2 μ¬μ©νκΈ°¶
4.1 HuggingFaceλ‘ λ‘λ¶
import torch
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import requests
# λͺ¨λΈ λ‘λ
model_name = "facebook/dinov2-base"
processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# μ΄λ―Έμ§ λ‘λ
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# μ μ²λ¦¬ λ° μΆλ‘
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# μΆλ ₯ ꡬ쑰
print(f"Last hidden state: {outputs.last_hidden_state.shape}")
# (1, 257, 768) = (batch, 1 CLS + 256 patches, hidden_dim)
# CLS token (μ 체 μ΄λ―Έμ§ νν)
cls_token = outputs.last_hidden_state[:, 0]
print(f"CLS token: {cls_token.shape}") # (1, 768)
# Patch tokens (μ§μ νν)
patch_tokens = outputs.last_hidden_state[:, 1:]
print(f"Patch tokens: {patch_tokens.shape}") # (1, 256, 768)
4.2 νΉμ§ μΆμΆ λ° νμ©¶
import torch
import torch.nn.functional as F
from transformers import AutoImageProcessor, AutoModel
import numpy as np
from sklearn.neighbors import NearestNeighbors
class DINOv2FeatureExtractor:
"""DINOv2λ₯Ό μ΄μ©ν μ΄λ―Έμ§ νΉμ§ μΆμΆκΈ°"""
def __init__(self, model_name="facebook/dinov2-base"):
self.processor = AutoImageProcessor.from_pretrained(model_name)
self.model = AutoModel.from_pretrained(model_name)
self.model.eval()
@torch.no_grad()
def extract_features(self, images, return_patches=False):
"""
μ΄λ―Έμ§μμ νΉμ§ μΆμΆ
Args:
images: PIL Image λλ 리μ€νΈ
return_patches: ν¨μΉλ³ νΉμ§λ λ°νν μ§
Returns:
cls_features: (n_images, hidden_dim)
patch_features: (n_images, n_patches, hidden_dim) - optional
"""
if not isinstance(images, list):
images = [images]
inputs = self.processor(images=images, return_tensors="pt")
outputs = self.model(**inputs)
cls_features = outputs.last_hidden_state[:, 0]
if return_patches:
patch_features = outputs.last_hidden_state[:, 1:]
return cls_features, patch_features
return cls_features
def compute_similarity(self, image1, image2):
"""λ μ΄λ―Έμ§ κ° μ μ¬λ (μ½μ¬μΈ)"""
feat1 = self.extract_features(image1)
feat2 = self.extract_features(image2)
similarity = F.cosine_similarity(feat1, feat2)
return similarity.item()
# μ¬μ© μμ
extractor = DINOv2FeatureExtractor()
# μ΄λ―Έμ§ κ²μ
def build_image_index(images):
"""μ΄λ―Έμ§ μΈλ±μ€ ꡬμΆ"""
features = []
for img in images:
feat = extractor.extract_features(img)
features.append(feat.numpy())
features = np.vstack(features)
# k-NN μΈλ±μ€
index = NearestNeighbors(n_neighbors=5, metric='cosine')
index.fit(features)
return index, features
def search_similar(query_image, index, features, k=5):
"""μ μ¬ μ΄λ―Έμ§ κ²μ"""
query_feat = extractor.extract_features(query_image).numpy()
distances, indices = index.kneighbors(query_feat, n_neighbors=k)
return indices[0], distances[0]
4.3 Dense Prediction (Semantic Segmentation)¶
import torch
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
def visualize_attention_maps(model, processor, image):
"""DINOv2μ attention map μκ°ν"""
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_attentions=True)
# λ§μ§λ§ λ μ΄μ΄μ attention
attentions = outputs.attentions[-1] # (1, n_heads, n_tokens, n_tokens)
# CLS tokenμ΄ κ° ν¨μΉμ μ£Όλ attention
cls_attn = attentions[0, :, 0, 1:] # (n_heads, n_patches)
# νκ·
cls_attn_mean = cls_attn.mean(dim=0) # (n_patches,)
# Reshape to 2D
n_patches = int(np.sqrt(cls_attn_mean.shape[0]))
attn_map = cls_attn_mean.reshape(n_patches, n_patches)
return attn_map.numpy()
def visualize_patch_pca(model, processor, image, n_components=3):
"""ν¨μΉ νΉμ§μ PCA μκ°ν (μλ―Έλ‘ μ μμ νμΈ)"""
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# ν¨μΉ ν ν°
patch_tokens = outputs.last_hidden_state[0, 1:].numpy() # (n_patches, hidden)
# PCA
pca = PCA(n_components=n_components)
patch_pca = pca.fit_transform(patch_tokens)
# Normalize to [0, 1] for visualization
patch_pca = (patch_pca - patch_pca.min()) / (patch_pca.max() - patch_pca.min())
# Reshape
n_patches = int(np.sqrt(patch_tokens.shape[0]))
pca_image = patch_pca.reshape(n_patches, n_patches, n_components)
return pca_image
# μκ°ν
# fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# axes[0].imshow(image)
# axes[0].set_title('Original')
# axes[1].imshow(visualize_attention_maps(model, processor, image), cmap='hot')
# axes[1].set_title('Attention Map')
# axes[2].imshow(visualize_patch_pca(model, processor, image))
# axes[2].set_title('PCA of Patches')
5. DINOv2 μμ©¶
5.1 Zero-shot Semantic Segmentation¶
"""
DINOv2μ ν¨μΉ νΉμ§μ μ΄μ©ν μΈκ·Έλ©ν
μ΄μ
λ°©λ²:
1. μ΄λ―Έμ§μμ DINOv2 ν¨μΉ νΉμ§ μΆμΆ
2. μμ μ΄λ―Έμ§μμ κ΄μ¬ μμμ νΉμ§ μΆμΆ
3. μ½μ¬μΈ μ μ¬λλ‘ ν΄λΉ μμ μ°ΎκΈ°
μ₯μ :
- νμ΅ μμ΄ μΈκ·Έλ©ν
μ΄μ
κ°λ₯
- μλ‘μ΄ κ°μ²΄ ν΄λμ€λ μ²λ¦¬ κ°λ₯
"""
def segment_with_reference(model, processor, target_image, reference_image, reference_mask):
"""
μ°Έμ‘° μ΄λ―Έμ§μ λ§μ€ν¬λ₯Ό μ΄μ©ν΄ νκ² μ΄λ―Έμ§ μΈκ·Έλ©ν
μ΄μ
Args:
target_image: μΈκ·Έλ©ν
μ΄μ
ν μ΄λ―Έμ§
reference_image: μ°Έμ‘° μ΄λ―Έμ§
reference_mask: μ°Έμ‘° μ΄λ―Έμ§μ κ΄μ¬ μμ λ§μ€ν¬ (binary)
"""
# νΉμ§ μΆμΆ
with torch.no_grad():
target_inputs = processor(images=target_image, return_tensors="pt")
target_outputs = model(**target_inputs)
target_patches = target_outputs.last_hidden_state[0, 1:] # (n_patches, hidden)
ref_inputs = processor(images=reference_image, return_tensors="pt")
ref_outputs = model(**ref_inputs)
ref_patches = ref_outputs.last_hidden_state[0, 1:] # (n_patches, hidden)
# μ°Έμ‘° λ§μ€ν¬μμ κ΄μ¬ μμμ νΉμ§ νκ·
n_patches = int(np.sqrt(ref_patches.shape[0]))
mask_resized = F.interpolate(
reference_mask.unsqueeze(0).unsqueeze(0).float(),
size=(n_patches, n_patches),
mode='nearest'
).squeeze().bool()
foreground_features = ref_patches[mask_resized.flatten()].mean(dim=0)
# νκ² μ΄λ―Έμ§μ κ° ν¨μΉμ μ μ¬λ κ³μ°
similarities = F.cosine_similarity(
target_patches,
foreground_features.unsqueeze(0),
dim=1
)
# Reshape to 2D
similarity_map = similarities.reshape(n_patches, n_patches)
return similarity_map.numpy()
5.2 Depth Estimation¶
"""
DINOv2 + Linear Probeλ‘ Depth Estimation
λ°©λ²:
1. DINOv2λ‘ ν¨μΉ νΉμ§ μΆμΆ
2. κ°λ¨ν Linear layerλ‘ depth μμΈ‘
3. μ μ λ°μ΄ν°λ‘λ μ’μ μ±λ₯
μ΄μ :
- DINOv2κ° μ΄λ―Έ 3D ꡬ쑰 μ 보λ₯Ό νμ΅
- ν¨μΉ νΉμ§μ depth cueκ° μΈμ½λ©λ¨
"""
class DepthEstimator(nn.Module):
def __init__(self, dinov2_model, hidden_dim=768):
super().__init__()
self.backbone = dinov2_model
self.backbone.eval()
for p in self.backbone.parameters():
p.requires_grad = False
self.head = nn.Sequential(
nn.Linear(hidden_dim, 256),
nn.ReLU(),
nn.Linear(256, 1),
nn.Sigmoid()
)
def forward(self, x):
with torch.no_grad():
features = self.backbone(x).last_hidden_state[:, 1:] # patch tokens
depth = self.head(features) # (batch, n_patches, 1)
# Reshape to image
batch, n_patches, _ = depth.shape
h = w = int(np.sqrt(n_patches))
depth = depth.reshape(batch, h, w)
return depth
μ 리¶
DINO/DINOv2 ν΅μ¬¶
| κ°λ | μ€λͺ |
|---|---|
| Self-distillation | Teacher-Student ꡬ쑰, λ μ΄λΈ μμ΄ νμ΅ |
| Multi-crop | Global + Local cropsλ‘ λ€μν μ€μΌμΌ νμ΅ |
| Centering | Teacher μΆλ ₯ centeringμΌλ‘ collapse λ°©μ§ |
| EMA Teacher | MomentumμΌλ‘ μμ μ μΈ νκ² μ 곡 |
| iBOT | Masked patch prediction μΆκ° (DINOv2) |
νμ©¶
- Image Retrieval: CLS tokenμΌλ‘ μ μ¬ μ΄λ―Έμ§ κ²μ
- Semantic Segmentation: ν¨μΉ νΉμ§μΌλ‘ zero-shot μΈκ·Έλ©ν μ΄μ
- Depth Estimation: Linear probeλ‘ depth μμΈ‘
- Fine-tuning: λ€μ΄μ€νΈλ¦Ό νμ€ν¬ νμ΅
λ€μ λ¨κ³¶
- 13_Segment_Anything.md: SAMμ promptable segmentation
- 14_Unified_Vision_Models.md: ν΅ν© Vision Foundation Models
μ°Έκ³ μλ£¶
λ Όλ¬Έ¶
- Caron et al. (2021). "Emerging Properties in Self-Supervised Vision Transformers" (DINO)
- Oquab et al. (2023). "DINOv2: Learning Robust Visual Features without Supervision"
- Zhou et al. (2021). "iBOT: Image BERT Pre-Training with Online Tokenizer"