이전: Self-Supervised Learning | 다음: 객체 탐지

37. 현대 딥러닝 아키텍처¶

학습 목표¶

최근 딥러닝 아키텍처 혁신(2020-2024) 살펴보기
ConvNeXt와 Transformer 시대의 순수 ConvNet의 진화 이해하기
EfficientNetV2와 점진적 학습 전략(progressive training strategies)에 대해 배우기
자기지도 학습 비전 파운데이션 모델인 DINOv2 탐구하기
빠른 확산 샘플링을 위한 잠재 일관성 모델(Latent Consistency Models, LCM) 이해하기
timm 및 transformers 라이브러리를 사용한 사전학습된 현대 아키텍처 적용하기

1. 아키텍처 진화 타임라인¶

딥러닝 아키텍처 환경은 빠르게 진화해왔습니다:

2017: ResNet/ResNeXt dominance
      └─ Bottleneck blocks, skip connections

2017: Transformer (NLP)
      └─ Self-attention, positional encoding

2020: Vision Transformer (ViT)
      └─ Pure attention for vision

2021: Swin Transformer
      └─ Hierarchical vision transformer with shifted windows

2022: ConvNeXt
      └─ Modernized ConvNet matching Transformers

2022: EfficientNetV2
      └─ Progressive training + Fused-MBConv

2023: DINOv2
      └─ Self-supervised vision foundation model

2023: Latent Consistency Models
      └─ Fast diffusion sampling (1-4 steps)

2024: ConvNeXt V2, Mamba, Hyena
      └─ Continued innovation in architectures

주요 트렌드¶

하이브리드 아키텍처: 합성곱(convolutions)과 어텐션(attention) 결합
자기지도 사전학습(Self-supervised pretraining): DINO, MAE, CLIP
스케일링 법칙(Scaling laws): 더 큰 모델, 더 많은 데이터, 더 긴 학습
효율성: FLOPs, 파라미터, 지연시간 감소
파운데이션 모델(Foundation models): 범용 사전학습 모델

2. ConvNeXt: ConvNet의 현대화¶

ConvNeXt (Liu et al., 2022)는 순수 ConvNet이 최신 설계 선택으로 현대화될 때 Transformer와 경쟁할 수 있음을 보여줍니다.

2.1 ResNet에서 ConvNeXt로의 설계 진화¶

ResNet-50에서 시작하여 단계별로 현대적 개선사항을 적용:

Step 1: Training procedure (90 → 300 epochs, AdamW, mixup, cutmix)
        Accuracy: 76.1% → 78.8%

Step 2: Macro design (stage ratio 3:4:6:3 → 3:3:9:3)
        Patchify stem (7×7 stride-2 → 4×4 stride-4)
        Accuracy: 78.8% → 79.4%

Step 3: ResNeXt-ify (grouped convolutions)
        Depthwise convolution (groups = channels)
        Accuracy: 79.4% → 80.5%

Step 4: Inverted bottleneck (narrow → wide → narrow)
        Expansion ratio 4× (similar to Transformers' MLP)
        Accuracy: 80.5% → 80.6%

Step 5: Large kernel sizes (3×3 → 7×7)
        Accuracy: 80.6% → 81.0%

Step 6: Micro design (ReLU → GELU, BN → LN, fewer layers)
        Accuracy: 81.0% → 82.0%

Final ConvNeXt-T: 82.0% (matches Swin-T)

2.2 ConvNeXt 블록 아키텍처¶

Input (C channels)
    |
    ├──────────────────┐  (Residual connection)
    |                  |
Depthwise Conv 7×7     |
    |                  |
LayerNorm              |
    |                  |
1×1 Conv (4C)          |  (Expansion)
    |                  |
GELU                   |
    |                  |
1×1 Conv (C)           |  (Projection)
    |                  |
    +──────────────────┘
    |
Output (C channels)

ResNet과의 주요 차이점: - 3×3 표준 합성곱 대신 Depthwise convolution (7×7) - Inverted bottleneck: 4C로 확장한 후 다시 C로 투영 - BatchNorm 대신 LayerNorm - ReLU 대신 GELU - 더 적은 활성화 함수: 블록당 하나만

2.3 PyTorch 구현¶

import torch
import torch.nn as nn

class ConvNeXtBlock(nn.Module):
    """ConvNeXt block with inverted bottleneck design."""

    def __init__(self, dim, expansion_ratio=4, kernel_size=7, layer_scale_init=1e-6):
        super().__init__()

        # Depthwise convolution
        self.dwconv = nn.Conv2d(
            dim, dim, kernel_size=kernel_size,
            padding=kernel_size // 2, groups=dim
        )

        # Normalization and projection
        self.norm = nn.LayerNorm(dim, eps=1e-6)
        self.pwconv1 = nn.Linear(dim, expansion_ratio * dim)  # Expansion
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(expansion_ratio * dim, dim)  # Projection

        # Layer scale (learned per-channel scaling)
        self.gamma = nn.Parameter(
            layer_scale_init * torch.ones(dim)
        ) if layer_scale_init > 0 else None

    def forward(self, x):
        shortcut = x

        # Depthwise conv
        x = self.dwconv(x)

        # Permute for LayerNorm and pointwise convs
        x = x.permute(0, 2, 3, 1)  # (N, C, H, W) -> (N, H, W, C)

        # Inverted bottleneck with LayerNorm
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)

        # Layer scale
        if self.gamma is not None:
            x = self.gamma * x

        # Permute back
        x = x.permute(0, 3, 1, 2)  # (N, H, W, C) -> (N, C, H, W)

        # Residual connection
        x = shortcut + x
        return x


class ConvNeXt(nn.Module):
    """ConvNeXt model."""

    def __init__(
        self,
        in_chans=3,
        num_classes=1000,
        depths=[3, 3, 9, 3],  # Number of blocks per stage
        dims=[96, 192, 384, 768],  # Channels per stage
        **kwargs
    ):
        super().__init__()

        # Stem: patchify with 4×4 conv, stride 4
        self.stem = nn.Sequential(
            nn.Conv2d(in_chans, dims[0], kernel_size=4, stride=4),
            nn.LayerNorm(dims[0], eps=1e-6, elementwise_affine=True)
        )

        # Build 4 stages
        self.stages = nn.ModuleList()
        for i in range(4):
            # Downsampling layer (except first stage)
            if i > 0:
                downsample = nn.Sequential(
                    nn.LayerNorm(dims[i-1], eps=1e-6),
                    nn.Conv2d(dims[i-1], dims[i], kernel_size=2, stride=2)
                )
            else:
                downsample = nn.Identity()

            # Stack ConvNeXt blocks
            blocks = nn.Sequential(*[
                ConvNeXtBlock(dims[i], **kwargs) for _ in range(depths[i])
            ])

            stage = nn.Sequential(downsample, blocks)
            self.stages.append(stage)

        # Head
        self.norm = nn.LayerNorm(dims[-1], eps=1e-6)
        self.head = nn.Linear(dims[-1], num_classes)

    def forward_features(self, x):
        # Stem
        x = self.stem(x)
        x = x.permute(0, 2, 3, 1)  # (N, C, H, W) -> (N, H, W, C)

        # Stages
        for stage in self.stages:
            x = x.permute(0, 3, 1, 2)  # -> (N, C, H, W)
            x = stage(x)
            x = x.permute(0, 2, 3, 1)  # -> (N, H, W, C)

        return self.norm(x.mean([1, 2]))  # Global average pooling

    def forward(self, x):
        x = self.forward_features(x)
        x = self.head(x)
        return x


# Example usage
model = ConvNeXt(
    depths=[3, 3, 9, 3],  # ConvNeXt-T
    dims=[96, 192, 384, 768]
)

x = torch.randn(2, 3, 224, 224)
output = model(x)
print(f"Output shape: {output.shape}")  # (2, 1000)

2.4 ConvNeXt V2 개선사항 (2023)¶

ConvNeXt V2는 다음을 도입했습니다: 1. 전역 응답 정규화(Global Response Normalization, GRN): 채널 간 특징 경쟁 강화 2. 완전 합성곱 MAE(Fully convolutional MAE): ConvNet을 위한 마스크 오토인코더 사전학습 3. 성능 향상: ImageNet-1K에서 87.3% (ConvNeXt V2-H)

class GRN(nn.Module):
    """Global Response Normalization layer."""

    def __init__(self, dim):
        super().__init__()
        self.gamma = nn.Parameter(torch.zeros(1, 1, 1, dim))
        self.beta = nn.Parameter(torch.zeros(1, 1, 1, dim))

    def forward(self, x):
        # x: (N, H, W, C)
        # Compute global feature map
        Gx = torch.norm(x, p=2, dim=(1, 2), keepdim=True)
        # Normalize
        Nx = Gx / (Gx.mean(dim=-1, keepdim=True) + 1e-6)
        # Scale and shift
        return self.gamma * (x * Nx) + self.beta + x

3. EfficientNetV2¶

EfficientNetV2 (Tan & Le, 2021)는 다음을 통해 학습 속도와 파라미터 효율성을 개선합니다: 1. Fused-MBConv 블록: 확장(expansion)과 depthwise 합성곱 융합 2. 점진적 학습(Progressive training): 이미지 크기와 정규화를 점진적으로 증가 3. 신경망 아키텍처 탐색(Neural Architecture Search, NAS): 학습 속도에 최적화

3.1 Fused-MBConv vs. MBConv¶

MBConv (MobileNetV2):                Fused-MBConv:
  Input                                Input
    |                                    |
  1×1 Conv (expand)                    3×3 Conv (expand)
    |                                    |
  DW 3×3                               [Fused operation]
    |                                    |
  1×1 Conv (project)                   1×1 Conv (project)
    |                                    |
  Output                               Output

  3 separate ops                       2 ops (faster for small FLOPs)

트레이드오프: - MBConv: 더 큰 모델에 적합 (더 적은 파라미터) - Fused-MBConv: 더 작은 모델에 적합 (더 빠른 학습)

EfficientNetV2는 서로 다른 단계에서 둘 다 사용합니다.

3.2 점진적 학습¶

핵심 아이디어: 처음에는 더 작은 이미지와 약한 정규화로 학습하고, 점진적으로 증가시킵니다.

Stage 1 (epochs 0-50):
  - Image size: 128×128
  - RandAugment magnitude: 5
  - Mixup alpha: 0

Stage 2 (epochs 50-100):
  - Image size: 192×192
  - RandAugment magnitude: 10
  - Mixup alpha: 0.2

Stage 3 (epochs 100-150):
  - Image size: 256×256
  - RandAugment magnitude: 15
  - Mixup alpha: 0.4

이점: - 더 빠른 수렴: 더 작은 이미지로 최적화가 더 쉬움 - 더 나은 정규화: 더 큰 이미지에 더 강력한 증강 - 향상된 정확도: ImageNet에서 85.7% (EfficientNetV2-L)

3.3 timm으로 EfficientNetV2 사용하기¶

import timm
import torch

# List available EfficientNetV2 models
models = timm.list_models('*efficientnetv2*', pretrained=True)
print(models)
# ['tf_efficientnetv2_b0', 'tf_efficientnetv2_b1', ..., 'tf_efficientnetv2_l']

# Load pretrained EfficientNetV2-S
model = timm.create_model('tf_efficientnetv2_s', pretrained=True)
model.eval()

# Get model info
print(f"Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
print(f"Input size: {model.default_cfg['input_size']}")

# Inference
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

from PIL import Image
img = Image.open('cat.jpg')
x = transforms(img).unsqueeze(0)  # (1, 3, 384, 384)

with torch.no_grad():
    logits = model(x)
    probs = torch.softmax(logits, dim=1)
    top5_idx = torch.topk(probs, 5).indices[0]

# Print top-5 predictions
labels = timm.data.ImageNetInfo.label_names()
for idx in top5_idx:
    print(f"{labels[idx]}: {probs[0, idx]:.3f}")

4. DINOv2: 자기지도 학습 비전 파운데이션 모델¶

DINOv2 (Oquab et al., 2023)는 레이블 없이 1억 4200만 이미지로 사전학습된 자기지도 학습 Vision Transformer입니다.

4.1 주요 혁신¶

레이블 없는 자기 증류(Self-distillation with no labels) (DINO 프레임워크)
레지스터 토큰을 가진 ViT 백본
대규모 사전학습 (1억 4200만 이미지, LVD-142M 데이터셋)
멀티태스크 헤드: 분류, 분할, 깊이 추정

DINO Self-Distillation:

   Student (ViT-S)          Teacher (EMA of Student)
         |                          |
    [CLS] token               [CLS] token
         |                          |
    ┌─────────┐              ┌─────────┐
    │ Predict │              │ Target  │
    └─────────┘              └─────────┘
         |                          |
         └──────── Match ───────────┘
              (no labels!)

Augmentations:
  - Student: strong crops (multi-crop)
  - Teacher: weak crops (global views)

레지스터 토큰(Register tokens) (추가 학습 가능 토큰):
배경 아티팩트를 흡수하여 특징 품질 향상
[CLS] 토큰과 유사하지만 분류에는 사용되지 않음
동결된 백본 + 선형 프로브(linear probes):
동결된 DINOv2로 특징 추출
다운스트림 태스크를 위한 경량 헤드 학습

4.2 모델 변형¶

모델	파라미터	레이어	Hidden Dim	Patch Size
DINOv2-S	22M	12	384	14×14
DINOv2-B	86M	12	768	14×14
DINOv2-L	304M	24	1024	14×14
DINOv2-g	1.1B	40	1536	14×14

4.3 사전학습된 DINOv2 사용하기¶

import torch
from transformers import AutoImageProcessor, AutoModel
from PIL import Image

# Load pretrained DINOv2-base
processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
model = AutoModel.from_pretrained('facebook/dinov2-base')
model.eval()

# Load image
img = Image.open('cat.jpg')

# Extract features
inputs = processor(images=img, return_tensors='pt')
with torch.no_grad():
    outputs = model(**inputs)

# Get patch embeddings (excluding [CLS])
patch_embeddings = outputs.last_hidden_state[:, 1:, :]  # (1, num_patches, 768)
print(f"Patch embeddings shape: {patch_embeddings.shape}")

# Get [CLS] token (global image representation)
cls_token = outputs.last_hidden_state[:, 0, :]  # (1, 768)
print(f"CLS token shape: {cls_token.shape}")

# Use as feature extractor for downstream tasks
# Example: k-NN classification
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# Assume we have a training set
train_features = []  # Extract from training images
train_labels = []

# Fit k-NN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(np.array(train_features), train_labels)

# Predict
pred = knn.predict(cls_token.numpy())

4.4 DINOv2를 사용한 다운스트림 태스크¶

1. 이미지 분류(Image Classification)

from transformers import Dinov2ForImageClassification

model = Dinov2ForImageClassification.from_pretrained(
    'facebook/dinov2-base',
    num_labels=10,  # Custom dataset
    ignore_mismatched_sizes=True
)

# Fine-tune on custom dataset
# ... training loop ...

2. 의미론적 분할(Semantic Segmentation)

# Use patch embeddings for dense prediction
B, N, D = patch_embeddings.shape
H = W = int(N ** 0.5)  # Assume square

# Reshape to spatial grid
spatial_features = patch_embeddings.reshape(B, H, W, D)
spatial_features = spatial_features.permute(0, 3, 1, 2)  # (B, D, H, W)

# Add segmentation head
seg_head = nn.Conv2d(D, num_classes, kernel_size=1)
logits = seg_head(spatial_features)  # (B, num_classes, H, W)

3. 깊이 추정(Depth Estimation)

# Similar to segmentation, but regress depth
depth_head = nn.Sequential(
    nn.Conv2d(D, 256, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Conv2d(256, 1, kernel_size=1)
)
depth_map = depth_head(spatial_features)  # (B, 1, H, W)

5. 잠재 일관성 모델(Latent Consistency Models, LCM)¶

잠재 일관성 모델(Latent Consistency Models) (Luo et al., 2023)은 확산 모델에서 1-4 단계로 빠른 샘플링을 가능하게 합니다 (표준 확산의 25-50 단계 vs.).

5.1 일관성 증류(Consistency Distillation)¶

핵심 아이디어: 사전학습된 확산 모델을 노이즈가 있는 잠재 변수를 깨끗한 잠재 변수로 직접 매핑하는 일관성 모델로 증류합니다.

Standard Diffusion (DDPM):
  x_T (noise) → x_{T-1} → ... → x_1 → x_0 (clean)
  (50 steps, slow)

Latent Consistency Model:
  x_T (noise) ───────────────────────→ x_0 (clean)
  (1-4 steps, fast!)

Consistency property:
  For any t, t' ∈ [0, T]:
    f(x_t, t) ≈ f(x_{t'}, t')
  (all noisy latents map to same clean latent)

5.2 LCM 학습¶

사전학습된 확산 모델로 시작 (예: Stable Diffusion)
일관성 손실을 사용하여 LCM으로 증류:

Consistency loss:
  L = E_{x, t, t'} [ || f(x_t, t) - sg(f(x_{t'}, t')) ||^2 ]

  where:
    - x_t, x_{t'} are noisy latents at different timesteps
    - f is the consistency model (student)
    - sg is stop-gradient (teacher is EMA of student)

몇 단계 샘플링: 2-4 단계로 ODE 솔버 사용 (예: DDIM)

5.3 빠른 파인튜닝을 위한 LCM-LoRA¶

LCM-LoRA는 일관성 증류에 저랭크 적응(Low-Rank Adaptation)을 적용합니다: - 더 빠른 학습: LoRA 가중치만 학습 (~파라미터의 1-5%) - 조합 가능(Composable): 다른 LoRA(스타일, 캐릭터 등)와 결합 - 효율적: 단일 GPU에서 증류 가능

5.4 Diffusers로 LCM 사용하기¶

from diffusers import DiffusionPipeline, LCMScheduler
import torch

# Load LCM pipeline
pipe = DiffusionPipeline.from_pretrained(
    "SimianLuo/LCM_Dreamshaper_v7",
    torch_dtype=torch.float16
)
pipe.to("cuda")

# LCM uses special scheduler
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

# Generate with 4 steps (vs. 50 for standard diffusion!)
prompt = "A beautiful sunset over mountains, highly detailed, 8k"
image = pipe(
    prompt=prompt,
    num_inference_steps=4,  # Very fast!
    guidance_scale=1.0,  # LCM works best with guidance_scale=1
).images[0]

image.save("sunset_lcm.png")

LCM-LoRA 사용하기:

from diffusers import StableDiffusionPipeline, LCMScheduler

# Load base model
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)

# Load LCM-LoRA weights
pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")

# Generate with 4-8 steps
image = pipe(
    prompt="Portrait of a cat, oil painting",
    num_inference_steps=8,
    guidance_scale=1.0
).images[0]

6. 아키텍처 비교 표¶

아키텍처	파라미터	FLOPs (G)	ImageNet 정확도	학습 데이터	사전학습 방법
ResNet-50	25M	4.1	76.2%	1.3M	Supervised
EfficientNet-B4	19M	4.5	82.9%	1.3M	Supervised + AutoAug
EfficientNetV2-S	24M	8.4	84.9%	1.3M	Supervised + Progressive
ViT-B/16	86M	17.6	84.5%	300M	Supervised (JFT-300M)
Swin-B	88M	15.4	85.2%	1.3M	Supervised
ConvNeXt-B	89M	15.4	85.8%	1.3M	Supervised
ConvNeXt V2-B	89M	15.4	86.8%	1.3M	FCMAE (self-supervised)
DINOv2-B	86M	17.6	84.5% (linear)	142M	Self-supervised (DINO)
DINOv2-g	1.1B	280	88.5% (linear)	142M	Self-supervised (DINO)

참고사항: - FLOPs: 224×224 해상도에서 측정 - ImageNet Acc: ImageNet-1K 검증 세트에서 Top-1 정확도 - DINOv2 (linear): 선형 프로브 평가 (동결된 특징 + 선형 분류기)

7. 사전학습된 모델 사용하기: 실용 가이드¶

7.1 timm 라이브러리 (PyTorch Image Models)¶

timm은 통합된 인터페이스로 700개 이상의 사전학습된 모델을 제공합니다.

import timm
import torch

# List all models
all_models = timm.list_models(pretrained=True)
print(f"Total models: {len(all_models)}")

# Search for specific architecture
convnext_models = timm.list_models('convnext*', pretrained=True)
print(convnext_models)

# Create model
model = timm.create_model(
    'convnext_base.fb_in22k_ft_in1k',  # Pretrained on ImageNet-22k, fine-tuned on 1k
    pretrained=True,
    num_classes=1000
)

# Inspect model
print(model.default_cfg)  # Config dict
print(f"Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")

# Feature extraction mode
model = timm.create_model('convnext_base', pretrained=True, num_classes=0)
# Returns features instead of logits

# Get intermediate features
model = timm.create_model('convnext_base', pretrained=True, features_only=True)
x = torch.randn(1, 3, 224, 224)
features = model(x)
for i, feat in enumerate(features):
    print(f"Stage {i}: {feat.shape}")
# Stage 0: (1, 128, 56, 56)
# Stage 1: (1, 256, 28, 28)
# Stage 2: (1, 512, 14, 14)
# Stage 3: (1, 1024, 7, 7)

7.2 Hugging Face Transformers¶

transformers 라이브러리는 AutoModel을 통해 비전 모델을 지원합니다.

from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch

# Load processor and model
processor = AutoImageProcessor.from_pretrained("microsoft/swin-base-patch4-window7-224")
model = AutoModelForImageClassification.from_pretrained("microsoft/swin-base-patch4-window7-224")

# Prepare input
from PIL import Image
img = Image.open("cat.jpg")
inputs = processor(images=img, return_tensors="pt")

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = logits.argmax(-1).item()

print(f"Predicted class: {model.config.id2label[predicted_class]}")

7.3 전이 학습 모범 사례¶

1. 특징 추출(Feature Extraction):

# Freeze pretrained weights
model = timm.create_model('convnext_base', pretrained=True)
for param in model.parameters():
    param.requires_grad = False

# Replace classifier head
num_classes = 10
model.head = torch.nn.Linear(model.head.in_features, num_classes)

# Only train the head
optimizer = torch.optim.Adam(model.head.parameters(), lr=1e-3)

2. 파인튜닝(Fine-tuning):

# Unfreeze all layers
for param in model.parameters():
    param.requires_grad = True

# Use lower learning rate for pretrained weights
optimizer = torch.optim.AdamW([
    {'params': model.stem.parameters(), 'lr': 1e-5},
    {'params': model.stages.parameters(), 'lr': 5e-5},
    {'params': model.head.parameters(), 'lr': 1e-3}
])

3. 점진적 동결 해제(Progressive unfreezing) (ULMFiT 전략):

# Epoch 0-5: Train head only
# Epoch 5-10: Unfreeze last stage
# Epoch 10+: Unfreeze all

def unfreeze_layers(model, epoch):
    if epoch < 5:
        # Freeze all except head
        for param in model.stem.parameters():
            param.requires_grad = False
        for param in model.stages.parameters():
            param.requires_grad = False
    elif epoch < 10:
        # Unfreeze last stage
        for param in model.stages[-1].parameters():
            param.requires_grad = True
    else:
        # Unfreeze all
        for param in model.parameters():
            param.requires_grad = True

8. 아키텍처 선택 가이드¶

8.1 의사 결정 트리¶

┌─ Need supervised pretraining?
│  ├─ Yes
│  │  ├─ Priority: Accuracy
│  │  │  └─ ConvNeXt V2, EfficientNetV2-L, Swin-L
│  │  └─ Priority: Speed
│  │     └─ EfficientNetV2-S, MobileNetV3
│  └─ No (self-supervised)
│     ├─ Vision foundation model
│     │  └─ DINOv2-L/g (best features)
│     └─ Custom dataset
│        └─ DINO, MAE, SimCLR

┌─ Need generative model?
│  ├─ Fast sampling (1-4 steps)
│  │  └─ Latent Consistency Models
│  └─ Best quality (25-50 steps)
│     └─ Stable Diffusion, DALL-E 3

┌─ Deployment constraints?
│  ├─ Edge device (mobile, IoT)
│  │  └─ MobileNetV3, EfficientNet-B0
│  ├─ Low latency (< 10ms)
│  │  └─ ConvNeXt-T, EfficientNetV2-S
│  └─ No constraints
│     └─ Any large model

8.2 실용적 권장사항¶

범용 비전 태스크 (분류, 탐지, 분할): - DINOv2: 퓨샷 학습(few-shot learning)을 위한 최고의 동결된 특징 - ConvNeXt V2: 최고의 파인튜닝 성능 - EfficientNetV2: 최고의 속도-정확도 트레이드오프

생성 태스크 (이미지 합성): - Stable Diffusion XL: 최고 품질 (50 단계) - LCM: 최고 속도 (4 단계) - LCM-LoRA: 최고 커스터마이징

자원 제약: - MobileNetV3: 모바일 배포 - EfficientNet-B0/B1: 엣지 디바이스에서 좋은 정확도

9. 연습 문제¶

문제 1: ConvNeXt 블록 구현¶

제공된 코드를 사용하지 않고 처음부터 ConvNeXt 블록을 구현하세요. 다음을 포함해야 합니다: - Depthwise 7×7 합성곱 - LayerNorm - Inverted bottleneck (4× 확장을 가진 1×1 합성곱) - GELU 활성화 - Layer scale - Residual connection

입력 shape (2, 64, 32, 32)로 테스트하세요.

솔루션

import torch
import torch.nn as nn

class ConvNeXtBlock(nn.Module):
    def __init__(self, dim, expansion_ratio=4, layer_scale_init=1e-6):
        super().__init__()

        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)
        self.norm = nn.LayerNorm(dim)
        self.pwconv1 = nn.Linear(dim, expansion_ratio * dim)
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(expansion_ratio * dim, dim)
        self.gamma = nn.Parameter(layer_scale_init * torch.ones(dim))

    def forward(self, x):
        shortcut = x
        x = self.dwconv(x)
        x = x.permute(0, 2, 3, 1)  # (N, C, H, W) -> (N, H, W, C)
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)
        x = self.gamma * x
        x = x.permute(0, 3, 1, 2)  # (N, H, W, C) -> (N, C, H, W)
        return shortcut + x

# Test
block = ConvNeXtBlock(64)
x = torch.randn(2, 64, 32, 32)
out = block(x)
assert out.shape == (2, 64, 32, 32)
print("ConvNeXt block test passed!")

문제 2: 점진적 학습 스케줄¶

다음을 수행하는 EfficientNetV2를 위한 점진적 학습 스케줄러를 구현하세요: - 이미지 크기를 128 → 192 → 256으로 증가 - RandAugment magnitude를 5 → 10 → 15로 증가 - Mixup alpha를 0 → 0.2 → 0.4로 증가 - 각 단계는 50 에폭 지속

솔루션

class ProgressiveTrainingScheduler:
    def __init__(self, total_epochs=150):
        self.total_epochs = total_epochs
        self.stages = [
            {'epochs': (0, 50), 'img_size': 128, 'rand_aug_mag': 5, 'mixup_alpha': 0.0},
            {'epochs': (50, 100), 'img_size': 192, 'rand_aug_mag': 10, 'mixup_alpha': 0.2},
            {'epochs': (100, 150), 'img_size': 256, 'rand_aug_mag': 15, 'mixup_alpha': 0.4},
        ]

    def get_config(self, epoch):
        for stage in self.stages:
            if stage['epochs'][0] <= epoch < stage['epochs'][1]:
                return {
                    'img_size': stage['img_size'],
                    'rand_aug_mag': stage['rand_aug_mag'],
                    'mixup_alpha': stage['mixup_alpha']
                }
        return self.stages[-1]  # Return last stage config

    def __call__(self, epoch):
        return self.get_config(epoch)

# Usage
scheduler = ProgressiveTrainingScheduler()
for epoch in [0, 25, 50, 75, 100, 125]:
    config = scheduler(epoch)
    print(f"Epoch {epoch}: img_size={config['img_size']}, "
          f"rand_aug={config['rand_aug_mag']}, mixup={config['mixup_alpha']}")

문제 3: DINOv2 특징 추출¶

DINOv2에서 패치 레벨 특징을 추출하고 코사인 유사도를 사용하여 특징 유사도를 시각화하세요. 1. DINOv2-small 로드 2. 이미지의 패치 임베딩 추출 3. 패치 간 쌍별 코사인 유사도 계산 4. 히트맵으로 시각화

솔루션

import torch
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np

# Load model
processor = AutoImageProcessor.from_pretrained('facebook/dinov2-small')
model = AutoModel.from_pretrained('facebook/dinov2-small')
model.eval()

# Load image
img = Image.open('cat.jpg')
inputs = processor(images=img, return_tensors='pt')

# Extract features
with torch.no_grad():
    outputs = model(**inputs)
    patch_embeddings = outputs.last_hidden_state[:, 1:, :]  # Exclude [CLS]

# Reshape to spatial grid
B, N, D = patch_embeddings.shape
H = W = int(N ** 0.5)
patches = patch_embeddings.reshape(B, H, W, D)[0]  # (H, W, D)

# Compute cosine similarity
patches_flat = patches.reshape(-1, D)  # (H*W, D)
# Normalize
patches_norm = patches_flat / patches_flat.norm(dim=1, keepdim=True)
# Cosine similarity matrix
sim_matrix = patches_norm @ patches_norm.T  # (H*W, H*W)

# Visualize
plt.figure(figsize=(10, 10))
plt.imshow(sim_matrix.numpy(), cmap='viridis')
plt.colorbar(label='Cosine Similarity')
plt.title('Patch-level Feature Similarity (DINOv2)')
plt.xlabel('Patch index')
plt.ylabel('Patch index')
plt.tight_layout()
plt.savefig('dinov2_similarity.png')

문제 4: LCM 빠른 생성¶

표준 DDIM (50 단계)과 LCM (4 단계) 간의 생성 속도와 품질을 비교하세요: 1. Stable Diffusion 1.5 로드 2. DDIM으로 생성 (50 단계) 3. LCM-LoRA 로드 4. LCM으로 생성 (4 단계) 5. 두 경우의 시간 측정

솔루션

from diffusers import StableDiffusionPipeline, LCMScheduler
import torch
import time

# Load base model
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)
pipe.to("cuda")

prompt = "A serene lake with mountains in background, sunset, highly detailed"

# Standard DDIM
print("Generating with DDIM (50 steps)...")
start = time.time()
image_ddim = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
ddim_time = time.time() - start
print(f"DDIM time: {ddim_time:.2f}s")
image_ddim.save("ddim_50steps.png")

# Load LCM-LoRA
pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

# LCM generation
print("Generating with LCM (4 steps)...")
start = time.time()
image_lcm = pipe(prompt, num_inference_steps=4, guidance_scale=1.0).images[0]
lcm_time = time.time() - start
print(f"LCM time: {lcm_time:.2f}s")
image_lcm.save("lcm_4steps.png")

# Speed comparison
speedup = ddim_time / lcm_time
print(f"\nSpeedup: {speedup:.1f}x faster with LCM")

문제 5: 모델 비교¶

사용자 정의 데이터셋에서 ConvNeXt-T, EfficientNetV2-S, DINOv2-S를 비교하세요: 1. timm/transformers에서 세 모델 모두 로드 2. 학습 세트에 대한 특징 추출 (동결) 3. 특징에 대해 선형 SVM 학습 4. 정확도 및 추론 시간 보고

솔루션

import timm
import torch
from transformers import AutoImageProcessor, AutoModel
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
import numpy as np
import time

# Assume we have a dataset loader
train_loader = ...  # DataLoader for training set
test_loader = ...   # DataLoader for test set

def extract_features(model, loader, is_dinov2=False):
    features, labels = [], []
    with torch.no_grad():
        for imgs, lbls in loader:
            imgs = imgs.cuda()
            if is_dinov2:
                # DINOv2 uses different interface
                outputs = model(imgs)
                feats = outputs.last_hidden_state[:, 0, :]  # [CLS]
            else:
                feats = model(imgs)  # timm feature extractor
            features.append(feats.cpu().numpy())
            labels.append(lbls.numpy())
    return np.concatenate(features), np.concatenate(labels)

# 1. ConvNeXt-T
print("Loading ConvNeXt-T...")
convnext = timm.create_model('convnext_tiny', pretrained=True, num_classes=0)
convnext = convnext.cuda().eval()

start = time.time()
train_feats_cn, train_labels = extract_features(convnext, train_loader)
test_feats_cn, test_labels = extract_features(convnext, test_loader)
cn_time = time.time() - start

# 2. EfficientNetV2-S
print("Loading EfficientNetV2-S...")
effnet = timm.create_model('tf_efficientnetv2_s', pretrained=True, num_classes=0)
effnet = effnet.cuda().eval()

start = time.time()
train_feats_eff, _ = extract_features(effnet, train_loader)
test_feats_eff, _ = extract_features(effnet, test_loader)
eff_time = time.time() - start

# 3. DINOv2-S
print("Loading DINOv2-S...")
dinov2 = AutoModel.from_pretrained('facebook/dinov2-small')
dinov2 = dinov2.cuda().eval()

start = time.time()
train_feats_dino, _ = extract_features(dinov2, train_loader, is_dinov2=True)
test_feats_dino, _ = extract_features(dinov2, test_loader, is_dinov2=True)
dino_time = time.time() - start

# Train linear SVM on each
for name, train_feats, test_feats, infer_time in [
    ('ConvNeXt-T', train_feats_cn, test_feats_cn, cn_time),
    ('EfficientNetV2-S', train_feats_eff, test_feats_eff, eff_time),
    ('DINOv2-S', train_feats_dino, test_feats_dino, dino_time)
]:
    svm = LinearSVC(max_iter=10000)
    svm.fit(train_feats, train_labels)
    preds = svm.predict(test_feats)
    acc = accuracy_score(test_labels, preds)
    print(f"{name}: Accuracy={acc:.3f}, Inference time={infer_time:.2f}s")

네비게이션¶

이전: 26. 정규화 레이어(Normalization Layers)
다음: 개요

추가 자료¶

ConvNeXt: A ConvNet for the 2020s (Liu et al., 2022)
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders (Woo et al., 2023)
EfficientNetV2: Smaller Models and Faster Training (Tan & Le, 2021)
DINOv2: Learning Robust Visual Features without Supervision (Oquab et al., 2023)
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference (Luo et al., 2023)
timm Documentation: https://timm.fast.ai/
Hugging Face Models: https://huggingface.co/models