22. Vision Transformer (ViT)
μ΄μ : λΉμ νΈλμ€ν¬λ¨Έ | λ€μ: νμ΅ μ΅μ ν
22. Vision Transformer (ViT)¶
κ°μ¶
Vision Transformer (ViT)λ Transformer μν€ν μ²λ₯Ό μ΄λ―Έμ§ λΆλ₯μ μ μ©ν λͺ¨λΈμ λλ€. μ΄λ―Έμ§λ₯Ό ν¨μΉλ‘ λΆν νκ³ , κ° ν¨μΉλ₯Ό ν ν°μ²λΌ μ²λ¦¬ν©λλ€. "An Image is Worth 16x16 Words" (Dosovitskiy et al., 2020)
μνμ λ°°κ²½¶
1. μ΄λ―Έμ§ ν¨μΉν¶
μ
λ ₯ μ΄λ―Έμ§: x β R^(H Γ W Γ C)
ν¨μΉ ν¬κΈ°: P Γ P
ν¨μΉ μνμ€:
x_p β R^(N Γ PΒ² Γ C) where N = (H Γ W) / PΒ²
μμ:
- μ΄λ―Έμ§: 224 Γ 224 Γ 3
- ν¨μΉ: 16 Γ 16
- N = (224 Γ 224) / (16 Γ 16) = 196 ν¨μΉ
- κ° ν¨μΉ: 16 Γ 16 Γ 3 = 768 μ°¨μ
2. ν¨μΉ μλ² λ©¶
Linear Projection:
z_0 = [x_class; x_pΒΉE; x_pΒ²E; ...; x_pβΏE] + E_pos
μ¬κΈ°μ:
- x_class: νμ΅ κ°λ₯ν [CLS] ν ν°
- E β R^(PΒ²C Γ D): ν¨μΉ μλ² λ© νλ ¬
- E_pos β R^((N+1) Γ D): μμΉ μλ² λ©
z_0 β R^((N+1) Γ D): μ΄κΈ° μλ² λ© μνμ€
3. Transformer Encoder¶
Encoder block (L layers):
z'_l = MSA(LN(z_{l-1})) + z_{l-1}
z_l = MLP(LN(z'_l)) + z'_l
μ΅μ’
μΆλ ₯:
y = LN(z_Lβ°) # [CLS] ν ν°λ§ μ¬μ©
μ¬κΈ°μ z_Lβ°λ Lλ²μ§Έ λ μ΄μ΄μ [CLS] ν ν°
ViT μν€ν μ² λ³ν¶
ViT-Base (B/16):
- Hidden size: 768
- Layers: 12
- Attention heads: 12
- MLP size: 3072
- Patch size: 16
- Parameters: 86M
ViT-Large (L/16):
- Hidden size: 1024
- Layers: 24
- Attention heads: 16
- MLP size: 4096
- Patch size: 16
- Parameters: 307M
ViT-Huge (H/14):
- Hidden size: 1280
- Layers: 32
- Attention heads: 16
- MLP size: 5120
- Patch size: 14
- Parameters: 632M
νμΌ κ΅¬μ‘°¶
10_ViT/
βββ README.md
βββ pytorch_lowlevel/
β βββ vit_lowlevel.py # ViT μ§μ ꡬν
βββ paper/
β βββ vit_paper.py # λ
Όλ¬Έ μ¬ν
βββ exercises/
βββ 01_patch_embedding.md # ν¨μΉ μλ² λ© μκ°ν
βββ 02_attention_maps.md # Attention μκ°ν
ν΅μ¬ κ°λ ¶
1. CNN vs ViT¶
CNN:
- μ§μμ μμ© μμ (local receptive field)
- Inductive bias: locality, translation equivariance
- μμ λ°μ΄ν°μ
μ μ 리
ViT:
- μ μ μμ© μμ (global from start)
- μ΅μνμ inductive bias
- λκ·λͺ¨ λ°μ΄ν°μ
μ μ 리 (JFT-300M)
- μμ λ°μ΄ν°: pre-training νμ
2. Position Embedding¶
1D Learnable (ViT κΈ°λ³Έ):
- N+1κ°μ νμ΅ κ°λ₯ν 벑ν°
- μμ μ 보 νμ΅
2D Positional (λ³ν):
- (row, col) λ³λ μλ² λ©
- μ΄λ―Έμ§ ꡬ쑰 λ°μ
Sinusoidal:
- κ³ μ λ μΌκ° ν¨μ
- μΈμ½ κ°λ₯μ±
3. [CLS] Token vs Global Average Pooling¶
[CLS] Token:
- 첫 λ²μ§Έ μμΉμ μΆκ°
- μ 체 μ΄λ―Έμ§ νν μ§μ½
- BERT μ€νμΌ
Global Average Pooling:
- λͺ¨λ ν¨μΉ νκ·
- CNN μ€νμΌ
- λΉμ·ν μ±λ₯
ꡬν λ 벨¶
Level 2: PyTorch Low-Level (pytorch_lowlevel/)¶
- F.linear, F.layer_norm μ¬μ©
- nn.TransformerEncoder λ―Έμ¬μ©
- ν¨μΉν μ§μ ꡬν
Level 3: Paper Implementation (paper/)¶
- λ Όλ¬Έ μ νν μ¬μ
- JFT/ImageNet pre-training
- Fine-tuning μ½λ
Level 4: Code Analysis (λ³λ)¶
- timm λΌμ΄λΈλ¬λ¦¬ λΆμ
- HuggingFace ViT λΆμ
νμ΅ μ²΄ν¬λ¦¬μ€νΈ¶
- [ ] ν¨μΉ μλ² λ© μμ μ΄ν΄
- [ ] μμΉ μλ² λ© μν
- [ ] [CLS] ν ν° μν
- [ ] CNN λλΉ μ₯λ¨μ
- [ ] Attention map μκ°ν
- [ ] Fine-tuning μ λ΅
μ°Έκ³ μλ£¶
- Dosovitskiy et al. (2020). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
- timm ViT
- 21_Vision_Transformer.md