35. CLIP (Contrastive Language-Image Pre-training)

이전: CLIPκ³Ό λ©€ν‹°λͺ¨λ‹¬ ν•™μŠ΅ | λ‹€μŒ: Self-Supervised Learning


35. CLIP (Contrastive Language-Image Pre-training)

κ°œμš”

CLIP은 이미지와 ν…μŠ€νŠΈλ₯Ό 같은 μž„λ² λ”© 곡간에 λ§€ν•‘ν•˜μ—¬ zero-shot 이미지 λΆ„λ₯˜λ₯Ό κ°€λŠ₯ν•˜κ²Œ ν•©λ‹ˆλ‹€. "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., 2021)


μˆ˜ν•™μ  λ°°κ²½

1. Contrastive Learning

λͺ©ν‘œ: 이미지-ν…μŠ€νŠΈ 쌍의 μœ μ‚¬λ„ ν•™μŠ΅

Batch λ‚΄ N개의 (image, text) 쌍:
- λŒ€κ°μ„  (i, i): μΌμΉ˜ν•˜λŠ” 쌍 (positive)
- λΉ„λŒ€κ°μ„  (i, j): 뢈일치 쌍 (negative)

μœ μ‚¬λ„ ν–‰λ ¬ (N Γ— N):
S[i, j] = <image_i, text_j> / Ο„

μ—¬κΈ°μ„œ Ο„λŠ” temperature parameter

2. InfoNCE Loss

Image-to-Text Loss:
L_i2t = -1/N Ξ£α΅’ log(exp(S[i,i]) / Ξ£β±Ό exp(S[i,j]))

Text-to-Image Loss:
L_t2i = -1/N Ξ£α΅’ log(exp(S[i,i]) / Ξ£β±Ό exp(S[j,i]))

Total Loss:
L = (L_i2t + L_t2i) / 2

직관:
- λΆ„μž: μΌμΉ˜ν•˜λŠ” 쌍의 μœ μ‚¬λ„ ↑
- λΆ„λͺ¨: λ‹€λ₯Έ 쌍과의 μœ μ‚¬λ„ ↓

3. Zero-shot Classification

μƒˆλ‘œμš΄ 이미지 λΆ„λ₯˜:

1. ν΄λž˜μŠ€λ³„ ν…μŠ€νŠΈ ν”„λ‘¬ν”„νŠΈ 생성:
   "A photo of a {class_name}"

2. ν…μŠ€νŠΈ μž„λ² λ”© 계산:
   T = [text_enc("A photo of a cat"),
        text_enc("A photo of a dog"),
        ...]

3. 이미지 μž„λ² λ”© 계산:
   I = image_enc(image)

4. μœ μ‚¬λ„λ‘œ λΆ„λ₯˜:
   probs = softmax(I @ T.T / Ο„)
   prediction = argmax(probs)

ν•™μŠ΅ 없이 μƒˆλ‘œμš΄ 클래슀 λΆ„λ₯˜ κ°€λŠ₯!

CLIP μ•„ν‚€ν…μ²˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         CLIP                                 β”‚
β”‚                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚  β”‚   Image Encoder   β”‚         β”‚   Text Encoder    β”‚        β”‚
β”‚  β”‚                   β”‚         β”‚                   β”‚        β”‚
β”‚  β”‚  ViT-B/32         β”‚         β”‚  Transformer      β”‚        β”‚
β”‚  β”‚  or               β”‚         β”‚  (12 layers)      β”‚        β”‚
β”‚  β”‚  ResNet-50        β”‚         β”‚                   β”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚            β”‚                             β”‚                   β”‚
β”‚            β–Ό                             β–Ό                   β”‚
β”‚     Image Embedding              Text Embedding              β”‚
β”‚        (B, D)                       (B, D)                   β”‚
β”‚            β”‚                             β”‚                   β”‚
β”‚            β”‚      L2 Normalize           β”‚                   β”‚
β”‚            β–Ό                             β–Ό                   β”‚
β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚     β”‚         Contrastive Loss                 β”‚            β”‚
β”‚     β”‚   maximize similarity of matching pairs   β”‚            β”‚
β”‚     β”‚   minimize similarity of non-matching    β”‚            β”‚
β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

λͺ¨λΈ λ³€ν˜•:
- CLIP ViT-B/32: 512 dim, 86M image + 63M text params
- CLIP ViT-B/16: 512 dim, 86M image + 63M text params
- CLIP ViT-L/14: 768 dim, 304M image + 123M text params
- CLIP RN50: ResNet-50 image encoder

파일 ꡬ쑰

12_CLIP/
β”œβ”€β”€ README.md
β”œβ”€β”€ numpy/
β”‚   └── clip_forward.py       # NumPy forward pass
β”œβ”€β”€ pytorch_lowlevel/
β”‚   └── clip_lowlevel.py      # PyTorch Low-Level CLIP
β”œβ”€β”€ paper/
β”‚   └── clip_paper.py         # λ…Όλ¬Έ μž¬ν˜„
└── exercises/
    β”œβ”€β”€ 01_zero_shot.md       # Zero-shot λΆ„λ₯˜
    └── 02_retrieval.md       # 이미지-ν…μŠ€νŠΈ 검색

핡심 κ°œλ…

1. λŒ€κ·œλͺ¨ 데이터셋

WebImageText (WIT) 데이터셋:
- 4μ–΅ 개의 (image, text) 쌍
- μΈν„°λ„·μ—μ„œ μˆ˜μ§‘
- μžμ—°μ–΄ supervision

데이터 μˆ˜μ§‘:
1. 이미지와 alt-text 쌍 μˆ˜μ§‘
2. 필터링 (ν’ˆμ§ˆ, 쀑볡 제거)
3. 클래슀 κ· ν˜• λ§žμΆ”κΈ°

2. Prompt Engineering

λ‹¨μˆœ ν”„λ‘¬ν”„νŠΈ:
"cat"  β†’  "A photo of a cat"

ν”„λ‘¬ν”„νŠΈ 앙상블:
templates = [
    "A photo of a {}",
    "A picture of a {}",
    "An image showing a {}",
    "A {} in the scene"
]

# μ—¬λŸ¬ ν…œν”Œλ¦Ώμ˜ 평균
text_embeddings = []
for template in templates:
    prompt = template.format(class_name)
    embedding = text_encoder(prompt)
    text_embeddings.append(embedding)
final_embedding = mean(text_embeddings)

3. μ‘μš© λΆ„μ•Ό

1. Zero-shot Classification
   - μƒˆλ‘œμš΄ 도메인에 λ°”λ‘œ 적용
   - ν”„λ‘¬ν”„νŠΈλ‘œ 클래슀 μ •μ˜

2. Image-Text Retrieval
   - ν…μŠ€νŠΈλ‘œ 이미지 검색
   - μ΄λ―Έμ§€λ‘œ ν…μŠ€νŠΈ 검색

3. Image Generation Guidance
   - DALL-E, Stable Diffusion의 guidance
   - CLIP score둜 생성 ν’ˆμ§ˆ μΈ‘μ •

4. Multimodal Embedding
   - 이미지와 ν…μŠ€νŠΈμ˜ 곡톡 ν‘œν˜„
   - λ‹€μš΄μŠ€νŠΈλ¦Ό νƒœμŠ€ν¬ 기반

κ΅¬ν˜„ 레벨

Level 2: PyTorch Low-Level (pytorch_lowlevel/)

  • Image encoder (ViT) 직접 κ΅¬ν˜„
  • Text encoder (Transformer) 직접 κ΅¬ν˜„
  • Contrastive loss κ΅¬ν˜„

Level 3: Paper Implementation (paper/)

  • 전체 ν•™μŠ΅ νŒŒμ΄ν”„λΌμΈ
  • Zero-shot 평가
  • Prompt engineering

Level 4: Code Analysis (별도)

  • OpenAI CLIP μ½”λ“œ 뢄석
  • open_clip 라이브러리 뢄석

ν•™μŠ΅ 체크리슀트

  • [ ] Contrastive learning 이해
  • [ ] InfoNCE loss μˆ˜μ‹ 이해
  • [ ] Zero-shot classification κ΅¬ν˜„
  • [ ] Temperature의 μ—­ν•  이해
  • [ ] Prompt engineering μ‹€μŠ΅
  • [ ] Image-text retrieval κ΅¬ν˜„

참고 자료

to navigate between lessons