35. CLIP (Contrastive Language-Image Pre-training)
μ΄μ : CLIPκ³Ό λ©ν°λͺ¨λ¬ νμ΅ | λ€μ: Self-Supervised Learning
35. CLIP (Contrastive Language-Image Pre-training)¶
κ°μ¶
CLIPμ μ΄λ―Έμ§μ ν μ€νΈλ₯Ό κ°μ μλ² λ© κ³΅κ°μ λ§€ννμ¬ zero-shot μ΄λ―Έμ§ λΆλ₯λ₯Ό κ°λ₯νκ² ν©λλ€. "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., 2021)
μνμ λ°°κ²½¶
1. Contrastive Learning¶
λͺ©ν: μ΄λ―Έμ§-ν
μ€νΈ μμ μ μ¬λ νμ΅
Batch λ΄ Nκ°μ (image, text) μ:
- λκ°μ (i, i): μΌμΉνλ μ (positive)
- λΉλκ°μ (i, j): λΆμΌμΉ μ (negative)
μ μ¬λ νλ ¬ (N Γ N):
S[i, j] = <image_i, text_j> / Ο
μ¬κΈ°μ Ολ temperature parameter
2. InfoNCE Loss¶
Image-to-Text Loss:
L_i2t = -1/N Ξ£α΅’ log(exp(S[i,i]) / Ξ£β±Ό exp(S[i,j]))
Text-to-Image Loss:
L_t2i = -1/N Ξ£α΅’ log(exp(S[i,i]) / Ξ£β±Ό exp(S[j,i]))
Total Loss:
L = (L_i2t + L_t2i) / 2
μ§κ΄:
- λΆμ: μΌμΉνλ μμ μ μ¬λ β
- λΆλͺ¨: λ€λ₯Έ μκ³Όμ μ μ¬λ β
3. Zero-shot Classification¶
μλ‘μ΄ μ΄λ―Έμ§ λΆλ₯:
1. ν΄λμ€λ³ ν
μ€νΈ ν둬ννΈ μμ±:
"A photo of a {class_name}"
2. ν
μ€νΈ μλ² λ© κ³μ°:
T = [text_enc("A photo of a cat"),
text_enc("A photo of a dog"),
...]
3. μ΄λ―Έμ§ μλ² λ© κ³μ°:
I = image_enc(image)
4. μ μ¬λλ‘ λΆλ₯:
probs = softmax(I @ T.T / Ο)
prediction = argmax(probs)
νμ΅ μμ΄ μλ‘μ΄ ν΄λμ€ λΆλ₯ κ°λ₯!
CLIP μν€ν μ²¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIP β
β β
β βββββββββββββββββββββ βββββββββββββββββββββ β
β β Image Encoder β β Text Encoder β β
β β β β β β
β β ViT-B/32 β β Transformer β β
β β or β β (12 layers) β β
β β ResNet-50 β β β β
β βββββββββββ¬ββββββββββ βββββββββββ¬ββββββββββ β
β β β β
β βΌ βΌ β
β Image Embedding Text Embedding β
β (B, D) (B, D) β
β β β β
β β L2 Normalize β β
β βΌ βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββ β
β β Contrastive Loss β β
β β maximize similarity of matching pairs β β
β β minimize similarity of non-matching β β
β ββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
λͺ¨λΈ λ³ν:
- CLIP ViT-B/32: 512 dim, 86M image + 63M text params
- CLIP ViT-B/16: 512 dim, 86M image + 63M text params
- CLIP ViT-L/14: 768 dim, 304M image + 123M text params
- CLIP RN50: ResNet-50 image encoder
νμΌ κ΅¬μ‘°¶
12_CLIP/
βββ README.md
βββ numpy/
β βββ clip_forward.py # NumPy forward pass
βββ pytorch_lowlevel/
β βββ clip_lowlevel.py # PyTorch Low-Level CLIP
βββ paper/
β βββ clip_paper.py # λ
Όλ¬Έ μ¬ν
βββ exercises/
βββ 01_zero_shot.md # Zero-shot λΆλ₯
βββ 02_retrieval.md # μ΄λ―Έμ§-ν
μ€νΈ κ²μ
ν΅μ¬ κ°λ ¶
1. λκ·λͺ¨ λ°μ΄ν°μ ¶
WebImageText (WIT) λ°μ΄ν°μ
:
- 4μ΅ κ°μ (image, text) μ
- μΈν°λ·μμ μμ§
- μμ°μ΄ supervision
λ°μ΄ν° μμ§:
1. μ΄λ―Έμ§μ alt-text μ μμ§
2. νν°λ§ (νμ§, μ€λ³΅ μ κ±°)
3. ν΄λμ€ κ· ν λ§μΆκΈ°
2. Prompt Engineering¶
λ¨μ ν둬ννΈ:
"cat" β "A photo of a cat"
ν둬ννΈ μμλΈ:
templates = [
"A photo of a {}",
"A picture of a {}",
"An image showing a {}",
"A {} in the scene"
]
# μ¬λ¬ ν
νλ¦Ώμ νκ·
text_embeddings = []
for template in templates:
prompt = template.format(class_name)
embedding = text_encoder(prompt)
text_embeddings.append(embedding)
final_embedding = mean(text_embeddings)
3. μμ© λΆμΌ¶
1. Zero-shot Classification
- μλ‘μ΄ λλ©μΈμ λ°λ‘ μ μ©
- ν둬ννΈλ‘ ν΄λμ€ μ μ
2. Image-Text Retrieval
- ν
μ€νΈλ‘ μ΄λ―Έμ§ κ²μ
- μ΄λ―Έμ§λ‘ ν
μ€νΈ κ²μ
3. Image Generation Guidance
- DALL-E, Stable Diffusionμ guidance
- CLIP scoreλ‘ μμ± νμ§ μΈ‘μ
4. Multimodal Embedding
- μ΄λ―Έμ§μ ν
μ€νΈμ κ³΅ν΅ νν
- λ€μ΄μ€νΈλ¦Ό νμ€ν¬ κΈ°λ°
ꡬν λ 벨¶
Level 2: PyTorch Low-Level (pytorch_lowlevel/)¶
- Image encoder (ViT) μ§μ ꡬν
- Text encoder (Transformer) μ§μ ꡬν
- Contrastive loss ꡬν
Level 3: Paper Implementation (paper/)¶
- μ 체 νμ΅ νμ΄νλΌμΈ
- Zero-shot νκ°
- Prompt engineering
Level 4: Code Analysis (λ³λ)¶
- OpenAI CLIP μ½λ λΆμ
- open_clip λΌμ΄λΈλ¬λ¦¬ λΆμ
νμ΅ μ²΄ν¬λ¦¬μ€νΈ¶
- [ ] Contrastive learning μ΄ν΄
- [ ] InfoNCE loss μμ μ΄ν΄
- [ ] Zero-shot classification ꡬν
- [ ] Temperatureμ μν μ΄ν΄
- [ ] Prompt engineering μ€μ΅
- [ ] Image-text retrieval ꡬν
μ°Έκ³ μλ£¶
- Radford et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision"
- OpenAI CLIP
- 34_CLIP_Multimodal.md