35. CLIP (Contrastive Language-Image Pre-training)
Previous: CLIP and Multimodal Learning | Next: Self-Supervised Learning
35. CLIP (Contrastive Language-Image Pre-training)¶
Overview¶
CLIP maps images and text to the same embedding space, enabling zero-shot image classification. "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., 2021)
Mathematical Background¶
1. Contrastive Learning¶
Goal: learn similarity of image-text pairs
N (image, text) pairs in a batch:
- Diagonal (i, i): matching pairs (positive)
- Off-diagonal (i, j): non-matching pairs (negative)
Similarity matrix (N ร N):
S[i, j] = <image_i, text_j> / ฯ
where ฯ is temperature parameter
2. InfoNCE Loss¶
Image-to-Text Loss:
L_i2t = -1/N ฮฃแตข log(exp(S[i,i]) / ฮฃโฑผ exp(S[i,j]))
Text-to-Image Loss:
L_t2i = -1/N ฮฃแตข log(exp(S[i,i]) / ฮฃโฑผ exp(S[j,i]))
Total Loss:
L = (L_i2t + L_t2i) / 2
Intuition:
- Numerator: similarity of matching pairs โ
- Denominator: similarity with other pairs โ
3. Zero-shot Classification¶
Classify new image:
1. Generate text prompts per class:
"A photo of a {class_name}"
2. Compute text embeddings:
T = [text_enc("A photo of a cat"),
text_enc("A photo of a dog"),
...]
3. Compute image embedding:
I = image_enc(image)
4. Classify by similarity:
probs = softmax(I @ T.T / ฯ)
prediction = argmax(probs)
Can classify new classes without training!
CLIP Architecture¶
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLIP โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ โ
โ โ Image Encoder โ โ Text Encoder โ โ
โ โ โ โ โ โ
โ โ ViT-B/32 โ โ Transformer โ โ
โ โ or โ โ (12 layers) โ โ
โ โ ResNet-50 โ โ โ โ
โ โโโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโโฌโโโโโโโโโโ โ
โ โ โ โ
โ โผ โผ โ
โ Image Embedding Text Embedding โ
โ (B, D) (B, D) โ
โ โ โ โ
โ โ L2 Normalize โ โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Contrastive Loss โ โ
โ โ maximize similarity of matching pairs โ โ
โ โ minimize similarity of non-matching โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Model variants:
- CLIP ViT-B/32: 512 dim, 86M image + 63M text params
- CLIP ViT-B/16: 512 dim, 86M image + 63M text params
- CLIP ViT-L/14: 768 dim, 304M image + 123M text params
- CLIP RN50: ResNet-50 image encoder
File Structure¶
12_CLIP/
โโโ README.md
โโโ numpy/
โ โโโ clip_forward.py # NumPy forward pass
โโโ pytorch_lowlevel/
โ โโโ clip_lowlevel.py # PyTorch Low-Level CLIP
โโโ paper/
โ โโโ clip_paper.py # Paper reproduction
โโโ exercises/
โโโ 01_zero_shot.md # Zero-shot classification
โโโ 02_retrieval.md # Image-text retrieval
Core Concepts¶
1. Large-scale Dataset¶
WebImageText (WIT) dataset:
- 400 million (image, text) pairs
- Collected from internet
- Natural language supervision
Data collection:
1. Collect image and alt-text pairs
2. Filtering (quality, deduplication)
3. Class balancing
2. Prompt Engineering¶
Simple prompt:
"cat" โ "A photo of a cat"
Prompt ensemble:
templates = [
"A photo of a {}",
"A picture of a {}",
"An image showing a {}",
"A {} in the scene"
]
# Average of multiple templates
text_embeddings = []
for template in templates:
prompt = template.format(class_name)
embedding = text_encoder(prompt)
text_embeddings.append(embedding)
final_embedding = mean(text_embeddings)
3. Applications¶
1. Zero-shot Classification
- Directly apply to new domains
- Define classes with prompts
2. Image-Text Retrieval
- Search images with text
- Search text with images
3. Image Generation Guidance
- Guidance for DALL-E, Stable Diffusion
- Measure generation quality with CLIP score
4. Multimodal Embedding
- Common representation for images and text
- Foundation for downstream tasks
Implementation Levels¶
Level 2: PyTorch Low-Level (pytorch_lowlevel/)¶
- Direct image encoder (ViT) implementation
- Direct text encoder (Transformer) implementation
- Implement contrastive loss
Level 3: Paper Implementation (paper/)¶
- Complete training pipeline
- Zero-shot evaluation
- Prompt engineering
Level 4: Code Analysis (separate)¶
- Analyze OpenAI CLIP code
- Analyze open_clip library
Learning Checklist¶
- [ ] Understand contrastive learning
- [ ] Understand InfoNCE loss formula
- [ ] Implement zero-shot classification
- [ ] Understand role of temperature
- [ ] Practice prompt engineering
- [ ] Implement image-text retrieval
References¶
- Radford et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision"
- OpenAI CLIP
- 34_CLIP_Multimodal.md