35. CLIP (Contrastive Language-Image Pre-training)

Previous: CLIP and Multimodal Learning | Next: Self-Supervised Learning


35. CLIP (Contrastive Language-Image Pre-training)

Overview

CLIP maps images and text to the same embedding space, enabling zero-shot image classification. "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., 2021)


Mathematical Background

1. Contrastive Learning

Goal: learn similarity of image-text pairs

N (image, text) pairs in a batch:
- Diagonal (i, i): matching pairs (positive)
- Off-diagonal (i, j): non-matching pairs (negative)

Similarity matrix (N ร— N):
S[i, j] = <image_i, text_j> / ฯ„

where ฯ„ is temperature parameter

2. InfoNCE Loss

Image-to-Text Loss:
L_i2t = -1/N ฮฃแตข log(exp(S[i,i]) / ฮฃโฑผ exp(S[i,j]))

Text-to-Image Loss:
L_t2i = -1/N ฮฃแตข log(exp(S[i,i]) / ฮฃโฑผ exp(S[j,i]))

Total Loss:
L = (L_i2t + L_t2i) / 2

Intuition:
- Numerator: similarity of matching pairs โ†‘
- Denominator: similarity with other pairs โ†“

3. Zero-shot Classification

Classify new image:

1. Generate text prompts per class:
   "A photo of a {class_name}"

2. Compute text embeddings:
   T = [text_enc("A photo of a cat"),
        text_enc("A photo of a dog"),
        ...]

3. Compute image embedding:
   I = image_enc(image)

4. Classify by similarity:
   probs = softmax(I @ T.T / ฯ„)
   prediction = argmax(probs)

Can classify new classes without training!

CLIP Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         CLIP                                 โ”‚
โ”‚                                                              โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”‚
โ”‚  โ”‚   Image Encoder   โ”‚         โ”‚   Text Encoder    โ”‚        โ”‚
โ”‚  โ”‚                   โ”‚         โ”‚                   โ”‚        โ”‚
โ”‚  โ”‚  ViT-B/32         โ”‚         โ”‚  Transformer      โ”‚        โ”‚
โ”‚  โ”‚  or               โ”‚         โ”‚  (12 layers)      โ”‚        โ”‚
โ”‚  โ”‚  ResNet-50        โ”‚         โ”‚                   โ”‚        โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚
โ”‚            โ”‚                             โ”‚                   โ”‚
โ”‚            โ–ผ                             โ–ผ                   โ”‚
โ”‚     Image Embedding              Text Embedding              โ”‚
โ”‚        (B, D)                       (B, D)                   โ”‚
โ”‚            โ”‚                             โ”‚                   โ”‚
โ”‚            โ”‚      L2 Normalize           โ”‚                   โ”‚
โ”‚            โ–ผ                             โ–ผ                   โ”‚
โ”‚     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”‚
โ”‚     โ”‚         Contrastive Loss                 โ”‚            โ”‚
โ”‚     โ”‚   maximize similarity of matching pairs   โ”‚            โ”‚
โ”‚     โ”‚   minimize similarity of non-matching    โ”‚            โ”‚
โ”‚     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Model variants:
- CLIP ViT-B/32: 512 dim, 86M image + 63M text params
- CLIP ViT-B/16: 512 dim, 86M image + 63M text params
- CLIP ViT-L/14: 768 dim, 304M image + 123M text params
- CLIP RN50: ResNet-50 image encoder

File Structure

12_CLIP/
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ numpy/
โ”‚   โ””โ”€โ”€ clip_forward.py       # NumPy forward pass
โ”œโ”€โ”€ pytorch_lowlevel/
โ”‚   โ””โ”€โ”€ clip_lowlevel.py      # PyTorch Low-Level CLIP
โ”œโ”€โ”€ paper/
โ”‚   โ””โ”€โ”€ clip_paper.py         # Paper reproduction
โ””โ”€โ”€ exercises/
    โ”œโ”€โ”€ 01_zero_shot.md       # Zero-shot classification
    โ””โ”€โ”€ 02_retrieval.md       # Image-text retrieval

Core Concepts

1. Large-scale Dataset

WebImageText (WIT) dataset:
- 400 million (image, text) pairs
- Collected from internet
- Natural language supervision

Data collection:
1. Collect image and alt-text pairs
2. Filtering (quality, deduplication)
3. Class balancing

2. Prompt Engineering

Simple prompt:
"cat"  โ†’  "A photo of a cat"

Prompt ensemble:
templates = [
    "A photo of a {}",
    "A picture of a {}",
    "An image showing a {}",
    "A {} in the scene"
]

# Average of multiple templates
text_embeddings = []
for template in templates:
    prompt = template.format(class_name)
    embedding = text_encoder(prompt)
    text_embeddings.append(embedding)
final_embedding = mean(text_embeddings)

3. Applications

1. Zero-shot Classification
   - Directly apply to new domains
   - Define classes with prompts

2. Image-Text Retrieval
   - Search images with text
   - Search text with images

3. Image Generation Guidance
   - Guidance for DALL-E, Stable Diffusion
   - Measure generation quality with CLIP score

4. Multimodal Embedding
   - Common representation for images and text
   - Foundation for downstream tasks

Implementation Levels

Level 2: PyTorch Low-Level (pytorch_lowlevel/)

  • Direct image encoder (ViT) implementation
  • Direct text encoder (Transformer) implementation
  • Implement contrastive loss

Level 3: Paper Implementation (paper/)

  • Complete training pipeline
  • Zero-shot evaluation
  • Prompt engineering

Level 4: Code Analysis (separate)

  • Analyze OpenAI CLIP code
  • Analyze open_clip library

Learning Checklist

  • [ ] Understand contrastive learning
  • [ ] Understand InfoNCE loss formula
  • [ ] Implement zero-shot classification
  • [ ] Understand role of temperature
  • [ ] Practice prompt engineering
  • [ ] Implement image-text retrieval

References

to navigate between lessons