Foundation Model Paradigm

Foundation Model Paradigm

Learning Objectives

  • Understand the definition and characteristics of Foundation Models
  • Grasp the paradigm shift from traditional ML to Foundation Models
  • Learn the concepts of In-context Learning and Emergent Capabilities
  • Identify the major Foundation Model lineage

1. What are Foundation Models?

1.1 Definition

Foundation Model is a term proposed by Stanford HAI in 2021, referring to models with the following characteristics:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Foundation Model Definition                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  1. Pre-trained on broad data                                   β”‚
β”‚     - Billions to trillions of text tokens                      β”‚
β”‚     - Hundreds of millions to billions of images               β”‚
β”‚                                                                 β”‚
β”‚  2. Adaptable to many tasks                                     β”‚
β”‚     - Single model performs classification, generation, QA,     β”‚
β”‚       translation, etc.                                         β”‚
β”‚     - Adapted through fine-tuning or prompting                  β”‚
β”‚                                                                 β”‚
β”‚  3. General-purpose representations                             β”‚
β”‚     - Task-agnostic knowledge encoding                          β”‚
β”‚     - Maximizes transfer learning                               β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1.2 Traditional ML vs Foundation Model

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Traditional Machine Learning Pipeline                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Task A ───► Data A ───► Model A ───► Deploy A                  β”‚
β”‚  Task B ───► Data B ───► Model B ───► Deploy B                  β”‚
β”‚  Task C ───► Data C ───► Model C ───► Deploy C                  β”‚
β”‚                                                                 β”‚
β”‚  β€’ Separate data collection for each task                       β”‚
β”‚  β€’ Separate model training for each task                        β”‚
β”‚  β€’ Limited knowledge sharing between tasks                      β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Foundation Model Pipeline                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚
β”‚  Massive Data ───► β”‚ Foundation Model β”‚                         β”‚
β”‚  (Web-scale)       β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β”‚                             β”‚                                   β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚              β–Ό              β–Ό              β–Ό                    β”‚
β”‚         Adapt A        Adapt B        Adapt C                   β”‚
β”‚         (Fine-tune)    (Prompt)       (LoRA)                    β”‚
β”‚              β”‚              β”‚              β”‚                    β”‚
β”‚              β–Ό              β–Ό              β–Ό                    β”‚
β”‚         Task A         Task B         Task C                    β”‚
β”‚                                                                 β”‚
β”‚  β€’ Single large-scale pre-training                              β”‚
β”‚  β€’ Lightweight adaptation for various tasks                     β”‚
β”‚  β€’ Maximizes knowledge transfer between tasks                   β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1.3 Types of Foundation Models

Category Representative Models Input/Output
Language Models GPT-4, LLaMA, Claude Text β†’ Text
Vision Models ViT, DINOv2, SAM Image β†’ Features/Segmentation
Multimodal CLIP, LLaVA, GPT-4V Text+Image β†’ Text
Generative Stable Diffusion, DALL-E Text β†’ Image
Audio Whisper, AudioLM Audio ↔ Text
Code Codex, CodeLlama Text β†’ Code

2. History of the Paradigm Shift

2.1 Timeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Foundation Model History                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2017 β”‚ Transformer (Vaswani et al.) - Introduced self-attention     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2018 β”‚ BERT (Google) - Bidirectional context via Masked LM          β”‚
β”‚      β”‚ GPT-1 (OpenAI) - First large-scale autoregressive LM         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2019 β”‚ GPT-2 (1.5B params) - "Too dangerous to release"             β”‚
β”‚      β”‚ T5 - Text-to-Text Transfer Transformer                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2020 β”‚ GPT-3 (175B) - Discovered In-context Learning                β”‚
β”‚      β”‚ Scaling Laws paper (Kaplan et al.)                           β”‚
β”‚      β”‚ ViT - Applied Transformer to Vision                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2021 β”‚ CLIP - Connected Vision and Language                         β”‚
β”‚      β”‚ DALL-E - Text-to-Image generation                            β”‚
β”‚      β”‚ "Foundation Models" term coined (Stanford HAI)               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2022 β”‚ ChatGPT - Popularization of LLMs                             β”‚
β”‚      β”‚ Chinchilla - Compute-optimal Scaling                         β”‚
β”‚      β”‚ Stable Diffusion - Open-source image generation              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2023 β”‚ GPT-4 - Multimodal Foundation Model                          β”‚
β”‚      β”‚ LLaMA - Open-source LLM revolution                           β”‚
β”‚      β”‚ SAM - Promptable Vision Foundation Model                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2024 β”‚ GPT-4o, Claude 3, Gemini 1.5 - Performance competition       β”‚
β”‚      β”‚ LLaMA 3, Mistral - Open-source advancement                   β”‚
β”‚      β”‚ Sora - Video Foundation Model                                β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.2 Major Turning Points

(1) GPT-3's In-context Learning (2020)

GPT-3 demonstrated the potential of few-shot learning, becoming a catalyst for the paradigm shift:

# Traditional Approach: Fine-tuning required for each task
model = load_pretrained("bert-base")
model = fine_tune(model, sentiment_dataset, epochs=3)
result = model.predict("This movie was great!")

# GPT-3 In-context Learning: Learning only through prompts
prompt = """
Classify the sentiment:
Text: "I love this product!" β†’ Positive
Text: "Terrible experience." β†’ Negative
Text: "This movie was great!" β†’
"""
result = gpt3.generate(prompt)  # "Positive"

(2) CLIP's Vision-Language Connection (2021)

CLIP enabled zero-shot classification by mapping images and text to the same space:

# Zero-shot Image Classification with CLIP
import clip

model, preprocess = clip.load("ViT-B/32")

# Embed images and text in the same space
image_features = model.encode_image(preprocess(image))
text_features = model.encode_text(clip.tokenize(["a dog", "a cat", "a bird"]))

# Classify by similarity (without training!)
similarity = (image_features @ text_features.T).softmax(dim=-1)
# [0.95, 0.03, 0.02] β†’ "a dog"

(3) ChatGPT's RLHF (2022)

ChatGPT generates human-aligned responses using RLHF (Reinforcement Learning from Human Feedback):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ChatGPT Training Process                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Step 1: Pre-training (GPT-3.5 base)                            β”‚
β”‚          Learn to predict next token from web text              β”‚
β”‚                         β”‚                                       β”‚
β”‚                         β–Ό                                       β”‚
β”‚  Step 2: Supervised Fine-tuning (SFT)                           β”‚
β”‚          Train on high-quality human-written responses          β”‚
β”‚                         β”‚                                       β”‚
β”‚                         β–Ό                                       β”‚
β”‚  Step 3: Reward Model Training                                  β”‚
β”‚          Train model to predict preference between response pairsβ”‚
β”‚                         β”‚                                       β”‚
β”‚                         β–Ό                                       β”‚
β”‚  Step 4: RLHF with PPO                                          β”‚
β”‚          Optimize policy using Reward Model as reward           β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3. In-context Learning (ICL)

3.1 Concept

In-context Learning is the ability to perform tasks using only examples in the prompt, without updating model weights.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Types of In-context Learning                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Zero-shot:  "Translate to French: Hello"                       β”‚
β”‚              β†’ "Bonjour"                                        β”‚
β”‚                                                                 β”‚
β”‚  One-shot:   "English: Hello β†’ French: Bonjour                  β”‚
β”‚               English: Goodbye β†’"                               β”‚
β”‚              β†’ "Au revoir"                                      β”‚
β”‚                                                                 β”‚
β”‚  Few-shot:   "English: Hello β†’ French: Bonjour                  β”‚
β”‚               English: Goodbye β†’ French: Au revoir              β”‚
β”‚               English: Thank you β†’ French: Merci                β”‚
β”‚               English: Good morning β†’"                          β”‚
β”‚              β†’ "Bonjour" (or "Bon matin")                       β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3.2 Why ICL Works (Hypotheses)

"""
Hypothesis 1: Bayesian Inference
- Infer task distribution from prompt examples
- P(output | input, examples) ∝ P(examples | task) Γ— P(task)

Hypothesis 2: Implicit Gradient Descent
- Transformer's attention implicitly performs gradient steps
- Mechanism similar to meta-learning

Hypothesis 3: Task Vector Retrieval
- Retrieve task vectors learned during pre-training
- Prompt activates appropriate task vectors
"""

3.3 Few-shot Prompt Example

# Sentiment Analysis Few-shot
sentiment_prompt = """
Analyze the sentiment of the following reviews:

Review: "The food was delicious and the service was excellent!"
Sentiment: Positive

Review: "I waited for an hour and the waiter was rude."
Sentiment: Negative

Review: "It was okay, nothing special but not bad either."
Sentiment: Neutral

Review: "Best experience ever! Will definitely come back!"
Sentiment:"""

# API call
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": sentiment_prompt}]
)
print(response.choices[0].message.content)  # "Positive"

4. Emergent Capabilities

4.1 Definition

Emergent Capabilities are abilities that are absent in smaller models but suddenly appear beyond a certain scale.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Characteristics of Emergent Abilities         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Performance                                                    β”‚
β”‚       β”‚                                                         β”‚
β”‚   100%β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β—‹β”€β”€β”€β”€β”€ Large    β”‚
β”‚       β”‚                                      β•±         models   β”‚
β”‚       β”‚                                    β•±                    β”‚
β”‚       β”‚                              β•±                          β”‚
β”‚    50%β”œβ”€ Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β·β•±Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β·  β”‚
β”‚       β”‚                      β•±    ↑ Phase Transition            β”‚
β”‚       β”‚                    β•±      (Sudden performance jump)     β”‚
β”‚       │──────────────────○─────────────────────────── Small     β”‚
β”‚     0%β”œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β–Ά  models β”‚
β”‚       β”‚      10B    50B    100B   200B   500B     Parameters   β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4.2 Representative Emergent Capabilities

Capability Description Emergence Scale (approx.)
Arithmetic Multi-digit addition/multiplication ~10B params
Chain-of-Thought Step-by-step reasoning ~60B params
Word Unscrambling Restoring scrambled words ~60B params
Multi-step Math Complex math problems ~100B params
Code Generation Complex code writing ~100B params

4.3 Chain-of-Thought (CoT) Prompting

# Without CoT - often fails
prompt_direct = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
   Each can has 3 tennis balls. How many tennis balls does he have now?
A:"""
# GPT-3 (small): "8" (incorrect)

# With CoT - improved accuracy through step-by-step reasoning
prompt_cot = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
   Each can has 3 tennis balls. How many tennis balls does he have now?
A: Let's think step by step.
   Roger started with 5 tennis balls.
   He bought 2 cans, each with 3 balls, so 2 Γ— 3 = 6 balls.
   Total: 5 + 6 = 11 tennis balls.
   The answer is 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and
   bought 6 more, how many apples do they have?
A: Let's think step by step."""
# GPT-3: "They started with 23, used 20, so 23-20=3.
#         Then bought 6 more: 3+6=9. The answer is 9." (correct)

5. Core Components of Foundation Models

5.1 Architecture Comparison

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Major Architecture Patterns                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Encoder-only (BERT, DINOv2)                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚  β”‚ [CLS] Token1 Token2 ... TokenN [SEP]    β”‚                    β”‚
β”‚  β”‚       ↓      ↓      ↓    ↓              β”‚                    β”‚
β”‚  β”‚    β”Œβ”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”           β”‚                    β”‚
β”‚  β”‚    β”‚   Bidirectional Attn   β”‚           β”‚                    β”‚
β”‚  β”‚    β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”˜           β”‚                    β”‚
β”‚  β”‚       ↓      ↓      ↓    ↓              β”‚                    β”‚
β”‚  β”‚    Pooled / Token Representations       β”‚                    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚  β€’ Utilizes bidirectional context                               β”‚
β”‚  β€’ Suitable for classification, embeddings                      β”‚
β”‚                                                                 β”‚
β”‚  Decoder-only (GPT, LLaMA)                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚  β”‚ Token1 β†’ Token2 β†’ Token3 β†’ ...          β”‚                    β”‚
β”‚  β”‚   ↓        ↓        ↓                   β”‚                    β”‚
β”‚  β”‚ β”Œβ”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”                 β”‚                    β”‚
β”‚  β”‚ β”‚  Causal (Masked) Attnβ”‚                β”‚                    β”‚
β”‚  β”‚ β””β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”˜                 β”‚                    β”‚
β”‚  β”‚   ↓        ↓        ↓                   β”‚                    β”‚
β”‚  β”‚ Next     Next     Next                  β”‚                    β”‚
β”‚  β”‚ Token    Token    Token                 β”‚                    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚  β€’ Autoregressive generation                                    β”‚
β”‚  β€’ Optimized for text generation                                β”‚
β”‚                                                                 β”‚
β”‚  Encoder-Decoder (T5, BART)                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚                    β”‚
β”‚  β”‚ β”‚ Encoder  │───▢│ Decoder  β”‚            β”‚                    β”‚
β”‚  β”‚ β”‚(Bi-dir)  β”‚    β”‚(Causal)  β”‚            β”‚                    β”‚
β”‚  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚                    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚  β€’ Separate input understanding and output generation           β”‚
β”‚  β€’ Suitable for translation, summarization                      β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

5.2 Key Components

"""
Core Components of Foundation Models:

1. Self-Attention
   - Query, Key, Value operations
   - Learn relationships between all positions

2. Feed-Forward Network (FFN)
   - Acts as knowledge storage
   - Accounts for most parameters

3. Positional Encoding
   - Inject sequence information
   - Sinusoidal, Learnable, RoPE, etc.

4. Normalization
   - LayerNorm (BERT, GPT)
   - RMSNorm (LLaMA) - more efficient

5. Activation Function
   - GELU (BERT, GPT)
   - SwiGLU (LLaMA) - better performance
"""

6. Using Foundation Models

6.1 Getting Started with HuggingFace

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model (e.g., LLaMA-2-7B)
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # Save memory
    device_map="auto"           # Automatic GPU allocation
)

# Generate text
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    do_sample=True,
    top_p=0.9
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

6.2 Using Vision Foundation Model

# DINOv2 - Universal image feature extraction
import torch
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
model = AutoModel.from_pretrained("facebook/dinov2-base")

# Extract image embeddings
image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    features = outputs.last_hidden_state  # (1, num_patches+1, 768)
    cls_embedding = features[:, 0]        # CLS token (whole image representation)

# Use this embedding for classification, retrieval, segmentation, etc.

6.3 Using API (OpenAI)

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain foundation models in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

7. Limitations and Challenges of Foundation Models

7.1 Current Limitations

Limitation Description Solution Attempts
Hallucination Generate false information RAG, Grounding
Outdated Knowledge Unaware of post-training information RAG, Fine-tuning
Reasoning Limits Difficulty with complex logical reasoning CoT, Self-consistency
High Compute Cost Enormous training/inference costs Quantization, Distillation
Safety/Alignment Can generate harmful content RLHF, Constitutional AI

7.2 Research Directions

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Future Research Directions                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  1. Efficient Models                                            β”‚
β”‚     └─ Mixture of Experts, Sparse Attention, Quantization       β”‚
β”‚                                                                 β”‚
β”‚  2. Multimodal Integration                                      β”‚
β”‚     └─ Unified Vision + Language + Audio + Code                 β”‚
β”‚                                                                 β”‚
β”‚  3. Reasoning Enhancement                                       β”‚
β”‚     └─ Test-time Compute (o1), Tree of Thoughts                 β”‚
β”‚                                                                 β”‚
β”‚  4. Continual Learning                                          β”‚
β”‚     └─ Continuous learning, Solving Catastrophic Forgetting     β”‚
β”‚                                                                 β”‚
β”‚  5. Safety & Alignment                                          β”‚
β”‚     └─ Constitutional AI, Red-teaming, Interpretability         β”‚
β”‚                                                                 β”‚
β”‚  6. Agentic Systems                                             β”‚
β”‚     └─ Tool Use, Multi-Agent, Autonomous Planning               β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Summary

Key Concepts

  • Foundation Model: General-purpose models trained on large-scale data and applicable to various tasks
  • Paradigm Shift: Task-specific β†’ Pre-train & Adapt
  • In-context Learning: Learning through prompts without weight updates
  • Emergent Capabilities: Abilities that suddenly appear at scale

Next Steps


References

Key Papers

  • Bommasani et al. (2021). "On the Opportunities and Risks of Foundation Models"
  • Brown et al. (2020). "Language Models are Few-Shot Learners" (GPT-3)
  • Radford et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision" (CLIP)
  • Wei et al. (2022). "Emergent Abilities of Large Language Models"

Additional Resources

to navigate between lessons