Foundation Model Paradigm
Foundation Model Paradigm¶
Learning Objectives¶
- Understand the definition and characteristics of Foundation Models
- Grasp the paradigm shift from traditional ML to Foundation Models
- Learn the concepts of In-context Learning and Emergent Capabilities
- Identify the major Foundation Model lineage
1. What are Foundation Models?¶
1.1 Definition¶
Foundation Model is a term proposed by Stanford HAI in 2021, referring to models with the following characteristics:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Foundation Model Definition β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Pre-trained on broad data β
β - Billions to trillions of text tokens β
β - Hundreds of millions to billions of images β
β β
β 2. Adaptable to many tasks β
β - Single model performs classification, generation, QA, β
β translation, etc. β
β - Adapted through fine-tuning or prompting β
β β
β 3. General-purpose representations β
β - Task-agnostic knowledge encoding β
β - Maximizes transfer learning β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1.2 Traditional ML vs Foundation Model¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Traditional Machine Learning Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Task A ββββΊ Data A ββββΊ Model A ββββΊ Deploy A β
β Task B ββββΊ Data B ββββΊ Model B ββββΊ Deploy B β
β Task C ββββΊ Data C ββββΊ Model C ββββΊ Deploy C β
β β
β β’ Separate data collection for each task β
β β’ Separate model training for each task β
β β’ Limited knowledge sharing between tasks β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Foundation Model Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββ β
β Massive Data ββββΊ β Foundation Model β β
β (Web-scale) ββββββββββ¬βββββββββ β
β β β
β ββββββββββββββββΌβββββββββββββββ β
β βΌ βΌ βΌ β
β Adapt A Adapt B Adapt C β
β (Fine-tune) (Prompt) (LoRA) β
β β β β β
β βΌ βΌ βΌ β
β Task A Task B Task C β
β β
β β’ Single large-scale pre-training β
β β’ Lightweight adaptation for various tasks β
β β’ Maximizes knowledge transfer between tasks β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1.3 Types of Foundation Models¶
| Category | Representative Models | Input/Output |
|---|---|---|
| Language Models | GPT-4, LLaMA, Claude | Text β Text |
| Vision Models | ViT, DINOv2, SAM | Image β Features/Segmentation |
| Multimodal | CLIP, LLaVA, GPT-4V | Text+Image β Text |
| Generative | Stable Diffusion, DALL-E | Text β Image |
| Audio | Whisper, AudioLM | Audio β Text |
| Code | Codex, CodeLlama | Text β Code |
2. History of the Paradigm Shift¶
2.1 Timeline¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Foundation Model History β
ββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2017 β Transformer (Vaswani et al.) - Introduced self-attention β
ββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2018 β BERT (Google) - Bidirectional context via Masked LM β
β β GPT-1 (OpenAI) - First large-scale autoregressive LM β
ββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2019 β GPT-2 (1.5B params) - "Too dangerous to release" β
β β T5 - Text-to-Text Transfer Transformer β
ββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2020 β GPT-3 (175B) - Discovered In-context Learning β
β β Scaling Laws paper (Kaplan et al.) β
β β ViT - Applied Transformer to Vision β
ββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2021 β CLIP - Connected Vision and Language β
β β DALL-E - Text-to-Image generation β
β β "Foundation Models" term coined (Stanford HAI) β
ββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2022 β ChatGPT - Popularization of LLMs β
β β Chinchilla - Compute-optimal Scaling β
β β Stable Diffusion - Open-source image generation β
ββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2023 β GPT-4 - Multimodal Foundation Model β
β β LLaMA - Open-source LLM revolution β
β β SAM - Promptable Vision Foundation Model β
ββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2024 β GPT-4o, Claude 3, Gemini 1.5 - Performance competition β
β β LLaMA 3, Mistral - Open-source advancement β
β β Sora - Video Foundation Model β
ββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.2 Major Turning Points¶
(1) GPT-3's In-context Learning (2020)¶
GPT-3 demonstrated the potential of few-shot learning, becoming a catalyst for the paradigm shift:
# Traditional Approach: Fine-tuning required for each task
model = load_pretrained("bert-base")
model = fine_tune(model, sentiment_dataset, epochs=3)
result = model.predict("This movie was great!")
# GPT-3 In-context Learning: Learning only through prompts
prompt = """
Classify the sentiment:
Text: "I love this product!" β Positive
Text: "Terrible experience." β Negative
Text: "This movie was great!" β
"""
result = gpt3.generate(prompt) # "Positive"
(2) CLIP's Vision-Language Connection (2021)¶
CLIP enabled zero-shot classification by mapping images and text to the same space:
# Zero-shot Image Classification with CLIP
import clip
model, preprocess = clip.load("ViT-B/32")
# Embed images and text in the same space
image_features = model.encode_image(preprocess(image))
text_features = model.encode_text(clip.tokenize(["a dog", "a cat", "a bird"]))
# Classify by similarity (without training!)
similarity = (image_features @ text_features.T).softmax(dim=-1)
# [0.95, 0.03, 0.02] β "a dog"
(3) ChatGPT's RLHF (2022)¶
ChatGPT generates human-aligned responses using RLHF (Reinforcement Learning from Human Feedback):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ChatGPT Training Process β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Pre-training (GPT-3.5 base) β
β Learn to predict next token from web text β
β β β
β βΌ β
β Step 2: Supervised Fine-tuning (SFT) β
β Train on high-quality human-written responses β
β β β
β βΌ β
β Step 3: Reward Model Training β
β Train model to predict preference between response pairsβ
β β β
β βΌ β
β Step 4: RLHF with PPO β
β Optimize policy using Reward Model as reward β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3. In-context Learning (ICL)¶
3.1 Concept¶
In-context Learning is the ability to perform tasks using only examples in the prompt, without updating model weights.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Types of In-context Learning β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Zero-shot: "Translate to French: Hello" β
β β "Bonjour" β
β β
β One-shot: "English: Hello β French: Bonjour β
β English: Goodbye β" β
β β "Au revoir" β
β β
β Few-shot: "English: Hello β French: Bonjour β
β English: Goodbye β French: Au revoir β
β English: Thank you β French: Merci β
β English: Good morning β" β
β β "Bonjour" (or "Bon matin") β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.2 Why ICL Works (Hypotheses)¶
"""
Hypothesis 1: Bayesian Inference
- Infer task distribution from prompt examples
- P(output | input, examples) β P(examples | task) Γ P(task)
Hypothesis 2: Implicit Gradient Descent
- Transformer's attention implicitly performs gradient steps
- Mechanism similar to meta-learning
Hypothesis 3: Task Vector Retrieval
- Retrieve task vectors learned during pre-training
- Prompt activates appropriate task vectors
"""
3.3 Few-shot Prompt Example¶
# Sentiment Analysis Few-shot
sentiment_prompt = """
Analyze the sentiment of the following reviews:
Review: "The food was delicious and the service was excellent!"
Sentiment: Positive
Review: "I waited for an hour and the waiter was rude."
Sentiment: Negative
Review: "It was okay, nothing special but not bad either."
Sentiment: Neutral
Review: "Best experience ever! Will definitely come back!"
Sentiment:"""
# API call
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": sentiment_prompt}]
)
print(response.choices[0].message.content) # "Positive"
4. Emergent Capabilities¶
4.1 Definition¶
Emergent Capabilities are abilities that are absent in smaller models but suddenly appear beyond a certain scale.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Characteristics of Emergent Abilities β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Performance β
β β β
β 100%ββββββββββββββββββββββββββββββββββββββββββββββββ Large β
β β β± models β
β β β± β
β β β± β
β 50%ββ Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β·β±Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· β
β β β± β Phase Transition β
β β β± (Sudden performance jump) β
β βββββββββββββββββββββββββββββββββββββββββββββββ Small β
β 0%βββββββββ¬ββββββββ¬ββββββββ¬ββββββββ¬ββββββββ¬βββββββΆ models β
β β 10B 50B 100B 200B 500B Parameters β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4.2 Representative Emergent Capabilities¶
| Capability | Description | Emergence Scale (approx.) |
|---|---|---|
| Arithmetic | Multi-digit addition/multiplication | ~10B params |
| Chain-of-Thought | Step-by-step reasoning | ~60B params |
| Word Unscrambling | Restoring scrambled words | ~60B params |
| Multi-step Math | Complex math problems | ~100B params |
| Code Generation | Complex code writing | ~100B params |
4.3 Chain-of-Thought (CoT) Prompting¶
# Without CoT - often fails
prompt_direct = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A:"""
# GPT-3 (small): "8" (incorrect)
# With CoT - improved accuracy through step-by-step reasoning
prompt_cot = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Let's think step by step.
Roger started with 5 tennis balls.
He bought 2 cans, each with 3 balls, so 2 Γ 3 = 6 balls.
Total: 5 + 6 = 11 tennis balls.
The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and
bought 6 more, how many apples do they have?
A: Let's think step by step."""
# GPT-3: "They started with 23, used 20, so 23-20=3.
# Then bought 6 more: 3+6=9. The answer is 9." (correct)
5. Core Components of Foundation Models¶
5.1 Architecture Comparison¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Major Architecture Patterns β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Encoder-only (BERT, DINOv2) β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β [CLS] Token1 Token2 ... TokenN [SEP] β β
β β β β β β β β
β β ββββ΄βββββββ΄βββββββ΄βββββ΄βββ β β
β β β Bidirectional Attn β β β
β β ββββ¬βββββββ¬βββββββ¬βββββ¬βββ β β
β β β β β β β β
β β Pooled / Token Representations β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β’ Utilizes bidirectional context β
β β’ Suitable for classification, embeddings β
β β
β Decoder-only (GPT, LLaMA) β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Token1 β Token2 β Token3 β ... β β
β β β β β β β
β β βββ΄βββββββββ΄βββββββββ΄ββ β β
β β β Causal (Masked) Attnβ β β
β β βββ¬βββββββββ¬βββββββββ¬ββ β β
β β β β β β β
β β Next Next Next β β
β β Token Token Token β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β’ Autoregressive generation β
β β’ Optimized for text generation β
β β
β Encoder-Decoder (T5, BART) β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β ββββββββββββ ββββββββββββ β β
β β β Encoder βββββΆβ Decoder β β β
β β β(Bi-dir) β β(Causal) β β β
β β ββββββββββββ ββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β’ Separate input understanding and output generation β
β β’ Suitable for translation, summarization β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5.2 Key Components¶
"""
Core Components of Foundation Models:
1. Self-Attention
- Query, Key, Value operations
- Learn relationships between all positions
2. Feed-Forward Network (FFN)
- Acts as knowledge storage
- Accounts for most parameters
3. Positional Encoding
- Inject sequence information
- Sinusoidal, Learnable, RoPE, etc.
4. Normalization
- LayerNorm (BERT, GPT)
- RMSNorm (LLaMA) - more efficient
5. Activation Function
- GELU (BERT, GPT)
- SwiGLU (LLaMA) - better performance
"""
6. Using Foundation Models¶
6.1 Getting Started with HuggingFace¶
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model (e.g., LLaMA-2-7B)
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # Save memory
device_map="auto" # Automatic GPU allocation
)
# Generate text
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
do_sample=True,
top_p=0.9
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
6.2 Using Vision Foundation Model¶
# DINOv2 - Universal image feature extraction
import torch
from PIL import Image
from transformers import AutoImageProcessor, AutoModel
processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
model = AutoModel.from_pretrained("facebook/dinov2-base")
# Extract image embeddings
image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
features = outputs.last_hidden_state # (1, num_patches+1, 768)
cls_embedding = features[:, 0] # CLS token (whole image representation)
# Use this embedding for classification, retrieval, segmentation, etc.
6.3 Using API (OpenAI)¶
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain foundation models in simple terms."}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
7. Limitations and Challenges of Foundation Models¶
7.1 Current Limitations¶
| Limitation | Description | Solution Attempts |
|---|---|---|
| Hallucination | Generate false information | RAG, Grounding |
| Outdated Knowledge | Unaware of post-training information | RAG, Fine-tuning |
| Reasoning Limits | Difficulty with complex logical reasoning | CoT, Self-consistency |
| High Compute Cost | Enormous training/inference costs | Quantization, Distillation |
| Safety/Alignment | Can generate harmful content | RLHF, Constitutional AI |
7.2 Research Directions¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Future Research Directions β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Efficient Models β
β ββ Mixture of Experts, Sparse Attention, Quantization β
β β
β 2. Multimodal Integration β
β ββ Unified Vision + Language + Audio + Code β
β β
β 3. Reasoning Enhancement β
β ββ Test-time Compute (o1), Tree of Thoughts β
β β
β 4. Continual Learning β
β ββ Continuous learning, Solving Catastrophic Forgetting β
β β
β 5. Safety & Alignment β
β ββ Constitutional AI, Red-teaming, Interpretability β
β β
β 6. Agentic Systems β
β ββ Tool Use, Multi-Agent, Autonomous Planning β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Summary¶
Key Concepts¶
- Foundation Model: General-purpose models trained on large-scale data and applicable to various tasks
- Paradigm Shift: Task-specific β Pre-train & Adapt
- In-context Learning: Learning through prompts without weight updates
- Emergent Capabilities: Abilities that suddenly appear at scale
Next Steps¶
- 02_Scaling_Laws.md: Relationship between model size and performance
- 03_Emergent_Abilities.md: In-depth analysis of emergent abilities
References¶
Key Papers¶
- Bommasani et al. (2021). "On the Opportunities and Risks of Foundation Models"
- Brown et al. (2020). "Language Models are Few-Shot Learners" (GPT-3)
- Radford et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision" (CLIP)
- Wei et al. (2022). "Emergent Abilities of Large Language Models"