Foundation Models Learning Guide¶

Overview¶

Foundation Models refer to models that are pre-trained on large-scale data and can be adapted to various downstream tasks. This folder covers the paradigm, Scaling Laws, state-of-the-art architectures, and practical applications of Foundation Models.

Prerequisites¶

Deep_Learning folder: ViT, CLIP, Self-Supervised Learning, Transformer
LLM_and_NLP folder: BERT, GPT, HuggingFace, Fine-tuning, RAG

Learning Objectives¶

Understand the Foundation Model paradigm and Scaling Laws
Learn state-of-the-art model architectures like LLaMA, Mistral, DINOv2, SAM
Master efficient adaptation (PEFT) and deployment strategies
Understand the working principles of Multimodal Foundation Models

File List¶

Section 1: Foundation Model Paradigm (01-03)¶

File	Topic	Key Content	Difficulty
01_Foundation_Model_Paradigm.md	FM Paradigm	Definition, History, In-context Learning, Emergent Capabilities	⭐⭐
02_Scaling_Laws.md	Scaling Laws	Chinchilla, Compute-optimal, Power Laws	⭐⭐⭐
03_Emergent_Abilities.md	Emergent Abilities	CoT Emergence, Phase Transitions, Capability Elicitation	⭐⭐⭐

Section 2: Pre-training Deep Dive (04-07)¶

File	Topic	Key Content	Difficulty
04_Pretraining_Objectives.md	Objectives	Causal LM, Masked LM, Prefix LM, UL2	⭐⭐⭐
05_Data_Curation.md	Data Curation	The Pile, RedPajama, Deduplication, Quality Filtering	⭐⭐⭐
06_Pretraining_Infrastructure.md	Training Infrastructure	FSDP, DeepSpeed ZeRO, Distributed Training	⭐⭐⭐⭐
07_Tokenization_Advanced.md	Tokenization	BPE, Unigram, Multilingual, Tokenizer-free	⭐⭐⭐

Section 3: State-of-the-art LLM Architectures (08-11)¶

File	Topic	Key Content	Difficulty
08_LLaMA_Family.md	LLaMA	LLaMA 1/2/3, RoPE, RMSNorm, SwiGLU, GQA	⭐⭐⭐
09_Mistral_MoE.md	Mistral & MoE	Mixtral, Sparse MoE, Router Design, Efficiency	⭐⭐⭐⭐
10_Long_Context_Models.md	Long Context	Longformer, Ring Attention, YaRN, PI	⭐⭐⭐
11_Small_Language_Models.md	Small LMs	Phi, Gemma, Qwen, TinyLlama, Knowledge Distillation	⭐⭐⭐

Section 4: Vision Foundation Models (12-15)¶

File	Topic	Key Content	Difficulty
12_DINOv2_Self_Supervised.md	DINOv2	DINO, DINOv2, Self-distillation, Dense Features	⭐⭐⭐
13_Segment_Anything.md	SAM	Promptable Segmentation, Image/Prompt/Mask Encoder	⭐⭐⭐⭐
14_Unified_Vision_Models.md	Unified Vision	Florence, PaLI, Unified-IO	⭐⭐⭐⭐
15_Image_Generation_Advanced.md	Advanced Image Generation	SDXL, ControlNet, IP-Adapter, LCM	⭐⭐⭐⭐

Section 5: Multimodal Foundation Models (16-18)¶

File	Topic	Key Content	Difficulty
16_Vision_Language_Deep.md	Vision-Language	LLaVA, Qwen-VL, Visual Instruction Tuning	⭐⭐⭐⭐
17_GPT4V_Gemini.md	GPT-4V & Gemini	Multimodal Input, Interleaved, API Usage	⭐⭐⭐
18_Audio_Video_Foundation.md	Audio/Video	Whisper, AudioLM, MusicGen, VideoLLaMA	⭐⭐⭐⭐

Section 6: Efficient Adaptation (19-21)¶

File	Topic	Key Content	Difficulty
19_PEFT_Unified.md	Unified PEFT	LoRA, QLoRA, DoRA, Adapters, IA3	⭐⭐⭐
20_Instruction_Tuning.md	Instruction Tuning	FLAN, Self-Instruct, Evol-Instruct	⭐⭐⭐
21_Continued_Pretraining.md	Continued Pre-training	Domain Adaptation, Preventing Catastrophic Forgetting	⭐⭐⭐⭐

Section 7: Deployment and Production (22-24)¶

File	Topic	Key Content	Difficulty
22_Inference_Optimization.md	Inference Optimization	vLLM, TGI, Speculative Decoding, PagedAttention	⭐⭐⭐
23_Advanced_RAG.md	Advanced RAG	Agentic RAG, HyDE, RAPTOR, ColBERT	⭐⭐⭐⭐
24_API_and_Evaluation.md	API & Evaluation	OpenAI/Anthropic/Google API, Benchmarks	⭐⭐⭐

Section 8: Future Directions (25)¶

File	Topic	Key Content	Difficulty
25_Research_Frontiers.md	Research Frontiers	World Models, o1 Reasoning, Synthetic Data	⭐⭐⭐⭐

Learning Roadmap¶

┌─────────────────────────────────────────────────────────────────────────┐
│                        Foundation Models Learning Path                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  [Prerequisites]                                                        │
│  Deep_Learning (ViT, CLIP, Transformer) + LLM_and_NLP (BERT, GPT, RAG)  │
│                              │                                          │
│                              ▼                                          │
│  ┌──────────────────────────────────────────────┐                       │
│  │      Phase 1: Paradigm (Week 1)              │                       │
│  │      01 → 02 → 03                            │                       │
│  │      (FM Definition → Scaling Laws → Emergence)│                     │
│  └──────────────────────────────────────────────┘                       │
│                              │                                          │
│              ┌───────────────┴───────────────┐                          │
│              ▼                               ▼                          │
│  ┌─────────────────────┐        ┌─────────────────────┐                 │
│  │  Path A: LLM Focus  │        │  Path B: Vision Focus│                │
│  │  04-11 (Pre-train   │        │  12-15 (DINOv2,     │                 │
│  │  + LLM Architecture)│        │  SAM, Image Gen)    │                 │
│  └─────────────────────┘        └─────────────────────┘                 │
│              │                               │                          │
│              └───────────────┬───────────────┘                          │
│                              ▼                                          │
│  ┌──────────────────────────────────────────────┐                       │
│  │      Phase 3: Multimodal (Week 3-4)          │                       │
│  │      16 → 17 → 18                            │                       │
│  │      (LLaVA → GPT-4V → Audio/Video)          │                       │
│  └──────────────────────────────────────────────┘                       │
│                              │                                          │
│                              ▼                                          │
│  ┌──────────────────────────────────────────────┐                       │
│  │      Phase 4: Practical Application (Week 5-6)│                      │
│  │      19 → 20 → 21 → 22 → 23 → 24             │                       │
│  │      (PEFT → Instruction → Deploy → RAG)     │                       │
│  └──────────────────────────────────────────────┘                       │
│                              │                                          │
│                              ▼                                          │
│  ┌──────────────────────────────────────────────┐                       │
│  │      Phase 5: Future Directions (Week 7)     │                       │
│  │      25 (Research Frontiers)                 │                       │
│  └──────────────────────────────────────────────┘                       │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Relationship with Existing Folders¶

Connection with Deep_Learning Folder¶

Deep_Learning Lesson	Foundation_Models Extension
19_ViT	10_Long_Context (ViT-based extensions)
20_CLIP	16_Vision_Language_Deep (LLaVA, etc.)
21_Self_Supervised	12_DINOv2 (Latest SSL)
17_Diffusion	15_Image_Generation_Advanced (SDXL, ControlNet)

Connection with LLM_and_NLP Folder¶

LLM_and_NLP Lesson	Foundation_Models Extension
04-05_BERT_GPT	08-09_LLaMA_Mistral (Latest open-source)
07_Fine_Tuning	19_PEFT_Unified (LoRA variants unified)
09_RAG	23_Advanced_RAG (Agentic RAG, etc.)
13_Quantization	22_Inference_Optimization (vLLM, Speculative)

Recommended Learning Paths¶

Quick Practical Application (2 weeks)¶

01 → 02 → 08 → 09 → 19 → 22
(Paradigm → Scaling → LLaMA → Mistral → PEFT → Inference)

Vision Foundation Focus (2 weeks)¶

01 → 03 → 12 → 13 → 14 → 15
(Paradigm → Emergence → DINOv2 → SAM → Unified → Image Gen)

Multimodal Specialist (3 weeks)¶

01 → 02 → 03 → 12 → 16 → 17 → 18 → 23
(Basics → Vision → VLM → GPT-4V → Audio/Video → RAG)

Complete Learning (6-7 weeks)¶

Sequential learning of all lessons (01 → 25)

Environment Setup¶

Minimum Requirements¶

# Python environment
python >= 3.10

# Core libraries
pip install torch>=2.0 transformers>=4.36 accelerate
pip install bitsandbytes peft  # For PEFT training
pip install vllm               # For inference optimization

Additional Libraries (per lesson)¶

# Vision Foundation Models
pip install timm segment-anything

# Multimodal
pip install open-clip-torch

# RAG
pip install langchain chromadb sentence-transformers

Recommended GPU Memory¶

Learning Content	Minimum VRAM	Recommended VRAM
Inference (7B model, 4bit)	6GB	8GB
Inference (7B model, FP16)	14GB	16GB
Fine-tuning (LoRA)	8GB	16GB
SAM execution	8GB	12GB

References¶

Key Papers¶

Scaling Laws: Kaplan et al. (2020), Hoffmann et al. (2022, Chinchilla)
LLaMA: Touvron et al. (2023)
Mistral/Mixtral: Jiang et al. (2023, 2024)
DINOv2: Oquab et al. (2023)
SAM: Kirillov et al. (2023)
LLaVA: Liu et al. (2023)

Online Resources¶

Next Steps¶

After completing this folder: - Model_Implementations: Implement major models from scratch for deeper understanding - MLOps: Build model deployment and operational pipelines - Reinforcement_Learning: Advanced learning of RLHF