Scaling Laws

Scaling Laws

Learning Objectives

  • Understand the concept and mathematical form of Scaling Laws
  • Compare Kaplan et al. vs Chinchilla laws
  • Learn compute-optimal training strategies
  • Grasp how to apply Scaling Laws in practice

1. What are Scaling Laws?

1.1 Definition

Scaling Laws are empirical laws that describe the relationship between number of parameters (N), amount of data (D), compute (C), and performance (Loss) of models.

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                    Core Relationships in Scaling Laws            │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│                                                                 │
│  Loss ā‰ˆ A/N^α + B/D^β + E                                       │
│                                                                 │
│  N = Number of model parameters                                 │
│  D = Number of training data tokens                             │
│  C = Compute (FLOPs) ā‰ˆ 6 Ɨ N Ɨ D                                │
│  E = Irreducible minimum loss (entropy of data)                 │
│                                                                 │
│  Key findings:                                                  │
│  • Loss decreases according to Power Law with respect to N, D   │
│  • When C is fixed, there exists an optimal ratio of N and D    │
│  • Larger models utilize data more efficiently                  │
│                                                                 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

1.2 Why Important?

"""
Practical value of Scaling Laws:

1. Cost Prediction
   - Estimate required resources before training
   - "How much is needed to train a 10B model?"

2. Optimal Allocation
   - Decide model size vs data amount with fixed budget
   - "What's the best setup with $100M budget?"

3. Performance Prediction
   - Estimate large model performance from small models
   - "With current 7B model, how much better will 70B be?"

4. Research Planning
   - Determine research directions with high ROI
   - "Should we increase data or scale up model?"
"""

2. Kaplan Scaling Laws (2020)

2.1 OpenAI's Initial Research

Laws discovered in Kaplan et al.'s 2020 paper "Scaling Laws for Neural Language Models":

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                    Kaplan Scaling Laws                           │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│                                                                 │
│  1. Loss vs Parameters                                          │
│     L(N) = (N_c / N)^α_N, where α_N ā‰ˆ 0.076                     │
│                                                                 │
│  2. Loss vs Data                                                │
│     L(D) = (D_c / D)^α_D, where α_D ā‰ˆ 0.095                     │
│                                                                 │
│  3. Loss vs Compute                                             │
│     L(C) = (C_c / C)^α_C, where α_C ā‰ˆ 0.050                     │
│                                                                 │
│  Key claims:                                                    │
│  • Parameter count is most important (α_N < α_D)                │
│  • For same compute, larger model + less data is better         │
│  • N āˆ C^0.73, D āˆ C^0.27 (Compute-optimal allocation)          │
│                                                                 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

2.2 Visualization

   Loss (Log)
       │
   3.5 ā”œā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ 100M params
       │   ╲
   3.0 ā”œā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€  1B params
       │       ╲
   2.5 ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ 10B params
       │           ╲
   2.0 ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€100B params
       │               ╲
   1.5 ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€  1T params (predicted)
       │
       └───┬───┬───┬───┬───┬───┬───┬──▶
          10^18  19   20   21   22   23   Compute (FLOPs)

   • Straight line = Power Law (linear in log scale)
   • Slope = α_C ā‰ˆ 0.05

2.3 Model Design Following Kaplan's Law

"""
Example application of Kaplan's law:

Compute budget: 10^21 FLOPs

Kaplan optimal allocation:
- N āˆ C^0.73 → N ā‰ˆ 10^15 (about 1 trillion parameters?!)
- D āˆ C^0.27 → D ā‰ˆ 10^9 (about 1 billion tokens)

Problem:
- Model becomes too large with insufficient data
- GPT-3 (175B) followed this law but...
- Chinchilla refuted this
"""

3. Chinchilla Scaling Laws (2022)

3.1 DeepMind's Rediscovery

Hoffmann et al.'s "Training Compute-Optimal Large Language Models" revised Kaplan's law:

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                    Chinchilla Scaling Laws                       │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│                                                                 │
│  Key finding: Existing models are Under-trained!                │
│                                                                 │
│  Compute-optimal scaling:                                        │
│  • N āˆ C^0.5  (number of parameters)                            │
│  • D āˆ C^0.5  (number of data tokens)                           │
│  • i.e., N and D should increase at the same rate for optimality│
│                                                                 │
│  Practical rule:                                                │
│  D ā‰ˆ 20 Ɨ N  (tokens ā‰ˆ 20 Ɨ parameters)                         │
│                                                                 │
│  Examples:                                                      │
│  • 1B model → 20B tokens needed                                 │
│  • 7B model → 140B tokens needed                                │
│  • 70B model → 1.4T tokens needed                               │
│                                                                 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

3.2 Chinchilla vs Gopher Comparison

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│               Chinchilla (70B) vs Gopher (280B)                  │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│                                                                 │
│  Model      │ Parameters│ Train Tokens│ Compute   │ Performance│
│  ───────────│──────────│─────────────│───────────│────────────│
│  Gopher     │ 280B     │ 300B        │ 5.0Ɨ10^23 │ Baseline   │
│  Chinchilla │ 70B      │ 1.4T        │ 5.0Ɨ10^23 │ +10% better│
│                                                                 │
│  Conclusion:                                                    │
│  • 4x smaller model performs better with same compute!          │
│  • Gopher is Under-trained (insufficient data)                  │
│  • Simply increasing model size is inefficient                  │
│                                                                 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

3.3 Status of Existing Models

             Tokens (D)
                │
          10T   ā”œ                               ā— LLaMA 2 (2023)
                │                           ā—
           1T   ā”œ                       ā— Chinchilla (Optimal)
                │                   ╱
         100B   ā”œ               ╱       ā— GPT-3 (Under-trained)
                │           ╱
          10B   ā”œ       ╱
                │   ╱                   ā— Gopher (Very Under-trained)
           1B   ā”œā”€
                └───┬───┬───┬───┬───┬───┬───┬───▶
                   1B  10B 100B  1T  10T      Parameters (N)

             ╱ = Compute-optimal frontier (D ā‰ˆ 20N)

             Points below the line are Under-trained

4. Mathematical Formulation

4.1 Loss Function

"""
Mathematical form of Scaling Law:

1. Single Variable Scaling
   L(N) = (N_c / N)^α + L_āˆž     # Consider parameters only
   L(D) = (D_c / D)^β + L_āˆž     # Consider data only

2. Combined Scaling (Chinchilla)
   L(N, D) = E + A/N^α + B/D^β

   where:
   - E ā‰ˆ 1.69 (irreducible loss, data entropy)
   - A ā‰ˆ 406.4
   - B ā‰ˆ 410.7
   - α ā‰ˆ 0.34
   - β ā‰ˆ 0.28

3. Compute Perspective
   C ā‰ˆ 6 Ɨ N Ɨ D  (FLOPs for training)

   Optimization: min L(N, D) subject to C = 6ND

   Result: N* āˆ C^0.5, D* āˆ C^0.5
"""

4.2 Scaling Law Simulation with Python

import numpy as np
import matplotlib.pyplot as plt

def chinchilla_loss(N, D, A=406.4, B=410.7, alpha=0.34, beta=0.28, E=1.69):
    """
    Calculate Loss according to Chinchilla Scaling Law

    Args:
        N: Number of parameters (billions)
        D: Number of tokens (billions)

    Returns:
        Expected Loss (log of perplexity)
    """
    return E + A / (N ** alpha) + B / (D ** beta)

def optimal_allocation(compute_budget, flops_per_token=6):
    """
    Calculate optimal N, D for given compute budget

    Args:
        compute_budget: Total FLOPs (e.g., 10^23)
        flops_per_token: FLOPs per token (approximately 6N)

    Returns:
        optimal_N, optimal_D (in billions)
    """
    # Chinchilla optimal ratio: D ā‰ˆ 20N
    # C = 6 * N * D = 6 * N * 20N = 120 * N^2
    # N = sqrt(C / 120)

    optimal_N = np.sqrt(compute_budget / 120) / 1e9  # billions
    optimal_D = 20 * optimal_N                        # billions

    return optimal_N, optimal_D

# Example: 10^23 FLOPs budget
compute = 1e23
N_opt, D_opt = optimal_allocation(compute)
print(f"Compute budget: 10^23 FLOPs")
print(f"Optimal parameters: {N_opt:.1f}B")
print(f"Optimal tokens: {D_opt:.1f}B")
print(f"Expected loss: {chinchilla_loss(N_opt, D_opt):.3f}")

# Visualization: Loss according to N vs D
N_range = np.logspace(0, 3, 50)  # 1B to 1000B
D_range = np.logspace(0, 4, 50)  # 1B to 10000B

N_grid, D_grid = np.meshgrid(N_range, D_range)
Loss_grid = chinchilla_loss(N_grid, D_grid)

plt.figure(figsize=(10, 8))
plt.contour(N_grid, D_grid, Loss_grid, levels=20)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Parameters N (Billions)')
plt.ylabel('Tokens D (Billions)')
plt.title('Chinchilla Scaling Law: Loss Contours')
plt.colorbar(label='Loss')
plt.plot(N_range, 20*N_range, 'r--', label='Optimal ratio (D=20N)')
plt.legend()
plt.show()

5. Application in Real Models

5.1 Scaling Comparison of Major Models

Model Parameters (N) Tokens (D) D/N Ratio Status
GPT-3 175B 300B 1.7 Under-trained
Gopher 280B 300B 1.1 Very Under-trained
Chinchilla 70B 1.4T 20 Optimal
LLaMA 1 65B 1.4T 21.5 Near-optimal
LLaMA 2 70B 2T 28.6 Slight Over-trained
Mistral 7B 8T (est.) ~1000 Over-trained

5.2 Benefits of Over-training

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                Over-training Strategy (LLaMA 2, Mistral)         │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│                                                                 │
│  Chinchilla is "training" optimal but not "deployment" optimal! │
│                                                                 │
│  From deployment perspective:                                   │
│  • Inference cost āˆ N (model size)                              │
│  • Training once, inference trillions of times                  │
│                                                                 │
│  Therefore:                                                     │
│  • Smaller model + more data = inference efficient              │
│  • "Inference-optimal" ≠ "Compute-optimal"                      │
│                                                                 │
│  LLaMA 2 strategy:                                              │
│  • 70B model with 2T tokens (D/N ā‰ˆ 29)                          │
│  • Train longer than Chinchilla                                 │
│  • Result: Better performance with smaller model                │
│                                                                 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

5.3 Practical Guidelines

"""
Scaling strategies in practice:

1. Research/Experimentation Phase (Compute-limited)
   - Follow Chinchilla rule: D ā‰ˆ 20N
   - Iterate quickly with smaller models

2. Production Deployment (Inference-limited)
   - Consider over-training: D > 20N
   - Smaller model + more data
   - Example: Mistral 7B > LLaMA 2 13B (on some tasks)

3. Budget Planning
   - C = 6 * N * D (FLOPs)
   - GPU hours ā‰ˆ C / (GPU_FLOPS * utilization)
   - Example: A100 80GB = ~300 TFLOPS (effective)

4. Scale-up Strategy
   - Tune hyperparameters with small models
   - Predict large model performance with Scaling Law
   - Execute large-scale training after validation
"""

def estimate_training_cost(N_billions, D_billions, gpu_price_per_hour=2.0):
    """
    Estimate training cost

    Args:
        N_billions: Number of parameters (B)
        D_billions: Number of tokens (B)
        gpu_price_per_hour: GPU cost per hour (USD)

    Returns:
        dict: Expected cost information
    """
    N = N_billions * 1e9
    D = D_billions * 1e9

    # 6ND FLOPs for training
    total_flops = 6 * N * D

    # A100 80GB: ~300 TFLOPS effective
    gpu_tflops = 300
    gpu_flops = gpu_tflops * 1e12

    # Total GPU time
    total_gpu_seconds = total_flops / gpu_flops
    total_gpu_hours = total_gpu_seconds / 3600

    # Cost
    total_cost = total_gpu_hours * gpu_price_per_hour

    return {
        "total_flops": f"{total_flops:.2e}",
        "gpu_hours": f"{total_gpu_hours:,.0f}",
        "cost_usd": f"${total_cost:,.0f}",
        "cost_with_8gpus": f"${total_cost/8:,.0f} ({total_gpu_hours/8:,.0f} hours)"
    }

# Example: LLaMA 2 7B training cost
cost_7b = estimate_training_cost(7, 2000)
print("LLaMA 2 7B (2T tokens):")
for k, v in cost_7b.items():
    print(f"  {k}: {v}")

6. Extensions of Scaling Laws

6.1 Scaling in Other Domains

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                    Scaling Laws by Domain                        │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│                                                                 │
│  Vision (ViT):                                                  │
│  • Similar power law observed                                   │
│  • α ā‰ˆ 0.05 (smaller than Language)                            │
│  • Data quality is more important                               │
│                                                                 │
│  Multimodal (CLIP):                                             │
│  • Separate optimization needed for image and text scaling      │
│  • Quality of data pairs is critical                            │
│                                                                 │
│  Code:                                                          │
│  • Steeper scaling (larger α)                                   │
│  • High-quality code data is scarce                             │
│                                                                 │
│  Reasoning:                                                     │
│  • Not smooth due to emergent behavior                          │
│  • Sudden performance improvements at specific thresholds       │
│                                                                 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

6.2 Fine-tuning Scaling Laws

"""
Scaling Law also applies to fine-tuning:

Research findings:
- Larger base model = less fine-tuning data needed
- Fine-tuning data also scales according to power law
- PEFT like LoRA follows similar patterns

Practical rules:
- Base model size Ɨ 10 = Fine-tuning data amount (approximately)
- 7B model: ~1K-10K examples
- 70B model: ~100-1K examples (to achieve same performance)

However, quality > quantity:
- 100 high-quality examples > 10,000 low-quality examples
"""

6.3 Inference Scaling (Test-time Compute)

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                    Inference Scaling (o1-style)                  │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│                                                                 │
│  Traditional Scaling: Increase compute during training          │
│  Inference Scaling: Increase compute during inference           │
│                                                                 │
│  Methods:                                                       │
│  • Generate longer Chain-of-Thought                             │
│  • Generate multiple answers and vote (Self-consistency)        │
│  • Tree of Thoughts / Beam Search                               │
│  • Iterative Verification/Refinement                            │
│                                                                 │
│  Effects:                                                       │
│  • Significantly improved accuracy on difficult problems        │
│  • Performance improvement possible without training            │
│  • Paradigm shift from GPT-4 → o1 (inference-time scaling)      │
│                                                                 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

7. Limitations of Scaling

7.1 Physical Limits

"""
Real limitations of Scaling:

1. Data Limits
   - Total internet text: ~10-50T tokens
   - High-quality data is much less
   - As of 2024, data exhaustion discussion beginning

2. Compute Limits
   - Power consumption (MW scale)
   - Semiconductor supply
   - Cost (billions of dollars)

3. Architecture Limits
   - Attention's O(n²) complexity
   - Memory bandwidth bottleneck
   - Communication overhead in distributed training

4. Diminishing Returns
   - α ā‰ˆ 0.05 means 10x compute → ~12% loss reduction
   - Increasingly larger investments needed
"""

7.2 Improvement Directions Beyond Scaling

Direction Description Examples
Architecture More efficient structures Mamba, RWKV, Hyena
Data Quality High-quality data curation Phi, LIMA
Synthetic Data Generate training data with AI Self-Instruct
Efficient Training Improve training efficiency Flash Attention, ZeRO
Test-time Compute Increase compute during inference CoT, Self-consistency, o1

Summary

Key Concepts

  • Scaling Laws: Power law relationship between parameters, data, compute, and performance
  • Kaplan: Prioritize N (large model + less data)
  • Chinchilla: Balance N and D (D ā‰ˆ 20N)
  • Over-training: Train smaller models longer for inference efficiency

Practical Formulas

Compute-optimal: D ā‰ˆ 20 Ɨ N (tokens)
Training FLOPs: C ā‰ˆ 6 Ɨ N Ɨ D
Inference-optimal: Smaller N, larger D

Next Steps


References

Key Papers

  • Kaplan et al. (2020). "Scaling Laws for Neural Language Models"
  • Hoffmann et al. (2022). "Training Compute-Optimal Large Language Models" (Chinchilla)
  • Touvron et al. (2023). "LLaMA 2: Open Foundation and Fine-Tuned Chat Models"

Additional Resources

to navigate between lessons