07. Gradient Descent Theory

07. Gradient Descent Theory

Learning Objectives

  • Understand the basic principle and update rule of gradient descent and implement it
  • Theoretically analyze convergence rates for convex and strongly convex functions
  • Learn the principle of stochastic gradient descent (SGD) and the role of mini-batches
  • Understand the operating principle of momentum and Nesterov accelerated gradient with physical intuition
  • Learn the derivation process of adaptive learning rate methods like Adam and RMSProp
  • Understand and apply practical considerations in neural network optimization

1. Gradient Descent Basics

1.1 Basic Principle

Gradient descent (GD) is a first-order optimization algorithm that iteratively moves in the opposite direction of the gradient to minimize a function.

Update rule:

$$ x_{t+1} = x_t - \eta \nabla f(x_t) $$

  • $x_t$: parameter at time $t$
  • $\eta$: learning rate (step size)
  • $\nabla f(x_t)$: gradient at $x_t$

Intuition: - Gradient $\nabla f(x)$ is the direction of fastest increase - Negative gradient $-\nabla f(x)$ is the direction of fastest decrease (steepest descent) - Learning rate $\eta$ controls the step size

1.2 First-Order Taylor Approximation

Gradient descent is based on first-order Taylor approximation:

$$ f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x $$

Choosing $\Delta x = -\eta \nabla f(x)$:

$$ f(x - \eta \nabla f(x)) \approx f(x) - \eta \|\nabla f(x)\|^2 $$

If $\eta$ is sufficiently small, the function value decreases.

1.3 Implementation and Visualization

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

# ๋ชฉ์  ํ•จ์ˆ˜: f(x,y) = (x-1)^2 + 2(y-2)^2
def objective(x, y):
    return (x - 1)**2 + 2*(y - 2)**2

def gradient(x, y):
    df_dx = 2*(x - 1)
    df_dy = 4*(y - 2)
    return np.array([df_dx, df_dy])

# ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•
def gradient_descent(x0, learning_rate, n_iterations):
    """๊ธฐ๋ณธ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•"""
    trajectory = [x0]
    x = x0.copy()

    for _ in range(n_iterations):
        grad = gradient(x[0], x[1])
        x = x - learning_rate * grad
        trajectory.append(x.copy())

    return np.array(trajectory)

# ์ดˆ๊ธฐ์ 
x0 = np.array([-2.0, -1.0])

# ์—ฌ๋Ÿฌ ํ•™์Šต๋ฅ ๋กœ ์‹คํ—˜
learning_rates = [0.1, 0.3, 0.5, 0.9]
n_iterations = 50

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

# ๋“ฑ๊ณ ์„  ๊ทธ๋ฆฌ๋“œ
x_vals = np.linspace(-3, 3, 300)
y_vals = np.linspace(-2, 4, 300)
X, Y = np.meshgrid(x_vals, y_vals)
Z = objective(X, Y)

for idx, lr in enumerate(learning_rates):
    ax = axes[idx]

    # ๋“ฑ๊ณ ์„ 
    contour = ax.contour(X, Y, Z, levels=20, alpha=0.6, cmap='viridis')
    ax.clabel(contour, inline=True, fontsize=8)

    # ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• ๊ถค์ 
    trajectory = gradient_descent(x0, lr, n_iterations)
    ax.plot(trajectory[:, 0], trajectory[:, 1], 'ro-', linewidth=2,
            markersize=6, label='GD ๊ถค์ ')
    ax.plot(x0[0], x0[1], 'g*', markersize=20, label='์‹œ์ž‘์ ')
    ax.plot(1, 2, 'r*', markersize=20, label='์ตœ์†Ÿ๊ฐ’')

    # ๊ทธ๋ž˜๋””์–ธํŠธ ๋ฒกํ„ฐ ํ‘œ์‹œ (์ฒ˜์Œ ๋ช‡ ๊ฐœ)
    for i in range(0, min(5, len(trajectory)-1), 1):
        x_curr = trajectory[i]
        grad = gradient(x_curr[0], x_curr[1])
        ax.quiver(x_curr[0], x_curr[1], -grad[0], -grad[1],
                  angles='xy', scale_units='xy', scale=5,
                  color='blue', width=0.005, alpha=0.7)

    ax.set_xlabel('x', fontsize=12)
    ax.set_ylabel('y', fontsize=12)
    ax.set_title(f'ํ•™์Šต๋ฅ  ฮท = {lr} ({len(trajectory)-1} iterations)', fontsize=13)
    ax.legend(fontsize=10)
    ax.grid(True, alpha=0.3)
    ax.set_xlim(-3, 3)
    ax.set_ylim(-2, 4)

    # ์ˆ˜๋ ด ์ •๋ณด
    final_x = trajectory[-1]
    final_loss = objective(final_x[0], final_x[1])
    distance_to_optimum = np.linalg.norm(final_x - np.array([1, 2]))
    ax.text(0.05, 0.95, f'์ตœ์ข… ์†์‹ค: {final_loss:.4f}\n์ตœ์†Ÿ๊ฐ’๊นŒ์ง€ ๊ฑฐ๋ฆฌ: {distance_to_optimum:.4f}',
            transform=ax.transAxes, fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.savefig('gradient_descent_learning_rates.png', dpi=150)
plt.show()

print("ํ•™์Šต๋ฅ ์˜ ์˜ํ–ฅ:")
print("  - ฮท ๋„ˆ๋ฌด ์ž‘์Œ: ์ˆ˜๋ ด ๋А๋ฆผ")
print("  - ฮท ์ ์ ˆํ•จ: ๋น ๋ฅธ ์ˆ˜๋ ด")
print("  - ฮท ๋„ˆ๋ฌด ํผ: ๋ฐœ์‚ฐ ๋˜๋Š” ์ง„๋™")

1.4 Choosing the Learning Rate

Learning rate too small: - Very slow convergence - Many iterations required

Learning rate too large: - Oscillation around minimum - Possible divergence

Appropriate learning rate: - Theoretical upper bound: $\eta \leq \frac{1}{L}$ (L: Lipschitz constant) - Practice: grid search or learning rate schedule

# ํ•™์Šต๋ฅ ์— ๋”ฐ๋ฅธ ์ˆ˜๋ ด ๊ณก์„ 
fig, ax = plt.subplots(figsize=(12, 6))

for lr in learning_rates:
    trajectory = gradient_descent(x0, lr, n_iterations)
    losses = [objective(x[0], x[1]) for x in trajectory]
    ax.plot(losses, linewidth=2, label=f'ฮท = {lr}')

ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('ํ•™์Šต๋ฅ ์— ๋”ฐ๋ฅธ ์†์‹ค ๊ฐ์†Œ', fontsize=14)
ax.set_yscale('log')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('learning_rate_convergence.png', dpi=150)
plt.show()

2. Convergence Analysis

2.1 Lipschitz Continuous Gradient

A function $f$ has Lipschitz continuous gradient if there exists a constant $L > 0$ satisfying:

$$ \|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\|, \quad \forall x, y $$

This is equivalent to $\nabla^2 f(x) \preceq LI$ (Hessian bounded by $LI$).

2.2 Convergence for Convex Functions

Theorem (Convex Function): If $f$ is convex and the gradient is $L$-Lipschitz continuous, with learning rate $\eta = \frac{1}{L}$:

$$ f(x_t) - f(x^*) \leq \frac{L\|x_0 - x^*\|^2}{2t} $$

That is, sublinear convergence: $O(1/t)$

2.3 Convergence for Strongly Convex Functions

A function $f$ is $m$-strongly convex if:

$$ f(y) \geq f(x) + \nabla f(x)^T(y-x) + \frac{m}{2}\|y-x\|^2 $$

or equivalently $\nabla^2 f(x) \succeq mI$.

Theorem (Strongly Convex Function): If $f$ is $m$-strongly convex and the gradient is $L$-Lipschitz continuous, with learning rate $\eta = \frac{1}{L}$:

$$ \|x_t - x^*\|^2 \leq \left(1 - \frac{m}{L}\right)^t \|x_0 - x^*\|^2 $$

That is, linear convergence: $O(\rho^t)$, where $\rho = 1 - \frac{m}{L} < 1$

Condition Number: $$\kappa = \frac{L}{m}$$

The larger the condition number (ill-conditioned), the slower the convergence.

2.4 Convergence Rate Simulation

import numpy as np
import matplotlib.pyplot as plt

# ์ด์ฐจ ํ˜•์‹: f(x) = 0.5 * x^T A x
# ๊ฐ•๋ณผ๋ก: eigenvalues(A) > 0

def create_quadratic(m, L, dim=10):
    """๊ฐ•๋ณผ๋ก ์ด์ฐจ ํ•จ์ˆ˜ ์ƒ์„ฑ (์กฐ๊ฑด์ˆ˜ ฮบ = L/m)"""
    # ๊ณ ์œ ๊ฐ’์„ m๊ณผ L ์‚ฌ์ด์— ๊ท ๋“ฑ ๋ถ„ํฌ
    eigenvalues = np.linspace(m, L, dim)
    # ๋žœ๋ค ์ง๊ต ํ–‰๋ ฌ
    Q, _ = np.linalg.qr(np.random.randn(dim, dim))
    # A = Q ฮ› Q^T
    A = Q @ np.diag(eigenvalues) @ Q.T
    return A

def quadratic_objective(x, A):
    return 0.5 * x @ A @ x

def quadratic_gradient(x, A):
    return A @ x

def gd_quadratic(A, x0, learning_rate, n_iterations):
    """์ด์ฐจ ํ•จ์ˆ˜์— ๋Œ€ํ•œ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•"""
    x = x0.copy()
    trajectory = [np.linalg.norm(x)**2]  # ||x - x*||^2, x* = 0

    for _ in range(n_iterations):
        grad = quadratic_gradient(x, A)
        x = x - learning_rate * grad
        trajectory.append(np.linalg.norm(x)**2)

    return trajectory

# ์‹คํ—˜ ์„ค์ •
dim = 10
n_iterations = 100

# ์—ฌ๋Ÿฌ ์กฐ๊ฑด์ˆ˜๋กœ ์‹คํ—˜
condition_numbers = [1, 10, 100, 1000]
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

for kappa in condition_numbers:
    m = 1.0
    L = kappa * m
    A = create_quadratic(m, L, dim)

    # ์ดˆ๊ธฐ์ 
    x0 = np.random.randn(dim)

    # ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•
    learning_rate = 1 / L
    trajectory = gd_quadratic(A, x0, learning_rate, n_iterations)

    # ์„ ํ˜• ์Šค์ผ€์ผ
    axes[0].plot(trajectory, linewidth=2, label=f'ฮบ = {kappa}')

    # ๋กœ๊ทธ ์Šค์ผ€์ผ
    axes[1].semilogy(trajectory, linewidth=2, label=f'ฮบ = {kappa}')

    # ์ด๋ก ์  ์ˆ˜๋ ด ์†๋„
    rho = 1 - m/L
    theoretical = [trajectory[0] * (rho ** t) for t in range(n_iterations + 1)]
    axes[1].semilogy(theoretical, '--', linewidth=1, alpha=0.6,
                     label=f'์ด๋ก  (ฮบ={kappa})')

axes[0].set_xlabel('Iteration', fontsize=12)
axes[0].set_ylabel('$\|x_t - x^*\|^2$', fontsize=12)
axes[0].set_title('์ˆ˜๋ ด ๊ณก์„  (์„ ํ˜• ์Šค์ผ€์ผ)', fontsize=14)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

axes[1].set_xlabel('Iteration', fontsize=12)
axes[1].set_ylabel('$\|x_t - x^*\|^2$ (log scale)', fontsize=12)
axes[1].set_title('์ˆ˜๋ ด ๊ณก์„  (๋กœ๊ทธ ์Šค์ผ€์ผ): ์„ ํ˜• ์ˆ˜๋ ด', fontsize=14)
axes[1].legend(fontsize=9)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('convergence_analysis_condition_number.png', dpi=150)
plt.show()

print("์กฐ๊ฑด์ˆ˜์˜ ์˜ํ–ฅ:")
print("  ฮบ = 1: ์™„๋ฒฝํ•œ ์กฐ๊ฑด (๋ชจ๋“  ๋ฐฉํ–ฅ ๋™์ผํ•œ ๊ณก๋ฅ )")
print("  ฮบ >> 1: ill-conditioned (์ผ๋ถ€ ๋ฐฉํ–ฅ ๋งค์šฐ ํ‰ํƒ„)")
print("  ์ˆ˜๋ ด ์†๋„: O((1 - 1/ฮบ)^t)")

3. Stochastic Gradient Descent (SGD)

3.1 Batch vs Mini-batch

Batch Gradient Descent: Compute gradient using entire dataset: $$x_{t+1} = x_t - \eta \nabla f(x_t) = x_t - \eta \frac{1}{n}\sum_{i=1}^n \nabla f_i(x_t)$$

Stochastic Gradient Descent (SGD): Estimate gradient using one randomly selected sample: $$x_{t+1} = x_t - \eta \nabla f_{i_t}(x_t)$$

Mini-batch SGD: Estimate gradient using mini-batch of $B$ samples: $$x_{t+1} = x_t - \eta \frac{1}{|B|}\sum_{i \in B} \nabla f_i(x_t)$$

3.2 Advantages and Disadvantages of SGD

Advantages: - Computational efficiency: No need for entire dataset per iteration - Memory efficiency: Suitable for large datasets - Regularization effect: Noise helps escape sharp minima - Online learning: Works even with streaming data

Disadvantages: - Noise: Unstable gradient estimation - Learning rate tuning: More sensitive than batch GD - Convergence speed: Theoretically slower (but faster in practice)

3.3 Effect of Mini-batch Size

import numpy as np
import matplotlib.pyplot as plt

# ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ: ์„ ํ˜• ํšŒ๊ท€
np.random.seed(42)
n_samples = 1000
n_features = 20

X = np.random.randn(n_samples, n_features)
true_w = np.random.randn(n_features)
y = X @ true_w + 0.1 * np.random.randn(n_samples)

def mse_loss(w, X, y):
    """MSE ์†์‹ค"""
    return 0.5 * np.mean((X @ w - y) ** 2)

def mse_gradient(w, X, y):
    """MSE ๊ทธ๋ž˜๋””์–ธํŠธ"""
    return X.T @ (X @ w - y) / len(y)

def sgd_minibatch(X, y, batch_size, learning_rate, n_epochs):
    """๋ฏธ๋‹ˆ๋ฐฐ์น˜ SGD"""
    n_samples = len(X)
    w = np.zeros(X.shape[1])
    losses = []

    for epoch in range(n_epochs):
        # ๋ฐ์ดํ„ฐ ์…”ํ”Œ
        indices = np.random.permutation(n_samples)
        X_shuffled = X[indices]
        y_shuffled = y[indices]

        # ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ ๋‚˜๋ˆ„๊ธฐ
        for i in range(0, n_samples, batch_size):
            X_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]

            # ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ ๋ฐ ์—…๋ฐ์ดํŠธ
            grad = mse_gradient(w, X_batch, y_batch)
            w = w - learning_rate * grad

        # ์—ํฌํฌ๋งˆ๋‹ค ์ „์ฒด ์†์‹ค ๊ธฐ๋ก
        loss = mse_loss(w, X, y)
        losses.append(loss)

    return w, losses

# ์—ฌ๋Ÿฌ ๋ฐฐ์น˜ ํฌ๊ธฐ๋กœ ์‹คํ—˜
batch_sizes = [1, 10, 50, 200, 1000]
n_epochs = 50
learning_rate = 0.01

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

for batch_size in batch_sizes:
    w_final, losses = sgd_minibatch(X, y, batch_size, learning_rate, n_epochs)

    # ์†์‹ค ๊ณก์„ 
    axes[0].plot(losses, linewidth=2, label=f'Batch size = {batch_size}')
    axes[1].semilogy(losses, linewidth=2, label=f'Batch size = {batch_size}')

axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('MSE Loss', fontsize=12)
axes[0].set_title('๋ฐฐ์น˜ ํฌ๊ธฐ์— ๋”ฐ๋ฅธ ์ˆ˜๋ ด', fontsize=14)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('MSE Loss (log scale)', fontsize=12)
axes[1].set_title('๋ฐฐ์น˜ ํฌ๊ธฐ์— ๋”ฐ๋ฅธ ์ˆ˜๋ ด (๋กœ๊ทธ ์Šค์ผ€์ผ)', fontsize=14)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('sgd_batch_size_effect.png', dpi=150)
plt.show()

print("๋ฐฐ์น˜ ํฌ๊ธฐ์˜ ์˜ํ–ฅ:")
print("  ์ž‘์€ ๋ฐฐ์น˜: ๋…ธ์ด์ฆˆ ํฌ๊ณ  ๋ถˆ์•ˆ์ •, ์ •๊ทœํ™” ํšจ๊ณผ, ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ")
print("  ํฐ ๋ฐฐ์น˜: ์•ˆ์ •์  ๊ทธ๋ž˜๋””์–ธํŠธ, ๋น ๋ฅธ ์ˆ˜๋ ด, ๊ณ„์‚ฐ ๋ณ‘๋ ฌํ™”")
print("  ์‹ค์ „: 32, 64, 128, 256 ๋“ฑ 2์˜ ๊ฑฐ๋“ญ์ œ๊ณฑ (GPU ์ตœ์ ํ™”)")

3.4 SGD Variance and Learning Rate

The SGD gradient has the same expectation as the true gradient, but with variance:

$$ \mathbb{E}[\nabla f_i(x)] = \nabla f(x), \quad \text{Var}[\nabla f_i(x)] = \sigma^2 $$

High variance slows convergence and causes instability. Solutions: - Learning rate decay: $\eta_t = \frac{\eta_0}{\sqrt{t}}$ - Increase mini-batch: variance $\propto 1/|B|$ - Adaptive methods: Adam, RMSProp

4. Momentum-based Methods

4.1 Momentum SGD

Basic momentum (Heavy Ball):

$$ \begin{align} v_t &= \beta v_{t-1} + \nabla f(x_t) \\ x_{t+1} &= x_t - \eta v_t \end{align} $$

where $\beta \in [0, 1)$ is the momentum coefficient (typically 0.9).

Physical intuition: - Like a ball rolling down a hill, maintains inertia from previous direction - Accelerates in consistent directions, dampens oscillations - Helps escape local minima and saddle points

4.2 Nesterov Accelerated Gradient (NAG)

Nesterov Accelerated Gradient:

$$ \begin{align} v_t &= \beta v_{t-1} + \nabla f(x_t - \beta v_{t-1}) \\ x_{t+1} &= x_t - \eta v_t \end{align} $$

Key idea: - "Look ahead" first ($x_t - \beta v_{t-1}$) and compute gradient there - Smarter look-ahead than momentum - Theoretically better convergence rate

import numpy as np
import matplotlib.pyplot as plt

# Rosenbrock ํ•จ์ˆ˜
def rosenbrock(x, y):
    return (1 - x)**2 + 100*(y - x**2)**2

def rosenbrock_gradient(x, y):
    df_dx = -2*(1 - x) - 400*x*(y - x**2)
    df_dy = 200*(y - x**2)
    return np.array([df_dx, df_dy])

# ๊ธฐ๋ณธ GD
def gd(x0, learning_rate, n_iterations):
    trajectory = [x0]
    x = x0.copy()

    for _ in range(n_iterations):
        grad = rosenbrock_gradient(x[0], x[1])
        x = x - learning_rate * grad
        trajectory.append(x.copy())

    return np.array(trajectory)

# ๋ชจ๋ฉ˜ํ…€ GD
def momentum_gd(x0, learning_rate, beta, n_iterations):
    trajectory = [x0]
    x = x0.copy()
    v = np.zeros_like(x)

    for _ in range(n_iterations):
        grad = rosenbrock_gradient(x[0], x[1])
        v = beta * v + grad
        x = x - learning_rate * v
        trajectory.append(x.copy())

    return np.array(trajectory)

# Nesterov GD
def nesterov_gd(x0, learning_rate, beta, n_iterations):
    trajectory = [x0]
    x = x0.copy()
    v = np.zeros_like(x)

    for _ in range(n_iterations):
        # Look-ahead
        x_lookahead = x - beta * v
        grad = rosenbrock_gradient(x_lookahead[0], x_lookahead[1])
        v = beta * v + grad
        x = x - learning_rate * v
        trajectory.append(x.copy())

    return np.array(trajectory)

# ์‹œ๊ฐํ™”
x0 = np.array([-1.0, 0.5])
learning_rate = 0.001
beta = 0.9
n_iterations = 200

trajectories = {
    'GD': gd(x0, learning_rate, n_iterations),
    'Momentum': momentum_gd(x0, learning_rate, beta, n_iterations),
    'Nesterov': nesterov_gd(x0, learning_rate, beta, n_iterations)
}

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# ๋“ฑ๊ณ ์„ 
x_vals = np.linspace(-1.5, 1.5, 300)
y_vals = np.linspace(-0.5, 1.5, 300)
X, Y = np.meshgrid(x_vals, y_vals)
Z = rosenbrock(X, Y)

# ์™ผ์ชฝ: ๊ถค์  ๋น„๊ต
ax = axes[0]
levels = np.logspace(-1, 3, 20)
contour = ax.contour(X, Y, Z, levels=levels, alpha=0.4, cmap='gray')

colors = {'GD': 'blue', 'Momentum': 'red', 'Nesterov': 'green'}
for name, traj in trajectories.items():
    ax.plot(traj[:, 0], traj[:, 1], '-', linewidth=2, color=colors[name],
            label=name, alpha=0.7)
    ax.plot(traj[0, 0], traj[0, 1], 'o', markersize=10, color=colors[name])

ax.plot(1, 1, 'k*', markersize=20, label='์ตœ์†Ÿ๊ฐ’ (1, 1)')
ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('y', fontsize=12)
ax.set_title('Rosenbrock ํ•จ์ˆ˜: ๋ชจ๋ฉ˜ํ…€ vs Nesterov', fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# ์˜ค๋ฅธ์ชฝ: ์†์‹ค ๊ณก์„ 
ax = axes[1]
for name, traj in trajectories.items():
    losses = [rosenbrock(x[0], x[1]) for x in traj]
    ax.semilogy(losses, linewidth=2, color=colors[name], label=name)

ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('Loss (log scale)', fontsize=12)
ax.set_title('์†์‹ค ๊ฐ์†Œ ๋น„๊ต', fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('momentum_vs_nesterov.png', dpi=150)
plt.show()

print("๋ชจ๋ฉ˜ํ…€์˜ ํšจ๊ณผ:")
print("  - ์ผ๊ด€๋œ ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€์†")
print("  - ์ง„๋™ ๊ฐ์‡  (ํŠนํžˆ ill-conditioned ๋ฌธ์ œ)")
print("  - ์•ˆ์žฅ์  ๋” ๋น ๋ฅด๊ฒŒ ํ†ต๊ณผ")
print("\nNesterov์˜ ์žฅ์ :")
print("  - Look-ahead๋กœ ๋” ํ˜„๋ช…ํ•œ ์—…๋ฐ์ดํŠธ")
print("  - ์ด๋ก ์ ์œผ๋กœ ๋” ๋‚˜์€ ์ˆ˜๋ ด ์†๋„")

5. Adaptive Learning Rate Methods

5.1 AdaGrad

Adaptive Gradient Algorithm:

$$ \begin{align} G_t &= G_{t-1} + \nabla f(x_t) \odot \nabla f(x_t) \quad \text{(๋ˆ„์  ์ œ๊ณฑ)} \\ x_{t+1} &= x_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \odot \nabla f(x_t) \end{align} $$

  • $\odot$: element-wise multiplication (Hadamard product)
  • $\epsilon$: numerical stability ($10^{-8}$)

Characteristics: - Frequently updated parameters: learning rate decreases - Rarely updated parameters: learning rate maintained - Problem: $G_t$ keeps growing โ†’ learning rate becomes too small

5.2 RMSProp

Root Mean Square Propagation:

$$ \begin{align} G_t &= \beta G_{t-1} + (1-\beta) \nabla f(x_t) \odot \nabla f(x_t) \quad \text{(์ง€์ˆ˜ ์ด๋™ ํ‰๊ท )} \\ x_{t+1} &= x_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \odot \nabla f(x_t) \end{align} $$

AdaGrad improvement: - Exponential moving average (EMA) instead of accumulation - Decay of old gradient influence - Solves the too-small learning rate problem

5.3 Adam

Adaptive Moment Estimation:

$$ \begin{align} m_t &= \beta_1 m_{t-1} + (1-\beta_1) \nabla f(x_t) \quad \text{(1์ฐจ ๋ชจ๋ฉ˜ํŠธ, ํ‰๊ท )} \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2) \nabla f(x_t) \odot \nabla f(x_t) \quad \text{(2์ฐจ ๋ชจ๋ฉ˜ํŠธ, ๋ถ„์‚ฐ)} \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \quad \text{(ํŽธํ–ฅ ๋ณด์ •)} \\ x_{t+1} &= x_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \odot \hat{m}_t \end{align} $$

Hyperparameters: - $\beta_1 = 0.9$ (first moment decay) - $\beta_2 = 0.999$ (second moment decay) - $\eta = 0.001$ (learning rate) - $\epsilon = 10^{-8}$

Characteristics: - Momentum + adaptive learning rate - Bias correction: removes bias in moment estimates at early stages - Works well for most deep learning problems

5.4 Implementation and Comparison

import numpy as np
import matplotlib.pyplot as plt

# ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„
class Optimizer:
    def __init__(self, learning_rate):
        self.lr = learning_rate

class SGD(Optimizer):
    def update(self, x, grad):
        return x - self.lr * grad

class Momentum(Optimizer):
    def __init__(self, learning_rate, beta=0.9):
        super().__init__(learning_rate)
        self.beta = beta
        self.v = None

    def update(self, x, grad):
        if self.v is None:
            self.v = np.zeros_like(x)
        self.v = self.beta * self.v + grad
        return x - self.lr * self.v

class AdaGrad(Optimizer):
    def __init__(self, learning_rate, epsilon=1e-8):
        super().__init__(learning_rate)
        self.epsilon = epsilon
        self.G = None

    def update(self, x, grad):
        if self.G is None:
            self.G = np.zeros_like(x)
        self.G += grad ** 2
        return x - self.lr * grad / (np.sqrt(self.G) + self.epsilon)

class RMSProp(Optimizer):
    def __init__(self, learning_rate, beta=0.9, epsilon=1e-8):
        super().__init__(learning_rate)
        self.beta = beta
        self.epsilon = epsilon
        self.G = None

    def update(self, x, grad):
        if self.G is None:
            self.G = np.zeros_like(x)
        self.G = self.beta * self.G + (1 - self.beta) * grad ** 2
        return x - self.lr * grad / (np.sqrt(self.G) + self.epsilon)

class Adam(Optimizer):
    def __init__(self, learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-8):
        super().__init__(learning_rate)
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None
        self.v = None
        self.t = 0

    def update(self, x, grad):
        if self.m is None:
            self.m = np.zeros_like(x)
            self.v = np.zeros_like(x)

        self.t += 1
        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2

        # ํŽธํ–ฅ ๋ณด์ •
        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)

        return x - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)

# ํ…Œ์ŠคํŠธ ํ•จ์ˆ˜: Beale ํ•จ์ˆ˜
def beale(x, y):
    return (1.5 - x + x*y)**2 + (2.25 - x + x*y**2)**2 + (2.625 - x + x*y**3)**2

def beale_gradient(x, y):
    df_dx = 2*(1.5 - x + x*y)*(-1 + y) + 2*(2.25 - x + x*y**2)*(-1 + y**2) + \
            2*(2.625 - x + x*y**3)*(-1 + y**3)
    df_dy = 2*(1.5 - x + x*y)*x + 2*(2.25 - x + x*y**2)*(2*x*y) + \
            2*(2.625 - x + x*y**3)*(3*x*y**2)
    return np.array([df_dx, df_dy])

# ์ตœ์ ํ™” ์‹คํ–‰
x0 = np.array([3.0, 3.0])
n_iterations = 500

optimizers = {
    'SGD': SGD(0.001),
    'Momentum': Momentum(0.001, beta=0.9),
    'AdaGrad': AdaGrad(0.5),
    'RMSProp': RMSProp(0.01, beta=0.9),
    'Adam': Adam(0.01, beta1=0.9, beta2=0.999)
}

trajectories = {}
for name, opt in optimizers.items():
    x = x0.copy()
    traj = [x.copy()]

    for _ in range(n_iterations):
        grad = beale_gradient(x[0], x[1])
        x = opt.update(x, grad)
        traj.append(x.copy())

    trajectories[name] = np.array(traj)

# ์‹œ๊ฐํ™”
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# ๋“ฑ๊ณ ์„ 
x_vals = np.linspace(-1, 4, 300)
y_vals = np.linspace(-1, 4, 300)
X, Y = np.meshgrid(x_vals, y_vals)
Z = beale(X, Y)

# ์™ผ์ชฝ: ๊ถค์ 
ax = axes[0]
levels = np.logspace(0, 4, 20)
contour = ax.contour(X, Y, Z, levels=levels, alpha=0.3, cmap='gray')

colors = {'SGD': 'blue', 'Momentum': 'red', 'AdaGrad': 'green',
          'RMSProp': 'purple', 'Adam': 'orange'}

for name, traj in trajectories.items():
    ax.plot(traj[:, 0], traj[:, 1], '-', linewidth=2, color=colors[name],
            label=name, alpha=0.7)

ax.plot(3, 0.5, 'k*', markersize=20, label='์ตœ์†Ÿ๊ฐ’')
ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('y', fontsize=12)
ax.set_title('Beale ํ•จ์ˆ˜: ์˜ตํ‹ฐ๋งˆ์ด์ € ๋น„๊ต', fontsize=14)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_xlim(-1, 4)
ax.set_ylim(-1, 4)

# ์˜ค๋ฅธ์ชฝ: ์†์‹ค ๊ณก์„ 
ax = axes[1]
for name, traj in trajectories.items():
    losses = [beale(x[0], x[1]) for x in traj]
    ax.semilogy(losses, linewidth=2, color=colors[name], label=name)

ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('Loss (log scale)', fontsize=12)
ax.set_title('์†์‹ค ๊ฐ์†Œ ๋น„๊ต', fontsize=14)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('optimizer_comparison.png', dpi=150)
plt.show()

print("์˜ตํ‹ฐ๋งˆ์ด์ € ํŠน์ง•:")
print("  SGD: ๋‹จ์ˆœ, ๋А๋ฆผ")
print("  Momentum: ๊ฐ€์†, ์ง„๋™ ๊ฐ์‡ ")
print("  AdaGrad: ํฌ์†Œ ๋ฐ์ดํ„ฐ ์ ํ•ฉ, ํ•™์Šต๋ฅ  ๊ฐ์†Œ ๋ฌธ์ œ")
print("  RMSProp: AdaGrad ๊ฐœ์„ , ์•ˆ์ •์ ")
print("  Adam: ๋Œ€๋ถ€๋ถ„ ์ƒํ™ฉ์—์„œ ์šฐ์ˆ˜, ์‚ฌ์‹ค์ƒ ํ‘œ์ค€")

5.5 Adam Bias Correction

In early stages, $m_t$ and $v_t$ are biased toward zero due to initialization. Bias correction fixes this.

$$ \mathbb{E}[m_t] = \mathbb{E}\left[\sum_{i=1}^t \beta_1^{t-i}(1-\beta_1)g_i\right] = \mathbb{E}[g_t](1 - \beta_1^t) $$

Therefore, $\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$ is an unbiased estimator.

# ํŽธํ–ฅ ๋ณด์ • ํšจ๊ณผ ์‹œ๊ฐํ™”
fig, ax = plt.subplots(figsize=(12, 6))

beta1, beta2 = 0.9, 0.999
t_vals = np.arange(1, 101)

correction1 = 1 / (1 - beta1 ** t_vals)
correction2 = 1 / (1 - beta2 ** t_vals)

ax.plot(t_vals, correction1, linewidth=2, label=f'1์ฐจ ๋ชจ๋ฉ˜ํŠธ ๋ณด์ • (ฮฒโ‚={beta1})')
ax.plot(t_vals, correction2, linewidth=2, label=f'2์ฐจ ๋ชจ๋ฉ˜ํŠธ ๋ณด์ • (ฮฒโ‚‚={beta2})')
ax.axhline(y=1, color='black', linestyle='--', linewidth=1, label='๋ณด์ • ์—†์Œ')

ax.set_xlabel('Iteration t', fontsize=12)
ax.set_ylabel('๋ณด์ • ๊ณ„์ˆ˜', fontsize=12)
ax.set_title('Adam ํŽธํ–ฅ ๋ณด์ • ๊ณ„์ˆ˜', fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('adam_bias_correction.png', dpi=150)
plt.show()

print("ํŽธํ–ฅ ๋ณด์ •์˜ ์ค‘์š”์„ฑ:")
print("  - ์ดˆ๊ธฐ ์Šคํ…์—์„œ ๋ชจ๋ฉ˜ํŠธ๊ฐ€ 0์œผ๋กœ ํŽธํ–ฅ")
print("  - ๋ณด์ • ์—†์œผ๋ฉด ์ดˆ๊ธฐ ํ•™์Šต๋ฅ ์ด ๊ณผ๋„ํ•˜๊ฒŒ ์ž‘์Œ")
print("  - ์ˆ˜์‹ญ iteration ํ›„ ๋ณด์ • ํšจ๊ณผ ์‚ฌ๋ผ์ง")

6. Learning Rate Schedules

6.1 Major Scheduling Strategies

Step Decay: $$\eta_t = \eta_0 \cdot \gamma^{\lfloor t/k \rfloor}$$

Exponential Decay: $$\eta_t = \eta_0 \cdot e^{-\lambda t}$$

Cosine Annealing: $$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right)$$

1-Cycle Policy: - Early: learning rate increase (warm-up) - Middle: maintain maximum learning rate - Late: learning rate decrease (annealing)

import numpy as np
import matplotlib.pyplot as plt

def step_decay(t, eta0=0.1, gamma=0.5, k=50):
    return eta0 * (gamma ** (t // k))

def exponential_decay(t, eta0=0.1, lam=0.01):
    return eta0 * np.exp(-lam * t)

def cosine_annealing(t, eta_min=0.0, eta_max=0.1, T=100):
    return eta_min + 0.5 * (eta_max - eta_min) * (1 + np.cos(np.pi * t / T))

def one_cycle(t, eta_min=0.001, eta_max=0.1, T=100, warmup_frac=0.3):
    if t < warmup_frac * T:
        # Warm-up
        return eta_min + (eta_max - eta_min) * t / (warmup_frac * T)
    else:
        # Annealing
        progress = (t - warmup_frac * T) / ((1 - warmup_frac) * T)
        return eta_min + 0.5 * (eta_max - eta_min) * (1 + np.cos(np.pi * progress))

# ์‹œ๊ฐํ™”
t_vals = np.arange(0, 200)
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

schedules = [
    ('Step Decay', step_decay, axes[0, 0]),
    ('Exponential Decay', exponential_decay, axes[0, 1]),
    ('Cosine Annealing', cosine_annealing, axes[1, 0]),
    ('1-Cycle Policy', one_cycle, axes[1, 1])
]

for name, func, ax in schedules:
    if name == 'Cosine Annealing':
        lr_vals = [func(t, T=200) for t in t_vals]
    elif name == '1-Cycle Policy':
        lr_vals = [func(t, T=200) for t in t_vals]
    else:
        lr_vals = [func(t) for t in t_vals]

    ax.plot(t_vals, lr_vals, linewidth=2, color='blue')
    ax.set_xlabel('Iteration', fontsize=12)
    ax.set_ylabel('Learning Rate', fontsize=12)
    ax.set_title(name, fontsize=14)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('learning_rate_schedules.png', dpi=150)
plt.show()

print("ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„์˜ ์—ญํ• :")
print("  - ์ดˆ๊ธฐ: ํฐ ํ•™์Šต๋ฅ ๋กœ ๋น ๋ฅด๊ฒŒ ์ข‹์€ ์˜์—ญ ํƒ์ƒ‰")
print("  - ํ›„๋ฐ˜: ์ž‘์€ ํ•™์Šต๋ฅ ๋กœ ๋ฏธ์„ธ ์กฐ์ •")
print("  - Warm-up: ์ดˆ๊ธฐ ๋ถˆ์•ˆ์ •์„ฑ ๋ฐฉ์ง€")
print("  - Cosine/1-Cycle: ๋ถ€๋“œ๋Ÿฌ์šด ๊ฐ์†Œ, Transformer ๋“ฑ์—์„œ ์ธ๊ธฐ")

7. Practical Considerations in Neural Network Optimization

7.1 Geometry of Loss Landscapes

Neural network loss functions are: - Non-convex: multiple local minima - High-dimensional: millions to billions of parameters - Saddle Points: minimum in some directions, maximum in others - Plateaus: gradients nearly zero

7.2 Sharp Minima vs Flat Minima

Sharp Minima: - Narrow, sharp minimum - Poor generalization on test data - Often occurs with large batch sizes

Flat Minima: - Wide, flat minimum - Good generalization on test data - Preferred with small batch sizes + SGD noise

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# ์†์‹ค ์ง€ํ˜• ์‹œ๋ฎฌ๋ ˆ์ด์…˜
def sharp_minimum(x, y):
    """๋‚ ์นด๋กœ์šด ์ตœ์†Ÿ๊ฐ’"""
    return x**2 + y**2 + 0.01 * np.random.randn()

def flat_minimum(x, y):
    """ํ‰ํƒ„ํ•œ ์ตœ์†Ÿ๊ฐ’"""
    return 0.1 * x**2 + 0.1 * y**2 + 0.01 * np.random.randn()

fig = plt.figure(figsize=(16, 6))

x = np.linspace(-2, 2, 100)
y = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(x, y)

# Sharp minimum
ax1 = fig.add_subplot(121, projection='3d')
Z_sharp = X**2 + Y**2
ax1.plot_surface(X, Y, Z_sharp, cmap='Reds', alpha=0.8, edgecolor='none')
ax1.set_title('Sharp Minimum (์ผ๋ฐ˜ํ™” ๋‚ฎ์Œ)', fontsize=14)
ax1.set_xlabel('wโ‚')
ax1.set_ylabel('wโ‚‚')
ax1.set_zlabel('Loss')

# Flat minimum
ax2 = fig.add_subplot(122, projection='3d')
Z_flat = 0.1 * X**2 + 0.1 * Y**2
ax2.plot_surface(X, Y, Z_flat, cmap='Blues', alpha=0.8, edgecolor='none')
ax2.set_title('Flat Minimum (์ผ๋ฐ˜ํ™” ๋†’์Œ)', fontsize=14)
ax2.set_xlabel('wโ‚')
ax2.set_ylabel('wโ‚‚')
ax2.set_zlabel('Loss')

plt.tight_layout()
plt.savefig('sharp_vs_flat_minima.png', dpi=150)
plt.show()

print("Sharp vs Flat Minima:")
print("  Sharp: ํŒŒ๋ผ๋ฏธํ„ฐ ์ž‘์€ ๋ณ€ํ™”์— ์†์‹ค ํฌ๊ฒŒ ๋ณ€ํ•จ โ†’ ๊ณผ์ ํ•ฉ")
print("  Flat: ํŒŒ๋ผ๋ฏธํ„ฐ ๋ณ€ํ™”์— ์†์‹ค ๋‘”๊ฐ โ†’ ์ผ๋ฐ˜ํ™”")
print("  SGD์˜ ๋…ธ์ด์ฆˆ๊ฐ€ Flat Minima ์„ ํ˜ธํ•˜๋„๋ก ์•”๋ฌต์  ์ •๊ทœํ™”")

7.3 Gradient Clipping

Prevents gradient explosion in RNN/Transformer:

Norm-based Clipping: $$ \tilde{g} = \begin{cases} g & \text{if } \|g\| \leq \theta \\ \theta \frac{g}{\|g\|} & \text{otherwise} \end{cases} $$

Value-based Clipping: $$ \tilde{g}_i = \max(-\theta, \min(\theta, g_i)) $$

import torch

def gradient_clipping_demo():
    # ๊ฐ„๋‹จํ•œ RNN ์‹œ๋ฎฌ๋ ˆ์ด์…˜
    torch.manual_seed(42)
    hidden_size = 10
    W = torch.randn(hidden_size, hidden_size, requires_grad=True) * 2  # ํฐ ๊ฐ€์ค‘์น˜

    # ์ˆœ๋ฐฉํ–ฅ (์—ฌ๋Ÿฌ ์‹œ๊ฐ„ ์Šคํ…)
    h = torch.randn(hidden_size)
    for _ in range(20):
        h = torch.tanh(W @ h)

    loss = h.sum()
    loss.backward()

    grad_norm = W.grad.norm().item()
    print(f"ํด๋ฆฌํ•‘ ์ „ ๊ทธ๋ž˜๋””์–ธํŠธ ๋…ธ๋ฆ„: {grad_norm:.4f}")

    # ๊ทธ๋ž˜๋””์–ธํŠธ ํด๋ฆฌํ•‘
    max_norm = 1.0
    torch.nn.utils.clip_grad_norm_([W], max_norm)

    clipped_grad_norm = W.grad.norm().item()
    print(f"ํด๋ฆฌํ•‘ ํ›„ ๊ทธ๋ž˜๋””์–ธํŠธ ๋…ธ๋ฆ„: {clipped_grad_norm:.4f}")

gradient_clipping_demo()

print("\n๊ทธ๋ž˜๋””์–ธํŠธ ํด๋ฆฌํ•‘:")
print("  - RNN/Transformer์—์„œ ํ•„์ˆ˜")
print("  - ์ผ๋ฐ˜์ ์œผ๋กœ max_norm = 1.0 ๋˜๋Š” 5.0")
print("  - ํ•™์Šต ์•ˆ์ •์„ฑ ํฌ๊ฒŒ ํ–ฅ์ƒ")

7.4 Practical Optimization Recipe

# PyTorch ์Šคํƒ€์ผ ์ตœ์ ํ™” ์„ค์ • ์˜ˆ์ œ
import torch
import torch.nn as nn
import torch.optim as optim

class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(10, 50),
            nn.ReLU(),
            nn.Linear(50, 1)
        )

    def forward(self, x):
        return self.layers(x)

model = SimpleNN()

# 1. ์˜ตํ‹ฐ๋งˆ์ด์ € ์„ ํƒ: Adam (๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ)
optimizer = optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))

# 2. ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„๋Ÿฌ: Cosine Annealing
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# 3. ํ•™์Šต ๋ฃจํ”„ (๊ฐ€์ƒ)
for epoch in range(10):
    # Forward & Backward
    # loss.backward()

    # ๊ทธ๋ž˜๋””์–ธํŠธ ํด๋ฆฌํ•‘ (RNN/Transformer)
    # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # ์˜ตํ‹ฐ๋งˆ์ด์ € ์Šคํ…
    optimizer.step()
    optimizer.zero_grad()

    # ํ•™์Šต๋ฅ  ์—…๋ฐ์ดํŠธ
    scheduler.step()

    print(f"Epoch {epoch+1}, LR: {scheduler.get_last_lr()[0]:.6f}")

print("\n์‹ค์ „ ์ตœ์ ํ™” ํŒ:")
print("  1. Adam์„ ๊ธฐ๋ณธ์œผ๋กœ ์‹œ์ž‘ (lr=1e-3)")
print("  2. Cosine Annealing ๋˜๋Š” Step Decay ์ ์šฉ")
print("  3. Warm-up ์‚ฌ์šฉ (Transformer ๋“ฑ)")
print("  4. ๊ทธ๋ž˜๋””์–ธํŠธ ํด๋ฆฌํ•‘ (RNN/Transformer)")
print("  5. ๋ฐฐ์น˜ ํฌ๊ธฐ: 32-256 (GPU ๋ฉ”๋ชจ๋ฆฌ์— ๋”ฐ๋ผ)")
print("  6. Weight Decay (L2 ์ •๊ทœํ™”): 1e-4 ~ 1e-5")

Practice Problems

  1. Convergence Rate Analysis: For $f(x) = \frac{1}{2}x^T A x$ (quadratic form), simulate and compare convergence rates of gradient descent when condition number $\kappa = 10$ vs $\kappa = 100$. Verify if it matches the theoretical rate $O((1 - 1/\kappa)^t)$.

  2. SGD vs Batch GD: Train a simple 2-layer neural network on MNIST, comparing batch GD, mini-batch SGD (batch sizes 32, 128, 512), and full SGD (batch size 1). Compare convergence speed, final test accuracy, and computation time.

  3. Optimizer Implementation: Implement the Adam optimizer from scratch using NumPy and visualize the difference with and without bias correction. Test on Rosenbrock or Beale function.

  4. Momentum vs Nesterov: Compare convergence rates of momentum SGD and Nesterov accelerated gradient on an ill-conditioned quadratic function ($\kappa = 100$). Analyze situations where Nesterov is superior.

  5. Learning Rate Schedules: Train a simple CNN on CIFAR-10 and compare: (1) fixed learning rate, (2) Step Decay, (3) Cosine Annealing, (4) 1-Cycle Policy. Report learning curves and final test accuracy for each method.

References

  • Ruder, S. (2016). "An Overview of Gradient Descent Optimization Algorithms". arXiv:1609.04747
  • Comprehensive review of all major optimizers
  • Bottou, L., Curtis, F. E., & Nocedal, J. (2018). "Optimization Methods for Large-Scale Machine Learning". SIAM Review, 60(2), 223-311
  • Theory and practice of large-scale ML optimization
  • Kingma, D. P., & Ba, J. (2015). "Adam: A Method for Stochastic Optimization". ICLR
  • Original Adam paper
  • Sutskever, I., et al. (2013). "On the Importance of Initialization and Momentum in Deep Learning". ICML
  • Importance of momentum
  • Keskar, N. S., et al. (2017). "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima". ICLR
  • Sharp vs Flat Minima
  • Smith, L. N. (2017). "Cyclical Learning Rates for Training Neural Networks". WACV
  • 1-Cycle Policy
  • PyTorch Optimization Documentation: https://pytorch.org/docs/stable/optim.html
  • Stanford CS231n Lecture Notes: http://cs231n.github.io/neural-networks-3/
to navigate between lessons