05. Multivariate Calculus

05. Multivariate Calculus

Learning Objectives

  • Understand and compute the geometric meaning of partial derivatives
  • Visualize and interpret the relationship between gradient vectors and contour lines
  • Calculate directional derivatives and understand their relationship to gradients
  • Apply the multivariate chain rule to differentiate complex composite functions
  • Understand the principles of function approximation and Newton's method using Taylor expansion
  • Visualize loss landscapes in machine learning and connect them to optimization problems

1. Partial Derivatives

1.1 Definition of Partial Derivative

For a multivariate function $f(x_1, x_2, \ldots, x_n)$, the partial derivative with respect to $x_i$ is the rate of change when only $x_i$ varies while other variables remain constant:

$$\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}$$

1.2 Examples of Partial Derivative Computation

Example 1: $f(x, y) = x^2 + 3xy + y^2$

$$\frac{\partial f}{\partial x} = 2x + 3y$$

$$\frac{\partial f}{\partial y} = 3x + 2y$$

Example 2: $f(x, y) = e^{x^2 + y^2}$

$$\frac{\partial f}{\partial x} = 2x e^{x^2 + y^2}$$

$$\frac{\partial f}{\partial y} = 2y e^{x^2 + y^2}$$

1.3 Symbolic Differentiation with SymPy

import numpy as np
import sympy as sp
from sympy import symbols, diff, exp, sin, cos, simplify

# ์‹ฌ๋ณผ ์ •์˜
x, y = symbols('x y')

# ํ•จ์ˆ˜ ์ •์˜
f = x**2 + 3*x*y + y**2

# ํŽธ๋ฏธ๋ถ„
df_dx = diff(f, x)
df_dy = diff(f, y)

print("ํ•จ์ˆ˜:", f)
print(f"โˆ‚f/โˆ‚x = {df_dx}")
print(f"โˆ‚f/โˆ‚y = {df_dy}")

# ํŠน์ • ์ ์—์„œ ๊ณ„์‚ฐ
point = {x: 1, y: 2}
print(f"\n์  (1, 2)์—์„œ:")
print(f"  f(1,2) = {f.subs(point)}")
print(f"  โˆ‚f/โˆ‚x(1,2) = {df_dx.subs(point)}")
print(f"  โˆ‚f/โˆ‚y(1,2) = {df_dy.subs(point)}")

# ๋ณต์žกํ•œ ํ•จ์ˆ˜
g = exp(x**2 + y**2) * sin(x*y)
dg_dx = simplify(diff(g, x))
dg_dy = simplify(diff(g, y))

print(f"\nํ•จ์ˆ˜: g(x,y) = {g}")
print(f"โˆ‚g/โˆ‚x = {dg_dx}")
print(f"โˆ‚g/โˆ‚y = {dg_dy}")

1.4 Geometric Meaning of Partial Derivatives

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# ํ•จ์ˆ˜ ์ •์˜: f(x, y) = x^2 + y^2
def f(x, y):
    return x**2 + y**2

# ๊ฒฉ์ž ์ƒ์„ฑ
x_vals = np.linspace(-3, 3, 100)
y_vals = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x_vals, y_vals)
Z = f(X, Y)

# ํŠน์ • ์ 
x0, y0 = 1.0, 1.5
z0 = f(x0, y0)

fig = plt.figure(figsize=(16, 6))

# 3D ํ‘œ๋ฉด
ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.7)
ax1.scatter([x0], [y0], [z0], c='red', s=100, label=f'์  ({x0}, {y0})')

# x ๋ฐฉํ–ฅ ์ ‘์„  (y ๊ณ ์ •)
x_line = np.linspace(x0-1, x0+1, 50)
y_line = np.full_like(x_line, y0)
z_line = f(x_line, y_line)
ax1.plot(x_line, y_line, z_line, 'r-', linewidth=3, label='y ๊ณ ์ • (โˆ‚f/โˆ‚x)')

# y ๋ฐฉํ–ฅ ์ ‘์„  (x ๊ณ ์ •)
x_line2 = np.full(50, x0)
y_line2 = np.linspace(y0-1, y0+1, 50)
z_line2 = f(x_line2, y_line2)
ax1.plot(x_line2, y_line2, z_line2, 'b-', linewidth=3, label='x ๊ณ ์ • (โˆ‚f/โˆ‚y)')

ax1.set_title('ํŽธ๋ฏธ๋ถ„์˜ ๊ธฐํ•˜ํ•™์  ์˜๋ฏธ', fontsize=12)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x,y)')
ax1.legend()

# x ๊ณ ์ • ๋‹จ๋ฉด
ax2 = fig.add_subplot(132)
y_slice = np.linspace(-3, 3, 100)
z_slice = f(x0, y_slice)
ax2.plot(y_slice, z_slice, 'b-', linewidth=2)
ax2.plot([y0], [z0], 'ro', markersize=10)
# ์ ‘์„ 
slope_y = 2*y0  # โˆ‚f/โˆ‚y = 2y
tangent_y = z0 + slope_y * (y_slice - y0)
ax2.plot(y_slice, tangent_y, 'r--', linewidth=2, label=f'์ ‘์„  (๊ธฐ์šธ๊ธฐ={slope_y:.1f})')
ax2.set_title(f'x={x0} ๊ณ ์ • ๋‹จ๋ฉด', fontsize=12)
ax2.set_xlabel('y')
ax2.set_ylabel('f(x,y)')
ax2.grid(True, alpha=0.3)
ax2.legend()

# y ๊ณ ์ • ๋‹จ๋ฉด
ax3 = fig.add_subplot(133)
x_slice = np.linspace(-3, 3, 100)
z_slice = f(x_slice, y0)
ax3.plot(x_slice, z_slice, 'r-', linewidth=2)
ax3.plot([x0], [z0], 'ro', markersize=10)
# ์ ‘์„ 
slope_x = 2*x0  # โˆ‚f/โˆ‚x = 2x
tangent_x = z0 + slope_x * (x_slice - x0)
ax3.plot(x_slice, tangent_x, 'b--', linewidth=2, label=f'์ ‘์„  (๊ธฐ์šธ๊ธฐ={slope_x:.1f})')
ax3.set_title(f'y={y0} ๊ณ ์ • ๋‹จ๋ฉด', fontsize=12)
ax3.set_xlabel('x')
ax3.set_ylabel('f(x,y)')
ax3.grid(True, alpha=0.3)
ax3.legend()

plt.tight_layout()
plt.savefig('partial_derivatives_geometry.png', dpi=150, bbox_inches='tight')
plt.close()

print("ํŽธ๋ฏธ๋ถ„ ๊ธฐํ•˜ํ•™ ์‹œ๊ฐํ™” ์ €์žฅ: partial_derivatives_geometry.png")

1.5 Higher-Order Partial Derivatives

Schwarz's theorem: If $f$ is $C^2$ class, the order of partial derivatives doesn't matter

$$\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x}$$

# ์Šˆ๋ฐ”๋ฅด์ธ  ์ •๋ฆฌ ๊ฒ€์ฆ
x, y = symbols('x y')
f = x**3 * y**2 + x * y**3

# ํ˜ผํ•ฉ ํŽธ๋ฏธ๋ถ„
f_xy = diff(diff(f, x), y)
f_yx = diff(diff(f, y), x)

print("ํ•จ์ˆ˜:", f)
print(f"โˆ‚ยฒf/โˆ‚xโˆ‚y = {f_xy}")
print(f"โˆ‚ยฒf/โˆ‚yโˆ‚x = {f_yx}")
print(f"๋™์ผํ•œ๊ฐ€? {simplify(f_xy - f_yx) == 0}")

2. Gradient

2.1 Gradient Vector

The gradient is a vector containing all partial derivatives:

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

The gradient indicates: 1. Direction: The direction of steepest increase of the function 2. Magnitude: The rate of change in that direction

2.2 Contour Lines and Gradient

The gradient is perpendicular to contour lines (level sets).

# ๋“ฑ๊ณ ์„ ๊ณผ ๊ทธ๋ž˜๋””์–ธํŠธ ์‹œ๊ฐํ™”
def f(x, y):
    """ํ•จ์ˆ˜: f(x, y) = x^2 + 2y^2"""
    return x**2 + 2*y**2

def grad_f(x, y):
    """๊ทธ๋ž˜๋””์–ธํŠธ: โˆ‡f = [2x, 4y]"""
    return np.array([2*x, 4*y])

# ๊ฒฉ์ž
x = np.linspace(-3, 3, 300)
y = np.linspace(-3, 3, 300)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

# ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ ์ง€์ 
n_points = 15
x_grad = np.linspace(-2.5, 2.5, n_points)
y_grad = np.linspace(-2.5, 2.5, n_points)
X_grad, Y_grad = np.meshgrid(x_grad, y_grad)

# ๊ฐ ์ ์—์„œ ๊ทธ๋ž˜๋””์–ธํŠธ
U = 2 * X_grad
V = 4 * Y_grad

# ๊ทธ๋ž˜๋””์–ธํŠธ ํฌ๊ธฐ๋กœ ์ •๊ทœํ™” (ํ™”์‚ดํ‘œ ๊ธธ์ด ์กฐ์ •)
norm = np.sqrt(U**2 + V**2)
U_norm = U / (norm + 1e-8) * 0.3
V_norm = V / (norm + 1e-8) * 0.3

plt.figure(figsize=(10, 8))
# ๋“ฑ๊ณ ์„ 
contours = plt.contour(X, Y, Z, levels=15, cmap='viridis', alpha=0.6)
plt.clabel(contours, inline=True, fontsize=8)
# ๊ทธ๋ž˜๋””์–ธํŠธ ๋ฒกํ„ฐ
plt.quiver(X_grad, Y_grad, U_norm, V_norm, norm, cmap='Reds',
           scale=1, scale_units='xy', width=0.004)
plt.colorbar(label='๊ทธ๋ž˜๋””์–ธํŠธ ํฌ๊ธฐ')
plt.title('๋“ฑ๊ณ ์„ ๊ณผ ๊ทธ๋ž˜๋””์–ธํŠธ (๊ทธ๋ž˜๋””์–ธํŠธ โŠฅ ๋“ฑ๊ณ ์„ )', fontsize=14)
plt.xlabel('x')
plt.ylabel('y')
plt.axis('equal')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('gradient_contour.png', dpi=150)
plt.close()

print("๋“ฑ๊ณ ์„ -๊ทธ๋ž˜๋””์–ธํŠธ ์‹œ๊ฐํ™” ์ €์žฅ: gradient_contour.png")

2.3 Gradient Descent Visualization

# ๊ทธ๋ž˜๋””์–ธํŠธ ํ•˜๊ฐ•๋ฒ• ๊ฒฝ๋กœ
def gradient_descent_2d(grad_f, x0, learning_rate=0.1, n_steps=50):
    """2D ๊ทธ๋ž˜๋””์–ธํŠธ ํ•˜๊ฐ•๋ฒ•"""
    path = [x0]
    x = x0.copy()

    for _ in range(n_steps):
        grad = grad_f(x[0], x[1])
        x = x - learning_rate * grad
        path.append(x.copy())

    return np.array(path)

# ์ดˆ๊ธฐ์ 
x0 = np.array([2.5, 2.0])

# ๊ฒฝ๋กœ ๊ณ„์‚ฐ
path = gradient_descent_2d(grad_f, x0, learning_rate=0.15, n_steps=30)

# ์‹œ๊ฐํ™”
plt.figure(figsize=(10, 8))
plt.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.6)
plt.plot(path[:, 0], path[:, 1], 'ro-', linewidth=2, markersize=6,
         label='๊ทธ๋ž˜๋””์–ธํŠธ ํ•˜๊ฐ• ๊ฒฝ๋กœ')
plt.plot(path[0, 0], path[0, 1], 'go', markersize=12, label='์‹œ์ž‘์ ')
plt.plot(path[-1, 0], path[-1, 1], 'r*', markersize=20, label='์ข…๋ฃŒ์ ')
plt.plot(0, 0, 'b*', markersize=20, label='์ตœ์†Ÿ๊ฐ’')
plt.colorbar(label='f(x, y)')
plt.title('๊ทธ๋ž˜๋””์–ธํŠธ ํ•˜๊ฐ•๋ฒ• ๊ฒฝ๋กœ', fontsize=14)
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.savefig('gradient_descent_path.png', dpi=150)
plt.close()

print("๊ทธ๋ž˜๋””์–ธํŠธ ํ•˜๊ฐ•๋ฒ• ๊ฒฝ๋กœ ์ €์žฅ: gradient_descent_path.png")
print(f"์‹œ์ž‘: {path[0]}, ์ข…๋ฃŒ: {path[-1]}, ์ตœ์†Ÿ๊ฐ’: [0, 0]")
print(f"์ตœ์ข… ํ•จ์ˆ˜ ๊ฐ’: {f(path[-1, 0], path[-1, 1]):.6f}")

2.4 Gradient Magnitude and Learning Rate

# ๋‹ค์–‘ํ•œ ํ•™์Šต๋ฅ ๋กœ ์‹คํ—˜
learning_rates = [0.05, 0.15, 0.3, 0.5]

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, lr in enumerate(learning_rates):
    path = gradient_descent_2d(grad_f, x0, learning_rate=lr, n_steps=30)

    axes[idx].contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.6)
    axes[idx].plot(path[:, 0], path[:, 1], 'ro-', linewidth=2, markersize=4)
    axes[idx].plot(path[0, 0], path[0, 1], 'go', markersize=12)
    axes[idx].plot(path[-1, 0], path[-1, 1], 'r*', markersize=15)
    axes[idx].plot(0, 0, 'b*', markersize=15)
    axes[idx].set_title(f'ํ•™์Šต๋ฅ  = {lr}\n์ตœ์ข… ๊ฐ’: {f(path[-1,0], path[-1,1]):.4f}',
                        fontsize=11)
    axes[idx].set_xlabel('x')
    axes[idx].set_ylabel('y')
    axes[idx].grid(True, alpha=0.3)
    axes[idx].axis('equal')

plt.tight_layout()
plt.savefig('gradient_descent_learning_rates.png', dpi=150)
plt.close()

print("ํ•™์Šต๋ฅ  ๋น„๊ต ์‹œ๊ฐํ™” ์ €์žฅ: gradient_descent_learning_rates.png")

3. Directional Derivative

3.1 Definition of Directional Derivative

The directional derivative in the direction of unit vector $\mathbf{u}$:

$$D_\mathbf{u} f(\mathbf{x}) = \lim_{h \to 0} \frac{f(\mathbf{x} + h\mathbf{u}) - f(\mathbf{x})}{h}$$

Relationship to gradient:

$$D_\mathbf{u} f = \nabla f \cdot \mathbf{u}$$

3.2 Direction of Maximum Change

The directional derivative is maximum when $\mathbf{u}$ points in the direction of $\nabla f$:

$$\max_{\|\mathbf{u}\|=1} D_\mathbf{u} f = \|\nabla f\|$$

# ๋ฐฉํ–ฅ ๋„ํ•จ์ˆ˜ ์‹œ๊ฐํ™”
point = np.array([1.5, 1.0])
grad_at_point = grad_f(point[0], point[1])
grad_norm = np.linalg.norm(grad_at_point)

# ๋‹ค์–‘ํ•œ ๋ฐฉํ–ฅ
n_directions = 36
angles = np.linspace(0, 2*np.pi, n_directions)
directions = np.array([np.cos(angles), np.sin(angles)]).T

# ๊ฐ ๋ฐฉํ–ฅ์œผ๋กœ ๋ฐฉํ–ฅ ๋„ํ•จ์ˆ˜ ๊ณ„์‚ฐ
directional_derivatives = []
for u in directions:
    Du_f = np.dot(grad_at_point, u)
    directional_derivatives.append(Du_f)

directional_derivatives = np.array(directional_derivatives)

# ๊ทน์ขŒํ‘œ ํ”Œ๋กฏ
fig = plt.figure(figsize=(14, 6))

# ๊ทน์ขŒํ‘œ ํ”Œ๋กฏ
ax1 = fig.add_subplot(121, projection='polar')
ax1.plot(angles, directional_derivatives, 'b-', linewidth=2)
ax1.fill(angles, directional_derivatives, alpha=0.3)
ax1.set_title('๋ฐฉํ–ฅ ๋„ํ•จ์ˆ˜ ํฌ๊ธฐ\n(๊ฐ๋„๋ณ„)', fontsize=12)
ax1.grid(True)

# 2D ํ”Œ๋กฏ
ax2 = fig.add_subplot(122)
ax2.contour(X, Y, Z, levels=15, cmap='viridis', alpha=0.6)
ax2.plot(point[0], point[1], 'ro', markersize=12, label='ํ‰๊ฐ€ ์ ')

# ๊ทธ๋ž˜๋””์–ธํŠธ ๋ฐฉํ–ฅ (์ตœ๋Œ€)
grad_unit = grad_at_point / grad_norm
ax2.arrow(point[0], point[1], grad_unit[0]*0.5, grad_unit[1]*0.5,
          head_width=0.15, head_length=0.1, fc='red', ec='red',
          linewidth=2, label=f'๊ทธ๋ž˜๋””์–ธํŠธ (์ตœ๋Œ€: {grad_norm:.2f})')

# ๋ช‡ ๊ฐ€์ง€ ๋‹ค๋ฅธ ๋ฐฉํ–ฅ
sample_angles = [0, np.pi/4, np.pi/2, 3*np.pi/4]
colors = ['blue', 'green', 'orange', 'purple']
for angle, color in zip(sample_angles, colors):
    u = np.array([np.cos(angle), np.sin(angle)])
    Du = np.dot(grad_at_point, u)
    ax2.arrow(point[0], point[1], u[0]*0.5, u[1]*0.5,
              head_width=0.1, head_length=0.08, fc=color, ec=color,
              alpha=0.6, linewidth=1.5)

ax2.set_title('๋‹ค์–‘ํ•œ ๋ฐฉํ–ฅ์˜ ๋„ํ•จ์ˆ˜', fontsize=12)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.axis('equal')

plt.tight_layout()
plt.savefig('directional_derivative.png', dpi=150)
plt.close()

print("๋ฐฉํ–ฅ ๋„ํ•จ์ˆ˜ ์‹œ๊ฐํ™” ์ €์žฅ: directional_derivative.png")
print(f"์  {point}์—์„œ ๊ทธ๋ž˜๋””์–ธํŠธ: {grad_at_point}")
print(f"๊ทธ๋ž˜๋””์–ธํŠธ ํฌ๊ธฐ (์ตœ๋Œ€ ๋ฐฉํ–ฅ ๋„ํ•จ์ˆ˜): {grad_norm:.4f}")

4. Multivariate Chain Rule

4.1 Forms of Chain Rule

When $z = f(x, y)$ and $x = x(t)$, $y = y(t)$:

$$\frac{dz}{dt} = \frac{\partial f}{\partial x} \frac{dx}{dt} + \frac{\partial f}{\partial y} \frac{dy}{dt}$$

In general:

$$\frac{dz}{dt} = \nabla f \cdot \frac{d\mathbf{x}}{dt}$$

4.2 Chain Rule Example

# ์ฒด์ธ ๋ฃฐ: z = f(x, y) = x^2 + y^2, x(t) = cos(t), y(t) = sin(t)
t = symbols('t')
x_t = sp.cos(t)
y_t = sp.sin(t)

# ํ•จ์ˆ˜
z = x_t**2 + y_t**2

# ๋ฐฉ๋ฒ• 1: ์ง์ ‘ ๋ฏธ๋ถ„
dz_dt_direct = diff(z, t)
print("z(t) =", simplify(z))
print(f"dz/dt (์ง์ ‘) = {simplify(dz_dt_direct)}")

# ๋ฐฉ๋ฒ• 2: ์ฒด์ธ ๋ฃฐ
x_sym, y_sym = symbols('x y')
f = x_sym**2 + y_sym**2
df_dx = diff(f, x_sym)
df_dy = diff(f, y_sym)
dx_dt = diff(x_t, t)
dy_dt = diff(y_t, t)

dz_dt_chain = df_dx.subs(x_sym, x_t).subs(y_sym, y_t) * dx_dt + \
              df_dy.subs(x_sym, x_t).subs(y_sym, y_t) * dy_dt

print(f"dz/dt (์ฒด์ธ ๋ฃฐ) = {simplify(dz_dt_chain)}")
print(f"๋™์ผํ•œ๊ฐ€? {simplify(dz_dt_direct - dz_dt_chain) == 0}")

4.3 Backpropagation and Chain Rule

Backpropagation in neural networks is the repeated application of the multivariate chain rule.

import torch
import torch.nn as nn

# ๊ฐ„๋‹จํ•œ ๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

# z = f(x, y) = x^2 + xy + y^2
z = x**2 + x*y + y**2

# ์ž๋™ ๋ฏธ๋ถ„
z.backward()

print("๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„: z = xยฒ + xy + yยฒ")
print(f"x = {x.item()}, y = {y.item()}")
print(f"z = {z.item()}")
print(f"\n์ž๋™ ๋ฏธ๋ถ„:")
print(f"โˆ‚z/โˆ‚x = {x.grad.item()}")
print(f"โˆ‚z/โˆ‚y = {y.grad.item()}")

# ์ˆ˜๋™ ๊ณ„์‚ฐ
x_val, y_val = 2.0, 3.0
dz_dx_manual = 2*x_val + y_val  # โˆ‚z/โˆ‚x = 2x + y
dz_dy_manual = x_val + 2*y_val  # โˆ‚z/โˆ‚y = x + 2y

print(f"\n์ˆ˜๋™ ๊ณ„์‚ฐ:")
print(f"โˆ‚z/โˆ‚x = 2x + y = {dz_dx_manual}")
print(f"โˆ‚z/โˆ‚y = x + 2y = {dz_dy_manual}")

5. Taylor Expansion

5.1 First-Order Taylor Expansion (Linear Approximation)

Near point $\mathbf{x}_0$:

$$f(\mathbf{x}_0 + \boldsymbol{\delta}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T \boldsymbol{\delta}$$

5.2 Second-Order Taylor Expansion (Quadratic Approximation)

Including Hessian $H$:

$$f(\mathbf{x}_0 + \boldsymbol{\delta}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T \boldsymbol{\delta} + \frac{1}{2} \boldsymbol{\delta}^T H(\mathbf{x}_0) \boldsymbol{\delta}$$

5.3 Taylor Expansion Visualization

# ํ…Œ์ผ๋Ÿฌ ์ „๊ฐœ ๊ทผ์‚ฌ
def f_example(x, y):
    return np.exp(-(x**2 + y**2)) * np.sin(x) * np.cos(y)

def grad_f_example(x, y):
    """์ˆ˜์น˜ ๊ทธ๋ž˜๋””์–ธํŠธ"""
    h = 1e-5
    df_dx = (f_example(x+h, y) - f_example(x-h, y)) / (2*h)
    df_dy = (f_example(x, y+h) - f_example(x, y-h)) / (2*h)
    return np.array([df_dx, df_dy])

def hessian_f_example(x, y):
    """์ˆ˜์น˜ ํ—ค์‹œ์•ˆ"""
    h = 1e-5
    H = np.zeros((2, 2))
    grad_base = grad_f_example(x, y)

    # H[0,0] = โˆ‚ยฒf/โˆ‚xยฒ
    grad_x_plus = grad_f_example(x+h, y)
    H[0, 0] = (grad_x_plus[0] - grad_base[0]) / h

    # H[1,1] = โˆ‚ยฒf/โˆ‚yยฒ
    grad_y_plus = grad_f_example(x, y+h)
    H[1, 1] = (grad_y_plus[1] - grad_base[1]) / h

    # H[0,1] = H[1,0] = โˆ‚ยฒf/โˆ‚xโˆ‚y
    H[0, 1] = (grad_x_plus[1] - grad_base[1]) / h
    H[1, 0] = H[0, 1]

    return H

# ์ „๊ฐœ ์ค‘์‹ฌ์ 
x0, y0 = 0.5, 0.5
f0 = f_example(x0, y0)
grad0 = grad_f_example(x0, y0)
H0 = hessian_f_example(x0, y0)

# ๊ฒฉ์ž
x_range = np.linspace(-0.5, 1.5, 100)
y_range = np.linspace(-0.5, 1.5, 100)
X_grid, Y_grid = np.meshgrid(x_range, y_range)

# ์› ํ•จ์ˆ˜
Z_true = f_example(X_grid, Y_grid)

# 1์ฐจ ๊ทผ์‚ฌ
delta_x = X_grid - x0
delta_y = Y_grid - y0
Z_linear = f0 + grad0[0]*delta_x + grad0[1]*delta_y

# 2์ฐจ ๊ทผ์‚ฌ
Z_quadratic = np.zeros_like(Z_true)
for i in range(len(y_range)):
    for j in range(len(x_range)):
        delta = np.array([delta_x[i, j], delta_y[i, j]])
        Z_quadratic[i, j] = f0 + grad0 @ delta + 0.5 * delta @ H0 @ delta

# ์‹œ๊ฐํ™”
fig = plt.figure(figsize=(18, 5))

titles = ['์› ํ•จ์ˆ˜', '1์ฐจ ํ…Œ์ผ๋Ÿฌ ๊ทผ์‚ฌ (์„ ํ˜•)', '2์ฐจ ํ…Œ์ผ๋Ÿฌ ๊ทผ์‚ฌ (์ด์ฐจ)']
Z_list = [Z_true, Z_linear, Z_quadratic]

for idx, (title, Z) in enumerate(zip(titles, Z_list)):
    ax = fig.add_subplot(1, 3, idx+1, projection='3d')
    ax.plot_surface(X_grid, Y_grid, Z, cmap='viridis', alpha=0.8)
    ax.scatter([x0], [y0], [f0], c='red', s=100, label='์ „๊ฐœ ์ค‘์‹ฌ')
    ax.set_title(title, fontsize=12)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('f(x,y)')
    ax.legend()

plt.tight_layout()
plt.savefig('taylor_expansion.png', dpi=150)
plt.close()

# ์˜ค์ฐจ ๊ณ„์‚ฐ
error_linear = np.abs(Z_true - Z_linear).max()
error_quad = np.abs(Z_true - Z_quadratic).max()

print("ํ…Œ์ผ๋Ÿฌ ์ „๊ฐœ ์‹œ๊ฐํ™” ์ €์žฅ: taylor_expansion.png")
print(f"1์ฐจ ๊ทผ์‚ฌ ์ตœ๋Œ€ ์˜ค์ฐจ: {error_linear:.6f}")
print(f"2์ฐจ ๊ทผ์‚ฌ ์ตœ๋Œ€ ์˜ค์ฐจ: {error_quad:.6f}")

5.4 Mathematical Foundation of Newton's Method

Newton's method is optimization using second-order Taylor expansion:

$$\mathbf{x}_{k+1} = \mathbf{x}_k - H^{-1}(\mathbf{x}_k) \nabla f(\mathbf{x}_k)$$

def newton_method_2d(f, grad_f, hess_f, x0, tol=1e-6, max_iter=20):
    """2D ๋‰ดํ„ด ๋ฐฉ๋ฒ•"""
    path = [x0]
    x = x0.copy()

    for i in range(max_iter):
        grad = grad_f(x[0], x[1])
        hess = hess_f(x[0], x[1])

        # ๋‰ดํ„ด ์Šคํ…
        try:
            delta = np.linalg.solve(hess, -grad)
        except np.linalg.LinAlgError:
            print(f"ํŠน์ด ํ—ค์‹œ์•ˆ at iteration {i}")
            break

        x = x + delta
        path.append(x.copy())

        if np.linalg.norm(delta) < tol:
            print(f"์ˆ˜๋ ด at iteration {i}")
            break

    return np.array(path)

# ํ…Œ์ŠคํŠธ ํ•จ์ˆ˜: f(x, y) = (x-1)^2 + 2(y-2)^2
def f_newton(x, y):
    return (x - 1)**2 + 2*(y - 2)**2

def grad_f_newton(x, y):
    return np.array([2*(x - 1), 4*(y - 2)])

def hess_f_newton(x, y):
    return np.array([[2, 0], [0, 4]])

# ๋‰ดํ„ด ๋ฐฉ๋ฒ• vs ๊ทธ๋ž˜๋””์–ธํŠธ ํ•˜๊ฐ•
x0_newton = np.array([0.0, 0.0])
path_newton = newton_method_2d(f_newton, grad_f_newton, hess_f_newton, x0_newton)
path_gd = gradient_descent_2d(grad_f_newton, x0_newton, learning_rate=0.2, n_steps=50)

# ์‹œ๊ฐํ™”
x_plt = np.linspace(-0.5, 2.5, 200)
y_plt = np.linspace(-0.5, 3.5, 200)
X_plt, Y_plt = np.meshgrid(x_plt, y_plt)
Z_plt = f_newton(X_plt, Y_plt)

plt.figure(figsize=(10, 8))
plt.contour(X_plt, Y_plt, Z_plt, levels=25, cmap='viridis', alpha=0.6)
plt.plot(path_gd[:, 0], path_gd[:, 1], 'ro-', linewidth=2, markersize=5,
         label=f'๊ทธ๋ž˜๋””์–ธํŠธ ํ•˜๊ฐ• ({len(path_gd)} ๋‹จ๊ณ„)')
plt.plot(path_newton[:, 0], path_newton[:, 1], 'bo-', linewidth=2, markersize=7,
         label=f'๋‰ดํ„ด ๋ฐฉ๋ฒ• ({len(path_newton)} ๋‹จ๊ณ„)')
plt.plot(1, 2, 'g*', markersize=20, label='์ตœ์†Ÿ๊ฐ’')
plt.colorbar(label='f(x, y)')
plt.title('๋‰ดํ„ด ๋ฐฉ๋ฒ• vs ๊ทธ๋ž˜๋””์–ธํŠธ ํ•˜๊ฐ•๋ฒ•', fontsize=14)
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('newton_vs_gradient_descent.png', dpi=150)
plt.close()

print("๋‰ดํ„ด vs GD ์‹œ๊ฐํ™” ์ €์žฅ: newton_vs_gradient_descent.png")
print(f"๋‰ดํ„ด ๋ฐฉ๋ฒ• ๋‹จ๊ณ„: {len(path_newton)}, ์ตœ์ข…: {path_newton[-1]}")
print(f"GD ๋‹จ๊ณ„: {len(path_gd)}, ์ตœ์ข…: {path_gd[-1]}")

6. Loss Landscape Visualization

6.1 Loss Function Landscape

# ๋น„๋“ฑ๋ฐฉ ์†์‹ค ํ•จ์ˆ˜ (์กฐ๊ฑด์ˆ˜๊ฐ€ ํฐ ๊ฒฝ์šฐ)
def loss_anisotropic(x, y):
    """๊ธธ์ญ‰ํ•œ ๊ณจ์งœ๊ธฐ ํ˜•ํƒœ์˜ ์†์‹ค ํ•จ์ˆ˜"""
    return 0.5 * x**2 + 10 * y**2

def grad_loss_anisotropic(x, y):
    return np.array([x, 20*y])

# ์‹œ๊ฐํ™”
x_loss = np.linspace(-10, 10, 300)
y_loss = np.linspace(-3, 3, 300)
X_loss, Y_loss = np.meshgrid(x_loss, y_loss)
Z_loss = loss_anisotropic(X_loss, Y_loss)

fig = plt.figure(figsize=(16, 6))

# 3D ํ‘œ๋ฉด
ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(X_loss, Y_loss, Z_loss, cmap='viridis', alpha=0.8,
                 vmin=0, vmax=50)
ax1.set_title('3D ์†์‹ค ์ง€ํ˜•\n(๋†’์€ ์กฐ๊ฑด์ˆ˜)', fontsize=12)
ax1.set_xlabel('$w_1$')
ax1.set_ylabel('$w_2$')
ax1.set_zlabel('Loss')
ax1.view_init(elev=25, azim=45)

# ๋“ฑ๊ณ ์„ 
ax2 = fig.add_subplot(132)
contours = ax2.contour(X_loss, Y_loss, Z_loss, levels=30, cmap='viridis')
ax2.clabel(contours, inline=True, fontsize=8)
ax2.set_title('๋“ฑ๊ณ ์„  (๊ธธ์ญ‰ํ•œ ๊ณจ์งœ๊ธฐ)', fontsize=12)
ax2.set_xlabel('$w_1$')
ax2.set_ylabel('$w_2$')
ax2.axis('equal')
ax2.grid(True, alpha=0.3)

# ๊ทธ๋ž˜๋””์–ธํŠธ ํ•˜๊ฐ• ๊ฒฝ๋กœ
ax3 = fig.add_subplot(133)
ax3.contour(X_loss, Y_loss, Z_loss, levels=30, cmap='viridis', alpha=0.6)

x0_loss = np.array([8.0, 2.5])
path_slow = gradient_descent_2d(grad_loss_anisotropic, x0_loss,
                                 learning_rate=0.05, n_steps=100)
ax3.plot(path_slow[:, 0], path_slow[:, 1], 'r.-', linewidth=1.5,
         markersize=3, label='๋А๋ฆฐ ์ˆ˜๋ ด')
ax3.plot(x0_loss[0], x0_loss[1], 'go', markersize=10, label='์‹œ์ž‘')
ax3.plot(0, 0, 'b*', markersize=15, label='์ตœ์†Ÿ๊ฐ’')
ax3.set_title('๊ทธ๋ž˜๋””์–ธํŠธ ํ•˜๊ฐ• ๊ฒฝ๋กœ\n(์ง€๊ทธ์žฌ๊ทธ)', fontsize=12)
ax3.set_xlabel('$w_1$')
ax3.set_ylabel('$w_2$')
ax3.legend()
ax3.grid(True, alpha=0.3)
ax3.axis('equal')

plt.tight_layout()
plt.savefig('loss_landscape_anisotropic.png', dpi=150)
plt.close()

print("๋น„๋“ฑ๋ฐฉ ์†์‹ค ์ง€ํ˜• ์‹œ๊ฐํ™” ์ €์žฅ: loss_landscape_anisotropic.png")

6.2 Saddle Point

# ์•ˆ์žฅ์  ํ•จ์ˆ˜: f(x, y) = x^2 - y^2
def saddle_function(x, y):
    return x**2 - y**2

def grad_saddle(x, y):
    return np.array([2*x, -2*y])

# ์‹œ๊ฐํ™”
x_saddle = np.linspace(-3, 3, 200)
y_saddle = np.linspace(-3, 3, 200)
X_saddle, Y_saddle = np.meshgrid(x_saddle, y_saddle)
Z_saddle = saddle_function(X_saddle, Y_saddle)

fig = plt.figure(figsize=(16, 6))

# 3D ํ‘œ๋ฉด
ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(X_saddle, Y_saddle, Z_saddle, cmap='coolwarm', alpha=0.8)
ax1.scatter([0], [0], [0], c='red', s=100, label='์•ˆ์žฅ์ ')
ax1.set_title('์•ˆ์žฅ์  ํ•จ์ˆ˜: $f(x,y) = x^2 - y^2$', fontsize=12)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f')
ax1.legend()

# ๋“ฑ๊ณ ์„ 
ax2 = fig.add_subplot(132)
contours = ax2.contour(X_saddle, Y_saddle, Z_saddle, levels=30, cmap='coolwarm')
ax2.clabel(contours, inline=True, fontsize=8)
ax2.plot(0, 0, 'ro', markersize=10, label='์•ˆ์žฅ์  (0, 0)')
ax2.set_title('๋“ฑ๊ณ ์„  (์Œ๊ณก์„  ํ˜•ํƒœ)', fontsize=12)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.legend()
ax2.axis('equal')
ax2.grid(True, alpha=0.3)

# ๋‹ค์–‘ํ•œ ์‹œ์ž‘์ ์—์„œ GD
ax3 = fig.add_subplot(133)
ax3.contour(X_saddle, Y_saddle, Z_saddle, levels=30, cmap='coolwarm', alpha=0.6)

start_points = [np.array([2, 0.1]), np.array([0.1, 2]),
                np.array([-2, -0.1]), np.array([-0.1, -2])]
colors = ['red', 'blue', 'green', 'orange']

for start, color in zip(start_points, colors):
    path = gradient_descent_2d(grad_saddle, start, learning_rate=0.1, n_steps=30)
    ax3.plot(path[:, 0], path[:, 1], 'o-', color=color, linewidth=1.5,
             markersize=4, alpha=0.7)

ax3.plot(0, 0, 'r*', markersize=20, label='์•ˆ์žฅ์ ')
ax3.set_title('์•ˆ์žฅ์  ๊ทผ์ฒ˜ GD ๊ฒฝ๋กœ\n(๋ถˆ์•ˆ์ •)', fontsize=12)
ax3.set_xlabel('x')
ax3.set_ylabel('y')
ax3.legend()
ax3.grid(True, alpha=0.3)
ax3.axis('equal')

plt.tight_layout()
plt.savefig('saddle_point.png', dpi=150)
plt.close()

print("์•ˆ์žฅ์  ์‹œ๊ฐํ™” ์ €์žฅ: saddle_point.png")

6.3 Condition Number and Convergence Rate

The condition number is the ratio of maximum to minimum eigenvalues of the Hessian:

$$\kappa(H) = \frac{\lambda_{\max}(H)}{\lambda_{\min}(H)}$$

Higher condition number makes optimization more difficult.

# ์กฐ๊ฑด์ˆ˜ ๋น„๊ต
def well_conditioned(x, y):
    """์กฐ๊ฑด์ˆ˜ = 1"""
    return x**2 + y**2

def ill_conditioned(x, y):
    """์กฐ๊ฑด์ˆ˜ = 100"""
    return x**2 + 100*y**2

def grad_well(x, y):
    return np.array([2*x, 2*y])

def grad_ill(x, y):
    return np.array([2*x, 200*y])

# ํ—ค์‹œ์•ˆ๊ณผ ์กฐ๊ฑด์ˆ˜
H_well = np.array([[2, 0], [0, 2]])
H_ill = np.array([[2, 0], [0, 200]])

eigs_well = np.linalg.eigvals(H_well)
eigs_ill = np.linalg.eigvals(H_ill)

cond_well = np.max(eigs_well) / np.min(eigs_well)
cond_ill = np.max(eigs_ill) / np.min(eigs_ill)

print(f"Well-conditioned ์กฐ๊ฑด์ˆ˜: {cond_well:.2f}")
print(f"Ill-conditioned ์กฐ๊ฑด์ˆ˜:  {cond_ill:.2f}")

# GD ๊ฒฝ๋กœ ๋น„๊ต
x0_cond = np.array([3.0, 3.0])
path_well = gradient_descent_2d(grad_well, x0_cond, learning_rate=0.2, n_steps=30)
path_ill = gradient_descent_2d(grad_ill, x0_cond, learning_rate=0.005, n_steps=100)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Well-conditioned
x_grid = np.linspace(-4, 4, 200)
y_grid = np.linspace(-4, 4, 200)
X_grid, Y_grid = np.meshgrid(x_grid, y_grid)
Z_well = well_conditioned(X_grid, Y_grid)

axes[0].contour(X_grid, Y_grid, Z_well, levels=20, cmap='viridis', alpha=0.6)
axes[0].plot(path_well[:, 0], path_well[:, 1], 'ro-', linewidth=2, markersize=4)
axes[0].set_title(f'Well-conditioned (ฮบ={cond_well:.1f})\n{len(path_well)} ๋‹จ๊ณ„',
                  fontsize=12)
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].axis('equal')
axes[0].grid(True, alpha=0.3)

# Ill-conditioned
Z_ill = ill_conditioned(X_grid, Y_grid)
axes[1].contour(X_grid, Y_grid, Z_ill, levels=30, cmap='viridis', alpha=0.6)
axes[1].plot(path_ill[:, 0], path_ill[:, 1], 'ro-', linewidth=1.5, markersize=3)
axes[1].set_title(f'Ill-conditioned (ฮบ={cond_ill:.1f})\n{len(path_ill)} ๋‹จ๊ณ„',
                  fontsize=12)
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].axis('equal')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('condition_number.png', dpi=150)
plt.close()

print("์กฐ๊ฑด์ˆ˜ ๋น„๊ต ์‹œ๊ฐํ™” ์ €์žฅ: condition_number.png")

7. ML Applications

7.1 Convexity Analysis

A convex function has a positive semi-definite Hessian.

# MSE ์†์‹ค์˜ ๋ณผ๋ก์„ฑ ํ™•์ธ
from sklearn.datasets import make_regression

X_reg, y_reg = make_regression(n_samples=100, n_features=2, noise=10, random_state=42)

def mse_loss(w):
    """MSE ์†์‹ค: L(w) = (1/2n) ||y - Xw||^2"""
    pred = X_reg @ w
    return 0.5 * np.mean((y_reg - pred)**2)

def grad_mse(w):
    """๊ทธ๋ž˜๋””์–ธํŠธ: โˆ‡L = -(1/n) X^T (y - Xw)"""
    residual = y_reg - X_reg @ w
    return -X_reg.T @ residual / len(y_reg)

def hess_mse(w):
    """ํ—ค์‹œ์•ˆ: H = (1/n) X^T X"""
    return X_reg.T @ X_reg / len(y_reg)

# ํ—ค์‹œ์•ˆ์˜ ๊ณ ์œ ๊ฐ’ (๋ณผ๋ก์„ฑ ํ™•์ธ)
w_sample = np.random.randn(2)
H = hess_mse(w_sample)
eigenvalues = np.linalg.eigvalsh(H)

print("MSE ์†์‹ค ํ•จ์ˆ˜ ๋ณผ๋ก์„ฑ ๋ถ„์„:")
print(f"ํ—ค์‹œ์•ˆ์˜ ๊ณ ์œ ๊ฐ’: {eigenvalues}")
print(f"๋ชจ๋‘ ์–‘์ˆ˜? {np.all(eigenvalues >= -1e-10)}")
print("โ†’ MSE๋Š” ๋ณผ๋ก ํ•จ์ˆ˜ (์ „์—ญ ์ตœ์†Ÿ๊ฐ’ ๋ณด์žฅ)")

7.2 Gradient-Based Optimization Comparison

# ๋‹ค์–‘ํ•œ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋น„๊ต
from scipy.optimize import minimize

# ์‹œ์ž‘์ 
w0 = np.array([2.0, -1.5])

# ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜
methods = ['BFGS', 'CG', 'Newton-CG', 'L-BFGS-B']
results = {}

for method in methods:
    if method == 'Newton-CG':
        result = minimize(mse_loss, w0, method=method, jac=grad_mse,
                          hess=hess_mse, options={'disp': False})
    else:
        result = minimize(mse_loss, w0, method=method, jac=grad_mse,
                          options={'disp': False})

    results[method] = result

    print(f"\n{method}:")
    print(f"  ์ตœ์  w: {result.x}")
    print(f"  ์ตœ์ข… ์†์‹ค: {result.fun:.6f}")
    print(f"  ๋ฐ˜๋ณต ํšŸ์ˆ˜: {result.nit}")
    print(f"  ํ•จ์ˆ˜ ํ‰๊ฐ€: {result.nfev}")

# ํ•ด์„ํ•ด์™€ ๋น„๊ต
w_analytic = np.linalg.lstsq(X_reg, y_reg, rcond=None)[0]
print(f"\nํ•ด์„ํ•ด (Normal Equation): {w_analytic}")

Practice Problems

Problem 1: Critical Point Classification

For the function $f(x, y) = x^3 - 3xy^2$: 1. Compute the gradient and find critical points 2. Compute the Hessian and determine the nature of each critical point (extremum/saddle point) 3. Visualize in 3D to verify

Problem 2: Constrained Optimization

Visualize contour lines and the constraint $x^2 + y^2 = 1$, and use Lagrange multipliers to optimize:

$$\min_{x,y} f(x, y) = x^2 + 2y^2 \quad \text{s.t.} \quad x^2 + y^2 = 1$$

Compare analytical and numerical solutions.

Problem 3: Momentum Gradient Descent

Implement gradient descent with momentum:

$$\mathbf{v}_{t+1} = \beta \mathbf{v}_t - \alpha \nabla f(\mathbf{x}_t)$$ $$\mathbf{x}_{t+1} = \mathbf{x}_t + \mathbf{v}_{t+1}$$

Compare performance with regular GD on a function with high condition number.

Problem 4: Normal Equation for Linear Regression

Derive the normal equation for MSE loss using Taylor expansion:

$$\mathbf{w}^* = (X^T X)^{-1} X^T \mathbf{y}$$

Hint: Solve $\nabla L = 0$ and verify that the Hessian is positive definite.

Problem 5: Adam Optimizer Implementation

Implement the Adam optimizer update rules and compare regular GD, Momentum, and Adam on an anisotropic loss function. Visualize convergence rate and paths.

References

Online Resources

Textbooks

  • Stewart, Calculus: Early Transcendentals, Chapters 14-15
  • Strang, Calculus, Volume 3 (Multivariable)
  • Boyd & Vandenberghe, Convex Optimization, Appendix A

Papers and Tutorials

  • Ruder, An Overview of Gradient Descent Optimization Algorithms (2016)
  • Goodfellow et al., Deep Learning, Chapter 4 (Numerical Computation)
  • Nocedal & Wright, Numerical Optimization, Chapter 2 (Fundamentals)
to navigate between lessons