05. Multivariate Calculus
05. Multivariate Calculus¶
Learning Objectives¶
- Understand and compute the geometric meaning of partial derivatives
- Visualize and interpret the relationship between gradient vectors and contour lines
- Calculate directional derivatives and understand their relationship to gradients
- Apply the multivariate chain rule to differentiate complex composite functions
- Understand the principles of function approximation and Newton's method using Taylor expansion
- Visualize loss landscapes in machine learning and connect them to optimization problems
1. Partial Derivatives¶
1.1 Definition of Partial Derivative¶
For a multivariate function $f(x_1, x_2, \ldots, x_n)$, the partial derivative with respect to $x_i$ is the rate of change when only $x_i$ varies while other variables remain constant:
$$\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}$$
1.2 Examples of Partial Derivative Computation¶
Example 1: $f(x, y) = x^2 + 3xy + y^2$
$$\frac{\partial f}{\partial x} = 2x + 3y$$
$$\frac{\partial f}{\partial y} = 3x + 2y$$
Example 2: $f(x, y) = e^{x^2 + y^2}$
$$\frac{\partial f}{\partial x} = 2x e^{x^2 + y^2}$$
$$\frac{\partial f}{\partial y} = 2y e^{x^2 + y^2}$$
1.3 Symbolic Differentiation with SymPy¶
import numpy as np
import sympy as sp
from sympy import symbols, diff, exp, sin, cos, simplify
# ์ฌ๋ณผ ์ ์
x, y = symbols('x y')
# ํจ์ ์ ์
f = x**2 + 3*x*y + y**2
# ํธ๋ฏธ๋ถ
df_dx = diff(f, x)
df_dy = diff(f, y)
print("ํจ์:", f)
print(f"โf/โx = {df_dx}")
print(f"โf/โy = {df_dy}")
# ํน์ ์ ์์ ๊ณ์ฐ
point = {x: 1, y: 2}
print(f"\n์ (1, 2)์์:")
print(f" f(1,2) = {f.subs(point)}")
print(f" โf/โx(1,2) = {df_dx.subs(point)}")
print(f" โf/โy(1,2) = {df_dy.subs(point)}")
# ๋ณต์กํ ํจ์
g = exp(x**2 + y**2) * sin(x*y)
dg_dx = simplify(diff(g, x))
dg_dy = simplify(diff(g, y))
print(f"\nํจ์: g(x,y) = {g}")
print(f"โg/โx = {dg_dx}")
print(f"โg/โy = {dg_dy}")
1.4 Geometric Meaning of Partial Derivatives¶
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# ํจ์ ์ ์: f(x, y) = x^2 + y^2
def f(x, y):
return x**2 + y**2
# ๊ฒฉ์ ์์ฑ
x_vals = np.linspace(-3, 3, 100)
y_vals = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x_vals, y_vals)
Z = f(X, Y)
# ํน์ ์
x0, y0 = 1.0, 1.5
z0 = f(x0, y0)
fig = plt.figure(figsize=(16, 6))
# 3D ํ๋ฉด
ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.7)
ax1.scatter([x0], [y0], [z0], c='red', s=100, label=f'์ ({x0}, {y0})')
# x ๋ฐฉํฅ ์ ์ (y ๊ณ ์ )
x_line = np.linspace(x0-1, x0+1, 50)
y_line = np.full_like(x_line, y0)
z_line = f(x_line, y_line)
ax1.plot(x_line, y_line, z_line, 'r-', linewidth=3, label='y ๊ณ ์ (โf/โx)')
# y ๋ฐฉํฅ ์ ์ (x ๊ณ ์ )
x_line2 = np.full(50, x0)
y_line2 = np.linspace(y0-1, y0+1, 50)
z_line2 = f(x_line2, y_line2)
ax1.plot(x_line2, y_line2, z_line2, 'b-', linewidth=3, label='x ๊ณ ์ (โf/โy)')
ax1.set_title('ํธ๋ฏธ๋ถ์ ๊ธฐํํ์ ์๋ฏธ', fontsize=12)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x,y)')
ax1.legend()
# x ๊ณ ์ ๋จ๋ฉด
ax2 = fig.add_subplot(132)
y_slice = np.linspace(-3, 3, 100)
z_slice = f(x0, y_slice)
ax2.plot(y_slice, z_slice, 'b-', linewidth=2)
ax2.plot([y0], [z0], 'ro', markersize=10)
# ์ ์
slope_y = 2*y0 # โf/โy = 2y
tangent_y = z0 + slope_y * (y_slice - y0)
ax2.plot(y_slice, tangent_y, 'r--', linewidth=2, label=f'์ ์ (๊ธฐ์ธ๊ธฐ={slope_y:.1f})')
ax2.set_title(f'x={x0} ๊ณ ์ ๋จ๋ฉด', fontsize=12)
ax2.set_xlabel('y')
ax2.set_ylabel('f(x,y)')
ax2.grid(True, alpha=0.3)
ax2.legend()
# y ๊ณ ์ ๋จ๋ฉด
ax3 = fig.add_subplot(133)
x_slice = np.linspace(-3, 3, 100)
z_slice = f(x_slice, y0)
ax3.plot(x_slice, z_slice, 'r-', linewidth=2)
ax3.plot([x0], [z0], 'ro', markersize=10)
# ์ ์
slope_x = 2*x0 # โf/โx = 2x
tangent_x = z0 + slope_x * (x_slice - x0)
ax3.plot(x_slice, tangent_x, 'b--', linewidth=2, label=f'์ ์ (๊ธฐ์ธ๊ธฐ={slope_x:.1f})')
ax3.set_title(f'y={y0} ๊ณ ์ ๋จ๋ฉด', fontsize=12)
ax3.set_xlabel('x')
ax3.set_ylabel('f(x,y)')
ax3.grid(True, alpha=0.3)
ax3.legend()
plt.tight_layout()
plt.savefig('partial_derivatives_geometry.png', dpi=150, bbox_inches='tight')
plt.close()
print("ํธ๋ฏธ๋ถ ๊ธฐํํ ์๊ฐํ ์ ์ฅ: partial_derivatives_geometry.png")
1.5 Higher-Order Partial Derivatives¶
Schwarz's theorem: If $f$ is $C^2$ class, the order of partial derivatives doesn't matter
$$\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x}$$
# ์๋ฐ๋ฅด์ธ ์ ๋ฆฌ ๊ฒ์ฆ
x, y = symbols('x y')
f = x**3 * y**2 + x * y**3
# ํผํฉ ํธ๋ฏธ๋ถ
f_xy = diff(diff(f, x), y)
f_yx = diff(diff(f, y), x)
print("ํจ์:", f)
print(f"โยฒf/โxโy = {f_xy}")
print(f"โยฒf/โyโx = {f_yx}")
print(f"๋์ผํ๊ฐ? {simplify(f_xy - f_yx) == 0}")
2. Gradient¶
2.1 Gradient Vector¶
The gradient is a vector containing all partial derivatives:
$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$
The gradient indicates: 1. Direction: The direction of steepest increase of the function 2. Magnitude: The rate of change in that direction
2.2 Contour Lines and Gradient¶
The gradient is perpendicular to contour lines (level sets).
# ๋ฑ๊ณ ์ ๊ณผ ๊ทธ๋๋์ธํธ ์๊ฐํ
def f(x, y):
"""ํจ์: f(x, y) = x^2 + 2y^2"""
return x**2 + 2*y**2
def grad_f(x, y):
"""๊ทธ๋๋์ธํธ: โf = [2x, 4y]"""
return np.array([2*x, 4*y])
# ๊ฒฉ์
x = np.linspace(-3, 3, 300)
y = np.linspace(-3, 3, 300)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
# ๊ทธ๋๋์ธํธ ๊ณ์ฐ ์ง์
n_points = 15
x_grad = np.linspace(-2.5, 2.5, n_points)
y_grad = np.linspace(-2.5, 2.5, n_points)
X_grad, Y_grad = np.meshgrid(x_grad, y_grad)
# ๊ฐ ์ ์์ ๊ทธ๋๋์ธํธ
U = 2 * X_grad
V = 4 * Y_grad
# ๊ทธ๋๋์ธํธ ํฌ๊ธฐ๋ก ์ ๊ทํ (ํ์ดํ ๊ธธ์ด ์กฐ์ )
norm = np.sqrt(U**2 + V**2)
U_norm = U / (norm + 1e-8) * 0.3
V_norm = V / (norm + 1e-8) * 0.3
plt.figure(figsize=(10, 8))
# ๋ฑ๊ณ ์
contours = plt.contour(X, Y, Z, levels=15, cmap='viridis', alpha=0.6)
plt.clabel(contours, inline=True, fontsize=8)
# ๊ทธ๋๋์ธํธ ๋ฒกํฐ
plt.quiver(X_grad, Y_grad, U_norm, V_norm, norm, cmap='Reds',
scale=1, scale_units='xy', width=0.004)
plt.colorbar(label='๊ทธ๋๋์ธํธ ํฌ๊ธฐ')
plt.title('๋ฑ๊ณ ์ ๊ณผ ๊ทธ๋๋์ธํธ (๊ทธ๋๋์ธํธ โฅ ๋ฑ๊ณ ์ )', fontsize=14)
plt.xlabel('x')
plt.ylabel('y')
plt.axis('equal')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('gradient_contour.png', dpi=150)
plt.close()
print("๋ฑ๊ณ ์ -๊ทธ๋๋์ธํธ ์๊ฐํ ์ ์ฅ: gradient_contour.png")
2.3 Gradient Descent Visualization¶
# ๊ทธ๋๋์ธํธ ํ๊ฐ๋ฒ ๊ฒฝ๋ก
def gradient_descent_2d(grad_f, x0, learning_rate=0.1, n_steps=50):
"""2D ๊ทธ๋๋์ธํธ ํ๊ฐ๋ฒ"""
path = [x0]
x = x0.copy()
for _ in range(n_steps):
grad = grad_f(x[0], x[1])
x = x - learning_rate * grad
path.append(x.copy())
return np.array(path)
# ์ด๊ธฐ์
x0 = np.array([2.5, 2.0])
# ๊ฒฝ๋ก ๊ณ์ฐ
path = gradient_descent_2d(grad_f, x0, learning_rate=0.15, n_steps=30)
# ์๊ฐํ
plt.figure(figsize=(10, 8))
plt.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.6)
plt.plot(path[:, 0], path[:, 1], 'ro-', linewidth=2, markersize=6,
label='๊ทธ๋๋์ธํธ ํ๊ฐ ๊ฒฝ๋ก')
plt.plot(path[0, 0], path[0, 1], 'go', markersize=12, label='์์์ ')
plt.plot(path[-1, 0], path[-1, 1], 'r*', markersize=20, label='์ข
๋ฃ์ ')
plt.plot(0, 0, 'b*', markersize=20, label='์ต์๊ฐ')
plt.colorbar(label='f(x, y)')
plt.title('๊ทธ๋๋์ธํธ ํ๊ฐ๋ฒ ๊ฒฝ๋ก', fontsize=14)
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.savefig('gradient_descent_path.png', dpi=150)
plt.close()
print("๊ทธ๋๋์ธํธ ํ๊ฐ๋ฒ ๊ฒฝ๋ก ์ ์ฅ: gradient_descent_path.png")
print(f"์์: {path[0]}, ์ข
๋ฃ: {path[-1]}, ์ต์๊ฐ: [0, 0]")
print(f"์ต์ข
ํจ์ ๊ฐ: {f(path[-1, 0], path[-1, 1]):.6f}")
2.4 Gradient Magnitude and Learning Rate¶
# ๋ค์ํ ํ์ต๋ฅ ๋ก ์คํ
learning_rates = [0.05, 0.15, 0.3, 0.5]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()
for idx, lr in enumerate(learning_rates):
path = gradient_descent_2d(grad_f, x0, learning_rate=lr, n_steps=30)
axes[idx].contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.6)
axes[idx].plot(path[:, 0], path[:, 1], 'ro-', linewidth=2, markersize=4)
axes[idx].plot(path[0, 0], path[0, 1], 'go', markersize=12)
axes[idx].plot(path[-1, 0], path[-1, 1], 'r*', markersize=15)
axes[idx].plot(0, 0, 'b*', markersize=15)
axes[idx].set_title(f'ํ์ต๋ฅ = {lr}\n์ต์ข
๊ฐ: {f(path[-1,0], path[-1,1]):.4f}',
fontsize=11)
axes[idx].set_xlabel('x')
axes[idx].set_ylabel('y')
axes[idx].grid(True, alpha=0.3)
axes[idx].axis('equal')
plt.tight_layout()
plt.savefig('gradient_descent_learning_rates.png', dpi=150)
plt.close()
print("ํ์ต๋ฅ ๋น๊ต ์๊ฐํ ์ ์ฅ: gradient_descent_learning_rates.png")
3. Directional Derivative¶
3.1 Definition of Directional Derivative¶
The directional derivative in the direction of unit vector $\mathbf{u}$:
$$D_\mathbf{u} f(\mathbf{x}) = \lim_{h \to 0} \frac{f(\mathbf{x} + h\mathbf{u}) - f(\mathbf{x})}{h}$$
Relationship to gradient:
$$D_\mathbf{u} f = \nabla f \cdot \mathbf{u}$$
3.2 Direction of Maximum Change¶
The directional derivative is maximum when $\mathbf{u}$ points in the direction of $\nabla f$:
$$\max_{\|\mathbf{u}\|=1} D_\mathbf{u} f = \|\nabla f\|$$
# ๋ฐฉํฅ ๋ํจ์ ์๊ฐํ
point = np.array([1.5, 1.0])
grad_at_point = grad_f(point[0], point[1])
grad_norm = np.linalg.norm(grad_at_point)
# ๋ค์ํ ๋ฐฉํฅ
n_directions = 36
angles = np.linspace(0, 2*np.pi, n_directions)
directions = np.array([np.cos(angles), np.sin(angles)]).T
# ๊ฐ ๋ฐฉํฅ์ผ๋ก ๋ฐฉํฅ ๋ํจ์ ๊ณ์ฐ
directional_derivatives = []
for u in directions:
Du_f = np.dot(grad_at_point, u)
directional_derivatives.append(Du_f)
directional_derivatives = np.array(directional_derivatives)
# ๊ทน์ขํ ํ๋กฏ
fig = plt.figure(figsize=(14, 6))
# ๊ทน์ขํ ํ๋กฏ
ax1 = fig.add_subplot(121, projection='polar')
ax1.plot(angles, directional_derivatives, 'b-', linewidth=2)
ax1.fill(angles, directional_derivatives, alpha=0.3)
ax1.set_title('๋ฐฉํฅ ๋ํจ์ ํฌ๊ธฐ\n(๊ฐ๋๋ณ)', fontsize=12)
ax1.grid(True)
# 2D ํ๋กฏ
ax2 = fig.add_subplot(122)
ax2.contour(X, Y, Z, levels=15, cmap='viridis', alpha=0.6)
ax2.plot(point[0], point[1], 'ro', markersize=12, label='ํ๊ฐ ์ ')
# ๊ทธ๋๋์ธํธ ๋ฐฉํฅ (์ต๋)
grad_unit = grad_at_point / grad_norm
ax2.arrow(point[0], point[1], grad_unit[0]*0.5, grad_unit[1]*0.5,
head_width=0.15, head_length=0.1, fc='red', ec='red',
linewidth=2, label=f'๊ทธ๋๋์ธํธ (์ต๋: {grad_norm:.2f})')
# ๋ช ๊ฐ์ง ๋ค๋ฅธ ๋ฐฉํฅ
sample_angles = [0, np.pi/4, np.pi/2, 3*np.pi/4]
colors = ['blue', 'green', 'orange', 'purple']
for angle, color in zip(sample_angles, colors):
u = np.array([np.cos(angle), np.sin(angle)])
Du = np.dot(grad_at_point, u)
ax2.arrow(point[0], point[1], u[0]*0.5, u[1]*0.5,
head_width=0.1, head_length=0.08, fc=color, ec=color,
alpha=0.6, linewidth=1.5)
ax2.set_title('๋ค์ํ ๋ฐฉํฅ์ ๋ํจ์', fontsize=12)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.axis('equal')
plt.tight_layout()
plt.savefig('directional_derivative.png', dpi=150)
plt.close()
print("๋ฐฉํฅ ๋ํจ์ ์๊ฐํ ์ ์ฅ: directional_derivative.png")
print(f"์ {point}์์ ๊ทธ๋๋์ธํธ: {grad_at_point}")
print(f"๊ทธ๋๋์ธํธ ํฌ๊ธฐ (์ต๋ ๋ฐฉํฅ ๋ํจ์): {grad_norm:.4f}")
4. Multivariate Chain Rule¶
4.1 Forms of Chain Rule¶
When $z = f(x, y)$ and $x = x(t)$, $y = y(t)$:
$$\frac{dz}{dt} = \frac{\partial f}{\partial x} \frac{dx}{dt} + \frac{\partial f}{\partial y} \frac{dy}{dt}$$
In general:
$$\frac{dz}{dt} = \nabla f \cdot \frac{d\mathbf{x}}{dt}$$
4.2 Chain Rule Example¶
# ์ฒด์ธ ๋ฃฐ: z = f(x, y) = x^2 + y^2, x(t) = cos(t), y(t) = sin(t)
t = symbols('t')
x_t = sp.cos(t)
y_t = sp.sin(t)
# ํจ์
z = x_t**2 + y_t**2
# ๋ฐฉ๋ฒ 1: ์ง์ ๋ฏธ๋ถ
dz_dt_direct = diff(z, t)
print("z(t) =", simplify(z))
print(f"dz/dt (์ง์ ) = {simplify(dz_dt_direct)}")
# ๋ฐฉ๋ฒ 2: ์ฒด์ธ ๋ฃฐ
x_sym, y_sym = symbols('x y')
f = x_sym**2 + y_sym**2
df_dx = diff(f, x_sym)
df_dy = diff(f, y_sym)
dx_dt = diff(x_t, t)
dy_dt = diff(y_t, t)
dz_dt_chain = df_dx.subs(x_sym, x_t).subs(y_sym, y_t) * dx_dt + \
df_dy.subs(x_sym, x_t).subs(y_sym, y_t) * dy_dt
print(f"dz/dt (์ฒด์ธ ๋ฃฐ) = {simplify(dz_dt_chain)}")
print(f"๋์ผํ๊ฐ? {simplify(dz_dt_direct - dz_dt_chain) == 0}")
4.3 Backpropagation and Chain Rule¶
Backpropagation in neural networks is the repeated application of the multivariate chain rule.
import torch
import torch.nn as nn
# ๊ฐ๋จํ ๊ณ์ฐ ๊ทธ๋ํ
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
# z = f(x, y) = x^2 + xy + y^2
z = x**2 + x*y + y**2
# ์๋ ๋ฏธ๋ถ
z.backward()
print("๊ณ์ฐ ๊ทธ๋ํ: z = xยฒ + xy + yยฒ")
print(f"x = {x.item()}, y = {y.item()}")
print(f"z = {z.item()}")
print(f"\n์๋ ๋ฏธ๋ถ:")
print(f"โz/โx = {x.grad.item()}")
print(f"โz/โy = {y.grad.item()}")
# ์๋ ๊ณ์ฐ
x_val, y_val = 2.0, 3.0
dz_dx_manual = 2*x_val + y_val # โz/โx = 2x + y
dz_dy_manual = x_val + 2*y_val # โz/โy = x + 2y
print(f"\n์๋ ๊ณ์ฐ:")
print(f"โz/โx = 2x + y = {dz_dx_manual}")
print(f"โz/โy = x + 2y = {dz_dy_manual}")
5. Taylor Expansion¶
5.1 First-Order Taylor Expansion (Linear Approximation)¶
Near point $\mathbf{x}_0$:
$$f(\mathbf{x}_0 + \boldsymbol{\delta}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T \boldsymbol{\delta}$$
5.2 Second-Order Taylor Expansion (Quadratic Approximation)¶
Including Hessian $H$:
$$f(\mathbf{x}_0 + \boldsymbol{\delta}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T \boldsymbol{\delta} + \frac{1}{2} \boldsymbol{\delta}^T H(\mathbf{x}_0) \boldsymbol{\delta}$$
5.3 Taylor Expansion Visualization¶
# ํ
์ผ๋ฌ ์ ๊ฐ ๊ทผ์ฌ
def f_example(x, y):
return np.exp(-(x**2 + y**2)) * np.sin(x) * np.cos(y)
def grad_f_example(x, y):
"""์์น ๊ทธ๋๋์ธํธ"""
h = 1e-5
df_dx = (f_example(x+h, y) - f_example(x-h, y)) / (2*h)
df_dy = (f_example(x, y+h) - f_example(x, y-h)) / (2*h)
return np.array([df_dx, df_dy])
def hessian_f_example(x, y):
"""์์น ํค์์"""
h = 1e-5
H = np.zeros((2, 2))
grad_base = grad_f_example(x, y)
# H[0,0] = โยฒf/โxยฒ
grad_x_plus = grad_f_example(x+h, y)
H[0, 0] = (grad_x_plus[0] - grad_base[0]) / h
# H[1,1] = โยฒf/โyยฒ
grad_y_plus = grad_f_example(x, y+h)
H[1, 1] = (grad_y_plus[1] - grad_base[1]) / h
# H[0,1] = H[1,0] = โยฒf/โxโy
H[0, 1] = (grad_x_plus[1] - grad_base[1]) / h
H[1, 0] = H[0, 1]
return H
# ์ ๊ฐ ์ค์ฌ์
x0, y0 = 0.5, 0.5
f0 = f_example(x0, y0)
grad0 = grad_f_example(x0, y0)
H0 = hessian_f_example(x0, y0)
# ๊ฒฉ์
x_range = np.linspace(-0.5, 1.5, 100)
y_range = np.linspace(-0.5, 1.5, 100)
X_grid, Y_grid = np.meshgrid(x_range, y_range)
# ์ ํจ์
Z_true = f_example(X_grid, Y_grid)
# 1์ฐจ ๊ทผ์ฌ
delta_x = X_grid - x0
delta_y = Y_grid - y0
Z_linear = f0 + grad0[0]*delta_x + grad0[1]*delta_y
# 2์ฐจ ๊ทผ์ฌ
Z_quadratic = np.zeros_like(Z_true)
for i in range(len(y_range)):
for j in range(len(x_range)):
delta = np.array([delta_x[i, j], delta_y[i, j]])
Z_quadratic[i, j] = f0 + grad0 @ delta + 0.5 * delta @ H0 @ delta
# ์๊ฐํ
fig = plt.figure(figsize=(18, 5))
titles = ['์ ํจ์', '1์ฐจ ํ
์ผ๋ฌ ๊ทผ์ฌ (์ ํ)', '2์ฐจ ํ
์ผ๋ฌ ๊ทผ์ฌ (์ด์ฐจ)']
Z_list = [Z_true, Z_linear, Z_quadratic]
for idx, (title, Z) in enumerate(zip(titles, Z_list)):
ax = fig.add_subplot(1, 3, idx+1, projection='3d')
ax.plot_surface(X_grid, Y_grid, Z, cmap='viridis', alpha=0.8)
ax.scatter([x0], [y0], [f0], c='red', s=100, label='์ ๊ฐ ์ค์ฌ')
ax.set_title(title, fontsize=12)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('f(x,y)')
ax.legend()
plt.tight_layout()
plt.savefig('taylor_expansion.png', dpi=150)
plt.close()
# ์ค์ฐจ ๊ณ์ฐ
error_linear = np.abs(Z_true - Z_linear).max()
error_quad = np.abs(Z_true - Z_quadratic).max()
print("ํ
์ผ๋ฌ ์ ๊ฐ ์๊ฐํ ์ ์ฅ: taylor_expansion.png")
print(f"1์ฐจ ๊ทผ์ฌ ์ต๋ ์ค์ฐจ: {error_linear:.6f}")
print(f"2์ฐจ ๊ทผ์ฌ ์ต๋ ์ค์ฐจ: {error_quad:.6f}")
5.4 Mathematical Foundation of Newton's Method¶
Newton's method is optimization using second-order Taylor expansion:
$$\mathbf{x}_{k+1} = \mathbf{x}_k - H^{-1}(\mathbf{x}_k) \nabla f(\mathbf{x}_k)$$
def newton_method_2d(f, grad_f, hess_f, x0, tol=1e-6, max_iter=20):
"""2D ๋ดํด ๋ฐฉ๋ฒ"""
path = [x0]
x = x0.copy()
for i in range(max_iter):
grad = grad_f(x[0], x[1])
hess = hess_f(x[0], x[1])
# ๋ดํด ์คํ
try:
delta = np.linalg.solve(hess, -grad)
except np.linalg.LinAlgError:
print(f"ํน์ด ํค์์ at iteration {i}")
break
x = x + delta
path.append(x.copy())
if np.linalg.norm(delta) < tol:
print(f"์๋ ด at iteration {i}")
break
return np.array(path)
# ํ
์คํธ ํจ์: f(x, y) = (x-1)^2 + 2(y-2)^2
def f_newton(x, y):
return (x - 1)**2 + 2*(y - 2)**2
def grad_f_newton(x, y):
return np.array([2*(x - 1), 4*(y - 2)])
def hess_f_newton(x, y):
return np.array([[2, 0], [0, 4]])
# ๋ดํด ๋ฐฉ๋ฒ vs ๊ทธ๋๋์ธํธ ํ๊ฐ
x0_newton = np.array([0.0, 0.0])
path_newton = newton_method_2d(f_newton, grad_f_newton, hess_f_newton, x0_newton)
path_gd = gradient_descent_2d(grad_f_newton, x0_newton, learning_rate=0.2, n_steps=50)
# ์๊ฐํ
x_plt = np.linspace(-0.5, 2.5, 200)
y_plt = np.linspace(-0.5, 3.5, 200)
X_plt, Y_plt = np.meshgrid(x_plt, y_plt)
Z_plt = f_newton(X_plt, Y_plt)
plt.figure(figsize=(10, 8))
plt.contour(X_plt, Y_plt, Z_plt, levels=25, cmap='viridis', alpha=0.6)
plt.plot(path_gd[:, 0], path_gd[:, 1], 'ro-', linewidth=2, markersize=5,
label=f'๊ทธ๋๋์ธํธ ํ๊ฐ ({len(path_gd)} ๋จ๊ณ)')
plt.plot(path_newton[:, 0], path_newton[:, 1], 'bo-', linewidth=2, markersize=7,
label=f'๋ดํด ๋ฐฉ๋ฒ ({len(path_newton)} ๋จ๊ณ)')
plt.plot(1, 2, 'g*', markersize=20, label='์ต์๊ฐ')
plt.colorbar(label='f(x, y)')
plt.title('๋ดํด ๋ฐฉ๋ฒ vs ๊ทธ๋๋์ธํธ ํ๊ฐ๋ฒ', fontsize=14)
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('newton_vs_gradient_descent.png', dpi=150)
plt.close()
print("๋ดํด vs GD ์๊ฐํ ์ ์ฅ: newton_vs_gradient_descent.png")
print(f"๋ดํด ๋ฐฉ๋ฒ ๋จ๊ณ: {len(path_newton)}, ์ต์ข
: {path_newton[-1]}")
print(f"GD ๋จ๊ณ: {len(path_gd)}, ์ต์ข
: {path_gd[-1]}")
6. Loss Landscape Visualization¶
6.1 Loss Function Landscape¶
# ๋น๋ฑ๋ฐฉ ์์ค ํจ์ (์กฐ๊ฑด์๊ฐ ํฐ ๊ฒฝ์ฐ)
def loss_anisotropic(x, y):
"""๊ธธ์ญํ ๊ณจ์ง๊ธฐ ํํ์ ์์ค ํจ์"""
return 0.5 * x**2 + 10 * y**2
def grad_loss_anisotropic(x, y):
return np.array([x, 20*y])
# ์๊ฐํ
x_loss = np.linspace(-10, 10, 300)
y_loss = np.linspace(-3, 3, 300)
X_loss, Y_loss = np.meshgrid(x_loss, y_loss)
Z_loss = loss_anisotropic(X_loss, Y_loss)
fig = plt.figure(figsize=(16, 6))
# 3D ํ๋ฉด
ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(X_loss, Y_loss, Z_loss, cmap='viridis', alpha=0.8,
vmin=0, vmax=50)
ax1.set_title('3D ์์ค ์งํ\n(๋์ ์กฐ๊ฑด์)', fontsize=12)
ax1.set_xlabel('$w_1$')
ax1.set_ylabel('$w_2$')
ax1.set_zlabel('Loss')
ax1.view_init(elev=25, azim=45)
# ๋ฑ๊ณ ์
ax2 = fig.add_subplot(132)
contours = ax2.contour(X_loss, Y_loss, Z_loss, levels=30, cmap='viridis')
ax2.clabel(contours, inline=True, fontsize=8)
ax2.set_title('๋ฑ๊ณ ์ (๊ธธ์ญํ ๊ณจ์ง๊ธฐ)', fontsize=12)
ax2.set_xlabel('$w_1$')
ax2.set_ylabel('$w_2$')
ax2.axis('equal')
ax2.grid(True, alpha=0.3)
# ๊ทธ๋๋์ธํธ ํ๊ฐ ๊ฒฝ๋ก
ax3 = fig.add_subplot(133)
ax3.contour(X_loss, Y_loss, Z_loss, levels=30, cmap='viridis', alpha=0.6)
x0_loss = np.array([8.0, 2.5])
path_slow = gradient_descent_2d(grad_loss_anisotropic, x0_loss,
learning_rate=0.05, n_steps=100)
ax3.plot(path_slow[:, 0], path_slow[:, 1], 'r.-', linewidth=1.5,
markersize=3, label='๋๋ฆฐ ์๋ ด')
ax3.plot(x0_loss[0], x0_loss[1], 'go', markersize=10, label='์์')
ax3.plot(0, 0, 'b*', markersize=15, label='์ต์๊ฐ')
ax3.set_title('๊ทธ๋๋์ธํธ ํ๊ฐ ๊ฒฝ๋ก\n(์ง๊ทธ์ฌ๊ทธ)', fontsize=12)
ax3.set_xlabel('$w_1$')
ax3.set_ylabel('$w_2$')
ax3.legend()
ax3.grid(True, alpha=0.3)
ax3.axis('equal')
plt.tight_layout()
plt.savefig('loss_landscape_anisotropic.png', dpi=150)
plt.close()
print("๋น๋ฑ๋ฐฉ ์์ค ์งํ ์๊ฐํ ์ ์ฅ: loss_landscape_anisotropic.png")
6.2 Saddle Point¶
# ์์ฅ์ ํจ์: f(x, y) = x^2 - y^2
def saddle_function(x, y):
return x**2 - y**2
def grad_saddle(x, y):
return np.array([2*x, -2*y])
# ์๊ฐํ
x_saddle = np.linspace(-3, 3, 200)
y_saddle = np.linspace(-3, 3, 200)
X_saddle, Y_saddle = np.meshgrid(x_saddle, y_saddle)
Z_saddle = saddle_function(X_saddle, Y_saddle)
fig = plt.figure(figsize=(16, 6))
# 3D ํ๋ฉด
ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(X_saddle, Y_saddle, Z_saddle, cmap='coolwarm', alpha=0.8)
ax1.scatter([0], [0], [0], c='red', s=100, label='์์ฅ์ ')
ax1.set_title('์์ฅ์ ํจ์: $f(x,y) = x^2 - y^2$', fontsize=12)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f')
ax1.legend()
# ๋ฑ๊ณ ์
ax2 = fig.add_subplot(132)
contours = ax2.contour(X_saddle, Y_saddle, Z_saddle, levels=30, cmap='coolwarm')
ax2.clabel(contours, inline=True, fontsize=8)
ax2.plot(0, 0, 'ro', markersize=10, label='์์ฅ์ (0, 0)')
ax2.set_title('๋ฑ๊ณ ์ (์๊ณก์ ํํ)', fontsize=12)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.legend()
ax2.axis('equal')
ax2.grid(True, alpha=0.3)
# ๋ค์ํ ์์์ ์์ GD
ax3 = fig.add_subplot(133)
ax3.contour(X_saddle, Y_saddle, Z_saddle, levels=30, cmap='coolwarm', alpha=0.6)
start_points = [np.array([2, 0.1]), np.array([0.1, 2]),
np.array([-2, -0.1]), np.array([-0.1, -2])]
colors = ['red', 'blue', 'green', 'orange']
for start, color in zip(start_points, colors):
path = gradient_descent_2d(grad_saddle, start, learning_rate=0.1, n_steps=30)
ax3.plot(path[:, 0], path[:, 1], 'o-', color=color, linewidth=1.5,
markersize=4, alpha=0.7)
ax3.plot(0, 0, 'r*', markersize=20, label='์์ฅ์ ')
ax3.set_title('์์ฅ์ ๊ทผ์ฒ GD ๊ฒฝ๋ก\n(๋ถ์์ )', fontsize=12)
ax3.set_xlabel('x')
ax3.set_ylabel('y')
ax3.legend()
ax3.grid(True, alpha=0.3)
ax3.axis('equal')
plt.tight_layout()
plt.savefig('saddle_point.png', dpi=150)
plt.close()
print("์์ฅ์ ์๊ฐํ ์ ์ฅ: saddle_point.png")
6.3 Condition Number and Convergence Rate¶
The condition number is the ratio of maximum to minimum eigenvalues of the Hessian:
$$\kappa(H) = \frac{\lambda_{\max}(H)}{\lambda_{\min}(H)}$$
Higher condition number makes optimization more difficult.
# ์กฐ๊ฑด์ ๋น๊ต
def well_conditioned(x, y):
"""์กฐ๊ฑด์ = 1"""
return x**2 + y**2
def ill_conditioned(x, y):
"""์กฐ๊ฑด์ = 100"""
return x**2 + 100*y**2
def grad_well(x, y):
return np.array([2*x, 2*y])
def grad_ill(x, y):
return np.array([2*x, 200*y])
# ํค์์๊ณผ ์กฐ๊ฑด์
H_well = np.array([[2, 0], [0, 2]])
H_ill = np.array([[2, 0], [0, 200]])
eigs_well = np.linalg.eigvals(H_well)
eigs_ill = np.linalg.eigvals(H_ill)
cond_well = np.max(eigs_well) / np.min(eigs_well)
cond_ill = np.max(eigs_ill) / np.min(eigs_ill)
print(f"Well-conditioned ์กฐ๊ฑด์: {cond_well:.2f}")
print(f"Ill-conditioned ์กฐ๊ฑด์: {cond_ill:.2f}")
# GD ๊ฒฝ๋ก ๋น๊ต
x0_cond = np.array([3.0, 3.0])
path_well = gradient_descent_2d(grad_well, x0_cond, learning_rate=0.2, n_steps=30)
path_ill = gradient_descent_2d(grad_ill, x0_cond, learning_rate=0.005, n_steps=100)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Well-conditioned
x_grid = np.linspace(-4, 4, 200)
y_grid = np.linspace(-4, 4, 200)
X_grid, Y_grid = np.meshgrid(x_grid, y_grid)
Z_well = well_conditioned(X_grid, Y_grid)
axes[0].contour(X_grid, Y_grid, Z_well, levels=20, cmap='viridis', alpha=0.6)
axes[0].plot(path_well[:, 0], path_well[:, 1], 'ro-', linewidth=2, markersize=4)
axes[0].set_title(f'Well-conditioned (ฮบ={cond_well:.1f})\n{len(path_well)} ๋จ๊ณ',
fontsize=12)
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].axis('equal')
axes[0].grid(True, alpha=0.3)
# Ill-conditioned
Z_ill = ill_conditioned(X_grid, Y_grid)
axes[1].contour(X_grid, Y_grid, Z_ill, levels=30, cmap='viridis', alpha=0.6)
axes[1].plot(path_ill[:, 0], path_ill[:, 1], 'ro-', linewidth=1.5, markersize=3)
axes[1].set_title(f'Ill-conditioned (ฮบ={cond_ill:.1f})\n{len(path_ill)} ๋จ๊ณ',
fontsize=12)
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].axis('equal')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('condition_number.png', dpi=150)
plt.close()
print("์กฐ๊ฑด์ ๋น๊ต ์๊ฐํ ์ ์ฅ: condition_number.png")
7. ML Applications¶
7.1 Convexity Analysis¶
A convex function has a positive semi-definite Hessian.
# MSE ์์ค์ ๋ณผ๋ก์ฑ ํ์ธ
from sklearn.datasets import make_regression
X_reg, y_reg = make_regression(n_samples=100, n_features=2, noise=10, random_state=42)
def mse_loss(w):
"""MSE ์์ค: L(w) = (1/2n) ||y - Xw||^2"""
pred = X_reg @ w
return 0.5 * np.mean((y_reg - pred)**2)
def grad_mse(w):
"""๊ทธ๋๋์ธํธ: โL = -(1/n) X^T (y - Xw)"""
residual = y_reg - X_reg @ w
return -X_reg.T @ residual / len(y_reg)
def hess_mse(w):
"""ํค์์: H = (1/n) X^T X"""
return X_reg.T @ X_reg / len(y_reg)
# ํค์์์ ๊ณ ์ ๊ฐ (๋ณผ๋ก์ฑ ํ์ธ)
w_sample = np.random.randn(2)
H = hess_mse(w_sample)
eigenvalues = np.linalg.eigvalsh(H)
print("MSE ์์ค ํจ์ ๋ณผ๋ก์ฑ ๋ถ์:")
print(f"ํค์์์ ๊ณ ์ ๊ฐ: {eigenvalues}")
print(f"๋ชจ๋ ์์? {np.all(eigenvalues >= -1e-10)}")
print("โ MSE๋ ๋ณผ๋ก ํจ์ (์ ์ญ ์ต์๊ฐ ๋ณด์ฅ)")
7.2 Gradient-Based Optimization Comparison¶
# ๋ค์ํ ์ต์ ํ ์๊ณ ๋ฆฌ์ฆ ๋น๊ต
from scipy.optimize import minimize
# ์์์
w0 = np.array([2.0, -1.5])
# ์ต์ ํ ์๊ณ ๋ฆฌ์ฆ
methods = ['BFGS', 'CG', 'Newton-CG', 'L-BFGS-B']
results = {}
for method in methods:
if method == 'Newton-CG':
result = minimize(mse_loss, w0, method=method, jac=grad_mse,
hess=hess_mse, options={'disp': False})
else:
result = minimize(mse_loss, w0, method=method, jac=grad_mse,
options={'disp': False})
results[method] = result
print(f"\n{method}:")
print(f" ์ต์ w: {result.x}")
print(f" ์ต์ข
์์ค: {result.fun:.6f}")
print(f" ๋ฐ๋ณต ํ์: {result.nit}")
print(f" ํจ์ ํ๊ฐ: {result.nfev}")
# ํด์ํด์ ๋น๊ต
w_analytic = np.linalg.lstsq(X_reg, y_reg, rcond=None)[0]
print(f"\nํด์ํด (Normal Equation): {w_analytic}")
Practice Problems¶
Problem 1: Critical Point Classification¶
For the function $f(x, y) = x^3 - 3xy^2$: 1. Compute the gradient and find critical points 2. Compute the Hessian and determine the nature of each critical point (extremum/saddle point) 3. Visualize in 3D to verify
Problem 2: Constrained Optimization¶
Visualize contour lines and the constraint $x^2 + y^2 = 1$, and use Lagrange multipliers to optimize:
$$\min_{x,y} f(x, y) = x^2 + 2y^2 \quad \text{s.t.} \quad x^2 + y^2 = 1$$
Compare analytical and numerical solutions.
Problem 3: Momentum Gradient Descent¶
Implement gradient descent with momentum:
$$\mathbf{v}_{t+1} = \beta \mathbf{v}_t - \alpha \nabla f(\mathbf{x}_t)$$ $$\mathbf{x}_{t+1} = \mathbf{x}_t + \mathbf{v}_{t+1}$$
Compare performance with regular GD on a function with high condition number.
Problem 4: Normal Equation for Linear Regression¶
Derive the normal equation for MSE loss using Taylor expansion:
$$\mathbf{w}^* = (X^T X)^{-1} X^T \mathbf{y}$$
Hint: Solve $\nabla L = 0$ and verify that the Hessian is positive definite.
Problem 5: Adam Optimizer Implementation¶
Implement the Adam optimizer update rules and compare regular GD, Momentum, and Adam on an anisotropic loss function. Visualize convergence rate and paths.
References¶
Online Resources¶
- Multivariable Calculus - MIT OpenCourseWare
- Visual Calculus - Interactive visualization
- Gradient Descent Visualization - Distill.pub
Textbooks¶
- Stewart, Calculus: Early Transcendentals, Chapters 14-15
- Strang, Calculus, Volume 3 (Multivariable)
- Boyd & Vandenberghe, Convex Optimization, Appendix A
Papers and Tutorials¶
- Ruder, An Overview of Gradient Descent Optimization Algorithms (2016)
- Goodfellow et al., Deep Learning, Chapter 4 (Numerical Computation)
- Nocedal & Wright, Numerical Optimization, Chapter 2 (Fundamentals)