03. ํ–‰๋ ฌ ๋ฏธ๋ถ„ (Matrix Calculus)

03. ํ–‰๋ ฌ ๋ฏธ๋ถ„ (Matrix Calculus)

ํ•™์Šต ๋ชฉํ‘œ

  • ์Šค์นผ๋ผ-๋ฒกํ„ฐ ๋ฏธ๋ถ„๊ณผ ๊ทธ๋ž˜๋””์–ธํŠธ์˜ ๊ฐœ๋…์„ ์ดํ•ดํ•˜๊ณ  ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค
  • ์•ผ์ฝ”๋น„์•ˆ ํ–‰๋ ฌ์˜ ์ •์˜์™€ ์ฒด์ธ ๋ฃฐ ์ ์šฉ ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•œ๋‹ค
  • ํ—ค์‹œ์•ˆ ํ–‰๋ ฌ์˜ ์˜๋ฏธ์™€ ์ตœ์ ํ™”์—์„œ์˜ ์—ญํ• ์„ ์ดํ•ดํ•œ๋‹ค
  • ์ฃผ์š” ํ–‰๋ ฌ ๋ฏธ๋ถ„ ํ•ญ๋“ฑ์‹์„ ์œ ๋„ํ•˜๊ณ  ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค
  • ๋จธ์‹ ๋Ÿฌ๋‹์˜ ์†์‹ค ํ•จ์ˆ˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์ง์ ‘ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค
  • PyTorch์˜ ์ž๋™ ๋ฏธ๋ถ„ ๊ธฐ๋Šฅ์„ ์ดํ•ดํ•˜๊ณ  ๊ฒ€์ฆ์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค

1. ์Šค์นผ๋ผ-๋ฒกํ„ฐ ๋ฏธ๋ถ„ (Scalar-by-Vector Derivatives)

1.1 ๊ทธ๋ž˜๋””์–ธํŠธ์˜ ์ •์˜

์Šค์นผ๋ผ ํ•จ์ˆ˜ $f: \mathbb{R}^n \to \mathbb{R}$์— ๋Œ€ํ•ด ๊ทธ๋ž˜๋””์–ธํŠธ๋Š” ๋ชจ๋“  ํŽธ๋ฏธ๋ถ„์„ ๋ชจ์•„๋†“์€ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค:

$$\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

๊ทธ๋ž˜๋””์–ธํŠธ๋Š” ํ•จ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๊ฐ€ํŒŒ๋ฅด๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ์„ ๊ฐ€๋ฆฌํ‚ต๋‹ˆ๋‹ค.

1.2 ๊ธฐ๋ณธ ์˜ˆ์ œ

์˜ˆ์ œ 1: $f(\mathbf{x}) = \mathbf{a}^T \mathbf{x}$

$$\frac{\partial}{\partial \mathbf{x}}(\mathbf{a}^T \mathbf{x}) = \mathbf{a}$$

์˜ˆ์ œ 2: $f(\mathbf{x}) = \mathbf{x}^T \mathbf{x}$

$$\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T \mathbf{x}) = 2\mathbf{x}$$

์˜ˆ์ œ 3: $f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}$ (์ด์ฐจ ํ˜•์‹)

$$\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = (A + A^T)\mathbf{x}$$

$A$๊ฐ€ ๋Œ€์นญ์ด๋ฉด $2A\mathbf{x}$๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

1.3 Python ๊ตฌํ˜„: ์ด์ฐจ ํ˜•์‹์˜ ๋ฏธ๋ถ„

import numpy as np
import torch
from sympy import symbols, Matrix, diff, simplify

# SymPy๋กœ ์‹ฌ๋ณผ๋ฆญ ๊ณ„์‚ฐ
print("=== SymPy ์‹ฌ๋ณผ๋ฆญ ๋ฏธ๋ถ„ ===")
x1, x2 = symbols('x1 x2')
x = Matrix([x1, x2])
A = Matrix([[2, 1], [1, 3]])

# f(x) = x^T A x
f = (x.T * A * x)[0]
print(f"f(x) = {f}")

# ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ
grad_f = Matrix([diff(f, x1), diff(f, x2)])
print(f"โˆ‡f = {simplify(grad_f)}")
print(f"(A + A^T)x = {simplify((A + A.T) * x)}")

# PyTorch๋กœ ์ˆ˜์น˜ ๊ณ„์‚ฐ ๋ฐ ๊ฒ€์ฆ
print("\n=== PyTorch ์ž๋™ ๋ฏธ๋ถ„ ===")
x_val = torch.tensor([1.0, 2.0], requires_grad=True)
A_torch = torch.tensor([[2.0, 1.0], [1.0, 3.0]])

# ์ˆœ์ „ํŒŒ
f_val = x_val @ A_torch @ x_val
print(f"f(x) = {f_val.item():.4f}")

# ์—ญ์ „ํŒŒ
f_val.backward()
print(f"โˆ‡f (autograd) = {x_val.grad}")

# ๊ณต์‹์œผ๋กœ ๊ณ„์‚ฐ
grad_formula = (A_torch + A_torch.T) @ x_val.detach()
print(f"โˆ‡f (๊ณต์‹)      = {grad_formula}")
print(f"์ฐจ์ด: {torch.norm(x_val.grad - grad_formula).item():.2e}")

1.4 ๋ถ„์ž ๋ ˆ์ด์•„์›ƒ vs ๋ถ„๋ชจ ๋ ˆ์ด์•„์›ƒ

ํ–‰๋ ฌ ๋ฏธ๋ถ„์—๋Š” ๋‘ ๊ฐ€์ง€ ํ‘œ๊ธฐ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋ถ„์ž ๋ ˆ์ด์•„์›ƒ (Numerator layout): $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$์˜ $(i,j)$ ์›์†Œ๊ฐ€ $\frac{\partial y_i}{\partial x_j}$
  • ๋ถ„๋ชจ ๋ ˆ์ด์•„์›ƒ (Denominator layout): $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$์˜ $(i,j)$ ์›์†Œ๊ฐ€ $\frac{\partial y_j}{\partial x_i}$

์ด ๋ฌธ์„œ์—์„œ๋Š” ๋ถ„์ž ๋ ˆ์ด์•„์›ƒ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

2. ๋ฒกํ„ฐ-๋ฒกํ„ฐ ๋ฏธ๋ถ„: ์•ผ์ฝ”๋น„์•ˆ (Jacobian)

2.1 ์•ผ์ฝ”๋น„์•ˆ ํ–‰๋ ฌ์˜ ์ •์˜

๋ฒกํ„ฐ ํ•จ์ˆ˜ $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$์— ๋Œ€ํ•ด ์•ผ์ฝ”๋น„์•ˆ ํ–‰๋ ฌ์€:

$$J = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$

ํฌ๊ธฐ๋Š” $m \times n$์ž…๋‹ˆ๋‹ค.

2.2 ์ฒด์ธ ๋ฃฐ with ์•ผ์ฝ”๋น„์•ˆ

$\mathbf{z} = \mathbf{g}(\mathbf{f}(\mathbf{x}))$์ผ ๋•Œ:

$$\frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}$$

์—ฌ๊ธฐ์„œ $\mathbf{y} = \mathbf{f}(\mathbf{x})$์ด๊ณ , ์šฐ๋ณ€์€ ์•ผ์ฝ”๋น„์•ˆ ํ–‰๋ ฌ์˜ ๊ณฑ์ž…๋‹ˆ๋‹ค.

2.3 ์•ผ์ฝ”๋น„์•ˆ ๊ณ„์‚ฐ ์˜ˆ์ œ

import torch

# ํ•จ์ˆ˜ ์ •์˜: f: R^2 -> R^3
def vector_function(x):
    """
    f([x1, x2]) = [x1^2 + x2,
                   x1 * x2,
                   sin(x1) + cos(x2)]
    """
    return torch.stack([
        x[0]**2 + x[1],
        x[0] * x[1],
        torch.sin(x[0]) + torch.cos(x[1])
    ])

x = torch.tensor([1.0, 2.0], requires_grad=True)

# PyTorch์˜ ์•ผ์ฝ”๋น„์•ˆ ๊ณ„์‚ฐ
from torch.autograd.functional import jacobian

J = jacobian(vector_function, x)
print("์•ผ์ฝ”๋น„์•ˆ ํ–‰๋ ฌ (3x2):")
print(J)

# ์ˆ˜๋™ ๊ณ„์‚ฐ์œผ๋กœ ๊ฒ€์ฆ
x1, x2 = x[0].item(), x[1].item()
J_manual = torch.tensor([
    [2*x1, 1],
    [x2, x1],
    [np.cos(x1), -np.sin(x2)]
])
print("\n์ˆ˜๋™ ๊ณ„์‚ฐ:")
print(J_manual)
print(f"\n์ฐจ์ด: {torch.norm(J - J_manual).item():.2e}")

2.4 ์ฒด์ธ ๋ฃฐ ์‹ค์Šต

# ํ•ฉ์„ฑ ํ•จ์ˆ˜์˜ ์•ผ์ฝ”๋น„์•ˆ: h(x) = g(f(x))
def f(x):
    """f: R^2 -> R^2"""
    return torch.stack([x[0]**2, x[0] + x[1]])

def g(y):
    """g: R^2 -> R^2"""
    return torch.stack([y[0] * y[1], y[0] - y[1]])

def h(x):
    """h = g โˆ˜ f"""
    return g(f(x))

x = torch.tensor([1.0, 2.0])

# ๋ฐฉ๋ฒ• 1: ์ง์ ‘ ๊ณ„์‚ฐ
J_h = jacobian(h, x)
print("J_h (์ง์ ‘):")
print(J_h)

# ๋ฐฉ๋ฒ• 2: ์ฒด์ธ ๋ฃฐ
J_f = jacobian(f, x)
y = f(x)
J_g = jacobian(g, y)
J_chain = J_g @ J_f
print("\nJ_g @ J_f (์ฒด์ธ ๋ฃฐ):")
print(J_chain)

print(f"\n์ฐจ์ด: {torch.norm(J_h - J_chain).item():.2e}")

3. ํ—ค์‹œ์•ˆ ํ–‰๋ ฌ (Hessian Matrix)

3.1 ํ—ค์‹œ์•ˆ์˜ ์ •์˜

์Šค์นผ๋ผ ํ•จ์ˆ˜ $f: \mathbb{R}^n \to \mathbb{R}$์˜ ํ—ค์‹œ์•ˆ ํ–‰๋ ฌ์€ 2์ฐจ ํŽธ๋ฏธ๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

$$H = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$

์Šˆ๋ฐ”๋ฅด์ธ  ์ •๋ฆฌ์— ์˜ํ•ด $f$๊ฐ€ $C^2$ ํด๋ž˜์Šค๋ฉด $H$๋Š” ๋Œ€์นญ์ž…๋‹ˆ๋‹ค.

3.2 ํ—ค์‹œ์•ˆ์˜ ์„ฑ์งˆ๊ณผ ์ตœ์ ํ™”

  • ์–‘์ •์น˜ (Positive Definite): ๋ชจ๋“  ๊ณ ์œ ๊ฐ’ > 0 โ†’ ๊ทน์†Ÿ๊ฐ’
  • ์Œ์ •์น˜ (Negative Definite): ๋ชจ๋“  ๊ณ ์œ ๊ฐ’ < 0 โ†’ ๊ทน๋Œ“๊ฐ’
  • ๋ถ€์ •๋ถ€ํ˜ธ (Indefinite): ์–‘์ˆ˜/์Œ์ˆ˜ ๊ณ ์œ ๊ฐ’ ํ˜ผ์žฌ โ†’ ์•ˆ์žฅ์ 

3.3 ๋‰ดํ„ด ๋ฐฉ๋ฒ•์—์„œ์˜ ์—ญํ• 

๋‰ดํ„ด ๋ฐฉ๋ฒ•์˜ ์—…๋ฐ์ดํŠธ ๊ทœ์น™:

$$\mathbf{x}_{k+1} = \mathbf{x}_k - H^{-1}(\mathbf{x}_k) \nabla f(\mathbf{x}_k)$$

ํ—ค์‹œ์•ˆ์˜ ์—ญํ–‰๋ ฌ์„ ์‚ฌ์šฉํ•˜์—ฌ 2์ฐจ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

3.4 ํ—ค์‹œ์•ˆ ๊ณ„์‚ฐ ์˜ˆ์ œ

import torch
import numpy as np

# ํ•จ์ˆ˜ ์ •์˜: f(x, y) = x^2 + xy + 2y^2
def f(x):
    return x[0]**2 + x[0]*x[1] + 2*x[1]**2

x = torch.tensor([1.0, 2.0], requires_grad=True)

# ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ
y = f(x)
grad = torch.autograd.grad(y, x, create_graph=True)[0]
print("โˆ‡f =", grad)

# ํ—ค์‹œ์•ˆ ๊ณ„์‚ฐ (๊ฐ ๊ทธ๋ž˜๋””์–ธํŠธ ์„ฑ๋ถ„์„ ๋‹ค์‹œ ๋ฏธ๋ถ„)
hessian = torch.zeros(2, 2)
for i in range(2):
    hessian[i] = torch.autograd.grad(grad[i], x, retain_graph=True)[0]

print("\nํ—ค์‹œ์•ˆ ํ–‰๋ ฌ:")
print(hessian)

# ์ˆ˜๋™ ๊ณ„์‚ฐ: H = [[2, 1], [1, 4]]
H_manual = torch.tensor([[2.0, 1.0], [1.0, 4.0]])
print("\n์ˆ˜๋™ ๊ณ„์‚ฐ:")
print(H_manual)

# ๊ณ ์œ ๊ฐ’์œผ๋กœ ์ •๋ถ€ํ˜ธ ํŒ์ •
eigenvalues = torch.linalg.eigvalsh(hessian)
print(f"\n๊ณ ์œ ๊ฐ’: {eigenvalues}")
print("์–‘์ •์น˜ (๊ทน์†Ÿ๊ฐ’):", torch.all(eigenvalues > 0).item())

3.5 ํ—ค์‹œ์•ˆ๊ณผ ๋ณผ๋ก์„ฑ

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# ๋ณผ๋ก ํ•จ์ˆ˜: f(x, y) = x^2 + 2y^2
x = np.linspace(-3, 3, 50)
y = np.linspace(-3, 3, 50)
X, Y = np.meshgrid(x, y)
Z_convex = X**2 + 2*Y**2

# ์•ˆ์žฅ์  ํ•จ์ˆ˜: f(x, y) = x^2 - y^2
Z_saddle = X**2 - Y**2

fig = plt.figure(figsize=(14, 6))

# ๋ณผ๋ก ํ•จ์ˆ˜
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(X, Y, Z_convex, cmap='viridis', alpha=0.8)
ax1.set_title('๋ณผ๋ก ํ•จ์ˆ˜: $f(x,y) = x^2 + 2y^2$\nํ—ค์‹œ์•ˆ ์–‘์ •์น˜', fontsize=12)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x,y)')

# ์•ˆ์žฅ์  ํ•จ์ˆ˜
ax2 = fig.add_subplot(122, projection='3d')
ax2.plot_surface(X, Y, Z_saddle, cmap='plasma', alpha=0.8)
ax2.set_title('์•ˆ์žฅ์ : $f(x,y) = x^2 - y^2$\nํ—ค์‹œ์•ˆ ๋ถ€์ •๋ถ€ํ˜ธ', fontsize=12)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_zlabel('f(x,y)')

plt.tight_layout()
plt.savefig('hessian_surfaces.png', dpi=150)
plt.close()

print("ํ—ค์‹œ์•ˆ๊ณผ ํ•จ์ˆ˜ ํ˜•ํƒœ ์‹œ๊ฐํ™” ์ €์žฅ: hessian_surfaces.png")

4. ํ–‰๋ ฌ ๋ฏธ๋ถ„ ํ•ญ๋“ฑ์‹

4.1 ์ฃผ์š” ํ•ญ๋“ฑ์‹ ๋ชฉ๋ก

ํ•จ์ˆ˜ ๋ฏธ๋ถ„ ๊ฒฐ๊ณผ
$\mathbf{a}^T \mathbf{x}$ $\mathbf{a}$
$\mathbf{x}^T \mathbf{x}$ $2\mathbf{x}$
$\mathbf{x}^T A \mathbf{x}$ $(A + A^T)\mathbf{x}$
$\mathbf{a}^T X \mathbf{b}$ $\mathbf{a}\mathbf{b}^T$
$\text{tr}(AB)$ $B^T$ (w.r.t. $A$)
$\log \|A\|$ $A^{-T}$ (์—ญํ–‰๋ ฌ์˜ ์ „์น˜)
$\mathbf{x}^T A^{-1} \mathbf{x}$ $-A^{-1}\mathbf{x}\mathbf{x}^T A^{-1}$ (w.r.t. $A$)

4.2 ํ•ญ๋“ฑ์‹ ์œ ๋„: $\mathbf{x}^T A \mathbf{x}$

์ธ๋ฑ์Šค ํ‘œ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ ์œ ๋„:

$$f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x} = \sum_{i,j} x_i A_{ij} x_j$$

$x_k$๋กœ ๋ฏธ๋ถ„:

$$\frac{\partial f}{\partial x_k} = \sum_j A_{kj} x_j + \sum_i x_i A_{ik} = (A\mathbf{x})_k + (A^T\mathbf{x})_k$$

๋”ฐ๋ผ์„œ:

$$\nabla f = (A + A^T)\mathbf{x}$$

4.3 ํ•ญ๋“ฑ์‹ ์œ ๋„: $\text{tr}(AB)$

ํŠธ๋ ˆ์ด์Šค์˜ ์„ฑ์งˆ $\text{tr}(AB) = \sum_{ij} A_{ij} B_{ji}$๋ฅผ ์‚ฌ์šฉ:

$$\frac{\partial}{\partial A_{kl}} \text{tr}(AB) = \frac{\partial}{\partial A_{kl}} \sum_{ij} A_{ij} B_{ji} = B_{lk}$$

๋”ฐ๋ผ์„œ:

$$\frac{\partial \text{tr}(AB)}{\partial A} = B^T$$

4.4 ํ•ญ๋“ฑ์‹ ๊ฒ€์ฆ ์ฝ”๋“œ

import torch

# ํ•ญ๋“ฑ์‹ 1: โˆ‚(x^T a)/โˆ‚x = a
x = torch.randn(5, requires_grad=True)
a = torch.randn(5)
f = x @ a
f.backward()
print("ํ•ญ๋“ฑ์‹ 1: โˆ‚(x^T a)/โˆ‚x = a")
print(f"autograd: {x.grad}")
print(f"๊ณต์‹:     {a}")
print(f"์ฐจ์ด: {torch.norm(x.grad - a).item():.2e}\n")

# ํ•ญ๋“ฑ์‹ 2: โˆ‚(x^T A x)/โˆ‚x = (A + A^T)x
x = torch.randn(5, requires_grad=True)
A = torch.randn(5, 5)
f = x @ A @ x
f.backward()
print("ํ•ญ๋“ฑ์‹ 2: โˆ‚(x^T A x)/โˆ‚x = (A + A^T)x")
print(f"autograd: {x.grad}")
expected = (A + A.T) @ x.detach()
print(f"๊ณต์‹:     {expected}")
print(f"์ฐจ์ด: {torch.norm(x.grad - expected).item():.2e}\n")

# ํ•ญ๋“ฑ์‹ 3: โˆ‚tr(AB)/โˆ‚A = B^T
A = torch.randn(4, 4, requires_grad=True)
B = torch.randn(4, 4)
f = torch.trace(A @ B)
f.backward()
print("ํ•ญ๋“ฑ์‹ 3: โˆ‚tr(AB)/โˆ‚A = B^T")
print(f"autograd:\n{A.grad}")
print(f"๊ณต์‹:\n{B.T}")
print(f"์ฐจ์ด: {torch.norm(A.grad - B.T).item():.2e}")

5. ML์—์„œ์˜ ํ–‰๋ ฌ ๋ฏธ๋ถ„ ์‘์šฉ

5.1 MSE ์†์‹ค์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ์œ ๋„

ํšŒ๊ท€ ๋ฌธ์ œ์—์„œ ์†์‹ค ํ•จ์ˆ˜:

$$L(\mathbf{w}) = \frac{1}{2n} \|\mathbf{y} - X\mathbf{w}\|^2$$

๊ทธ๋ž˜๋””์–ธํŠธ:

$$\nabla_\mathbf{w} L = -\frac{1}{n} X^T (\mathbf{y} - X\mathbf{w})$$

์œ ๋„ ๊ณผ์ •:

$$\nabla_\mathbf{w} L = \nabla_\mathbf{w} \frac{1}{2n}(\mathbf{y} - X\mathbf{w})^T(\mathbf{y} - X\mathbf{w})$$

$\mathbf{r} = \mathbf{y} - X\mathbf{w}$๋กœ ๋†“์œผ๋ฉด:

$$\nabla_\mathbf{w} L = \frac{1}{n} \nabla_\mathbf{w} \mathbf{r}^T \mathbf{r} = \frac{1}{n} \cdot 2 \mathbf{r}^T \nabla_\mathbf{w} \mathbf{r} = -\frac{1}{n} X^T \mathbf{r}$$

5.2 MSE ๊ทธ๋ž˜๋””์–ธํŠธ ๊ตฌํ˜„ ๋ฐ ๊ฒ€์ฆ

import torch
import torch.nn as nn

# ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
n, d = 100, 10
X = torch.randn(n, d)
y = torch.randn(n)
w = torch.randn(d, requires_grad=True)

# ๋ฐฉ๋ฒ• 1: PyTorch autograd
pred = X @ w
loss = 0.5 * torch.mean((y - pred)**2)
loss.backward()
grad_autograd = w.grad.clone()

# ๋ฐฉ๋ฒ• 2: ์ˆ˜๋™ ์œ ๋„ ๊ณต์‹
residual = y - X @ w.detach()
grad_formula = -X.T @ residual / n

print("MSE ๊ทธ๋ž˜๋””์–ธํŠธ ๋น„๊ต:")
print(f"autograd: {grad_autograd[:5]}")
print(f"๊ณต์‹:     {grad_formula[:5]}")
print(f"์ฐจ์ด: {torch.norm(grad_autograd - grad_formula).item():.2e}")

5.3 ์†Œํ”„ํŠธ๋งฅ์Šค ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ๊ทธ๋ž˜๋””์–ธํŠธ

์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜:

$$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค:

$$L = -\sum_i y_i \log \sigma(\mathbf{z})_i$$

๊ทธ๋ž˜๋””์–ธํŠธ (์›-ํ•ซ ๋ ˆ์ด๋ธ” $\mathbf{y}$์— ๋Œ€ํ•ด):

$$\frac{\partial L}{\partial \mathbf{z}} = \sigma(\mathbf{z}) - \mathbf{y}$$

์ด ๊ฐ„๊ฒฐํ•œ ํ˜•ํƒœ๋Š” ์†Œํ”„ํŠธ๋งฅ์Šค์˜ ์•ผ์ฝ”๋น„์•ˆ ๊ณ„์‚ฐ์—์„œ ์œ ๋„๋ฉ๋‹ˆ๋‹ค.

5.4 ์†Œํ”„ํŠธ๋งฅ์Šค ๊ทธ๋ž˜๋””์–ธํŠธ ๊ฒ€์ฆ

import torch
import torch.nn.functional as F

# ๋กœ์ง“๊ณผ ํƒ€๊ฒŸ
logits = torch.randn(5, requires_grad=True)
target_class = 2  # ํด๋ž˜์Šค 2๊ฐ€ ์ •๋‹ต

# ๋ฐฉ๋ฒ• 1: PyTorch autograd
loss = F.cross_entropy(logits.unsqueeze(0), torch.tensor([target_class]))
loss.backward()
grad_autograd = logits.grad.clone()

# ๋ฐฉ๋ฒ• 2: ์ˆ˜๋™ ๊ณ„์‚ฐ
probs = F.softmax(logits.detach(), dim=0)
y_onehot = torch.zeros(5)
y_onehot[target_class] = 1.0
grad_formula = probs - y_onehot

print("์†Œํ”„ํŠธ๋งฅ์Šค ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ๊ทธ๋ž˜๋””์–ธํŠธ:")
print(f"autograd: {grad_autograd}")
print(f"๊ณต์‹:     {grad_formula}")
print(f"์ฐจ์ด: {torch.norm(grad_autograd - grad_formula).item():.2e}")

5.5 ์—ญ์ „ํŒŒ: ์•ผ์ฝ”๋น„์•ˆ์˜ ์—ฐ์‡„ ๊ณฑ

์‹ ๊ฒฝ๋ง์—์„œ ์—ญ์ „ํŒŒ๋Š” ์ฒด์ธ ๋ฃฐ์˜ ๋ฐ˜๋ณต ์ ์šฉ์ž…๋‹ˆ๋‹ค:

$$\frac{\partial L}{\partial \mathbf{w}_1} = \frac{\partial L}{\partial \mathbf{z}_L} \frac{\partial \mathbf{z}_L}{\partial \mathbf{z}_{L-1}} \cdots \frac{\partial \mathbf{z}_2}{\partial \mathbf{z}_1} \frac{\partial \mathbf{z}_1}{\partial \mathbf{w}_1}$$

๊ฐ ํ•ญ์€ ์•ผ์ฝ”๋น„์•ˆ์ด๊ณ , ์˜ค๋ฅธ์ชฝ์—์„œ ์™ผ์ชฝ์œผ๋กœ ๊ณ„์‚ฐ (๋ฆฌ๋ฒ„์Šค ๋ชจ๋“œ).

5.6 ์„ ํ˜• ๋ ˆ์ด์–ด์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ์œ ๋„

์„ ํ˜• ๋ ˆ์ด์–ด: $\mathbf{z} = W\mathbf{x} + \mathbf{b}$

์†์‹ค $L$์— ๋Œ€ํ•œ ๊ทธ๋ž˜๋””์–ธํŠธ:

$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \mathbf{z}} \mathbf{x}^T$$

$$\frac{\partial L}{\partial \mathbf{b}} = \frac{\partial L}{\partial \mathbf{z}}$$

$$\frac{\partial L}{\partial \mathbf{x}} = W^T \frac{\partial L}{\partial \mathbf{z}}$$

# ์„ ํ˜• ๋ ˆ์ด์–ด ๊ทธ๋ž˜๋””์–ธํŠธ ์ˆ˜๋™ ๊ตฌํ˜„
class LinearLayer:
    def __init__(self, in_dim, out_dim):
        self.W = torch.randn(out_dim, in_dim, requires_grad=False)
        self.b = torch.randn(out_dim, requires_grad=False)
        self.x = None
        self.dW = None
        self.db = None

    def forward(self, x):
        self.x = x
        return self.W @ x + self.b

    def backward(self, dL_dz):
        """dL_dz: ์†์‹ค์— ๋Œ€ํ•œ ์ถœ๋ ฅ์˜ ๊ทธ๋ž˜๋””์–ธํŠธ"""
        self.dW = torch.outer(dL_dz, self.x)  # (out_dim, in_dim)
        self.db = dL_dz  # (out_dim,)
        dL_dx = self.W.T @ dL_dz  # (in_dim,)
        return dL_dx

# ํ…Œ์ŠคํŠธ
layer = LinearLayer(5, 3)
x = torch.randn(5)
z = layer.forward(x)
dL_dz = torch.randn(3)  # ๊ฐ€์งœ ๊ทธ๋ž˜๋””์–ธํŠธ
dL_dx = layer.backward(dL_dz)

print("์„ ํ˜• ๋ ˆ์ด์–ด ์—ญ์ „ํŒŒ:")
print(f"dW ํ˜•ํƒœ: {layer.dW.shape}")
print(f"db ํ˜•ํƒœ: {layer.db.shape}")
print(f"dx ํ˜•ํƒœ: {dL_dx.shape}")

# PyTorch๋กœ ๊ฒ€์ฆ
W_torch = layer.W.clone().requires_grad_(True)
b_torch = layer.b.clone().requires_grad_(True)
x_torch = x.clone().requires_grad_(True)

z_torch = W_torch @ x_torch + b_torch
z_torch.backward(dL_dz)

print(f"\ndW ์ฐจ์ด: {torch.norm(layer.dW - W_torch.grad).item():.2e}")
print(f"db ์ฐจ์ด: {torch.norm(layer.db - b_torch.grad).item():.2e}")
print(f"dx ์ฐจ์ด: {torch.norm(dL_dx - x_torch.grad).item():.2e}")

6. ์ž๋™ ๋ฏธ๋ถ„ (Automatic Differentiation)

6.1 ํฌ์›Œ๋“œ ๋ชจ๋“œ vs ๋ฆฌ๋ฒ„์Šค ๋ชจ๋“œ

ํฌ์›Œ๋“œ ๋ชจ๋“œ (Forward Mode): - ์ž…๋ ฅ์—์„œ ์ถœ๋ ฅ ๋ฐฉํ–ฅ์œผ๋กœ ๋ฏธ๋ถ„ ์ „ํŒŒ - $n$๊ฐœ ์ž…๋ ฅ, 1๊ฐœ ์ถœ๋ ฅ์ผ ๋•Œ ํšจ์œจ์  - ๋ฐฉํ–ฅ ๋„ํ•จ์ˆ˜ ๊ณ„์‚ฐ์— ์œ ์šฉ

๋ฆฌ๋ฒ„์Šค ๋ชจ๋“œ (Reverse Mode): - ์ถœ๋ ฅ์—์„œ ์ž…๋ ฅ ๋ฐฉํ–ฅ์œผ๋กœ ๋ฏธ๋ถ„ ์ „ํŒŒ (์—ญ์ „ํŒŒ) - 1๊ฐœ ์ถœ๋ ฅ, $n$๊ฐœ ์ž…๋ ฅ์ผ ๋•Œ ํšจ์œจ์  - ๋”ฅ๋Ÿฌ๋‹์—์„œ ์‚ฌ์šฉ (์†์‹ค ํ•จ์ˆ˜๋Š” ์Šค์นผ๋ผ)

6.2 ๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„

๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„๋Š” ์—ฐ์‚ฐ์„ ๋…ธ๋“œ๋กœ, ๋ฐ์ดํ„ฐ ํ๋ฆ„์„ ์—ฃ์ง€๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.

# ๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„ ์˜ˆ์‹œ: f(x, y) = (x + y) * (x - y)
import torch

x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)

# ์ค‘๊ฐ„ ๋ณ€์ˆ˜ ์ €์žฅ
a = x + y  # a = 5
b = x - y  # b = 1
f = a * b  # f = 5

print("๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„:")
print(f"x={x.item()}, y={y.item()}")
print(f"a = x + y = {a.item()}")
print(f"b = x - y = {b.item()}")
print(f"f = a * b = {f.item()}")

# ์—ญ์ „ํŒŒ
f.backward()
print(f"\nโˆ‚f/โˆ‚x = {x.grad.item()}")
print(f"โˆ‚f/โˆ‚y = {y.grad.item()}")

# ์ˆ˜๋™ ๊ณ„์‚ฐ ๊ฒ€์ฆ
# f = (x+y)(x-y) = x^2 - y^2
# โˆ‚f/โˆ‚x = 2x = 6
# โˆ‚f/โˆ‚y = -2y = -4
print(f"\n์ˆ˜๋™ ๊ณ„์‚ฐ: โˆ‚f/โˆ‚x = 2x = {2*x.item()}")
print(f"์ˆ˜๋™ ๊ณ„์‚ฐ: โˆ‚f/โˆ‚y = -2y = {-2*y.item()}")

6.3 PyTorch Autograd ๋‚ด๋ถ€ ๋™์ž‘

# ๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„ ์‹œ๊ฐํ™” (๊ฐ„๋‹จํ•œ ์˜ˆ)
import torch

x = torch.tensor([1.0, 2.0], requires_grad=True)
w = torch.tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
b = torch.tensor([0.5, 1.0], requires_grad=True)

# ์ˆœ์ „ํŒŒ
z = w @ x + b  # ์„ ํ˜• ๋ณ€ํ™˜
a = torch.relu(z)  # ํ™œ์„ฑํ™”
loss = a.sum()  # ์†์‹ค

print("๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„ ์ถ”์ :")
print(f"grad_fn of z: {z.grad_fn}")
print(f"grad_fn of a: {a.grad_fn}")
print(f"grad_fn of loss: {loss.grad_fn}")

# ์—ญ์ „ํŒŒ
loss.backward()

print("\n๊ทธ๋ž˜๋””์–ธํŠธ:")
print(f"โˆ‚L/โˆ‚x: {x.grad}")
print(f"โˆ‚L/โˆ‚w:\n{w.grad}")
print(f"โˆ‚L/โˆ‚b: {b.grad}")

6.4 ๊ณ ์ฐจ ๋ฏธ๋ถ„

# 2์ฐจ ๋ฏธ๋ถ„ (ํ—ค์‹œ์•ˆ ๋Œ€๊ฐ์„ )
x = torch.tensor(2.0, requires_grad=True)
y = x**4

# 1์ฐจ ๋ฏธ๋ถ„
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"f(x) = x^4, f'(x) = 4x^3")
print(f"f'(2) = {dy_dx.item()} (์˜ˆ์ƒ: {4*2**3})")

# 2์ฐจ ๋ฏธ๋ถ„
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]
print(f"f''(x) = 12x^2")
print(f"f''(2) = {d2y_dx2.item()} (์˜ˆ์ƒ: {12*2**2})")

6.5 ์ž๋™ ๋ฏธ๋ถ„์˜ ํ•œ๊ณ„์™€ ์ˆ˜๋™ ๊ตฌํ˜„

์ž๋™ ๋ฏธ๋ถ„์€ ํŽธ๋ฆฌํ•˜์ง€๋งŒ ๋•Œ๋กœ๋Š” ์ˆ˜๋™ ๊ตฌํ˜„์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ (๊ทธ๋ž˜๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…)
  • ์ปค์Šคํ…€ ์—ญ์ „ํŒŒ ๋กœ์ง
  • ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ๊ฐœ์„ 
# ์ปค์Šคํ…€ autograd ํ•จ์ˆ˜
class MyReLU(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        return x.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        x, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[x < 0] = 0
        return grad_input

# ์‚ฌ์šฉ
x = torch.randn(5, requires_grad=True)
y = MyReLU.apply(x)
loss = y.sum()
loss.backward()

print("์ปค์Šคํ…€ ReLU ๊ทธ๋ž˜๋””์–ธํŠธ:")
print(f"x: {x.detach()}")
print(f"y: {y.detach()}")
print(f"โˆ‚L/โˆ‚x: {x.grad}")

์—ฐ์Šต ๋ฌธ์ œ

๋ฌธ์ œ 1: ํ–‰๋ ฌ ๋ฏธ๋ถ„ ํ•ญ๋“ฑ์‹ ์œ ๋„

$\mathbf{x} \in \mathbb{R}^n$, $A \in \mathbb{R}^{n \times n}$์ผ ๋•Œ, ๋‹ค์Œ์„ ์ฆ๋ช…ํ•˜์‹œ์˜ค:

$$\frac{\partial}{\partial \mathbf{x}} (\mathbf{x}^T A \mathbf{x}) = (A + A^T)\mathbf{x}$$

์ธ๋ฑ์Šค ํ‘œ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ๊ณ„๋ณ„๋กœ ์œ ๋„ํ•˜๊ณ , PyTorch๋กœ ๊ฒ€์ฆํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜์‹œ์˜ค.

๋ฌธ์ œ 2: ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๊ทธ๋ž˜๋””์–ธํŠธ

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ ์†์‹ค ํ•จ์ˆ˜๋Š”:

$$L(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log \sigma(\mathbf{w}^T \mathbf{x}_i) + (1-y_i) \log(1-\sigma(\mathbf{w}^T \mathbf{x}_i)) \right]$$

์—ฌ๊ธฐ์„œ $\sigma(z) = 1/(1+e^{-z})$. ๊ทธ๋ž˜๋””์–ธํŠธ $\nabla_\mathbf{w} L$์„ ์œ ๋„ํ•˜์‹œ์˜ค. ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์Œ์„ ๋ณด์ด์‹œ์˜ค:

$$\nabla_\mathbf{w} L = \frac{1}{n} X^T (\boldsymbol{\sigma} - \mathbf{y})$$

์—ฌ๊ธฐ์„œ $\boldsymbol{\sigma} = [\sigma(\mathbf{w}^T \mathbf{x}_1), \ldots, \sigma(\mathbf{w}^T \mathbf{x}_n)]^T$.

๋ฌธ์ œ 3: ๋ฐฐ์น˜ ์ •๊ทœํ™” ๊ทธ๋ž˜๋””์–ธํŠธ

๋ฐฐ์น˜ ์ •๊ทœํ™”๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค:

$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

์—ฌ๊ธฐ์„œ $\mu = \frac{1}{n}\sum_i x_i$, $\sigma^2 = \frac{1}{n}\sum_i (x_i - \mu)^2$.

์†์‹ค $L$์— ๋Œ€ํ•œ $x_i$์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์œ ๋„ํ•˜์‹œ์˜ค. PyTorch์˜ BatchNorm1d์™€ ๋น„๊ตํ•˜์—ฌ ๊ฒ€์ฆํ•˜์‹œ์˜ค.

๋ฌธ์ œ 4: ์†Œํ”„ํŠธ๋งฅ์Šค ์•ผ์ฝ”๋น„์•ˆ

์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜ $\sigma(\mathbf{z})_i = e^{z_i} / \sum_j e^{z_j}$์˜ ์•ผ์ฝ”๋น„์•ˆ์„ ๊ณ„์‚ฐํ•˜์‹œ์˜ค.

$$\frac{\partial \sigma_i}{\partial z_j} = ?$$

ํžŒํŠธ: $i=j$์ธ ๊ฒฝ์šฐ์™€ $i \neq j$์ธ ๊ฒฝ์šฐ๋ฅผ ๋‚˜๋ˆ„์–ด ๊ณ„์‚ฐํ•˜์‹œ์˜ค. ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์Œ์„ ๋ณด์ด์‹œ์˜ค:

$$\frac{\partial \sigma_i}{\partial z_j} = \sigma_i(\delta_{ij} - \sigma_j)$$

๋ฌธ์ œ 5: L2 ์ •๊ทœํ™” ๊ทธ๋ž˜๋””์–ธํŠธ

๋ฆฟ์ง€ ํšŒ๊ท€์˜ ์†์‹ค ํ•จ์ˆ˜:

$$L(\mathbf{w}) = \frac{1}{2n}\|\mathbf{y} - X\mathbf{w}\|^2 + \frac{\lambda}{2}\|\mathbf{w}\|^2$$

๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์œ ๋„ํ•˜๊ณ , ์ •๊ทœ ๋ฐฉ์ •์‹ (Normal Equation)์„ ๊ตฌํ•˜์‹œ์˜ค. PyTorch๋กœ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•๊ณผ ํ•ด์„ํ•ด๋ฅผ ๋น„๊ตํ•˜์‹œ์˜ค.

์ฐธ๊ณ  ์ž๋ฃŒ

์˜จ๋ผ์ธ ์ž๋ฃŒ

๊ต์žฌ

  • Magnus & Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics
  • Goodfellow et al., Deep Learning, Chapter 6 (Numerical Computation)
  • Boyd & Vandenberghe, Convex Optimization, Appendix A

๋…ผ๋ฌธ

  • Griewank & Walther, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2008)
  • Baydin et al., Automatic Differentiation in Machine Learning: a Survey (JMLR 2018)
to navigate between lessons