03. ํ๋ ฌ ๋ฏธ๋ถ (Matrix Calculus)
03. ํ๋ ฌ ๋ฏธ๋ถ (Matrix Calculus)¶
ํ์ต ๋ชฉํ¶
- ์ค์นผ๋ผ-๋ฒกํฐ ๋ฏธ๋ถ๊ณผ ๊ทธ๋๋์ธํธ์ ๊ฐ๋ ์ ์ดํดํ๊ณ ๊ณ์ฐํ ์ ์๋ค
- ์ผ์ฝ๋น์ ํ๋ ฌ์ ์ ์์ ์ฒด์ธ ๋ฃฐ ์ ์ฉ ๋ฐฉ๋ฒ์ ํ์ตํ๋ค
- ํค์์ ํ๋ ฌ์ ์๋ฏธ์ ์ต์ ํ์์์ ์ญํ ์ ์ดํดํ๋ค
- ์ฃผ์ ํ๋ ฌ ๋ฏธ๋ถ ํญ๋ฑ์์ ์ ๋ํ๊ณ ํ์ฉํ ์ ์๋ค
- ๋จธ์ ๋ฌ๋์ ์์ค ํจ์ ๊ทธ๋๋์ธํธ๋ฅผ ์ง์ ์ ๋ํ ์ ์๋ค
- PyTorch์ ์๋ ๋ฏธ๋ถ ๊ธฐ๋ฅ์ ์ดํดํ๊ณ ๊ฒ์ฆ์ ํ์ฉํ ์ ์๋ค
1. ์ค์นผ๋ผ-๋ฒกํฐ ๋ฏธ๋ถ (Scalar-by-Vector Derivatives)¶
1.1 ๊ทธ๋๋์ธํธ์ ์ ์¶
์ค์นผ๋ผ ํจ์ $f: \mathbb{R}^n \to \mathbb{R}$์ ๋ํด ๊ทธ๋๋์ธํธ๋ ๋ชจ๋ ํธ๋ฏธ๋ถ์ ๋ชจ์๋์ ๋ฒกํฐ์ ๋๋ค:
$$\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$
๊ทธ๋๋์ธํธ๋ ํจ์๊ฐ ๊ฐ์ฅ ๊ฐํ๋ฅด๊ฒ ์ฆ๊ฐํ๋ ๋ฐฉํฅ์ ๊ฐ๋ฆฌํต๋๋ค.
1.2 ๊ธฐ๋ณธ ์์ ¶
์์ 1: $f(\mathbf{x}) = \mathbf{a}^T \mathbf{x}$
$$\frac{\partial}{\partial \mathbf{x}}(\mathbf{a}^T \mathbf{x}) = \mathbf{a}$$
์์ 2: $f(\mathbf{x}) = \mathbf{x}^T \mathbf{x}$
$$\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T \mathbf{x}) = 2\mathbf{x}$$
์์ 3: $f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}$ (์ด์ฐจ ํ์)
$$\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = (A + A^T)\mathbf{x}$$
$A$๊ฐ ๋์นญ์ด๋ฉด $2A\mathbf{x}$๊ฐ ๋ฉ๋๋ค.
1.3 Python ๊ตฌํ: ์ด์ฐจ ํ์์ ๋ฏธ๋ถ¶
import numpy as np
import torch
from sympy import symbols, Matrix, diff, simplify
# SymPy๋ก ์ฌ๋ณผ๋ฆญ ๊ณ์ฐ
print("=== SymPy ์ฌ๋ณผ๋ฆญ ๋ฏธ๋ถ ===")
x1, x2 = symbols('x1 x2')
x = Matrix([x1, x2])
A = Matrix([[2, 1], [1, 3]])
# f(x) = x^T A x
f = (x.T * A * x)[0]
print(f"f(x) = {f}")
# ๊ทธ๋๋์ธํธ ๊ณ์ฐ
grad_f = Matrix([diff(f, x1), diff(f, x2)])
print(f"โf = {simplify(grad_f)}")
print(f"(A + A^T)x = {simplify((A + A.T) * x)}")
# PyTorch๋ก ์์น ๊ณ์ฐ ๋ฐ ๊ฒ์ฆ
print("\n=== PyTorch ์๋ ๋ฏธ๋ถ ===")
x_val = torch.tensor([1.0, 2.0], requires_grad=True)
A_torch = torch.tensor([[2.0, 1.0], [1.0, 3.0]])
# ์์ ํ
f_val = x_val @ A_torch @ x_val
print(f"f(x) = {f_val.item():.4f}")
# ์ญ์ ํ
f_val.backward()
print(f"โf (autograd) = {x_val.grad}")
# ๊ณต์์ผ๋ก ๊ณ์ฐ
grad_formula = (A_torch + A_torch.T) @ x_val.detach()
print(f"โf (๊ณต์) = {grad_formula}")
print(f"์ฐจ์ด: {torch.norm(x_val.grad - grad_formula).item():.2e}")
1.4 ๋ถ์ ๋ ์ด์์ vs ๋ถ๋ชจ ๋ ์ด์์¶
ํ๋ ฌ ๋ฏธ๋ถ์๋ ๋ ๊ฐ์ง ํ๊ธฐ๋ฒ์ด ์์ต๋๋ค:
- ๋ถ์ ๋ ์ด์์ (Numerator layout): $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$์ $(i,j)$ ์์๊ฐ $\frac{\partial y_i}{\partial x_j}$
- ๋ถ๋ชจ ๋ ์ด์์ (Denominator layout): $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$์ $(i,j)$ ์์๊ฐ $\frac{\partial y_j}{\partial x_i}$
์ด ๋ฌธ์์์๋ ๋ถ์ ๋ ์ด์์์ ์ฌ์ฉํฉ๋๋ค.
2. ๋ฒกํฐ-๋ฒกํฐ ๋ฏธ๋ถ: ์ผ์ฝ๋น์ (Jacobian)¶
2.1 ์ผ์ฝ๋น์ ํ๋ ฌ์ ์ ์¶
๋ฒกํฐ ํจ์ $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$์ ๋ํด ์ผ์ฝ๋น์ ํ๋ ฌ์:
$$J = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$
ํฌ๊ธฐ๋ $m \times n$์ ๋๋ค.
2.2 ์ฒด์ธ ๋ฃฐ with ์ผ์ฝ๋น์¶
$\mathbf{z} = \mathbf{g}(\mathbf{f}(\mathbf{x}))$์ผ ๋:
$$\frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}$$
์ฌ๊ธฐ์ $\mathbf{y} = \mathbf{f}(\mathbf{x})$์ด๊ณ , ์ฐ๋ณ์ ์ผ์ฝ๋น์ ํ๋ ฌ์ ๊ณฑ์ ๋๋ค.
2.3 ์ผ์ฝ๋น์ ๊ณ์ฐ ์์ ¶
import torch
# ํจ์ ์ ์: f: R^2 -> R^3
def vector_function(x):
"""
f([x1, x2]) = [x1^2 + x2,
x1 * x2,
sin(x1) + cos(x2)]
"""
return torch.stack([
x[0]**2 + x[1],
x[0] * x[1],
torch.sin(x[0]) + torch.cos(x[1])
])
x = torch.tensor([1.0, 2.0], requires_grad=True)
# PyTorch์ ์ผ์ฝ๋น์ ๊ณ์ฐ
from torch.autograd.functional import jacobian
J = jacobian(vector_function, x)
print("์ผ์ฝ๋น์ ํ๋ ฌ (3x2):")
print(J)
# ์๋ ๊ณ์ฐ์ผ๋ก ๊ฒ์ฆ
x1, x2 = x[0].item(), x[1].item()
J_manual = torch.tensor([
[2*x1, 1],
[x2, x1],
[np.cos(x1), -np.sin(x2)]
])
print("\n์๋ ๊ณ์ฐ:")
print(J_manual)
print(f"\n์ฐจ์ด: {torch.norm(J - J_manual).item():.2e}")
2.4 ์ฒด์ธ ๋ฃฐ ์ค์ต¶
# ํฉ์ฑ ํจ์์ ์ผ์ฝ๋น์: h(x) = g(f(x))
def f(x):
"""f: R^2 -> R^2"""
return torch.stack([x[0]**2, x[0] + x[1]])
def g(y):
"""g: R^2 -> R^2"""
return torch.stack([y[0] * y[1], y[0] - y[1]])
def h(x):
"""h = g โ f"""
return g(f(x))
x = torch.tensor([1.0, 2.0])
# ๋ฐฉ๋ฒ 1: ์ง์ ๊ณ์ฐ
J_h = jacobian(h, x)
print("J_h (์ง์ ):")
print(J_h)
# ๋ฐฉ๋ฒ 2: ์ฒด์ธ ๋ฃฐ
J_f = jacobian(f, x)
y = f(x)
J_g = jacobian(g, y)
J_chain = J_g @ J_f
print("\nJ_g @ J_f (์ฒด์ธ ๋ฃฐ):")
print(J_chain)
print(f"\n์ฐจ์ด: {torch.norm(J_h - J_chain).item():.2e}")
3. ํค์์ ํ๋ ฌ (Hessian Matrix)¶
3.1 ํค์์์ ์ ์¶
์ค์นผ๋ผ ํจ์ $f: \mathbb{R}^n \to \mathbb{R}$์ ํค์์ ํ๋ ฌ์ 2์ฐจ ํธ๋ฏธ๋ถ์ผ๋ก ๊ตฌ์ฑ๋ฉ๋๋ค:
$$H = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$
์๋ฐ๋ฅด์ธ ์ ๋ฆฌ์ ์ํด $f$๊ฐ $C^2$ ํด๋์ค๋ฉด $H$๋ ๋์นญ์ ๋๋ค.
3.2 ํค์์์ ์ฑ์ง๊ณผ ์ต์ ํ¶
- ์์ ์น (Positive Definite): ๋ชจ๋ ๊ณ ์ ๊ฐ > 0 โ ๊ทน์๊ฐ
- ์์ ์น (Negative Definite): ๋ชจ๋ ๊ณ ์ ๊ฐ < 0 โ ๊ทน๋๊ฐ
- ๋ถ์ ๋ถํธ (Indefinite): ์์/์์ ๊ณ ์ ๊ฐ ํผ์ฌ โ ์์ฅ์
3.3 ๋ดํด ๋ฐฉ๋ฒ์์์ ์ญํ ¶
๋ดํด ๋ฐฉ๋ฒ์ ์ ๋ฐ์ดํธ ๊ท์น:
$$\mathbf{x}_{k+1} = \mathbf{x}_k - H^{-1}(\mathbf{x}_k) \nabla f(\mathbf{x}_k)$$
ํค์์์ ์ญํ๋ ฌ์ ์ฌ์ฉํ์ฌ 2์ฐจ ์ ๋ณด๋ฅผ ํ์ฉํฉ๋๋ค.
3.4 ํค์์ ๊ณ์ฐ ์์ ¶
import torch
import numpy as np
# ํจ์ ์ ์: f(x, y) = x^2 + xy + 2y^2
def f(x):
return x[0]**2 + x[0]*x[1] + 2*x[1]**2
x = torch.tensor([1.0, 2.0], requires_grad=True)
# ๊ทธ๋๋์ธํธ ๊ณ์ฐ
y = f(x)
grad = torch.autograd.grad(y, x, create_graph=True)[0]
print("โf =", grad)
# ํค์์ ๊ณ์ฐ (๊ฐ ๊ทธ๋๋์ธํธ ์ฑ๋ถ์ ๋ค์ ๋ฏธ๋ถ)
hessian = torch.zeros(2, 2)
for i in range(2):
hessian[i] = torch.autograd.grad(grad[i], x, retain_graph=True)[0]
print("\nํค์์ ํ๋ ฌ:")
print(hessian)
# ์๋ ๊ณ์ฐ: H = [[2, 1], [1, 4]]
H_manual = torch.tensor([[2.0, 1.0], [1.0, 4.0]])
print("\n์๋ ๊ณ์ฐ:")
print(H_manual)
# ๊ณ ์ ๊ฐ์ผ๋ก ์ ๋ถํธ ํ์
eigenvalues = torch.linalg.eigvalsh(hessian)
print(f"\n๊ณ ์ ๊ฐ: {eigenvalues}")
print("์์ ์น (๊ทน์๊ฐ):", torch.all(eigenvalues > 0).item())
3.5 ํค์์๊ณผ ๋ณผ๋ก์ฑ¶
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# ๋ณผ๋ก ํจ์: f(x, y) = x^2 + 2y^2
x = np.linspace(-3, 3, 50)
y = np.linspace(-3, 3, 50)
X, Y = np.meshgrid(x, y)
Z_convex = X**2 + 2*Y**2
# ์์ฅ์ ํจ์: f(x, y) = x^2 - y^2
Z_saddle = X**2 - Y**2
fig = plt.figure(figsize=(14, 6))
# ๋ณผ๋ก ํจ์
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(X, Y, Z_convex, cmap='viridis', alpha=0.8)
ax1.set_title('๋ณผ๋ก ํจ์: $f(x,y) = x^2 + 2y^2$\nํค์์ ์์ ์น', fontsize=12)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x,y)')
# ์์ฅ์ ํจ์
ax2 = fig.add_subplot(122, projection='3d')
ax2.plot_surface(X, Y, Z_saddle, cmap='plasma', alpha=0.8)
ax2.set_title('์์ฅ์ : $f(x,y) = x^2 - y^2$\nํค์์ ๋ถ์ ๋ถํธ', fontsize=12)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_zlabel('f(x,y)')
plt.tight_layout()
plt.savefig('hessian_surfaces.png', dpi=150)
plt.close()
print("ํค์์๊ณผ ํจ์ ํํ ์๊ฐํ ์ ์ฅ: hessian_surfaces.png")
4. ํ๋ ฌ ๋ฏธ๋ถ ํญ๋ฑ์¶
4.1 ์ฃผ์ ํญ๋ฑ์ ๋ชฉ๋ก¶
| ํจ์ | ๋ฏธ๋ถ ๊ฒฐ๊ณผ |
|---|---|
| $\mathbf{a}^T \mathbf{x}$ | $\mathbf{a}$ |
| $\mathbf{x}^T \mathbf{x}$ | $2\mathbf{x}$ |
| $\mathbf{x}^T A \mathbf{x}$ | $(A + A^T)\mathbf{x}$ |
| $\mathbf{a}^T X \mathbf{b}$ | $\mathbf{a}\mathbf{b}^T$ |
| $\text{tr}(AB)$ | $B^T$ (w.r.t. $A$) |
| $\log \|A\|$ | $A^{-T}$ (์ญํ๋ ฌ์ ์ ์น) |
| $\mathbf{x}^T A^{-1} \mathbf{x}$ | $-A^{-1}\mathbf{x}\mathbf{x}^T A^{-1}$ (w.r.t. $A$) |
4.2 ํญ๋ฑ์ ์ ๋: $\mathbf{x}^T A \mathbf{x}$¶
์ธ๋ฑ์ค ํ๊ธฐ๋ฒ์ ์ฌ์ฉํ ์ ๋:
$$f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x} = \sum_{i,j} x_i A_{ij} x_j$$
$x_k$๋ก ๋ฏธ๋ถ:
$$\frac{\partial f}{\partial x_k} = \sum_j A_{kj} x_j + \sum_i x_i A_{ik} = (A\mathbf{x})_k + (A^T\mathbf{x})_k$$
๋ฐ๋ผ์:
$$\nabla f = (A + A^T)\mathbf{x}$$
4.3 ํญ๋ฑ์ ์ ๋: $\text{tr}(AB)$¶
ํธ๋ ์ด์ค์ ์ฑ์ง $\text{tr}(AB) = \sum_{ij} A_{ij} B_{ji}$๋ฅผ ์ฌ์ฉ:
$$\frac{\partial}{\partial A_{kl}} \text{tr}(AB) = \frac{\partial}{\partial A_{kl}} \sum_{ij} A_{ij} B_{ji} = B_{lk}$$
๋ฐ๋ผ์:
$$\frac{\partial \text{tr}(AB)}{\partial A} = B^T$$
4.4 ํญ๋ฑ์ ๊ฒ์ฆ ์ฝ๋¶
import torch
# ํญ๋ฑ์ 1: โ(x^T a)/โx = a
x = torch.randn(5, requires_grad=True)
a = torch.randn(5)
f = x @ a
f.backward()
print("ํญ๋ฑ์ 1: โ(x^T a)/โx = a")
print(f"autograd: {x.grad}")
print(f"๊ณต์: {a}")
print(f"์ฐจ์ด: {torch.norm(x.grad - a).item():.2e}\n")
# ํญ๋ฑ์ 2: โ(x^T A x)/โx = (A + A^T)x
x = torch.randn(5, requires_grad=True)
A = torch.randn(5, 5)
f = x @ A @ x
f.backward()
print("ํญ๋ฑ์ 2: โ(x^T A x)/โx = (A + A^T)x")
print(f"autograd: {x.grad}")
expected = (A + A.T) @ x.detach()
print(f"๊ณต์: {expected}")
print(f"์ฐจ์ด: {torch.norm(x.grad - expected).item():.2e}\n")
# ํญ๋ฑ์ 3: โtr(AB)/โA = B^T
A = torch.randn(4, 4, requires_grad=True)
B = torch.randn(4, 4)
f = torch.trace(A @ B)
f.backward()
print("ํญ๋ฑ์ 3: โtr(AB)/โA = B^T")
print(f"autograd:\n{A.grad}")
print(f"๊ณต์:\n{B.T}")
print(f"์ฐจ์ด: {torch.norm(A.grad - B.T).item():.2e}")
5. ML์์์ ํ๋ ฌ ๋ฏธ๋ถ ์์ฉ¶
5.1 MSE ์์ค์ ๊ทธ๋๋์ธํธ ์ ๋¶
ํ๊ท ๋ฌธ์ ์์ ์์ค ํจ์:
$$L(\mathbf{w}) = \frac{1}{2n} \|\mathbf{y} - X\mathbf{w}\|^2$$
๊ทธ๋๋์ธํธ:
$$\nabla_\mathbf{w} L = -\frac{1}{n} X^T (\mathbf{y} - X\mathbf{w})$$
์ ๋ ๊ณผ์ :
$$\nabla_\mathbf{w} L = \nabla_\mathbf{w} \frac{1}{2n}(\mathbf{y} - X\mathbf{w})^T(\mathbf{y} - X\mathbf{w})$$
$\mathbf{r} = \mathbf{y} - X\mathbf{w}$๋ก ๋์ผ๋ฉด:
$$\nabla_\mathbf{w} L = \frac{1}{n} \nabla_\mathbf{w} \mathbf{r}^T \mathbf{r} = \frac{1}{n} \cdot 2 \mathbf{r}^T \nabla_\mathbf{w} \mathbf{r} = -\frac{1}{n} X^T \mathbf{r}$$
5.2 MSE ๊ทธ๋๋์ธํธ ๊ตฌํ ๋ฐ ๊ฒ์ฆ¶
import torch
import torch.nn as nn
# ๋ฐ์ดํฐ ์์ฑ
n, d = 100, 10
X = torch.randn(n, d)
y = torch.randn(n)
w = torch.randn(d, requires_grad=True)
# ๋ฐฉ๋ฒ 1: PyTorch autograd
pred = X @ w
loss = 0.5 * torch.mean((y - pred)**2)
loss.backward()
grad_autograd = w.grad.clone()
# ๋ฐฉ๋ฒ 2: ์๋ ์ ๋ ๊ณต์
residual = y - X @ w.detach()
grad_formula = -X.T @ residual / n
print("MSE ๊ทธ๋๋์ธํธ ๋น๊ต:")
print(f"autograd: {grad_autograd[:5]}")
print(f"๊ณต์: {grad_formula[:5]}")
print(f"์ฐจ์ด: {torch.norm(grad_autograd - grad_formula).item():.2e}")
5.3 ์ํํธ๋งฅ์ค ๊ต์ฐจ ์ํธ๋กํผ ๊ทธ๋๋์ธํธ¶
์ํํธ๋งฅ์ค ํจ์:
$$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$$
๊ต์ฐจ ์ํธ๋กํผ ์์ค:
$$L = -\sum_i y_i \log \sigma(\mathbf{z})_i$$
๊ทธ๋๋์ธํธ (์-ํซ ๋ ์ด๋ธ $\mathbf{y}$์ ๋ํด):
$$\frac{\partial L}{\partial \mathbf{z}} = \sigma(\mathbf{z}) - \mathbf{y}$$
์ด ๊ฐ๊ฒฐํ ํํ๋ ์ํํธ๋งฅ์ค์ ์ผ์ฝ๋น์ ๊ณ์ฐ์์ ์ ๋๋ฉ๋๋ค.
5.4 ์ํํธ๋งฅ์ค ๊ทธ๋๋์ธํธ ๊ฒ์ฆ¶
import torch
import torch.nn.functional as F
# ๋ก์ง๊ณผ ํ๊ฒ
logits = torch.randn(5, requires_grad=True)
target_class = 2 # ํด๋์ค 2๊ฐ ์ ๋ต
# ๋ฐฉ๋ฒ 1: PyTorch autograd
loss = F.cross_entropy(logits.unsqueeze(0), torch.tensor([target_class]))
loss.backward()
grad_autograd = logits.grad.clone()
# ๋ฐฉ๋ฒ 2: ์๋ ๊ณ์ฐ
probs = F.softmax(logits.detach(), dim=0)
y_onehot = torch.zeros(5)
y_onehot[target_class] = 1.0
grad_formula = probs - y_onehot
print("์ํํธ๋งฅ์ค ๊ต์ฐจ ์ํธ๋กํผ ๊ทธ๋๋์ธํธ:")
print(f"autograd: {grad_autograd}")
print(f"๊ณต์: {grad_formula}")
print(f"์ฐจ์ด: {torch.norm(grad_autograd - grad_formula).item():.2e}")
5.5 ์ญ์ ํ: ์ผ์ฝ๋น์์ ์ฐ์ ๊ณฑ¶
์ ๊ฒฝ๋ง์์ ์ญ์ ํ๋ ์ฒด์ธ ๋ฃฐ์ ๋ฐ๋ณต ์ ์ฉ์ ๋๋ค:
$$\frac{\partial L}{\partial \mathbf{w}_1} = \frac{\partial L}{\partial \mathbf{z}_L} \frac{\partial \mathbf{z}_L}{\partial \mathbf{z}_{L-1}} \cdots \frac{\partial \mathbf{z}_2}{\partial \mathbf{z}_1} \frac{\partial \mathbf{z}_1}{\partial \mathbf{w}_1}$$
๊ฐ ํญ์ ์ผ์ฝ๋น์์ด๊ณ , ์ค๋ฅธ์ชฝ์์ ์ผ์ชฝ์ผ๋ก ๊ณ์ฐ (๋ฆฌ๋ฒ์ค ๋ชจ๋).
5.6 ์ ํ ๋ ์ด์ด์ ๊ทธ๋๋์ธํธ ์ ๋¶
์ ํ ๋ ์ด์ด: $\mathbf{z} = W\mathbf{x} + \mathbf{b}$
์์ค $L$์ ๋ํ ๊ทธ๋๋์ธํธ:
$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \mathbf{z}} \mathbf{x}^T$$
$$\frac{\partial L}{\partial \mathbf{b}} = \frac{\partial L}{\partial \mathbf{z}}$$
$$\frac{\partial L}{\partial \mathbf{x}} = W^T \frac{\partial L}{\partial \mathbf{z}}$$
# ์ ํ ๋ ์ด์ด ๊ทธ๋๋์ธํธ ์๋ ๊ตฌํ
class LinearLayer:
def __init__(self, in_dim, out_dim):
self.W = torch.randn(out_dim, in_dim, requires_grad=False)
self.b = torch.randn(out_dim, requires_grad=False)
self.x = None
self.dW = None
self.db = None
def forward(self, x):
self.x = x
return self.W @ x + self.b
def backward(self, dL_dz):
"""dL_dz: ์์ค์ ๋ํ ์ถ๋ ฅ์ ๊ทธ๋๋์ธํธ"""
self.dW = torch.outer(dL_dz, self.x) # (out_dim, in_dim)
self.db = dL_dz # (out_dim,)
dL_dx = self.W.T @ dL_dz # (in_dim,)
return dL_dx
# ํ
์คํธ
layer = LinearLayer(5, 3)
x = torch.randn(5)
z = layer.forward(x)
dL_dz = torch.randn(3) # ๊ฐ์ง ๊ทธ๋๋์ธํธ
dL_dx = layer.backward(dL_dz)
print("์ ํ ๋ ์ด์ด ์ญ์ ํ:")
print(f"dW ํํ: {layer.dW.shape}")
print(f"db ํํ: {layer.db.shape}")
print(f"dx ํํ: {dL_dx.shape}")
# PyTorch๋ก ๊ฒ์ฆ
W_torch = layer.W.clone().requires_grad_(True)
b_torch = layer.b.clone().requires_grad_(True)
x_torch = x.clone().requires_grad_(True)
z_torch = W_torch @ x_torch + b_torch
z_torch.backward(dL_dz)
print(f"\ndW ์ฐจ์ด: {torch.norm(layer.dW - W_torch.grad).item():.2e}")
print(f"db ์ฐจ์ด: {torch.norm(layer.db - b_torch.grad).item():.2e}")
print(f"dx ์ฐจ์ด: {torch.norm(dL_dx - x_torch.grad).item():.2e}")
6. ์๋ ๋ฏธ๋ถ (Automatic Differentiation)¶
6.1 ํฌ์๋ ๋ชจ๋ vs ๋ฆฌ๋ฒ์ค ๋ชจ๋¶
ํฌ์๋ ๋ชจ๋ (Forward Mode): - ์ ๋ ฅ์์ ์ถ๋ ฅ ๋ฐฉํฅ์ผ๋ก ๋ฏธ๋ถ ์ ํ - $n$๊ฐ ์ ๋ ฅ, 1๊ฐ ์ถ๋ ฅ์ผ ๋ ํจ์จ์ - ๋ฐฉํฅ ๋ํจ์ ๊ณ์ฐ์ ์ ์ฉ
๋ฆฌ๋ฒ์ค ๋ชจ๋ (Reverse Mode): - ์ถ๋ ฅ์์ ์ ๋ ฅ ๋ฐฉํฅ์ผ๋ก ๋ฏธ๋ถ ์ ํ (์ญ์ ํ) - 1๊ฐ ์ถ๋ ฅ, $n$๊ฐ ์ ๋ ฅ์ผ ๋ ํจ์จ์ - ๋ฅ๋ฌ๋์์ ์ฌ์ฉ (์์ค ํจ์๋ ์ค์นผ๋ผ)
6.2 ๊ณ์ฐ ๊ทธ๋ํ¶
๊ณ์ฐ ๊ทธ๋ํ๋ ์ฐ์ฐ์ ๋ ธ๋๋ก, ๋ฐ์ดํฐ ํ๋ฆ์ ์ฃ์ง๋ก ํํํฉ๋๋ค.
# ๊ณ์ฐ ๊ทธ๋ํ ์์: f(x, y) = (x + y) * (x - y)
import torch
x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
# ์ค๊ฐ ๋ณ์ ์ ์ฅ
a = x + y # a = 5
b = x - y # b = 1
f = a * b # f = 5
print("๊ณ์ฐ ๊ทธ๋ํ:")
print(f"x={x.item()}, y={y.item()}")
print(f"a = x + y = {a.item()}")
print(f"b = x - y = {b.item()}")
print(f"f = a * b = {f.item()}")
# ์ญ์ ํ
f.backward()
print(f"\nโf/โx = {x.grad.item()}")
print(f"โf/โy = {y.grad.item()}")
# ์๋ ๊ณ์ฐ ๊ฒ์ฆ
# f = (x+y)(x-y) = x^2 - y^2
# โf/โx = 2x = 6
# โf/โy = -2y = -4
print(f"\n์๋ ๊ณ์ฐ: โf/โx = 2x = {2*x.item()}")
print(f"์๋ ๊ณ์ฐ: โf/โy = -2y = {-2*y.item()}")
6.3 PyTorch Autograd ๋ด๋ถ ๋์¶
# ๊ณ์ฐ ๊ทธ๋ํ ์๊ฐํ (๊ฐ๋จํ ์)
import torch
x = torch.tensor([1.0, 2.0], requires_grad=True)
w = torch.tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
b = torch.tensor([0.5, 1.0], requires_grad=True)
# ์์ ํ
z = w @ x + b # ์ ํ ๋ณํ
a = torch.relu(z) # ํ์ฑํ
loss = a.sum() # ์์ค
print("๊ณ์ฐ ๊ทธ๋ํ ์ถ์ :")
print(f"grad_fn of z: {z.grad_fn}")
print(f"grad_fn of a: {a.grad_fn}")
print(f"grad_fn of loss: {loss.grad_fn}")
# ์ญ์ ํ
loss.backward()
print("\n๊ทธ๋๋์ธํธ:")
print(f"โL/โx: {x.grad}")
print(f"โL/โw:\n{w.grad}")
print(f"โL/โb: {b.grad}")
6.4 ๊ณ ์ฐจ ๋ฏธ๋ถ¶
# 2์ฐจ ๋ฏธ๋ถ (ํค์์ ๋๊ฐ์ )
x = torch.tensor(2.0, requires_grad=True)
y = x**4
# 1์ฐจ ๋ฏธ๋ถ
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"f(x) = x^4, f'(x) = 4x^3")
print(f"f'(2) = {dy_dx.item()} (์์: {4*2**3})")
# 2์ฐจ ๋ฏธ๋ถ
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]
print(f"f''(x) = 12x^2")
print(f"f''(2) = {d2y_dx2.item()} (์์: {12*2**2})")
6.5 ์๋ ๋ฏธ๋ถ์ ํ๊ณ์ ์๋ ๊ตฌํ¶
์๋ ๋ฏธ๋ถ์ ํธ๋ฆฌํ์ง๋ง ๋๋ก๋ ์๋ ๊ตฌํ์ด ํ์ํฉ๋๋ค:
- ๋ฉ๋ชจ๋ฆฌ ํจ์จ์ฑ (๊ทธ๋๋์ธํธ ์ฒดํฌํฌ์ธํ )
- ์ปค์คํ ์ญ์ ํ ๋ก์ง
- ์์น ์์ ์ฑ ๊ฐ์
# ์ปค์คํ
autograd ํจ์
class MyReLU(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
ctx.save_for_backward(x)
return x.clamp(min=0)
@staticmethod
def backward(ctx, grad_output):
x, = ctx.saved_tensors
grad_input = grad_output.clone()
grad_input[x < 0] = 0
return grad_input
# ์ฌ์ฉ
x = torch.randn(5, requires_grad=True)
y = MyReLU.apply(x)
loss = y.sum()
loss.backward()
print("์ปค์คํ
ReLU ๊ทธ๋๋์ธํธ:")
print(f"x: {x.detach()}")
print(f"y: {y.detach()}")
print(f"โL/โx: {x.grad}")
์ฐ์ต ๋ฌธ์ ¶
๋ฌธ์ 1: ํ๋ ฌ ๋ฏธ๋ถ ํญ๋ฑ์ ์ ๋¶
$\mathbf{x} \in \mathbb{R}^n$, $A \in \mathbb{R}^{n \times n}$์ผ ๋, ๋ค์์ ์ฆ๋ช ํ์์ค:
$$\frac{\partial}{\partial \mathbf{x}} (\mathbf{x}^T A \mathbf{x}) = (A + A^T)\mathbf{x}$$
์ธ๋ฑ์ค ํ๊ธฐ๋ฒ์ ์ฌ์ฉํ์ฌ ๋จ๊ณ๋ณ๋ก ์ ๋ํ๊ณ , PyTorch๋ก ๊ฒ์ฆํ๋ ์ฝ๋๋ฅผ ์์ฑํ์์ค.
๋ฌธ์ 2: ๋ก์ง์คํฑ ํ๊ท ๊ทธ๋๋์ธํธ¶
๋ก์ง์คํฑ ํ๊ท์ ์์ค ํจ์๋:
$$L(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log \sigma(\mathbf{w}^T \mathbf{x}_i) + (1-y_i) \log(1-\sigma(\mathbf{w}^T \mathbf{x}_i)) \right]$$
์ฌ๊ธฐ์ $\sigma(z) = 1/(1+e^{-z})$. ๊ทธ๋๋์ธํธ $\nabla_\mathbf{w} L$์ ์ ๋ํ์์ค. ๊ฒฐ๊ณผ๊ฐ ๋ค์๊ณผ ๊ฐ์์ ๋ณด์ด์์ค:
$$\nabla_\mathbf{w} L = \frac{1}{n} X^T (\boldsymbol{\sigma} - \mathbf{y})$$
์ฌ๊ธฐ์ $\boldsymbol{\sigma} = [\sigma(\mathbf{w}^T \mathbf{x}_1), \ldots, \sigma(\mathbf{w}^T \mathbf{x}_n)]^T$.
๋ฌธ์ 3: ๋ฐฐ์น ์ ๊ทํ ๊ทธ๋๋์ธํธ¶
๋ฐฐ์น ์ ๊ทํ๋ ๋ค์๊ณผ ๊ฐ์ด ์ ์๋ฉ๋๋ค:
$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$
์ฌ๊ธฐ์ $\mu = \frac{1}{n}\sum_i x_i$, $\sigma^2 = \frac{1}{n}\sum_i (x_i - \mu)^2$.
์์ค $L$์ ๋ํ $x_i$์ ๊ทธ๋๋์ธํธ๋ฅผ ์ ๋ํ์์ค. PyTorch์ BatchNorm1d์ ๋น๊ตํ์ฌ ๊ฒ์ฆํ์์ค.
๋ฌธ์ 4: ์ํํธ๋งฅ์ค ์ผ์ฝ๋น์¶
์ํํธ๋งฅ์ค ํจ์ $\sigma(\mathbf{z})_i = e^{z_i} / \sum_j e^{z_j}$์ ์ผ์ฝ๋น์์ ๊ณ์ฐํ์์ค.
$$\frac{\partial \sigma_i}{\partial z_j} = ?$$
ํํธ: $i=j$์ธ ๊ฒฝ์ฐ์ $i \neq j$์ธ ๊ฒฝ์ฐ๋ฅผ ๋๋์ด ๊ณ์ฐํ์์ค. ๊ฒฐ๊ณผ๊ฐ ๋ค์๊ณผ ๊ฐ์์ ๋ณด์ด์์ค:
$$\frac{\partial \sigma_i}{\partial z_j} = \sigma_i(\delta_{ij} - \sigma_j)$$
๋ฌธ์ 5: L2 ์ ๊ทํ ๊ทธ๋๋์ธํธ¶
๋ฆฟ์ง ํ๊ท์ ์์ค ํจ์:
$$L(\mathbf{w}) = \frac{1}{2n}\|\mathbf{y} - X\mathbf{w}\|^2 + \frac{\lambda}{2}\|\mathbf{w}\|^2$$
๊ทธ๋๋์ธํธ๋ฅผ ์ ๋ํ๊ณ , ์ ๊ท ๋ฐฉ์ ์ (Normal Equation)์ ๊ตฌํ์์ค. PyTorch๋ก ๊ฒฝ์ฌ ํ๊ฐ๋ฒ๊ณผ ํด์ํด๋ฅผ ๋น๊ตํ์์ค.
์ฐธ๊ณ ์๋ฃ¶
์จ๋ผ์ธ ์๋ฃ¶
- Matrix Calculus for Deep Learning - ์์ธํ ํ๋ ฌ ๋ฏธ๋ถ ํํ ๋ฆฌ์ผ
- The Matrix Cookbook - ํ๋ ฌ ๋ฏธ๋ถ ๊ณต์์ง
- PyTorch Autograd Documentation
๊ต์ฌ¶
- Magnus & Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics
- Goodfellow et al., Deep Learning, Chapter 6 (Numerical Computation)
- Boyd & Vandenberghe, Convex Optimization, Appendix A
๋ ผ๋ฌธ¶
- Griewank & Walther, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2008)
- Baydin et al., Automatic Differentiation in Machine Learning: a Survey (JMLR 2018)