03. Matrix Calculus
03. Matrix Calculus¶
Learning Objectives¶
- Understand and compute scalar-by-vector derivatives and gradients
- Learn the definition of the Jacobian matrix and how to apply the chain rule
- Understand the meaning of the Hessian matrix and its role in optimization
- Derive and utilize key matrix derivative identities
- Directly derive gradients of machine learning loss functions
- Understand PyTorch's automatic differentiation capabilities and use them for verification
1. Scalar-by-Vector Derivatives¶
1.1 Definition of Gradient¶
For a scalar function $f: \mathbb{R}^n \to \mathbb{R}$, the gradient is a vector containing all partial derivatives:
$$\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$
The gradient points in the direction of steepest increase of the function.
1.2 Basic Examples¶
Example 1: $f(\mathbf{x}) = \mathbf{a}^T \mathbf{x}$
$$\frac{\partial}{\partial \mathbf{x}}(\mathbf{a}^T \mathbf{x}) = \mathbf{a}$$
Example 2: $f(\mathbf{x}) = \mathbf{x}^T \mathbf{x}$
$$\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T \mathbf{x}) = 2\mathbf{x}$$
Example 3: $f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}$ (quadratic form)
$$\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = (A + A^T)\mathbf{x}$$
If $A$ is symmetric, this becomes $2A\mathbf{x}$.
1.3 Python Implementation: Derivative of Quadratic Forms¶
import numpy as np
import torch
from sympy import symbols, Matrix, diff, simplify
# SymPy๋ก ์ฌ๋ณผ๋ฆญ ๊ณ์ฐ
print("=== SymPy ์ฌ๋ณผ๋ฆญ ๋ฏธ๋ถ ===")
x1, x2 = symbols('x1 x2')
x = Matrix([x1, x2])
A = Matrix([[2, 1], [1, 3]])
# f(x) = x^T A x
f = (x.T * A * x)[0]
print(f"f(x) = {f}")
# ๊ทธ๋๋์ธํธ ๊ณ์ฐ
grad_f = Matrix([diff(f, x1), diff(f, x2)])
print(f"โf = {simplify(grad_f)}")
print(f"(A + A^T)x = {simplify((A + A.T) * x)}")
# PyTorch๋ก ์์น ๊ณ์ฐ ๋ฐ ๊ฒ์ฆ
print("\n=== PyTorch ์๋ ๋ฏธ๋ถ ===")
x_val = torch.tensor([1.0, 2.0], requires_grad=True)
A_torch = torch.tensor([[2.0, 1.0], [1.0, 3.0]])
# ์์ ํ
f_val = x_val @ A_torch @ x_val
print(f"f(x) = {f_val.item():.4f}")
# ์ญ์ ํ
f_val.backward()
print(f"โf (autograd) = {x_val.grad}")
# ๊ณต์์ผ๋ก ๊ณ์ฐ
grad_formula = (A_torch + A_torch.T) @ x_val.detach()
print(f"โf (๊ณต์) = {grad_formula}")
print(f"์ฐจ์ด: {torch.norm(x_val.grad - grad_formula).item():.2e}")
1.4 Numerator Layout vs Denominator Layout¶
There are two conventions in matrix calculus:
- Numerator layout: The $(i,j)$ element of $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ is $\frac{\partial y_i}{\partial x_j}$
- Denominator layout: The $(i,j)$ element of $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ is $\frac{\partial y_j}{\partial x_i}$
This document uses the numerator layout.
2. Vector-by-Vector Derivatives: Jacobian¶
2.1 Definition of Jacobian Matrix¶
For a vector function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian matrix is:
$$J = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$
The size is $m \times n$.
2.2 Chain Rule with Jacobian¶
When $\mathbf{z} = \mathbf{g}(\mathbf{f}(\mathbf{x}))$:
$$\frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}$$
where $\mathbf{y} = \mathbf{f}(\mathbf{x})$, and the right-hand side is the product of Jacobian matrices.
2.3 Jacobian Computation Example¶
import torch
# ํจ์ ์ ์: f: R^2 -> R^3
def vector_function(x):
"""
f([x1, x2]) = [x1^2 + x2,
x1 * x2,
sin(x1) + cos(x2)]
"""
return torch.stack([
x[0]**2 + x[1],
x[0] * x[1],
torch.sin(x[0]) + torch.cos(x[1])
])
x = torch.tensor([1.0, 2.0], requires_grad=True)
# PyTorch์ ์ผ์ฝ๋น์ ๊ณ์ฐ
from torch.autograd.functional import jacobian
J = jacobian(vector_function, x)
print("์ผ์ฝ๋น์ ํ๋ ฌ (3x2):")
print(J)
# ์๋ ๊ณ์ฐ์ผ๋ก ๊ฒ์ฆ
x1, x2 = x[0].item(), x[1].item()
J_manual = torch.tensor([
[2*x1, 1],
[x2, x1],
[np.cos(x1), -np.sin(x2)]
])
print("\n์๋ ๊ณ์ฐ:")
print(J_manual)
print(f"\n์ฐจ์ด: {torch.norm(J - J_manual).item():.2e}")
2.4 Chain Rule Practice¶
# ํฉ์ฑ ํจ์์ ์ผ์ฝ๋น์: h(x) = g(f(x))
def f(x):
"""f: R^2 -> R^2"""
return torch.stack([x[0]**2, x[0] + x[1]])
def g(y):
"""g: R^2 -> R^2"""
return torch.stack([y[0] * y[1], y[0] - y[1]])
def h(x):
"""h = g โ f"""
return g(f(x))
x = torch.tensor([1.0, 2.0])
# ๋ฐฉ๋ฒ 1: ์ง์ ๊ณ์ฐ
J_h = jacobian(h, x)
print("J_h (์ง์ ):")
print(J_h)
# ๋ฐฉ๋ฒ 2: ์ฒด์ธ ๋ฃฐ
J_f = jacobian(f, x)
y = f(x)
J_g = jacobian(g, y)
J_chain = J_g @ J_f
print("\nJ_g @ J_f (์ฒด์ธ ๋ฃฐ):")
print(J_chain)
print(f"\n์ฐจ์ด: {torch.norm(J_h - J_chain).item():.2e}")
3. Hessian Matrix¶
3.1 Definition of Hessian¶
The Hessian matrix of a scalar function $f: \mathbb{R}^n \to \mathbb{R}$ consists of second-order partial derivatives:
$$H = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$
By Schwarz's theorem, if $f$ is $C^2$ class, then $H$ is symmetric.
3.2 Properties of Hessian and Optimization¶
- Positive Definite: All eigenvalues > 0 โ local minimum
- Negative Definite: All eigenvalues < 0 โ local maximum
- Indefinite: Mixed positive/negative eigenvalues โ saddle point
3.3 Role in Newton's Method¶
Newton's method update rule:
$$\mathbf{x}_{k+1} = \mathbf{x}_k - H^{-1}(\mathbf{x}_k) \nabla f(\mathbf{x}_k)$$
It uses the inverse of the Hessian to leverage second-order information.
3.4 Hessian Computation Example¶
import torch
import numpy as np
# ํจ์ ์ ์: f(x, y) = x^2 + xy + 2y^2
def f(x):
return x[0]**2 + x[0]*x[1] + 2*x[1]**2
x = torch.tensor([1.0, 2.0], requires_grad=True)
# ๊ทธ๋๋์ธํธ ๊ณ์ฐ
y = f(x)
grad = torch.autograd.grad(y, x, create_graph=True)[0]
print("โf =", grad)
# ํค์์ ๊ณ์ฐ (๊ฐ ๊ทธ๋๋์ธํธ ์ฑ๋ถ์ ๋ค์ ๋ฏธ๋ถ)
hessian = torch.zeros(2, 2)
for i in range(2):
hessian[i] = torch.autograd.grad(grad[i], x, retain_graph=True)[0]
print("\nํค์์ ํ๋ ฌ:")
print(hessian)
# ์๋ ๊ณ์ฐ: H = [[2, 1], [1, 4]]
H_manual = torch.tensor([[2.0, 1.0], [1.0, 4.0]])
print("\n์๋ ๊ณ์ฐ:")
print(H_manual)
# ๊ณ ์ ๊ฐ์ผ๋ก ์ ๋ถํธ ํ์
eigenvalues = torch.linalg.eigvalsh(hessian)
print(f"\n๊ณ ์ ๊ฐ: {eigenvalues}")
print("์์ ์น (๊ทน์๊ฐ):", torch.all(eigenvalues > 0).item())
3.5 Hessian and Convexity¶
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# ๋ณผ๋ก ํจ์: f(x, y) = x^2 + 2y^2
x = np.linspace(-3, 3, 50)
y = np.linspace(-3, 3, 50)
X, Y = np.meshgrid(x, y)
Z_convex = X**2 + 2*Y**2
# ์์ฅ์ ํจ์: f(x, y) = x^2 - y^2
Z_saddle = X**2 - Y**2
fig = plt.figure(figsize=(14, 6))
# ๋ณผ๋ก ํจ์
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(X, Y, Z_convex, cmap='viridis', alpha=0.8)
ax1.set_title('๋ณผ๋ก ํจ์: $f(x,y) = x^2 + 2y^2$\nํค์์ ์์ ์น', fontsize=12)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x,y)')
# ์์ฅ์ ํจ์
ax2 = fig.add_subplot(122, projection='3d')
ax2.plot_surface(X, Y, Z_saddle, cmap='plasma', alpha=0.8)
ax2.set_title('์์ฅ์ : $f(x,y) = x^2 - y^2$\nํค์์ ๋ถ์ ๋ถํธ', fontsize=12)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_zlabel('f(x,y)')
plt.tight_layout()
plt.savefig('hessian_surfaces.png', dpi=150)
plt.close()
print("ํค์์๊ณผ ํจ์ ํํ ์๊ฐํ ์ ์ฅ: hessian_surfaces.png")
4. Matrix Derivative Identities¶
4.1 Key Identities¶
| Function | Derivative |
|---|---|
| $\mathbf{a}^T \mathbf{x}$ | $\mathbf{a}$ |
| $\mathbf{x}^T \mathbf{x}$ | $2\mathbf{x}$ |
| $\mathbf{x}^T A \mathbf{x}$ | $(A + A^T)\mathbf{x}$ |
| $\mathbf{a}^T X \mathbf{b}$ | $\mathbf{a}\mathbf{b}^T$ |
| $\text{tr}(AB)$ | $B^T$ (w.r.t. $A$) |
| $\log \|A\|$ | $A^{-T}$ (transpose of inverse) |
| $\mathbf{x}^T A^{-1} \mathbf{x}$ | $-A^{-1}\mathbf{x}\mathbf{x}^T A^{-1}$ (w.r.t. $A$) |
4.2 Derivation: $\mathbf{x}^T A \mathbf{x}$¶
Using index notation:
$$f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x} = \sum_{i,j} x_i A_{ij} x_j$$
Taking derivative with respect to $x_k$:
$$\frac{\partial f}{\partial x_k} = \sum_j A_{kj} x_j + \sum_i x_i A_{ik} = (A\mathbf{x})_k + (A^T\mathbf{x})_k$$
Therefore:
$$\nabla f = (A + A^T)\mathbf{x}$$
4.3 Derivation: $\text{tr}(AB)$¶
Using the trace property $\text{tr}(AB) = \sum_{ij} A_{ij} B_{ji}$:
$$\frac{\partial}{\partial A_{kl}} \text{tr}(AB) = \frac{\partial}{\partial A_{kl}} \sum_{ij} A_{ij} B_{ji} = B_{lk}$$
Therefore:
$$\frac{\partial \text{tr}(AB)}{\partial A} = B^T$$
4.4 Identity Verification Code¶
import torch
# ํญ๋ฑ์ 1: โ(x^T a)/โx = a
x = torch.randn(5, requires_grad=True)
a = torch.randn(5)
f = x @ a
f.backward()
print("ํญ๋ฑ์ 1: โ(x^T a)/โx = a")
print(f"autograd: {x.grad}")
print(f"๊ณต์: {a}")
print(f"์ฐจ์ด: {torch.norm(x.grad - a).item():.2e}\n")
# ํญ๋ฑ์ 2: โ(x^T A x)/โx = (A + A^T)x
x = torch.randn(5, requires_grad=True)
A = torch.randn(5, 5)
f = x @ A @ x
f.backward()
print("ํญ๋ฑ์ 2: โ(x^T A x)/โx = (A + A^T)x")
print(f"autograd: {x.grad}")
expected = (A + A.T) @ x.detach()
print(f"๊ณต์: {expected}")
print(f"์ฐจ์ด: {torch.norm(x.grad - expected).item():.2e}\n")
# ํญ๋ฑ์ 3: โtr(AB)/โA = B^T
A = torch.randn(4, 4, requires_grad=True)
B = torch.randn(4, 4)
f = torch.trace(A @ B)
f.backward()
print("ํญ๋ฑ์ 3: โtr(AB)/โA = B^T")
print(f"autograd:\n{A.grad}")
print(f"๊ณต์:\n{B.T}")
print(f"์ฐจ์ด: {torch.norm(A.grad - B.T).item():.2e}")
5. Applications of Matrix Calculus in ML¶
5.1 MSE Loss Gradient Derivation¶
In regression problems, the loss function is:
$$L(\mathbf{w}) = \frac{1}{2n} \|\mathbf{y} - X\mathbf{w}\|^2$$
Gradient:
$$\nabla_\mathbf{w} L = -\frac{1}{n} X^T (\mathbf{y} - X\mathbf{w})$$
Derivation:
$$\nabla_\mathbf{w} L = \nabla_\mathbf{w} \frac{1}{2n}(\mathbf{y} - X\mathbf{w})^T(\mathbf{y} - X\mathbf{w})$$
Let $\mathbf{r} = \mathbf{y} - X\mathbf{w}$:
$$\nabla_\mathbf{w} L = \frac{1}{n} \nabla_\mathbf{w} \mathbf{r}^T \mathbf{r} = \frac{1}{n} \cdot 2 \mathbf{r}^T \nabla_\mathbf{w} \mathbf{r} = -\frac{1}{n} X^T \mathbf{r}$$
5.2 MSE Gradient Implementation and Verification¶
import torch
import torch.nn as nn
# ๋ฐ์ดํฐ ์์ฑ
n, d = 100, 10
X = torch.randn(n, d)
y = torch.randn(n)
w = torch.randn(d, requires_grad=True)
# ๋ฐฉ๋ฒ 1: PyTorch autograd
pred = X @ w
loss = 0.5 * torch.mean((y - pred)**2)
loss.backward()
grad_autograd = w.grad.clone()
# ๋ฐฉ๋ฒ 2: ์๋ ์ ๋ ๊ณต์
residual = y - X @ w.detach()
grad_formula = -X.T @ residual / n
print("MSE ๊ทธ๋๋์ธํธ ๋น๊ต:")
print(f"autograd: {grad_autograd[:5]}")
print(f"๊ณต์: {grad_formula[:5]}")
print(f"์ฐจ์ด: {torch.norm(grad_autograd - grad_formula).item():.2e}")
5.3 Softmax Cross-Entropy Gradient¶
Softmax function:
$$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$$
Cross-entropy loss:
$$L = -\sum_i y_i \log \sigma(\mathbf{z})_i$$
Gradient (for one-hot label $\mathbf{y}$):
$$\frac{\partial L}{\partial \mathbf{z}} = \sigma(\mathbf{z}) - \mathbf{y}$$
This concise form is derived from computing the Jacobian of the softmax.
5.4 Softmax Gradient Verification¶
import torch
import torch.nn.functional as F
# ๋ก์ง๊ณผ ํ๊ฒ
logits = torch.randn(5, requires_grad=True)
target_class = 2 # ํด๋์ค 2๊ฐ ์ ๋ต
# ๋ฐฉ๋ฒ 1: PyTorch autograd
loss = F.cross_entropy(logits.unsqueeze(0), torch.tensor([target_class]))
loss.backward()
grad_autograd = logits.grad.clone()
# ๋ฐฉ๋ฒ 2: ์๋ ๊ณ์ฐ
probs = F.softmax(logits.detach(), dim=0)
y_onehot = torch.zeros(5)
y_onehot[target_class] = 1.0
grad_formula = probs - y_onehot
print("์ํํธ๋งฅ์ค ๊ต์ฐจ ์ํธ๋กํผ ๊ทธ๋๋์ธํธ:")
print(f"autograd: {grad_autograd}")
print(f"๊ณต์: {grad_formula}")
print(f"์ฐจ์ด: {torch.norm(grad_autograd - grad_formula).item():.2e}")
5.5 Backpropagation: Chain of Jacobians¶
Backpropagation in neural networks is the repeated application of the chain rule:
$$\frac{\partial L}{\partial \mathbf{w}_1} = \frac{\partial L}{\partial \mathbf{z}_L} \frac{\partial \mathbf{z}_L}{\partial \mathbf{z}_{L-1}} \cdots \frac{\partial \mathbf{z}_2}{\partial \mathbf{z}_1} \frac{\partial \mathbf{z}_1}{\partial \mathbf{w}_1}$$
Each term is a Jacobian, computed from right to left (reverse mode).
5.6 Linear Layer Gradient Derivation¶
Linear layer: $\mathbf{z} = W\mathbf{x} + \mathbf{b}$
Gradients with respect to loss $L$:
$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \mathbf{z}} \mathbf{x}^T$$
$$\frac{\partial L}{\partial \mathbf{b}} = \frac{\partial L}{\partial \mathbf{z}}$$
$$\frac{\partial L}{\partial \mathbf{x}} = W^T \frac{\partial L}{\partial \mathbf{z}}$$
# ์ ํ ๋ ์ด์ด ๊ทธ๋๋์ธํธ ์๋ ๊ตฌํ
class LinearLayer:
def __init__(self, in_dim, out_dim):
self.W = torch.randn(out_dim, in_dim, requires_grad=False)
self.b = torch.randn(out_dim, requires_grad=False)
self.x = None
self.dW = None
self.db = None
def forward(self, x):
self.x = x
return self.W @ x + self.b
def backward(self, dL_dz):
"""dL_dz: ์์ค์ ๋ํ ์ถ๋ ฅ์ ๊ทธ๋๋์ธํธ"""
self.dW = torch.outer(dL_dz, self.x) # (out_dim, in_dim)
self.db = dL_dz # (out_dim,)
dL_dx = self.W.T @ dL_dz # (in_dim,)
return dL_dx
# ํ
์คํธ
layer = LinearLayer(5, 3)
x = torch.randn(5)
z = layer.forward(x)
dL_dz = torch.randn(3) # ๊ฐ์ง ๊ทธ๋๋์ธํธ
dL_dx = layer.backward(dL_dz)
print("์ ํ ๋ ์ด์ด ์ญ์ ํ:")
print(f"dW ํํ: {layer.dW.shape}")
print(f"db ํํ: {layer.db.shape}")
print(f"dx ํํ: {dL_dx.shape}")
# PyTorch๋ก ๊ฒ์ฆ
W_torch = layer.W.clone().requires_grad_(True)
b_torch = layer.b.clone().requires_grad_(True)
x_torch = x.clone().requires_grad_(True)
z_torch = W_torch @ x_torch + b_torch
z_torch.backward(dL_dz)
print(f"\ndW ์ฐจ์ด: {torch.norm(layer.dW - W_torch.grad).item():.2e}")
print(f"db ์ฐจ์ด: {torch.norm(layer.db - b_torch.grad).item():.2e}")
print(f"dx ์ฐจ์ด: {torch.norm(dL_dx - x_torch.grad).item():.2e}")
6. Automatic Differentiation¶
6.1 Forward Mode vs Reverse Mode¶
Forward Mode: - Propagates derivatives from inputs to outputs - Efficient when there are $n$ inputs and 1 output - Useful for directional derivatives
Reverse Mode: - Propagates derivatives from outputs to inputs (backpropagation) - Efficient when there is 1 output and $n$ inputs - Used in deep learning (loss function is scalar)
6.2 Computational Graph¶
A computational graph represents operations as nodes and data flow as edges.
# ๊ณ์ฐ ๊ทธ๋ํ ์์: f(x, y) = (x + y) * (x - y)
import torch
x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
# ์ค๊ฐ ๋ณ์ ์ ์ฅ
a = x + y # a = 5
b = x - y # b = 1
f = a * b # f = 5
print("๊ณ์ฐ ๊ทธ๋ํ:")
print(f"x={x.item()}, y={y.item()}")
print(f"a = x + y = {a.item()}")
print(f"b = x - y = {b.item()}")
print(f"f = a * b = {f.item()}")
# ์ญ์ ํ
f.backward()
print(f"\nโf/โx = {x.grad.item()}")
print(f"โf/โy = {y.grad.item()}")
# ์๋ ๊ณ์ฐ ๊ฒ์ฆ
# f = (x+y)(x-y) = x^2 - y^2
# โf/โx = 2x = 6
# โf/โy = -2y = -4
print(f"\n์๋ ๊ณ์ฐ: โf/โx = 2x = {2*x.item()}")
print(f"์๋ ๊ณ์ฐ: โf/โy = -2y = {-2*y.item()}")
6.3 PyTorch Autograd Internals¶
# ๊ณ์ฐ ๊ทธ๋ํ ์๊ฐํ (๊ฐ๋จํ ์)
import torch
x = torch.tensor([1.0, 2.0], requires_grad=True)
w = torch.tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
b = torch.tensor([0.5, 1.0], requires_grad=True)
# ์์ ํ
z = w @ x + b # ์ ํ ๋ณํ
a = torch.relu(z) # ํ์ฑํ
loss = a.sum() # ์์ค
print("๊ณ์ฐ ๊ทธ๋ํ ์ถ์ :")
print(f"grad_fn of z: {z.grad_fn}")
print(f"grad_fn of a: {a.grad_fn}")
print(f"grad_fn of loss: {loss.grad_fn}")
# ์ญ์ ํ
loss.backward()
print("\n๊ทธ๋๋์ธํธ:")
print(f"โL/โx: {x.grad}")
print(f"โL/โw:\n{w.grad}")
print(f"โL/โb: {b.grad}")
6.4 Higher-Order Derivatives¶
# 2์ฐจ ๋ฏธ๋ถ (ํค์์ ๋๊ฐ์ )
x = torch.tensor(2.0, requires_grad=True)
y = x**4
# 1์ฐจ ๋ฏธ๋ถ
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"f(x) = x^4, f'(x) = 4x^3")
print(f"f'(2) = {dy_dx.item()} (์์: {4*2**3})")
# 2์ฐจ ๋ฏธ๋ถ
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]
print(f"f''(x) = 12x^2")
print(f"f''(2) = {d2y_dx2.item()} (์์: {12*2**2})")
6.5 Limitations of Automatic Differentiation and Manual Implementation¶
While automatic differentiation is convenient, manual implementation is sometimes necessary:
- Memory efficiency (gradient checkpointing)
- Custom backpropagation logic
- Improved numerical stability
# ์ปค์คํ
autograd ํจ์
class MyReLU(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
ctx.save_for_backward(x)
return x.clamp(min=0)
@staticmethod
def backward(ctx, grad_output):
x, = ctx.saved_tensors
grad_input = grad_output.clone()
grad_input[x < 0] = 0
return grad_input
# ์ฌ์ฉ
x = torch.randn(5, requires_grad=True)
y = MyReLU.apply(x)
loss = y.sum()
loss.backward()
print("์ปค์คํ
ReLU ๊ทธ๋๋์ธํธ:")
print(f"x: {x.detach()}")
print(f"y: {y.detach()}")
print(f"โL/โx: {x.grad}")
Practice Problems¶
Problem 1: Derive Matrix Derivative Identity¶
For $\mathbf{x} \in \mathbb{R}^n$ and $A \in \mathbb{R}^{n \times n}$, prove:
$$\frac{\partial}{\partial \mathbf{x}} (\mathbf{x}^T A \mathbf{x}) = (A + A^T)\mathbf{x}$$
Derive step-by-step using index notation and write code to verify using PyTorch.
Problem 2: Logistic Regression Gradient¶
The loss function for logistic regression is:
$$L(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log \sigma(\mathbf{w}^T \mathbf{x}_i) + (1-y_i) \log(1-\sigma(\mathbf{w}^T \mathbf{x}_i)) \right]$$
where $\sigma(z) = 1/(1+e^{-z})$. Derive the gradient $\nabla_\mathbf{w} L$. Show that the result is:
$$\nabla_\mathbf{w} L = \frac{1}{n} X^T (\boldsymbol{\sigma} - \mathbf{y})$$
where $\boldsymbol{\sigma} = [\sigma(\mathbf{w}^T \mathbf{x}_1), \ldots, \sigma(\mathbf{w}^T \mathbf{x}_n)]^T$.
Problem 3: Batch Normalization Gradient¶
Batch normalization is defined as:
$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$
where $\mu = \frac{1}{n}\sum_i x_i$ and $\sigma^2 = \frac{1}{n}\sum_i (x_i - \mu)^2$.
Derive the gradient of $x_i$ with respect to loss $L$. Verify by comparing with PyTorch's BatchNorm1d.
Problem 4: Softmax Jacobian¶
Compute the Jacobian of the softmax function $\sigma(\mathbf{z})_i = e^{z_i} / \sum_j e^{z_j}$:
$$\frac{\partial \sigma_i}{\partial z_j} = ?$$
Hint: Consider the cases $i=j$ and $i \neq j$ separately. Show that the result is:
$$\frac{\partial \sigma_i}{\partial z_j} = \sigma_i(\delta_{ij} - \sigma_j)$$
Problem 5: L2 Regularization Gradient¶
For ridge regression loss function:
$$L(\mathbf{w}) = \frac{1}{2n}\|\mathbf{y} - X\mathbf{w}\|^2 + \frac{\lambda}{2}\|\mathbf{w}\|^2$$
Derive the gradient and obtain the normal equation. Compare gradient descent and the analytical solution using PyTorch.
References¶
Online Resources¶
- Matrix Calculus for Deep Learning - Detailed matrix calculus tutorial
- The Matrix Cookbook - Matrix derivative formula reference
- PyTorch Autograd Documentation
Textbooks¶
- Magnus & Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics
- Goodfellow et al., Deep Learning, Chapter 6 (Numerical Computation)
- Boyd & Vandenberghe, Convex Optimization, Appendix A
Papers¶
- Griewank & Walther, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2008)
- Baydin et al., Automatic Differentiation in Machine Learning: a Survey (JMLR 2018)