14. Convexity and Duality
14. Convexity and Duality¶
Learning Objectives¶
- Understand advanced properties of convex sets and convex functions
- Grasp the principles of Lagrangian duality and weak/strong duality
- Fully understand the derivation of SVM dual problem and its connection to the kernel trick
- Master the concept of Fenchel conjugate and its applications
- Understand proximal operators and proximal gradient methods applied to LASSO
- Learn practical applications of convex optimization and duality in machine learning
1. Review and Deep Dive into Convex Optimization¶
1.1 Convex Sets¶
A set $C$ is convex if: $$ \forall x, y \in C, \forall \theta \in [0, 1]: \quad \theta x + (1-\theta) y \in C $$
Examples: - Hyperplane: $\{x : a^T x = b\}$ - Halfspace: $\{x : a^T x \leq b\}$ - Norm ball: $\{x : \|x\| \leq r\}$ - Positive semidefinite matrices: $\mathbb{S}_+^n$
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
# ๋ณผ๋ก ์งํฉ vs ๋น๋ณผ๋ก ์งํฉ ์๊ฐํ
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# ๋ณผ๋ก ์งํฉ (๋ค๊ฐํ)
convex_points = np.array([[0, 0], [2, 0], [2.5, 1], [1.5, 2], [0.5, 1.5]])
axes[0].add_patch(Polygon(convex_points, fill=True, alpha=0.3, color='blue'))
axes[0].plot(convex_points[:, 0], convex_points[:, 1], 'bo-')
# ๋ ์ ์ฐ๊ฒฐ
p1, p2 = convex_points[0], convex_points[3]
axes[0].plot([p1[0], p2[0]], [p1[1], p2[1]], 'r-', linewidth=2, label='์ ๋ถ')
axes[0].scatter(*p1, color='red', s=100, zorder=5)
axes[0].scatter(*p2, color='red', s=100, zorder=5)
axes[0].set_title('๋ณผ๋ก ์งํฉ: ๋ ์ ์ฌ์ด ์ ๋ถ์ด ์งํฉ ๋ด๋ถ')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_aspect('equal')
# ๋น๋ณผ๋ก ์งํฉ (์ด์น๋ฌ ๋ชจ์)
theta = np.linspace(0, 2*np.pi, 100)
outer = np.column_stack([np.cos(theta), np.sin(theta)])
inner = np.column_stack([0.5*np.cos(theta) + 0.3, 0.5*np.sin(theta)])
crescent = np.vstack([outer[:50], inner[50:0:-1]])
axes[1].add_patch(Polygon(crescent, fill=True, alpha=0.3, color='orange'))
# ๋ณผ๋ก์ฑ ์๋ฐ ๋ณด์ด๊ธฐ
p1, p2 = crescent[10], crescent[-10]
axes[1].plot([p1[0], p2[0]], [p1[1], p2[1]], 'r-', linewidth=2, label='์ ๋ถ (์ธ๋ถ ์ง๋๊ฐ)')
axes[1].scatter(*p1, color='red', s=100, zorder=5)
axes[1].scatter(*p2, color='red', s=100, zorder=5)
axes[1].set_title('๋น๋ณผ๋ก ์งํฉ: ์ ๋ถ์ด ์งํฉ ๋ฐ์ผ๋ก')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_aspect('equal')
plt.tight_layout()
plt.show()
1.2 Convex Functions¶
A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex if: $$ f(\theta x + (1-\theta) y) \leq \theta f(x) + (1-\theta) f(y), \quad \forall x, y, \forall \theta \in [0, 1] $$
First-order condition (if differentiable): $$ f(y) \geq f(x) + \nabla f(x)^T (y - x) $$ (function always lies above its tangent)
Second-order condition (if twice differentiable): $$ \nabla^2 f(x) \succeq 0 \quad \text{(Hessian is positive semidefinite)} $$
# ๋ณผ๋ก ํจ์ vs ๋น๋ณผ๋ก ํจ์
x = np.linspace(-3, 3, 200)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# 1. ๋ณผ๋ก ํจ์: f(x) = x^2
axes[0, 0].plot(x, x**2, 'b-', linewidth=2, label='$f(x) = x^2$')
x0 = 1.0
axes[0, 0].plot(x, x0**2 + 2*x0*(x - x0), 'r--', label=f'์ ์ (x={x0})')
axes[0, 0].scatter([x0], [x0**2], color='red', s=100, zorder=5)
axes[0, 0].set_title('๋ณผ๋ก ํจ์: ํจ์๊ฐ ์ ์ ์์')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# 2. ๋น๋ณผ๋ก ํจ์: f(x) = x^3
axes[0, 1].plot(x, x**3, 'b-', linewidth=2, label='$f(x) = x^3$')
x0 = -1.0
axes[0, 1].plot(x, x0**3 + 3*x0**2*(x - x0), 'r--', label=f'์ ์ (x={x0})')
axes[0, 1].scatter([x0], [x0**3], color='red', s=100, zorder=5)
axes[0, 1].set_title('๋น๋ณผ๋ก ํจ์: ํจ์๊ฐ ์ ์ ์๋๋ก')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# 3. ๊ฐ๋ณผ๋ก ํจ์: f(x) = x^2 + 2x + 1
axes[1, 0].plot(x, x**2 + 2*x + 1, 'b-', linewidth=2, label='$f(x) = x^2 + 2x + 1$')
minimum = -1 # f'(x) = 2x + 2 = 0
axes[1, 0].scatter([minimum], [minimum**2 + 2*minimum + 1], color='green', s=100,
zorder=5, label='์ ์ญ ์ต์๊ฐ')
axes[1, 0].set_title('๊ฐ๋ณผ๋ก ํจ์: ์ ์ผํ ์ ์ญ ์ต์๊ฐ')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# 4. ํค์์ ํ์ธ
def hessian_demo():
"""2๋ณ๋ ํจ์์ ํค์์"""
from mpl_toolkits.mplot3d import Axes3D
x1 = np.linspace(-2, 2, 50)
x2 = np.linspace(-2, 2, 50)
X1, X2 = np.meshgrid(x1, x2)
# ๋ณผ๋ก: f(x1, x2) = x1^2 + x2^2
Z = X1**2 + X2**2
ax = fig.add_subplot(2, 2, 4, projection='3d')
ax.plot_surface(X1, X2, Z, cmap='viridis', alpha=0.7)
ax.set_title('2๋ณ๋ ๋ณผ๋ก ํจ์: $f(x_1, x_2) = x_1^2 + x_2^2$')
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.set_zlabel('$f(x_1, x_2)$')
hessian_demo()
plt.tight_layout()
plt.show()
# ํค์์ ์์ ์ ๋ถํธ ํ์ธ
def check_convexity(f, grad, hessian, x):
"""2์ฐจ ์กฐ๊ฑด์ผ๋ก ๋ณผ๋ก์ฑ ํ์ธ"""
H = hessian(x)
eigenvalues = np.linalg.eigvals(H)
print(f"์ {x}์์:")
print(f" ํค์์ ๊ณ ์ ๊ฐ: {eigenvalues}")
print(f" ๋ชจ๋ >= 0? {np.all(eigenvalues >= -1e-10)}")
return np.all(eigenvalues >= -1e-10)
# ์: f(x) = x^T A x (A๊ฐ ์์ ์ ๋ถํธ)
A = np.array([[2, 0], [0, 3]])
hessian = lambda x: 2 * A
x_test = np.array([1.0, 1.0])
is_convex = check_convexity(None, None, hessian, x_test)
print(f"\nํจ์ x^T A x๋ ๋ณผ๋ก: {is_convex}")
1.3 Epigraph Perspective¶
The epigraph of function $f$: $$ \text{epi}(f) = \{(x, t) : f(x) \leq t\} $$
Theorem: $f$ is convex $\Leftrightarrow$ $\text{epi}(f)$ is a convex set
This perspective is useful for geometric understanding of convex functions.
1.4 Importance of Convex Optimization¶
Key property: Local minimum = Global minimum
from scipy.optimize import minimize
# ๋ณผ๋ก ํจ์: ์ด๋์ ์์ํด๋ ๊ฐ์ ์ต์๊ฐ
def convex_func(x):
return x[0]**2 + x[1]**2 + 2*x[0] + 3*x[1]
starting_points = [
np.array([10.0, 10.0]),
np.array([-5.0, 5.0]),
np.array([0.0, -8.0])
]
print("๋ณผ๋ก ํจ์ ์ต์ ํ:")
for i, x0 in enumerate(starting_points):
result = minimize(convex_func, x0, method='BFGS')
print(f"์์์ {i+1} {x0}: ์ต์๊ฐ ์์น {result.x}, ๊ฐ {result.fun:.4f}")
# ๋น๋ณผ๋ก ํจ์: ์์์ ์ ๋ฐ๋ผ ๋ค๋ฅธ ๊ฒฐ๊ณผ
def nonconvex_func(x):
return np.sin(x[0]) + np.cos(x[1]) + 0.1*(x[0]**2 + x[1]**2)
print("\n๋น๋ณผ๋ก ํจ์ ์ต์ ํ:")
for i, x0 in enumerate(starting_points):
result = minimize(nonconvex_func, x0, method='BFGS')
print(f"์์์ {i+1} {x0}: ์ต์๊ฐ ์์น {result.x}, ๊ฐ {result.fun:.4f}")
2. Lagrangian Duality¶
2.1 Primal Problem¶
General optimization problem: $$ \begin{align} \min_{x} \quad & f(x) \\ \text{s.t.} \quad & g_i(x) \leq 0, \quad i = 1, \ldots, m \\ & h_j(x) = 0, \quad j = 1, \ldots, p \end{align} $$
- $f$: objective function
- $g_i$: inequality constraints
- $h_j$: equality constraints
2.2 Lagrangian¶
Lagrangian function: $$ \mathcal{L}(x, \lambda, \nu) = f(x) + \sum_{i=1}^{m} \lambda_i g_i(x) + \sum_{j=1}^{p} \nu_j h_j(x) $$
- $\lambda_i \geq 0$: Lagrange multipliers for inequality constraints
- $\nu_j$: Lagrange multipliers for equality constraints
# ๊ฐ๋จํ ์์ : min x^2 + y^2 s.t. x + y = 1
def primal_objective(x):
return x[0]**2 + x[1]**2
def equality_constraint(x):
return x[0] + x[1] - 1
def lagrangian(x, nu):
"""๋ผ๊ทธ๋์ง์: L(x, ฮฝ) = x^2 + y^2 + ฮฝ(x + y - 1)"""
return primal_objective(x) + nu * equality_constraint(x)
# ์๊ฐํ
x_range = np.linspace(-1, 2, 100)
y_range = np.linspace(-1, 2, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = X**2 + Y**2
plt.figure(figsize=(10, 8))
plt.contour(X, Y, Z, levels=20, alpha=0.6)
plt.plot(x_range, 1 - x_range, 'r-', linewidth=3, label='์ ์ฝ: x + y = 1')
# ์ต์ ํด (๋ผ๊ทธ๋์ฃผ ์กฐ๊ฑด์ผ๋ก๋ถํฐ)
# โ_x L = 2x + ฮฝ = 0, โ_y L = 2y + ฮฝ = 0, x + y = 1
# => x = y = 1/2
optimal_x = np.array([0.5, 0.5])
plt.scatter(*optimal_x, color='green', s=200, zorder=5, label=f'์ต์ ํด {optimal_x}')
plt.xlabel('x')
plt.ylabel('y')
plt.title('์ ์ฝ ์ต์ ํ: ๋ผ๊ทธ๋์ฃผ ์น์๋ฒ')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.show()
print(f"์ต์ ํด: x = {optimal_x}")
print(f"์ต์ ๊ฐ: f(x*) = {primal_objective(optimal_x):.4f}")
2.3 Dual Function¶
Dual function: $$ g(\lambda, \nu) = \inf_{x} \mathcal{L}(x, \lambda, \nu) $$
Key property: $g(\lambda, \nu)$ is always concave (even if the primal problem is not convex!)
2.4 Weak Duality¶
Theorem: For any $\lambda \geq 0, \nu$ $$ g(\lambda, \nu) \leq p^* $$ where $p^*$ is the optimal value of the primal problem.
Proof: $$ g(\lambda, \nu) = \inf_{x} \mathcal{L}(x, \lambda, \nu) \leq \mathcal{L}(x^*, \lambda, \nu) = f(x^*) + \sum \lambda_i \underbrace{g_i(x^*)}_{\leq 0} + \sum \nu_j \underbrace{h_j(x^*)}_{=0} \leq f(x^*) = p^* $$
2.5 Strong Duality¶
Dual problem: $$ \max_{\lambda \geq 0, \nu} \quad g(\lambda, \nu) $$
Strong duality: $p^* = d^*$ (optimal values are equal)
Slater's condition (sufficient condition): - Primal problem is convex - There exists feasible $x$ such that all $g_i(x) < 0$ (strict inequality)
# ๊ฐ ์๋์ฑ ์์ : 2์ฐจ ๊ณํ๋ฒ (QP)
from scipy.optimize import minimize
import cvxpy as cp
# ์์ด ๋ฌธ์ : min 1/2 x^T Q x + c^T x, s.t. Ax <= b
Q = np.array([[2, 0], [0, 2]])
c = np.array([1, 1])
A = np.array([[1, 1], [-1, 0], [0, -1]])
b = np.array([1, 0, 0])
# CVXPY๋ก ์์ด ๋ฌธ์ ํ๊ธฐ
x_primal = cp.Variable(2)
objective_primal = cp.Minimize(0.5 * cp.quad_form(x_primal, Q) + c @ x_primal)
constraints_primal = [A @ x_primal <= b]
prob_primal = cp.Problem(objective_primal, constraints_primal)
p_star = prob_primal.solve()
print(f"์์ด ๋ฌธ์ :")
print(f" ์ต์ ๊ฐ: p* = {p_star:.4f}")
print(f" ์ต์ ํด: x* = {x_primal.value}")
# ์๋ ๋ฌธ์ (์ ๋ ์๋ต, ์ด๋ก ์ ์ผ๋ก)
# max -1/2 ฮป^T (A Q^{-1} A^T) ฮป - b^T ฮป - 1/2 c^T Q^{-1} c, s.t. ฮป >= 0, A^T ฮป = -c
lambda_dual = cp.Variable(3)
Q_inv = np.linalg.inv(Q)
objective_dual = cp.Maximize(
-0.5 * cp.quad_form(lambda_dual, A @ Q_inv @ A.T) - b @ lambda_dual - 0.5 * c @ Q_inv @ c
)
constraints_dual = [lambda_dual >= 0, A.T @ lambda_dual == -c]
prob_dual = cp.Problem(objective_dual, constraints_dual)
d_star = prob_dual.solve()
print(f"\n์๋ ๋ฌธ์ :")
print(f" ์ต์ ๊ฐ: d* = {d_star:.4f}")
print(f" ์ต์ ์น์: ฮป* = {lambda_dual.value}")
print(f"\n๊ฐ ์๋์ฑ ํ์ธ: p* - d* = {p_star - d_star:.6f} โ 0")
3. SVM Dual Problem¶
3.1 Primal Problem¶
Hard-margin SVM: $$ \begin{align} \min_{w, b} \quad & \frac{1}{2} \|w\|^2 \\ \text{s.t.} \quad & y_i (w^T x_i + b) \geq 1, \quad i = 1, \ldots, n \end{align} $$
Goal: Maximize margin $\frac{2}{\|w\|}$ $\Leftrightarrow$ Minimize $\|w\|^2$
3.2 Constructing the Lagrangian¶
$$ \mathcal{L}(w, b, \alpha) = \frac{1}{2} \|w\|^2 - \sum_{i=1}^{n} \alpha_i [y_i(w^T x_i + b) - 1] $$
where $\alpha_i \geq 0$ are Lagrange multipliers.
3.3 KKT Conditions¶
1. Stationarity: $$ \nabla_w \mathcal{L} = w - \sum_{i=1}^{n} \alpha_i y_i x_i = 0 \quad \Rightarrow \quad w = \sum_{i=1}^{n} \alpha_i y_i x_i $$
$$ \frac{\partial \mathcal{L}}{\partial b} = -\sum_{i=1}^{n} \alpha_i y_i = 0 \quad \Rightarrow \quad \sum_{i=1}^{n} \alpha_i y_i = 0 $$
2. Complementary Slackness: $$ \alpha_i [y_i(w^T x_i + b) - 1] = 0 $$
3. Primal/Dual Feasibility: $y_i(w^T x_i + b) \geq 1$, $\alpha_i \geq 0$
3.4 Deriving the Dual Problem¶
Substituting $w = \sum \alpha_i y_i x_i$ into the Lagrangian:
$$ \begin{align} \mathcal{L}(w, b, \alpha) &= \frac{1}{2} \left\|\sum_{i} \alpha_i y_i x_i\right\|^2 - \sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i^T x_j) - b\sum_i \alpha_i y_i + \sum_i \alpha_i \\ &= -\frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i^T x_j) + \sum_i \alpha_i \end{align} $$
Dual problem: $$ \begin{align} \max_{\alpha} \quad & \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j=1}^{n} \alpha_i \alpha_j y_i y_j (x_i^T x_j) \\ \text{s.t.} \quad & \sum_{i=1}^{n} \alpha_i y_i = 0 \\ & \alpha_i \geq 0, \quad i = 1, \ldots, n \end{align} $$
from sklearn.svm import SVC
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
# ๊ฐ๋จํ 2D ๋ฐ์ดํฐ
np.random.seed(42)
X, y = make_classification(n_samples=100, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, flip_y=0.1,
class_sep=1.5, random_state=42)
y = 2*y - 1 # {0, 1} -> {-1, 1}
# SVM ํ๋ จ (์ ํ ์ปค๋)
svm = SVC(kernel='linear', C=1000) # C๊ฐ ํฌ๋ฉด ํ๋ ๋ง์ง์ ๊ทผ์
svm.fit(X, y)
# ๊ฒฐ๊ณผ
w = svm.coef_[0]
b = svm.intercept_[0]
support_vectors = svm.support_vectors_
alphas = svm.dual_coef_[0] # ฮฑ_i * y_i
print("SVM ๊ฒฐ๊ณผ:")
print(f" ๊ฐ์ค์น w: {w}")
print(f" ์ ํธ b: {b}")
print(f" ์ํฌํธ ๋ฒกํฐ ์: {len(support_vectors)}")
print(f" ์๋ ๊ณ์ (ฮฑ_i * y_i): {alphas[:5]}...") # ์ฒ์ 5๊ฐ๋ง
# ์๊ฐํ
plt.figure(figsize=(10, 8))
plt.scatter(X[y==1, 0], X[y==1, 1], c='blue', marker='o', label='ํด๋์ค +1')
plt.scatter(X[y==-1, 0], X[y==-1, 1], c='red', marker='s', label='ํด๋์ค -1')
plt.scatter(support_vectors[:, 0], support_vectors[:, 1], s=200, linewidth=2,
facecolors='none', edgecolors='green', label='์ํฌํธ ๋ฒกํฐ')
# ๊ฒฐ์ ๊ฒฝ๊ณ
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x_plot = np.linspace(x_min, x_max, 100)
y_plot = -(w[0] * x_plot + b) / w[1]
plt.plot(x_plot, y_plot, 'k-', linewidth=2, label='๊ฒฐ์ ๊ฒฝ๊ณ')
# ๋ง์ง
margin = 1 / np.linalg.norm(w)
y_plot_margin_up = y_plot + margin * np.sqrt(1 + (w[0]/w[1])**2)
y_plot_margin_down = y_plot - margin * np.sqrt(1 + (w[0]/w[1])**2)
plt.plot(x_plot, y_plot_margin_up, 'k--', linewidth=1, alpha=0.5)
plt.plot(x_plot, y_plot_margin_down, 'k--', linewidth=1, alpha=0.5)
plt.fill_between(x_plot, y_plot_margin_down, y_plot_margin_up, alpha=0.1, color='gray')
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title(f'SVM: ๋ง์ง = {2*margin:.3f}')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
3.5 Natural Emergence of the Kernel Trick¶
In the dual problem, data appears only through inner products $x_i^T x_j$:
$$ \max_{\alpha} \quad \sum_{i} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \underbrace{(x_i^T x_j)}_{k(x_i, x_j)} $$
Kernel trick: $k(x_i, x_j) = \phi(x_i)^T \phi(x_j)$ (high-dimensional feature space)
Example: Polynomial kernel $k(x, x') = (x^T x' + 1)^2$
# ๋น์ ํ SVM (RBF ์ปค๋)
from sklearn.datasets import make_moons
X_nonlinear, y_nonlinear = make_moons(n_samples=200, noise=0.15, random_state=42)
y_nonlinear = 2*y_nonlinear - 1
svm_rbf = SVC(kernel='rbf', gamma=2, C=1.0)
svm_rbf.fit(X_nonlinear, y_nonlinear)
# ๊ฒฐ์ ๊ฒฝ๊ณ ์๊ฐํ
h = 0.02
x_min, x_max = X_nonlinear[:, 0].min() - 0.5, X_nonlinear[:, 0].max() + 0.5
y_min, y_max = X_nonlinear[:, 1].min() - 0.5, X_nonlinear[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = svm_rbf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)
plt.contour(xx, yy, Z, levels=[0], colors='black', linewidths=2)
plt.scatter(X_nonlinear[y_nonlinear==1, 0], X_nonlinear[y_nonlinear==1, 1],
c='blue', marker='o', edgecolors='k', label='ํด๋์ค +1')
plt.scatter(X_nonlinear[y_nonlinear==-1, 0], X_nonlinear[y_nonlinear==-1, 1],
c='red', marker='s', edgecolors='k', label='ํด๋์ค -1')
plt.scatter(svm_rbf.support_vectors_[:, 0], svm_rbf.support_vectors_[:, 1],
s=200, linewidth=2, facecolors='none', edgecolors='green', label='์ํฌํธ ๋ฒกํฐ')
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('๋น์ ํ SVM (RBF ์ปค๋)')
plt.legend()
plt.show()
print(f"์ํฌํธ ๋ฒกํฐ ์: {len(svm_rbf.support_vectors_)}")
4. Fenchel Conjugate¶
4.1 Definition¶
The Fenchel conjugate of function $f: \mathbb{R}^n \to \mathbb{R}$: $$ f^*(y) = \sup_{x} \left( y^T x - f(x) \right) $$
Intuition: $y$ is the slope, $f^*(y)$ is the y-intercept of the tangent line with slope $y$
4.2 Geometric Interpretation¶
# ํ์ฒผ ์ผค๋ ์๊ฐํ
def fenchel_conjugate_demo():
x = np.linspace(-3, 3, 200)
f_x = x**2 # f(x) = x^2
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# ์ ํจ์
axes[0].plot(x, f_x, 'b-', linewidth=2, label='$f(x) = x^2$')
# ์ฌ๋ฌ ๊ธฐ์ธ๊ธฐ์ ์ ์
slopes = [-2, 0, 2, 4]
colors = ['red', 'green', 'orange', 'purple']
for y, color in zip(slopes, colors):
# f*(y) = sup_x (yx - x^2)
# ์ต๋ํ: d/dx (yx - x^2) = y - 2x = 0 => x* = y/2
x_star = y / 2
f_star_y = y * x_star - x_star**2 # = y^2/4
# ์ ์ : yx - f*(y)
tangent = y * x - f_star_y
axes[0].plot(x, tangent, color=color, linestyle='--', alpha=0.7,
label=f'y={y}: $f^*({y})={f_star_y:.2f}$')
axes[0].scatter([x_star], [f_x[np.abs(x - x_star).argmin()]],
color=color, s=100, zorder=5)
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].set_title('ํ์ฒผ ์ผค๋ : ์ ์ ๊ด์ ')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim(-2, 10)
# ์ผค๋ ํจ์
y_range = np.linspace(-4, 4, 100)
f_star = y_range**2 / 4 # f*(y) = y^2/4 for f(x) = x^2
axes[1].plot(y_range, f_star, 'b-', linewidth=2, label='$f^*(y) = y^2/4$')
axes[1].scatter(slopes, [s**2/4 for s in slopes], color=colors, s=100, zorder=5)
axes[1].set_xlabel('y (๊ธฐ์ธ๊ธฐ)')
axes[1].set_ylabel('$f^*(y)$')
axes[1].set_title('ํ์ฒผ ์ผค๋ ํจ์')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
fenchel_conjugate_demo()
4.3 Important Examples¶
1. Conjugate of norm: $f(x) = \frac{1}{2}\|x\|_2^2$ $$ f^*(y) = \frac{1}{2}\|y\|_2^2 $$
2. Conjugate of indicator function: $f(x) = I_C(x)$ (indicator function of set $C$) $$ f^*(y) = \sup_{x \in C} y^T x \quad \text{(support function)} $$
3. Conjugate of log-sum-exp: $f(x) = \log(\sum_i e^{x_i})$ $$ f^*(y) = \sum_i y_i \log y_i \quad \text{(negative entropy)} $$
# ํ์ฒผ ์ผค๋ ๊ณ์ฐ ์์
def conjugate_squared_norm(y):
"""f(x) = 1/2 ||x||^2์ ์ผค๋ """
return 0.5 * np.linalg.norm(y)**2
def conjugate_indicator_ball(y, radius=1.0):
"""f(x) = I_{||x|| <= r}์ ์ผค๋ (support function)"""
return radius * np.linalg.norm(y)
# ๊ฒ์ฆ: f**(x) = f(x) (์ด์ค ์ผค๋ )
x_test = np.array([1.0, 2.0])
f_x = 0.5 * np.linalg.norm(x_test)**2
# f**๋ฅผ ์์น์ ์ผ๋ก ๊ณ์ฐ
from scipy.optimize import minimize_scalar
def double_conjugate(x):
"""์ด์ค ์ผค๋ f**(x) = sup_y (x^T y - f*(y))"""
# 1์ฐจ์์ผ๋ก ๋จ์ํ (์์)
def negative_objective(y_norm):
y = y_norm * x / np.linalg.norm(x) # ๋ฐฉํฅ ๊ณ ์
return -(np.dot(x, y) - conjugate_squared_norm(y))
result = minimize_scalar(negative_objective, bounds=(0, 10), method='bounded')
return -result.fun
f_double_star_x = double_conjugate(x_test)
print(f"์ด์ค ์ผค๋ ์ ๋ฆฌ ๊ฒ์ฆ:")
print(f" f(x) = {f_x:.4f}")
print(f" f**(x) = {f_double_star_x:.4f}")
print(f" ์ผ์น: {np.isclose(f_x, f_double_star_x)}")
5. Proximal Operator¶
5.1 Definition¶
The proximal operator of function $f$: $$ \text{prox}_f(v) = \arg\min_{x} \left( f(x) + \frac{1}{2}\|x - v\|^2 \right) $$
Intuition: Find $x$ that is close to $v$ while minimizing $f$
5.2 Proximal Operator of L1 Norm (Soft Thresholding)¶
For $f(x) = \lambda \|x\|_1$: $$ [\text{prox}_{\lambda \|\cdot\|_1}(v)]_i = \text{sign}(v_i) \max(|v_i| - \lambda, 0) $$
This is called soft thresholding.
def soft_threshold(v, lambda_param):
"""L1 ๊ทผ์ ์ฐ์ฐ์ (์ํํธ ์๊ณ๊ฐ)"""
return np.sign(v) * np.maximum(np.abs(v) - lambda_param, 0)
# ์๊ฐํ
v = np.linspace(-3, 3, 200)
lambda_values = [0.5, 1.0, 1.5]
plt.figure(figsize=(10, 6))
plt.plot(v, v, 'k--', label='ํญ๋ฑ ํจ์ (ฮป=0)', alpha=0.5)
for lam in lambda_values:
prox_v = soft_threshold(v, lam)
plt.plot(v, prox_v, linewidth=2, label=f'ฮป={lam}')
plt.xlabel('v')
plt.ylabel('prox(v)')
plt.title('L1 ๊ทผ์ ์ฐ์ฐ์ (์ํํธ ์๊ณ๊ฐ)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axhline(0, color='black', linewidth=0.5)
plt.axvline(0, color='black', linewidth=0.5)
plt.show()
5.3 Proximal Gradient Descent¶
Problem: $\min f(x) + g(x)$, where $f$ is smooth and $g$ is non-smooth (e.g., L1)
Algorithm: $$ x_{k+1} = \text{prox}_{\alpha g}(x_k - \alpha \nabla f(x_k)) $$
# LASSO ๋ฌธ์ : min 1/2 ||Ax - b||^2 + ฮป||x||_1
def lasso_proximal_gradient(A, b, lambda_param, max_iter=1000, alpha=0.01):
"""๊ทผ์ ๊ฒฝ์ฌ ํ๊ฐ๋ฒ์ผ๋ก LASSO ํ๊ธฐ"""
n = A.shape[1]
x = np.zeros(n)
losses = []
for k in range(max_iter):
# ๋งค๋๋ฌ์ด ๋ถ๋ถ์ ๊ทธ๋๋์ธํธ: โ(1/2 ||Ax - b||^2) = A^T(Ax - b)
residual = A @ x - b
grad_f = A.T @ residual
# ๊ฒฝ์ฌ ํ๊ฐ
x_temp = x - alpha * grad_f
# ๊ทผ์ ์ฐ์ฐ (L1)
x = soft_threshold(x_temp, alpha * lambda_param)
# ์์ค ๊ณ์ฐ
loss = 0.5 * np.sum(residual**2) + lambda_param * np.sum(np.abs(x))
losses.append(loss)
if k > 0 and abs(losses[-1] - losses[-2]) < 1e-6:
break
return x, losses
# ํ
์คํธ: ํฌ์ ํ๊ท
np.random.seed(42)
n, p = 100, 50
A = np.random.randn(n, p)
x_true = np.zeros(p)
x_true[:10] = np.random.randn(10) # 10๊ฐ๋ง 0์ด ์๋
b = A @ x_true + 0.1 * np.random.randn(n)
lambda_param = 0.1
x_lasso, losses = lasso_proximal_gradient(A, b, lambda_param, max_iter=1000, alpha=0.001)
# ๊ฒฐ๊ณผ
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# ์์ค ํจ์
axes[0].plot(losses)
axes[0].set_xlabel('๋ฐ๋ณต')
axes[0].set_ylabel('์์ค')
axes[0].set_title('LASSO ๊ทผ์ ๊ฒฝ์ฌ ํ๊ฐ๋ฒ: ์๋ ด')
axes[0].set_yscale('log')
axes[0].grid(True, alpha=0.3)
# ๊ณ์ ๋น๊ต
axes[1].stem(x_true, linefmt='b-', markerfmt='bo', basefmt=' ', label='์ค์ ๊ณ์')
axes[1].stem(x_lasso, linefmt='r-', markerfmt='rs', basefmt=' ', label='์ถ์ ๊ณ์ (LASSO)')
axes[1].set_xlabel('๊ณ์ ์ธ๋ฑ์ค')
axes[1].set_ylabel('๊ฐ')
axes[1].set_title(f'LASSO ๊ฒฐ๊ณผ (ฮป={lambda_param})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"0์ด ์๋ ๊ณ์ ์:")
print(f" ์ค์ : {np.sum(x_true != 0)}")
print(f" ์ถ์ : {np.sum(np.abs(x_lasso) > 1e-3)}")
6. Machine Learning Applications¶
6.1 Dual Interpretation of Logistic Regression¶
Logistic regression: $$ \min_{w} \sum_{i=1}^{n} \log(1 + e^{-y_i w^T x_i}) + \frac{\lambda}{2} \|w\|^2 $$
The dual problem connects to entropy maximization.
6.2 ADMM (Alternating Direction Method of Multipliers)¶
Problem: $\min f(x) + g(z)$ s.t. $Ax + Bz = c$
ADMM algorithm: 1. $x_{k+1} = \arg\min_x \mathcal{L}_\rho(x, z_k, \lambda_k)$ 2. $z_{k+1} = \arg\min_z \mathcal{L}_\rho(x_{k+1}, z, \lambda_k)$ 3. $\lambda_{k+1} = \lambda_k + \rho(Ax_{k+1} + Bz_{k+1} - c)$
where $\mathcal{L}_\rho$ is the augmented Lagrangian.
# ADMM์ผ๋ก LASSO ํ๊ธฐ
def lasso_admm(A, b, lambda_param, rho=1.0, max_iter=100):
"""ADMM์ผ๋ก LASSO: min 1/2||Ax-b||^2 + ฮป||z||_1, s.t. x = z"""
n, p = A.shape
x = np.zeros(p)
z = np.zeros(p)
u = np.zeros(p) # ์ค์ผ์ผ๋ ์๋ ๋ณ์
# ์ฌ์ ๊ณ์ฐ
AtA = A.T @ A
Atb = A.T @ b
I = np.eye(p)
losses = []
for k in range(max_iter):
# x ์
๋ฐ์ดํธ: (A^T A + ฯI)x = A^T b + ฯ(z - u)
x = np.linalg.solve(AtA + rho * I, Atb + rho * (z - u))
# z ์
๋ฐ์ดํธ: soft thresholding
z_old = z.copy()
z = soft_threshold(x + u, lambda_param / rho)
# u ์
๋ฐ์ดํธ (์๋ ๋ณ์)
u = u + x - z
# ์์ค
loss = 0.5 * np.linalg.norm(A @ x - b)**2 + lambda_param * np.linalg.norm(z, 1)
losses.append(loss)
# ์๋ ด ํ์ธ
if np.linalg.norm(z - z_old) < 1e-4:
break
return x, z, losses
# ๋น๊ต: Proximal Gradient vs ADMM
x_admm, z_admm, losses_admm = lasso_admm(A, b, lambda_param, rho=1.0, max_iter=100)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(losses, label='๊ทผ์ ๊ฒฝ์ฌ ํ๊ฐ๋ฒ')
plt.plot(losses_admm, label='ADMM')
plt.xlabel('๋ฐ๋ณต')
plt.ylabel('์์ค')
plt.title('์๋ ด ์๋ ๋น๊ต')
plt.legend()
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
plt.stem(x_lasso, linefmt='b-', markerfmt='bo', basefmt=' ', label='Proximal GD')
plt.stem(z_admm, linefmt='r-', markerfmt='rs', basefmt=' ', label='ADMM')
plt.xlabel('๊ณ์ ์ธ๋ฑ์ค')
plt.ylabel('๊ฐ')
plt.title('์ต์ข
๊ณ์ ๋น๊ต')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"๊ณ์ ์ฐจ์ด (L2): {np.linalg.norm(x_lasso - z_admm):.6f}")
6.3 Real-World Applications¶
- SVM: Efficient solution for large-scale classification via dual problem
- L1 regularization: Proximal operators in LASSO, Elastic Net
- Distributed optimization: ADMM for large-scale parallel data processing
- Deep learning: Fenchel duality used in GANs, f-divergence
Practice Problems¶
Problem 1: Proving Convexity¶
Verify whether the following functions are convex: (a) $f(x) = e^x$ (b) $f(x) = -\log(x)$ (x > 0) (c) $f(x, y) = x^2 + xy + y^2$
Prove each using the second-order condition (Hessian).
Problem 2: Lagrangian Duality¶
Derive the Lagrangian and dual function for the following problem: $$ \min x^2 + y^2 \quad \text{s.t.} \quad x + y \geq 1, \quad x, y \geq 0 $$
Apply KKT conditions to find the optimal solution.
Problem 3: SVM Implementation¶
Implement linear SVM from scratch (cvxpy allowed). (a) Solve both primal and dual problems (b) Verify that both solutions satisfy strong duality (c) Identify and visualize support vectors
Problem 4: Fenchel Conjugate¶
Derive the Fenchel conjugate for the following functions: (a) $f(x) = \frac{1}{2}x^2 + x$ (1D) (b) $f(x) = |x|$ (L1 norm) (c) $f(x) = \max(0, x)$ (ReLU)
Problem 5: ADMM Application¶
Solve a distributed linear regression problem using ADMM: $$ \min \sum_{i=1}^{N} \frac{1}{2}\|A_i x - b_i\|^2 $$
Assume data is distributed across $N$ nodes.
References¶
Books¶
- Convex Optimization (Boyd & Vandenberghe, 2004) - Standard textbook
- Proximal Algorithms (Parikh & Boyd, 2014) - Comprehensive treatment of proximal methods
- Online Convex Optimization (Hazan, 2016) - Connection to online learning
Papers¶
- Vapnik (1995), "The Nature of Statistical Learning Theory" - SVM foundations
- Boyd et al. (2011), "Distributed Optimization and Statistical Learning via ADMM"
- Rockafellar (1970), "Convex Analysis" - Theory of Fenchel conjugate
Online Resources¶
Libraries¶
- CVXPY: Python convex optimization
- scikit-learn: SVM, LASSO implementations
- ProximalOperators.jl: Julia proximal operator library