04. Norms and Distance Metrics

04. Norms and Distance Metrics

Learning Objectives

  • Understand and compute different types of vector norms (L1, L2, Lp, Lโˆž)
  • Learn the meaning and applications of matrix norms (Frobenius, spectral, nuclear)
  • Understand the characteristics of various distance metrics (Euclidean, Mahalanobis, cosine)
  • Learn the relationship between regularization and norms, and the geometric interpretation of L1/L2 regularization
  • Understand through practice how norms and distances are utilized in machine learning
  • Be able to compute and visualize norms and distances using NumPy and scikit-learn

1. Vector Norms

1.1 Definition of Norm

A norm on a vector space is a function $\|\cdot\|: \mathbb{R}^n \to \mathbb{R}_+$ satisfying three properties:

  1. Positive definiteness: $\|\mathbf{x}\| \geq 0$ and $\|\mathbf{x}\| = 0 \Leftrightarrow \mathbf{x} = \mathbf{0}$
  2. Homogeneity: $\|\alpha \mathbf{x}\| = |\alpha| \|\mathbf{x}\|$ for all $\alpha \in \mathbb{R}$
  3. Triangle inequality: $\|\mathbf{x} + \mathbf{y}\| \leq \|\mathbf{x}\| + \|\mathbf{y}\|$

1.2 Lp Norm

The general $L_p$ norm is defined as:

$$\|\mathbf{x}\|_p = \left( \sum_{i=1}^n |x_i|^p \right)^{1/p}$$

1.3 L1 Norm (Manhattan Distance)

$$\|\mathbf{x}\|_1 = \sum_{i=1}^n |x_i|$$

  • Sum of absolute values
  • Also called Manhattan distance or taxicab distance
  • Used in regularization to induce sparsity

1.4 L2 Norm (Euclidean Distance)

$$\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2}$$

  • Most common norm
  • Euclidean distance
  • Differentiable and easy to work with

1.5 Lโˆž Norm (Maximum Norm)

$$\|\mathbf{x}\|_\infty = \max_i |x_i|$$

  • Largest absolute value
  • Obtained as the limit $p \to \infty$

1.6 Norm Computation Example

import numpy as np
import matplotlib.pyplot as plt

# ๋ฒกํ„ฐ ์ •์˜
x = np.array([3, 4])

# ๋‹ค์–‘ํ•œ ๋…ธ๋ฆ„ ๊ณ„์‚ฐ
l1_norm = np.linalg.norm(x, ord=1)
l2_norm = np.linalg.norm(x, ord=2)
l_inf_norm = np.linalg.norm(x, ord=np.inf)

# ์ˆ˜๋™ ๊ณ„์‚ฐ ๊ฒ€์ฆ
l1_manual = np.sum(np.abs(x))
l2_manual = np.sqrt(np.sum(x**2))
l_inf_manual = np.max(np.abs(x))

print("๋ฒกํ„ฐ:", x)
print(f"\nL1 ๋…ธ๋ฆ„: {l1_norm:.4f} (์ˆ˜๋™: {l1_manual:.4f})")
print(f"L2 ๋…ธ๋ฆ„: {l2_norm:.4f} (์ˆ˜๋™: {l2_manual:.4f})")
print(f"Lโˆž ๋…ธ๋ฆ„: {l_inf_norm:.4f} (์ˆ˜๋™: {l_inf_manual:.4f})")

# Lp ๋…ธ๋ฆ„ for p = 0.5, 1, 2, 3, 10
p_values = [0.5, 1, 2, 3, 10]
norms = [np.sum(np.abs(x)**p)**(1/p) for p in p_values]

print(f"\nLp ๋…ธ๋ฆ„ (๋‹ค์–‘ํ•œ p):")
for p, norm in zip(p_values, norms):
    print(f"  L{p} = {norm:.4f}")

1.7 Unit Sphere Visualization

# 2D์—์„œ ||x||_p = 1์ธ ๋‹จ์œ„ ๊ตฌ ์‹œ๊ฐํ™”
theta = np.linspace(0, 2*np.pi, 1000)

fig, axes = plt.subplots(1, 4, figsize=(16, 4))
p_values = [1, 2, 3, np.inf]

for ax, p in zip(axes, p_values):
    if p == np.inf:
        # Lโˆž: ์ •์‚ฌ๊ฐํ˜•
        square = np.array([[-1, -1], [1, -1], [1, 1], [-1, 1], [-1, -1]])
        ax.plot(square[:, 0], square[:, 1], 'b-', linewidth=2)
        title = r'$L_\infty$ norm'
    else:
        # Lp: ํŒŒ๋ผ๋ฉ”ํŠธ๋ฆญ ๊ณก์„ 
        # x(t) = sign(cos(t)) * |cos(t)|^(2/p)
        # y(t) = sign(sin(t)) * |sin(t)|^(2/p)
        x_coords = np.sign(np.cos(theta)) * np.abs(np.cos(theta))**(2/p)
        y_coords = np.sign(np.sin(theta)) * np.abs(np.sin(theta))**(2/p)
        ax.plot(x_coords, y_coords, 'b-', linewidth=2)
        title = f'$L_{p}$ norm'

    ax.set_xlim(-1.5, 1.5)
    ax.set_ylim(-1.5, 1.5)
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.set_title(title, fontsize=12)
    ax.set_xlabel('$x_1$')
    ax.set_ylabel('$x_2$')

plt.suptitle('๋‹จ์œ„ ๊ตฌ: $\|x\|_p = 1$', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('unit_spheres.png', dpi=150, bbox_inches='tight')
plt.close()

print("๋‹จ์œ„ ๊ตฌ ์‹œ๊ฐํ™” ์ €์žฅ: unit_spheres.png")

1.8 Verification of Norm Properties

# ์‚ผ๊ฐ ๋ถ€๋“ฑ์‹ ๊ฒ€์ฆ
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])

for p in [1, 2, np.inf]:
    norm_x = np.linalg.norm(x, ord=p)
    norm_y = np.linalg.norm(y, ord=p)
    norm_sum = np.linalg.norm(x + y, ord=p)

    print(f"\nL{p} ๋…ธ๋ฆ„ ์‚ผ๊ฐ ๋ถ€๋“ฑ์‹:")
    print(f"  ||x|| = {norm_x:.4f}")
    print(f"  ||y|| = {norm_y:.4f}")
    print(f"  ||x+y|| = {norm_sum:.4f}")
    print(f"  ||x|| + ||y|| = {norm_x + norm_y:.4f}")
    print(f"  ๋ถ€๋“ฑ์‹ ์„ฑ๋ฆฝ: {norm_sum <= norm_x + norm_y + 1e-10}")

2. Matrix Norms

2.1 Frobenius Norm

$$\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\text{tr}(A^T A)}$$

  • Square root of sum of squared elements
  • Extension of L2 norm for vectors to matrices

2.2 Spectral Norm

$$\|A\|_2 = \sigma_{\max}(A) = \sqrt{\lambda_{\max}(A^T A)}$$

  • Maximum singular value
  • Induced norm: $\|A\|_2 = \max_{\|\mathbf{x}\|_2=1} \|A\mathbf{x}\|_2$
  • Measure of how much a matrix can stretch a vector

2.3 Nuclear Norm

$$\|A\|_* = \sum_i \sigma_i(A)$$

  • Sum of all singular values
  • Used in low-rank matrix approximation (matrix completion problems)
  • Matrix version of L1 norm (induces sparsity)

2.4 Matrix Norm Computation

import numpy as np

# ๋žœ๋ค ํ–‰๋ ฌ ์ƒ์„ฑ
A = np.random.randn(4, 3)

# ํ”„๋กœ๋ฒ ๋‹ˆ์šฐ์Šค ๋…ธ๋ฆ„
frobenius = np.linalg.norm(A, ord='fro')
frobenius_manual = np.sqrt(np.sum(A**2))

# ์ŠคํŽ™ํŠธ๋Ÿผ ๋…ธ๋ฆ„ (์ตœ๋Œ€ ํŠน์ด๊ฐ’)
spectral = np.linalg.norm(A, ord=2)
U, s, Vt = np.linalg.svd(A)
spectral_manual = s[0]

# ํ•ต ๋…ธ๋ฆ„ (ํŠน์ด๊ฐ’์˜ ํ•ฉ)
nuclear = np.sum(s)

print("ํ–‰๋ ฌ ํ˜•ํƒœ:", A.shape)
print(f"\nํ”„๋กœ๋ฒ ๋‹ˆ์šฐ์Šค ๋…ธ๋ฆ„: {frobenius:.4f}")
print(f"  ์ˆ˜๋™ ๊ณ„์‚ฐ:        {frobenius_manual:.4f}")
print(f"\n์ŠคํŽ™ํŠธ๋Ÿผ ๋…ธ๋ฆ„:     {spectral:.4f}")
print(f"  ์ตœ๋Œ€ ํŠน์ด๊ฐ’:      {spectral_manual:.4f}")
print(f"\nํ•ต ๋…ธ๋ฆ„:           {nuclear:.4f}")
print(f"  ํŠน์ด๊ฐ’: {s}")

2.5 Properties of Matrix Norms

# ํ–‰๋ ฌ ๋…ธ๋ฆ„์˜ ๋ถ€๋“ฑ์‹
A = np.random.randn(5, 5)

frobenius = np.linalg.norm(A, ord='fro')
spectral = np.linalg.norm(A, ord=2)

# ๋ถ€๋“ฑ์‹: ||A||_2 โ‰ค ||A||_F โ‰ค sqrt(rank(A)) * ||A||_2
rank_A = np.linalg.matrix_rank(A)

print("ํ–‰๋ ฌ ๋…ธ๋ฆ„ ๋ถ€๋“ฑ์‹:")
print(f"์ŠคํŽ™ํŠธ๋Ÿผ ๋…ธ๋ฆ„:       {spectral:.4f}")
print(f"ํ”„๋กœ๋ฒ ๋‹ˆ์šฐ์Šค ๋…ธ๋ฆ„:   {frobenius:.4f}")
print(f"sqrt(rank) * ์ŠคํŽ™ํŠธ๋Ÿผ: {np.sqrt(rank_A) * spectral:.4f}")
print(f"\n||A||_2 โ‰ค ||A||_F: {spectral <= frobenius + 1e-10}")
print(f"||A||_F โ‰ค sqrt(r)*||A||_2: {frobenius <= np.sqrt(rank_A) * spectral + 1e-10}")

3. Distance Metrics

3.1 Euclidean Distance

$$d(\mathbf{x}, \mathbf{y}) = \|\mathbf{x} - \mathbf{y}\|_2 = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}$$

The most common distance metric.

3.2 Mahalanobis Distance

$$d_M(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x} - \mathbf{y})^T \Sigma^{-1} (\mathbf{x} - \mathbf{y})}$$

  • Considers covariance matrix $\Sigma$
  • Accounts for correlations between variables and scale
  • Useful for outlier detection

3.3 Cosine Similarity

$$\text{sim}(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x}^T \mathbf{y}}{\|\mathbf{x}\|_2 \|\mathbf{y}\|_2}$$

$$d_{\text{cosine}}(\mathbf{x}, \mathbf{y}) = 1 - \text{sim}(\mathbf{x}, \mathbf{y})$$

  • Compares only direction of vectors (ignores magnitude)
  • Used for text similarity and embedding comparison

3.4 Hamming Distance

$$d_H(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n \mathbb{1}[x_i \neq y_i]$$

  • Number of different elements
  • Used for binary data and categorical data

3.5 Distance Metric Comparison

from scipy.spatial import distance

# ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ
x = np.array([1, 2, 3, 4])
y = np.array([2, 3, 4, 5])

# ๋‹ค์–‘ํ•œ ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ
euclidean = distance.euclidean(x, y)
manhattan = distance.cityblock(x, y)
chebyshev = distance.chebyshev(x, y)
cosine_dist = distance.cosine(x, y)

# ์ˆ˜๋™ ๊ณ„์‚ฐ
euclidean_manual = np.linalg.norm(x - y)
manhattan_manual = np.sum(np.abs(x - y))
chebyshev_manual = np.max(np.abs(x - y))
cosine_manual = 1 - np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

print("๊ฑฐ๋ฆฌ ์ธก๋„ ๋น„๊ต:")
print(f"์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ:  {euclidean:.4f} (์ˆ˜๋™: {euclidean_manual:.4f})")
print(f"๋งจํ•ดํŠผ ๊ฑฐ๋ฆฌ:    {manhattan:.4f} (์ˆ˜๋™: {manhattan_manual:.4f})")
print(f"์ฒด๋น„์…ฐํ”„ ๊ฑฐ๋ฆฌ:  {chebyshev:.4f} (์ˆ˜๋™: {chebyshev_manual:.4f})")
print(f"์ฝ”์‚ฌ์ธ ๊ฑฐ๋ฆฌ:    {cosine_dist:.4f} (์ˆ˜๋™: {cosine_manual:.4f})")

3.6 Mahalanobis Distance Example

from scipy.spatial.distance import mahalanobis

# ๋‹ค๋ณ€๋Ÿ‰ ์ •๊ทœ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ๋ง
mean = np.array([0, 0])
cov = np.array([[2, 1], [1, 2]])
np.random.seed(42)
samples = np.random.multivariate_normal(mean, cov, 500)

# ์›์ ์—์„œ์˜ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ
euclidean_dists = np.linalg.norm(samples, axis=1)

# ์›์ ์—์„œ์˜ ๋งˆํ• ๋ผ๋…ธ๋น„์Šค ๊ฑฐ๋ฆฌ
cov_inv = np.linalg.inv(cov)
mahal_dists = np.array([mahalanobis(s, mean, cov_inv) for s in samples])

# ์‹œ๊ฐํ™”
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ
scatter1 = axes[0].scatter(samples[:, 0], samples[:, 1], c=euclidean_dists,
                           cmap='viridis', alpha=0.6, edgecolors='k', linewidth=0.5)
axes[0].set_title('์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ', fontsize=12)
axes[0].set_xlabel('$x_1$')
axes[0].set_ylabel('$x_2$')
axes[0].axis('equal')
axes[0].grid(True, alpha=0.3)
plt.colorbar(scatter1, ax=axes[0], label='๊ฑฐ๋ฆฌ')

# ๋งˆํ• ๋ผ๋…ธ๋น„์Šค ๊ฑฐ๋ฆฌ
scatter2 = axes[1].scatter(samples[:, 0], samples[:, 1], c=mahal_dists,
                           cmap='viridis', alpha=0.6, edgecolors='k', linewidth=0.5)
axes[1].set_title('๋งˆํ• ๋ผ๋…ธ๋น„์Šค ๊ฑฐ๋ฆฌ (๊ณต๋ถ„์‚ฐ ๊ณ ๋ ค)', fontsize=12)
axes[1].set_xlabel('$x_1$')
axes[1].set_ylabel('$x_2$')
axes[1].axis('equal')
axes[1].grid(True, alpha=0.3)
plt.colorbar(scatter2, ax=axes[1], label='๊ฑฐ๋ฆฌ')

plt.tight_layout()
plt.savefig('mahalanobis_distance.png', dpi=150)
plt.close()

print("๋งˆํ• ๋ผ๋…ธ๋น„์Šค ๊ฑฐ๋ฆฌ ์‹œ๊ฐํ™” ์ €์žฅ: mahalanobis_distance.png")
print(f"์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ํ‰๊ท : {euclidean_dists.mean():.4f}")
print(f"๋งˆํ• ๋ผ๋…ธ๋น„์Šค ๊ฑฐ๋ฆฌ ํ‰๊ท : {mahal_dists.mean():.4f}")

4. Regularization and Norms

4.1 L1 Regularization (Lasso)

Adding L1 norm penalty to loss function:

$$L(\mathbf{w}) = L_{\text{data}}(\mathbf{w}) + \lambda \|\mathbf{w}\|_1$$

  • Induces sparsity (many weights become exactly 0)
  • Feature selection effect

4.2 L2 Regularization (Ridge)

Adding L2 norm penalty to loss function:

$$L(\mathbf{w}) = L_{\text{data}}(\mathbf{w}) + \lambda \|\mathbf{w}\|_2^2$$

  • Weight shrinkage
  • Prevents overfitting
  • Differentiable

4.3 Elastic Net

Combination of L1 and L2:

$$L(\mathbf{w}) = L_{\text{data}}(\mathbf{w}) + \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2$$

4.4 Regularization Comparison (Regression)

from sklearn.linear_model import Lasso, Ridge, ElasticNet, LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# ๋ฐ์ดํ„ฐ ์ƒ์„ฑ (ํฌ์†Œํ•œ ์‹ค์ œ ๊ฐ€์ค‘์น˜)
X, y, true_coef = make_regression(
    n_samples=200, n_features=50, n_informative=10,
    noise=10, coef=True, random_state=42
)

# ํ›ˆ๋ จ/ํ…Œ์ŠคํŠธ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ์Šค์ผ€์ผ๋ง
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ๋ชจ๋ธ ํ›ˆ๋ จ
models = {
    'No regularization': LinearRegression(),
    'L1 (Lasso)': Lasso(alpha=1.0),
    'L2 (Ridge)': Ridge(alpha=1.0),
    'Elastic Net': ElasticNet(alpha=1.0, l1_ratio=0.5)
}

results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    train_score = model.score(X_train_scaled, y_train)
    test_score = model.score(X_test_scaled, y_test)
    n_nonzero = np.sum(np.abs(model.coef_) > 1e-5)

    results[name] = {
        'train_r2': train_score,
        'test_r2': test_score,
        'n_nonzero': n_nonzero,
        'coef': model.coef_
    }

    print(f"\n{name}:")
    print(f"  ํ›ˆ๋ จ Rยฒ: {train_score:.4f}")
    print(f"  ํ…Œ์ŠคํŠธ Rยฒ: {test_score:.4f}")
    print(f"  0์ด ์•„๋‹Œ ๊ณ„์ˆ˜: {n_nonzero}/50")
    print(f"  ๊ณ„์ˆ˜ L1 ๋…ธ๋ฆ„: {np.linalg.norm(model.coef_, ord=1):.4f}")
    print(f"  ๊ณ„์ˆ˜ L2 ๋…ธ๋ฆ„: {np.linalg.norm(model.coef_, ord=2):.4f}")

# ๊ณ„์ˆ˜ ์‹œ๊ฐํ™”
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, (name, result) in enumerate(results.items()):
    axes[idx].stem(range(50), result['coef'], basefmt=' ')
    axes[idx].axhline(y=0, color='k', linewidth=0.5)
    axes[idx].set_title(f'{name}\n0์ด ์•„๋‹Œ ๊ณ„์ˆ˜: {result["n_nonzero"]}/50', fontsize=11)
    axes[idx].set_xlabel('ํŠน์ง• ์ธ๋ฑ์Šค')
    axes[idx].set_ylabel('๊ณ„์ˆ˜ ๊ฐ’')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('regularization_comparison.png', dpi=150)
plt.close()

print("\n์ •๊ทœํ™” ๋น„๊ต ์‹œ๊ฐํ™” ์ €์žฅ: regularization_comparison.png")

5. Geometry of Norms

5.1 Why Does L1 Produce Sparse Solutions?

This can be understood through the geometric relationship between contours and constraints.

# L1 vs L2 ์ •๊ทœํ™”์˜ ๊ธฐํ•˜ํ•™
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# ๊ฒฉ์ž ์ƒ์„ฑ
w1 = np.linspace(-2, 2, 300)
w2 = np.linspace(-2, 2, 300)
W1, W2 = np.meshgrid(w1, w2)

# ์†์‹ค ํ•จ์ˆ˜ (์ด์ฐจ ํ˜•์‹): L = (w1-1)^2 + (w2-0.5)^2
w_optimal = np.array([1.0, 0.5])
L = (W1 - w_optimal[0])**2 + (W2 - w_optimal[1])**2

# L1 ์ œ์•ฝ ์กฐ๊ฑด
axes[0].contour(W1, W2, L, levels=20, cmap='viridis', alpha=0.6)
theta = np.linspace(0, 2*np.pi, 1000)
constraint_size = 1.0
# L1: ๋‹ค์ด์•„๋ชฌ๋“œ
l1_x = constraint_size * np.sign(np.cos(theta)) * np.abs(np.cos(theta))
l1_y = constraint_size * np.sign(np.sin(theta)) * np.abs(np.sin(theta))
axes[0].plot(l1_x, l1_y, 'r-', linewidth=3, label=r'$\|w\|_1 = 1$')
axes[0].plot([1, 0], [0, 0], 'ro', markersize=10, label='์ตœ์ ํ•ด (ํฌ์†Œ)')
axes[0].set_title('L1 ์ •๊ทœํ™”: ํฌ์†Œ ํ•ด', fontsize=12)
axes[0].set_xlabel('$w_1$')
axes[0].set_ylabel('$w_2$')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_aspect('equal')

# L2 ์ œ์•ฝ ์กฐ๊ฑด
axes[1].contour(W1, W2, L, levels=20, cmap='viridis', alpha=0.6)
# L2: ์›
l2_x = constraint_size * np.cos(theta)
l2_y = constraint_size * np.sin(theta)
axes[1].plot(l2_x, l2_y, 'b-', linewidth=3, label=r'$\|w\|_2 = 1$')
# ์ตœ์ ํ•ด ์ฐพ๊ธฐ (์›๊ณผ ๋“ฑ๊ณ ์„ ์˜ ์ ‘์ )
opt_w2 = w_optimal / np.linalg.norm(w_optimal) * constraint_size
axes[1].plot([opt_w2[0]], [opt_w2[1]], 'bo', markersize=10, label='์ตœ์ ํ•ด (๋ฐ€์ง‘)')
axes[1].set_title('L2 ์ •๊ทœํ™”: ๋ฐ€์ง‘ ํ•ด', fontsize=12)
axes[1].set_xlabel('$w_1$')
axes[1].set_ylabel('$w_2$')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_aspect('equal')

plt.tight_layout()
plt.savefig('l1_l2_geometry.png', dpi=150)
plt.close()

print("L1/L2 ์ •๊ทœํ™” ๊ธฐํ•˜ํ•™ ์‹œ๊ฐํ™” ์ €์žฅ: l1_l2_geometry.png")

5.2 Constraints and Optimization

Understanding through Lagrangian formulation:

Constraint form: $$\min_\mathbf{w} L_{\text{data}}(\mathbf{w}) \quad \text{s.t.} \quad \|\mathbf{w}\|_p \leq t$$

Penalty form (equivalent): $$\min_\mathbf{w} L_{\text{data}}(\mathbf{w}) + \lambda \|\mathbf{w}\|_p$$

Sparsity arises when the corners of L1 meet the axes.

5.3 3D Visualization

from mpl_toolkits.mplot3d import Axes3D

# 3D์—์„œ L1/L2 ์ œ์•ฝ ์กฐ๊ฑด
fig = plt.figure(figsize=(14, 6))

# L1 ๊ตฌ
ax1 = fig.add_subplot(121, projection='3d')
u = np.linspace(0, 2*np.pi, 50)
v = np.linspace(0, np.pi, 50)
U, V = np.meshgrid(u, v)

# L1 ๊ตฌ๋Š” ์ •ํŒ”๋ฉด์ฒด
r = 1.0
vertices = np.array([
    [r, 0, 0], [-r, 0, 0], [0, r, 0],
    [0, -r, 0], [0, 0, r], [0, 0, -r]
])
# ๊ฐ„๋‹จํžˆ ์ ๋งŒ ํ‘œ์‹œ
ax1.scatter(vertices[:, 0], vertices[:, 1], vertices[:, 2],
            c='red', s=100, alpha=0.8)
ax1.set_title('$L_1$ ๋‹จ์œ„ ๊ตฌ (์ •ํŒ”๋ฉด์ฒด)', fontsize=12)

# L2 ๊ตฌ (ํ‘œ์ค€ ๊ตฌ)
ax2 = fig.add_subplot(122, projection='3d')
x = r * np.outer(np.cos(u), np.sin(v))
y = r * np.outer(np.sin(u), np.sin(v))
z = r * np.outer(np.ones(np.size(u)), np.cos(v))
ax2.plot_surface(x, y, z, cmap='viridis', alpha=0.8)
ax2.set_title('$L_2$ ๋‹จ์œ„ ๊ตฌ (๊ตฌ)', fontsize=12)

for ax in [ax1, ax2]:
    ax.set_xlabel('$w_1$')
    ax.set_ylabel('$w_2$')
    ax.set_zlabel('$w_3$')
    ax.set_xlim(-1.5, 1.5)
    ax.set_ylim(-1.5, 1.5)
    ax.set_zlim(-1.5, 1.5)

plt.tight_layout()
plt.savefig('l1_l2_3d.png', dpi=150)
plt.close()

print("3D ๋‹จ์œ„ ๊ตฌ ์‹œ๊ฐํ™” ์ €์žฅ: l1_l2_3d.png")

6. ML Applications

6.1 Distance Selection in k-NN

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score

# ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
X, y = make_classification(n_samples=500, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# ๋‹ค์–‘ํ•œ ๊ฑฐ๋ฆฌ ์ธก๋„๋กœ k-NN ํ›ˆ๋ จ
distances = {
    'Euclidean (L2)': 'euclidean',
    'Manhattan (L1)': 'manhattan',
    'Chebyshev (Lโˆž)': 'chebyshev',
    'Minkowski (p=3)': 'minkowski'
}

print("k-NN ๊ฑฐ๋ฆฌ ์ธก๋„ ๋น„๊ต (5-fold CV):")
for name, metric in distances.items():
    if metric == 'minkowski':
        knn = KNeighborsClassifier(n_neighbors=5, metric=metric, p=3)
    else:
        knn = KNeighborsClassifier(n_neighbors=5, metric=metric)

    scores = cross_val_score(knn, X, y, cv=5)
    print(f"{name:20s}: {scores.mean():.4f} (+/- {scores.std():.4f})")

6.2 Embedding Similarity (Cosine vs Euclidean)

# ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์‹œ๋ฎฌ๋ ˆ์ด์…˜
np.random.seed(42)
n_docs = 100
embed_dim = 50

# ๋ฌธ์„œ ์ž„๋ฒ ๋”ฉ (๋‹จ์œ„ ๋ฒกํ„ฐ๋กœ ์ •๊ทœํ™”)
embeddings = np.random.randn(n_docs, embed_dim)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

# ์ฟผ๋ฆฌ ๋ฌธ์„œ
query = embeddings[0]

# ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ
euclidean_dists = np.linalg.norm(embeddings - query, axis=1)

# ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„
cosine_sims = embeddings @ query
cosine_dists = 1 - cosine_sims

# ์ƒ์œ„ 10๊ฐœ ์œ ์‚ฌ ๋ฌธ์„œ
top_k = 10
euclidean_top = np.argsort(euclidean_dists)[1:top_k+1]
cosine_top = np.argsort(cosine_dists)[1:top_k+1]

print("์ž„๋ฒ ๋”ฉ ์œ ์‚ฌ๋„ ๋น„๊ต:")
print(f"์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ์ƒ์œ„ {top_k}๊ฐœ: {euclidean_top}")
print(f"์ฝ”์‚ฌ์ธ ๊ฑฐ๋ฆฌ ์ƒ์œ„ {top_k}๊ฐœ:   {cosine_top}")
print(f"๊ฒน์น˜๋Š” ๋ฌธ์„œ ์ˆ˜: {len(set(euclidean_top) & set(cosine_top))}/{top_k}")

# ์ƒ๊ด€๊ด€๊ณ„
from scipy.stats import spearmanr
corr, p_value = spearmanr(euclidean_dists, cosine_dists)
print(f"\n์œ ํด๋ฆฌ๋“œ-์ฝ”์‚ฌ์ธ ๊ฑฐ๋ฆฌ ์ƒ๊ด€๊ณ„์ˆ˜: {corr:.4f} (p={p_value:.4e})")

6.3 Batch Normalization and Gradient Norm

import torch
import torch.nn as nn

# ๊ฐ„๋‹จํ•œ ์‹ ๊ฒฝ๋ง
class SimpleNet(nn.Module):
    def __init__(self, use_batchnorm=False):
        super().__init__()
        self.fc1 = nn.Linear(10, 50)
        self.bn1 = nn.BatchNorm1d(50) if use_batchnorm else nn.Identity()
        self.fc2 = nn.Linear(50, 20)
        self.bn2 = nn.BatchNorm1d(20) if use_batchnorm else nn.Identity()
        self.fc3 = nn.Linear(20, 1)

    def forward(self, x):
        x = torch.relu(self.bn1(self.fc1(x)))
        x = torch.relu(self.bn2(self.fc2(x)))
        return self.fc3(x)

# ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
X = torch.randn(100, 10)
y = torch.randn(100, 1)

# ๋ชจ๋ธ ๋น„๊ต
models = {
    'Without BatchNorm': SimpleNet(use_batchnorm=False),
    'With BatchNorm': SimpleNet(use_batchnorm=True)
}

print("๋ฐฐ์น˜ ์ •๊ทœํ™”์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ๋…ธ๋ฆ„ ์˜ํ–ฅ:\n")
for name, model in models.items():
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

    # ์ˆœ์ „ํŒŒ
    output = model(X)
    loss = nn.MSELoss()(output, y)

    # ์—ญ์ „ํŒŒ
    optimizer.zero_grad()
    loss.backward()

    # ๊ทธ๋ž˜๋””์–ธํŠธ ๋…ธ๋ฆ„ ๊ณ„์‚ฐ
    total_norm = 0.0
    for p in model.parameters():
        if p.grad is not None:
            param_norm = p.grad.data.norm(2)
            total_norm += param_norm.item() ** 2
    total_norm = total_norm ** 0.5

    print(f"{name}:")
    print(f"  ์†์‹ค: {loss.item():.4f}")
    print(f"  ๊ทธ๋ž˜๋””์–ธํŠธ L2 ๋…ธ๋ฆ„: {total_norm:.4f}\n")

6.4 Gradient Clipping

# ๊ทธ๋ž˜๋””์–ธํŠธ ํญ๋ฐœ ๋ฐฉ์ง€
def train_step_with_clipping(model, X, y, optimizer, max_norm=1.0):
    """๊ทธ๋ž˜๋””์–ธํŠธ ํด๋ฆฌํ•‘์„ ์ ์šฉํ•œ ํ•™์Šต ์Šคํ…"""
    optimizer.zero_grad()
    output = model(X)
    loss = nn.MSELoss()(output, y)
    loss.backward()

    # ๊ทธ๋ž˜๋””์–ธํŠธ ๋…ธ๋ฆ„ ๊ณ„์‚ฐ
    total_norm = 0.0
    for p in model.parameters():
        if p.grad is not None:
            param_norm = p.grad.data.norm(2)
            total_norm += param_norm.item() ** 2
    total_norm = total_norm ** 0.5

    # ํด๋ฆฌํ•‘
    if total_norm > max_norm:
        clip_coef = max_norm / (total_norm + 1e-6)
        for p in model.parameters():
            if p.grad is not None:
                p.grad.data.mul_(clip_coef)
        clipped_norm = max_norm
    else:
        clipped_norm = total_norm

    optimizer.step()
    return loss.item(), total_norm, clipped_norm

# ํ…Œ์ŠคํŠธ
model = SimpleNet()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

loss, orig_norm, clipped_norm = train_step_with_clipping(
    model, X, y, optimizer, max_norm=1.0
)

print("๊ทธ๋ž˜๋””์–ธํŠธ ํด๋ฆฌํ•‘:")
print(f"์›๋ž˜ ๊ทธ๋ž˜๋””์–ธํŠธ ๋…ธ๋ฆ„: {orig_norm:.4f}")
print(f"ํด๋ฆฌํ•‘ ํ›„ ๋…ธ๋ฆ„:       {clipped_norm:.4f}")
print(f"ํด๋ฆฌํ•‘ ๋ฐœ์ƒ:          {orig_norm > 1.0}")

Practice Problems

Problem 1: Proof of Norm Properties

For $p \geq 1$, prove that the $L_p$ norm satisfies the triangle inequality:

$$\|\mathbf{x} + \mathbf{y}\|_p \leq \|\mathbf{x}\|_p + \|\mathbf{y}\|_p$$

Hint: Use Minkowski's inequality or verify numerically.

Problem 2: Mahalanobis Distance Implementation

Write a function that computes the Mahalanobis distance for a given dataset and detects outliers. Set the threshold using the chi-squared distribution.

Problem 3: Regularization Path

Track the order in which coefficients become zero as $\lambda$ increases from 0 in Lasso regression. Use scikit-learn's lasso_path or implement it yourself.

Problem 4: Norm-Preserving Transformations

For an orthogonal matrix $Q$ (i.e., $Q^T Q = I$), prove:

$$\|Q\mathbf{x}\|_2 = \|\mathbf{x}\|_2$$

Verify numerically and also check for the Frobenius norm.

Problem 5: Distance-Based Clustering

Perform k-means clustering on given 2D data using Euclidean, Manhattan, and cosine distances separately, and compare the results. Evaluate cluster quality using the silhouette score.

References

Online Resources

Textbooks

  • Boyd & Vandenberghe, Convex Optimization, Chapter 3 (Norms)
  • Hastie et al., The Elements of Statistical Learning, Chapter 3.4 (Shrinkage Methods)
  • Murphy, Machine Learning: A Probabilistic Perspective, Chapter 13.3

Papers

  • Tibshirani, Regression Shrinkage and Selection via the Lasso (JRSS 1996)
  • Zou & Hastie, Regularization and Variable Selection via the Elastic Net (JRSS 2005)
  • Mahalanobis, On the Generalized Distance in Statistics (1936)
to navigate between lessons