24. ์†์‹ค ํ•จ์ˆ˜(Loss Functions)

์ด์ „: ํ•™์Šต ์ตœ์ ํ™” | ๋‹ค์Œ: ์˜ตํ‹ฐ๋งˆ์ด์ €


24. ์†์‹ค ํ•จ์ˆ˜(Loss Functions)

ํ•™์Šต ๋ชฉํ‘œ

  • ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ์—์„œ ์†์‹ค ํ•จ์ˆ˜์˜ ์—ญํ• ๊ณผ ์ตœ์ ํ™”์™€์˜ ๊ด€๊ณ„ ์ดํ•ดํ•˜๊ธฐ
  • ํšŒ๊ท€ ์†์‹ค(MSE, MAE, Huber)๊ณผ ๋‹ค์–‘ํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ์˜ ํ™œ์šฉ๋ฒ• ์ตํžˆ๊ธฐ
  • ๋ถ„๋ฅ˜ ์†์‹ค(BCE, Cross-Entropy, Focal Loss)๊ณผ ๊ท ํ˜•/๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ ์‚ฌ์šฉ ์‚ฌ๋ก€ ํ•™์Šตํ•˜๊ธฐ
  • ํ‘œํ˜„ ํ•™์Šต์„ ์œ„ํ•œ ๋ฉ”ํŠธ๋ฆญ ํ•™์Šต ์†์‹ค(Contrastive, Triplet, InfoNCE) ํƒ๊ตฌํ•˜๊ธฐ
  • ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜, ๊ฒ€์ถœ, ์ƒ์„ฑ ๋ชจ๋ธ์„ ์œ„ํ•œ ์ปค์Šคํ…€ ์†์‹ค ํ•จ์ˆ˜๋ฅผ PyTorch๋กœ ๊ตฌํ˜„ํ•˜๊ธฐ

๋‚œ์ด๋„: โญโญโญ


1. ์†์‹ค ํ•จ์ˆ˜ ์†Œ๊ฐœ

1.1 ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ์—์„œ์˜ ์—ญํ• 

์†์‹ค ํ•จ์ˆ˜(๋ชฉ์  ํ•จ์ˆ˜(objective function) ๋˜๋Š” ๋น„์šฉ ํ•จ์ˆ˜(cost function)๋ผ๊ณ ๋„ ํ•จ)๋Š” ์‹ ๊ฒฝ๋ง์˜ ์˜ˆ์ธก์ด ์ •๋‹ต๊ณผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ผ์น˜ํ•˜๋Š”์ง€๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์ค‘์—๋Š” SGD๋‚˜ Adam๊ณผ ๊ฐ™์€ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด ์†์‹ค์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค.

Training Loop:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                         โ”‚
โ”‚  Input (x) โ”€โ”€โ–ถ Model(ฮธ) โ”€โ”€โ–ถ Prediction (ลท)            โ”‚
โ”‚                                  โ”‚                      โ”‚
โ”‚                                  โ–ผ                      โ”‚
โ”‚                         Loss = L(ลท, y)                 โ”‚
โ”‚                                  โ”‚                      โ”‚
โ”‚                                  โ–ผ                      โ”‚
โ”‚                         โˆ‚L/โˆ‚ฮธ (Backprop)               โ”‚
โ”‚                                  โ”‚                      โ”‚
โ”‚                                  โ–ผ                      โ”‚
โ”‚                    ฮธ โ† ฮธ - ฮทยทโˆ‚L/โˆ‚ฮธ (Update)            โ”‚
โ”‚                                  โ”‚                      โ”‚
โ”‚                                  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

์ข‹์€ ์†์‹ค ํ•จ์ˆ˜์˜ ์ฃผ์š” ํŠน์„ฑ: - ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ(Differentiable): ์—ญ์ „ํŒŒ๋ฅผ ์œ„ํ•œ ๊ธฐ์šธ๊ธฐ๊ฐ€ ์žˆ์–ด์•ผ ํ•จ - ๋ณผ๋ก(Convex, ์ด์ƒ์ ์œผ๋กœ): ๋‹จ์ผ ์ „์—ญ ์ตœ์†Ÿ๊ฐ’์ด ์žˆ์œผ๋ฉด ์ตœ์ ํ™”๊ฐ€ ๋” ์‰ฌ์›€ - ์ž‘์—… ์ •๋ ฌ(Task-aligned): ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ ์‹ค์ œ ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ๋ฐ˜์˜ - ์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ (Numerically stable): ์˜ค๋ฒ„ํ”Œ๋กœ/์–ธ๋”ํ”Œ๋กœ ๋ฐฉ์ง€

1.2 ์†์‹ค ๊ฒฝ๊ด€ ์‹œ๊ฐํ™”(Loss Landscape Visualization)

์†์‹ค ๊ฒฝ๊ด€์€ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋”ฐ๋ผ ์†์‹ค์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

import torch
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def visualize_loss_landscape():
    """Visualize a simple 2D loss landscape"""
    # Create a grid of parameter values
    w1 = np.linspace(-5, 5, 100)
    w2 = np.linspace(-5, 5, 100)
    W1, W2 = np.meshgrid(w1, w2)

    # Example loss: Rosenbrock function (non-convex)
    a, b = 1, 100
    Z = (a - W1)**2 + b * (W2 - W1**2)**2

    # Plot 3D surface
    fig = plt.figure(figsize=(12, 5))

    ax1 = fig.add_subplot(121, projection='3d')
    ax1.plot_surface(W1, W2, Z, cmap='viridis', alpha=0.8)
    ax1.set_xlabel('w1')
    ax1.set_ylabel('w2')
    ax1.set_zlabel('Loss')
    ax1.set_title('3D Loss Landscape')

    # Plot contour
    ax2 = fig.add_subplot(122)
    contour = ax2.contour(W1, W2, Z, levels=30, cmap='viridis')
    ax2.set_xlabel('w1')
    ax2.set_ylabel('w2')
    ax2.set_title('Contour Plot')
    plt.colorbar(contour, ax=ax2)

    plt.tight_layout()
    plt.savefig('loss_landscape.png', dpi=150, bbox_inches='tight')
    plt.show()

visualize_loss_landscape()

1.3 ์ตœ์ ํ™”์™€์˜ ๊ด€๊ณ„

๋‹ค์–‘ํ•œ ์†์‹ค ํ•จ์ˆ˜๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค:

์†์‹ค ์œ ํ˜• ๊ฒฝ๊ด€ ์ตœ์ ํ™” ๊ณผ์ œ
MSE ๋ถ€๋“œ๋Ÿฝ๊ณ  ๋ณผ๋กํ•จ ์‰ฌ์›€, ์•ˆ์ •์ ์ธ ๊ธฐ์šธ๊ธฐ
Cross-Entropy ๋ถ€๋“œ๋Ÿฝ๊ณ  ๋ณผ๋กํ•จ ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๊ฐ€๋Šฅ
Triplet Loss ๋น„๋ณผ๋ก, ๋งŽ์€ ์ง€์—ญ ์ตœ์†Ÿ๊ฐ’ ์‹ ์ค‘ํ•œ ๋งˆ์ด๋‹ ํ•„์š”
GAN Loss ๋น„๋ณผ๋ก, ์•ˆ์žฅ์  ๋ถˆ์•ˆ์ •, ๋ชจ๋“œ ๋ถ•๊ดด

2. ํšŒ๊ท€ ์†์‹ค(Regression Losses)

ํšŒ๊ท€ ์†์‹ค์€ ์—ฐ์†์ ์ธ ๊ฐ’์„ ์˜ˆ์ธกํ•  ๋•Œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค(์˜ˆ: ์ฃผํƒ ๊ฐ€๊ฒฉ, ์˜จ๋„, ์ขŒํ‘œ).

2.1 ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ(Mean Squared Error, L2 Loss)

๊ณต์‹:

MSE = (1/n) ฮฃ(ลทแตข - yแตข)ยฒ

ํŠน์„ฑ: - ํฐ ์˜ค์ฐจ๋ฅผ ํฌ๊ฒŒ ํŒจ๋„ํ‹ฐ(์ด์ฐจ ํ•จ์ˆ˜) - ์ด์ƒ์น˜์— ๋ฏผ๊ฐํ•จ - ๋ชจ๋“  ๊ณณ์—์„œ ๋ถ€๋“œ๋Ÿฌ์šด ๊ธฐ์šธ๊ธฐ

PyTorch ๊ตฌํ˜„:

import torch
import torch.nn as nn

# Built-in version
mse_loss = nn.MSELoss()

# Example usage
predictions = torch.tensor([2.5, 0.0, 2.1, 7.8])
targets = torch.tensor([3.0, -0.5, 2.0, 8.0])

loss = mse_loss(predictions, targets)
print(f"MSE Loss: {loss.item():.4f}")  # 0.0825

# Manual implementation
def mse_manual(pred, target):
    return torch.mean((pred - target) ** 2)

loss_manual = mse_manual(predictions, targets)
print(f"Manual MSE: {loss_manual.item():.4f}")  # 0.0825

์‚ฌ์šฉ ์‹œ๊ธฐ: - ํฐ ์˜ค์ฐจ๋ฅผ ํฌ๊ฒŒ ํŒจ๋„ํ‹ฐํ•ด์•ผ ํ•˜๋Š” ํšŒ๊ท€ ์ž‘์—… - ์‹ฌ๊ฐํ•œ ์ด์ƒ์น˜๊ฐ€ ์—†๋Š” ๋ฐ์ดํ„ฐ - ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด ๋ถ€๋“œ๋Ÿฌ์šด ๊ธฐ์šธ๊ธฐ๊ฐ€ ํ•„์š”ํ•  ๋•Œ

2.2 ํ‰๊ท  ์ ˆ๋Œ€ ์˜ค์ฐจ(Mean Absolute Error, L1 Loss)

๊ณต์‹:

MAE = (1/n) ฮฃ|ลทแตข - yแตข|

ํŠน์„ฑ: - ์„ ํ˜• ํŒจ๋„ํ‹ฐ(์ด์ƒ์น˜์— ๊ฐ•๊ฑดํ•จ) - ๊ธฐ์šธ๊ธฐ ํฌ๊ธฐ๊ฐ€ ์ผ์ •ํ•จ - 0์—์„œ ๋ถˆ์•ˆ์ •ํ•  ์ˆ˜ ์žˆ์Œ(๋ฏธ๋ถ„ ๋ถˆ๊ฐ€๋Šฅ)

PyTorch ๊ตฌํ˜„:

# Built-in version
mae_loss = nn.L1Loss()

predictions = torch.tensor([2.5, 0.0, 2.1, 7.8])
targets = torch.tensor([3.0, -0.5, 2.0, 8.0])

loss = mae_loss(predictions, targets)
print(f"MAE Loss: {loss.item():.4f}")  # 0.2250

# Manual implementation
def mae_manual(pred, target):
    return torch.mean(torch.abs(pred - target))

loss_manual = mae_manual(predictions, targets)
print(f"Manual MAE: {loss_manual.item():.4f}")  # 0.2250

์‚ฌ์šฉ ์‹œ๊ธฐ: - ์ด์ƒ์น˜๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ - ๋ชจ๋“  ์˜ค์ฐจ๋ฅผ ๋™๋“ฑํ•˜๊ฒŒ ๊ฐ€์ค‘ํ•ด์•ผ ํ•  ๋•Œ - ๊ฐ•๊ฑดํ•œ ํšŒ๊ท€ ์ž‘์—…

2.3 ํ›„๋ฒ„ ์†์‹ค(Huber Loss, Smooth L1)

๊ณต์‹:

         โŽง  0.5(ลท - y)ยฒ          if |ลท - y| โ‰ค ฮด
L_ฮด(ลท,y) = โŽจ
         โŽฉ  ฮด|ลท - y| - 0.5ฮดยฒ    otherwise

ํŠน์„ฑ: - L1๊ณผ L2์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉ - ์ž‘์€ ์˜ค์ฐจ์—๋Š” ์ด์ฐจ, ํฐ ์˜ค์ฐจ์—๋Š” ์„ ํ˜• - ฮด(์ „ํ™˜์ )๋กœ ์ œ์–ด๋จ

PyTorch ๊ตฌํ˜„:

# Built-in version (SmoothL1Loss uses ฮด=1.0)
huber_loss = nn.SmoothL1Loss(beta=1.0)  # beta is ฮด

predictions = torch.tensor([2.5, 0.0, 2.1, 10.0])  # Last value is outlier
targets = torch.tensor([3.0, -0.5, 2.0, 8.0])

loss = huber_loss(predictions, targets)
print(f"Huber Loss: {loss.item():.4f}")  # 0.4125

# Manual implementation
def huber_manual(pred, target, delta=1.0):
    error = pred - target
    abs_error = torch.abs(error)
    quadratic = torch.clamp(abs_error, max=delta)
    linear = abs_error - quadratic
    return torch.mean(0.5 * quadratic**2 + delta * linear)

loss_manual = huber_manual(predictions, targets, delta=1.0)
print(f"Manual Huber: {loss_manual.item():.4f}")  # 0.4125

์‚ฌ์šฉ ์‹œ๊ธฐ: - ์ผ๋ถ€ ์ด์ƒ์น˜๊ฐ€ ์žˆ์ง€๋งŒ ์—ฌ์ „ํžˆ ํฐ ์˜ค์ฐจ๋ฅผ ํŒจ๋„ํ‹ฐํ•˜๊ณ  ์‹ถ์„ ๋•Œ - ๊ฐ์ฒด ๊ฒ€์ถœ(๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ํšŒ๊ท€) - ๋กœ๋ณดํ‹ฑ์Šค(์„ผ์„œ ํ“จ์ „)

2.4 ๋กœ๊ทธ-์ฝ”์‹œ ์†์‹ค(Log-Cosh Loss)

๊ณต์‹:

L(ลท, y) = ฮฃ log(cosh(ลทแตข - yแตข))

ํŠน์„ฑ: - ๋‘ ๋ฒˆ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ(Huber๋ณด๋‹ค ๋ถ€๋“œ๋Ÿฌ์›€) - ์ž‘์€ x์— ๋Œ€ํ•ด ๋Œ€๋žต (xยฒ/2), ํฐ x์— ๋Œ€ํ•ด |x| - MSE๋ณด๋‹ค ์ด์ƒ์น˜์— ๋œ ๋ฏผ๊ฐํ•จ

PyTorch ๊ตฌํ˜„:

def log_cosh_loss(pred, target):
    """Log-Cosh Loss"""
    error = pred - target
    return torch.mean(torch.log(torch.cosh(error)))

predictions = torch.tensor([2.5, 0.0, 2.1, 10.0])
targets = torch.tensor([3.0, -0.5, 2.0, 8.0])

loss = log_cosh_loss(predictions, targets)
print(f"Log-Cosh Loss: {loss.item():.4f}")

์‚ฌ์šฉ ์‹œ๊ธฐ: - ๋‘ ๋ฒˆ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ์†์‹ค์ด ํ•„์š”ํ•  ๋•Œ(์˜ˆ: Hessian ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๊ธฐ) - XGBoost ๋ฐ ๊ธฐํƒ€ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๋ฐฉ๋ฒ•

2.5 ํšŒ๊ท€ ์†์‹ค ๋น„๊ต

import torch
import matplotlib.pyplot as plt

def compare_regression_losses():
    """Compare different regression losses"""
    errors = torch.linspace(-5, 5, 200)

    # Calculate losses
    mse = errors ** 2
    mae = torch.abs(errors)
    huber = torch.where(
        torch.abs(errors) <= 1.0,
        0.5 * errors ** 2,
        torch.abs(errors) - 0.5
    )
    log_cosh = torch.log(torch.cosh(errors))

    # Plot
    plt.figure(figsize=(12, 5))

    plt.subplot(1, 2, 1)
    plt.plot(errors.numpy(), mse.numpy(), label='MSE (L2)', linewidth=2)
    plt.plot(errors.numpy(), mae.numpy(), label='MAE (L1)', linewidth=2)
    plt.plot(errors.numpy(), huber.numpy(), label='Huber (ฮด=1)', linewidth=2)
    plt.plot(errors.numpy(), log_cosh.numpy(), label='Log-Cosh', linewidth=2)
    plt.xlabel('Prediction Error (ลท - y)')
    plt.ylabel('Loss')
    plt.title('Regression Losses')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.ylim(0, 10)

    # Gradient plot
    plt.subplot(1, 2, 2)
    grad_mse = 2 * errors
    grad_mae = torch.sign(errors)
    grad_huber = torch.where(
        torch.abs(errors) <= 1.0,
        errors,
        torch.sign(errors)
    )
    grad_log_cosh = torch.tanh(errors)

    plt.plot(errors.numpy(), grad_mse.numpy(), label='MSE grad', linewidth=2)
    plt.plot(errors.numpy(), grad_mae.numpy(), label='MAE grad', linewidth=2)
    plt.plot(errors.numpy(), grad_huber.numpy(), label='Huber grad', linewidth=2)
    plt.plot(errors.numpy(), grad_log_cosh.numpy(), label='Log-Cosh grad', linewidth=2)
    plt.xlabel('Prediction Error (ลท - y)')
    plt.ylabel('Gradient')
    plt.title('Loss Gradients')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.ylim(-5, 5)

    plt.tight_layout()
    plt.savefig('regression_losses.png', dpi=150, bbox_inches='tight')
    plt.show()

compare_regression_losses()

๋น„๊ต ํ‘œ:

์†์‹ค ์ด์ƒ์น˜ ๊ฐ•๊ฑด์„ฑ ๊ธฐ์šธ๊ธฐ ๋ถ€๋“œ๋Ÿฌ์›€ ์‚ฌ์šฉ ์‚ฌ๋ก€
MSE ๋‚ฎ์Œ ๋†’์Œ ๊นจ๋—ํ•œ ๋ฐ์ดํ„ฐ, ์•ˆ์ •์ ์ธ ํ›ˆ๋ จ
MAE ๋†’์Œ ๋‚ฎ์Œ (0์—์„œ ๋ถˆ์—ฐ์†) ์ด์ƒ์น˜๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ
Huber ์ค‘๊ฐ„ ์ค‘๊ฐ„ MSE/MAE ์‚ฌ์ด ๊ท ํ˜•
Log-Cosh ์ค‘๊ฐ„-๋†’์Œ ๋†’์Œ (๋‘ ๋ฒˆ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ) ๊ณ ๊ธ‰ ์ตœ์ ํ™”๊ธฐ

3. ๋ถ„๋ฅ˜ ์†์‹ค(Classification Losses)

๋ถ„๋ฅ˜ ์†์‹ค์€ ์ด์‚ฐ ๋ ˆ์ด๋ธ” ์˜ˆ์ธก ์ž‘์—…์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

3.1 ์ด์ง„ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ(Binary Cross-Entropy, BCE)

๊ณต์‹:

BCE = -(1/n) ฮฃ [yแตข log(ลทแตข) + (1-yแตข) log(1-ลทแตข)]

where:
- yแตข โˆˆ {0, 1} (true label)
- ลทแตข โˆˆ (0, 1) (predicted probability)

ํŠน์„ฑ: - ์ด์ง„ ๋ถ„๋ฅ˜์šฉ(2๊ฐœ ํด๋ž˜์Šค) - ํ™•๋ฅ ์„ ์ถœ๋ ฅํ•˜๊ธฐ ์œ„ํ•ด ์‹œ๊ทธ๋ชจ์ด๋“œ ํ™œ์„ฑํ™” ํ•„์š” - ๋ณผ๋ก ์†์‹ค ํ•จ์ˆ˜

PyTorch ๊ตฌํ˜„:

import torch
import torch.nn as nn

# Method 1: BCELoss (requires sigmoid applied first)
sigmoid = nn.Sigmoid()
bce_loss = nn.BCELoss()

logits = torch.tensor([0.5, 2.0, -1.0, 0.0])  # Raw outputs
targets = torch.tensor([1.0, 1.0, 0.0, 0.0])  # Binary labels

probabilities = sigmoid(logits)
loss = bce_loss(probabilities, targets)
print(f"BCE Loss: {loss.item():.4f}")

# Method 2: BCEWithLogitsLoss (numerically stable, combines sigmoid + BCE)
bce_with_logits = nn.BCEWithLogitsLoss()
loss_stable = bce_with_logits(logits, targets)
print(f"BCE with Logits: {loss_stable.item():.4f}")

# Manual implementation
def bce_manual(pred_probs, target):
    """Manual BCE (expects probabilities)"""
    epsilon = 1e-7  # For numerical stability
    pred_probs = torch.clamp(pred_probs, epsilon, 1 - epsilon)
    return -torch.mean(
        target * torch.log(pred_probs) +
        (1 - target) * torch.log(1 - pred_probs)
    )

loss_manual = bce_manual(probabilities, targets)
print(f"Manual BCE: {loss_manual.item():.4f}")

์ด์ง„ ๋ถ„๋ฅ˜ ์˜ˆ์ œ:

class BinaryClassifier(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)  # Single output

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)  # Raw logit

# Training setup
model = BinaryClassifier(input_dim=10)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Dummy data
inputs = torch.randn(32, 10)  # Batch of 32
targets = torch.randint(0, 2, (32, 1)).float()

# Training step
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()

print(f"Training Loss: {loss.item():.4f}")

์‚ฌ์šฉ ์‹œ๊ธฐ: - ์ด์ง„ ๋ถ„๋ฅ˜(์ŠคํŒธ/์ŠคํŒธ ์•„๋‹˜, ๊ณ ์–‘์ด/๊ฐœ) - ๋‹ค์ค‘ ๋ ˆ์ด๋ธ” ๋ถ„๋ฅ˜(๊ฐ ๋ ˆ์ด๋ธ”์ด ๋…๋ฆฝ์ ) - ์‹œ๊ทธ๋ชจ์ด๋“œ ์ถœ๋ ฅ ํ™œ์„ฑํ™”

3.2 ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค(Cross-Entropy Loss, Multi-Class)

๊ณต์‹:

CE = -(1/n) ฮฃแตข ฮฃโฑผ yแตขโฑผ log(ลทแตขโฑผ)

where:
- yแตขโฑผ is 1 if sample i is class j, else 0
- ลทแตขโฑผ is predicted probability for class j

ํŠน์„ฑ: - ๋‹ค์ค‘ ํด๋ž˜์Šค ๋ถ„๋ฅ˜์šฉ(K > 2 ํด๋ž˜์Šค) - ์†Œํ”„ํŠธ๋งฅ์Šค ํ™œ์„ฑํ™” ํ•„์š” - PyTorch์˜ CrossEntropyLoss๋Š” softmax + NLLLoss๋ฅผ ๊ฒฐํ•ฉํ•จ

PyTorch ๊ตฌํ˜„:

# Method 1: CrossEntropyLoss (combines log_softmax + NLLLoss)
ce_loss = nn.CrossEntropyLoss()

logits = torch.tensor([
    [2.0, 1.0, 0.1],  # Sample 1
    [0.5, 2.5, 0.3],  # Sample 2
    [0.1, 0.2, 3.0],  # Sample 3
])
targets = torch.tensor([0, 1, 2])  # Class indices

loss = ce_loss(logits, targets)
print(f"CrossEntropy Loss: {loss.item():.4f}")

# Method 2: Manual with softmax + NLLLoss
log_softmax = nn.LogSoftmax(dim=1)
nll_loss = nn.NLLLoss()

log_probs = log_softmax(logits)
loss_manual = nll_loss(log_probs, targets)
print(f"Manual CE (LogSoftmax + NLL): {loss_manual.item():.4f}")

# Method 3: From scratch
def cross_entropy_manual(logits, targets):
    """Manual cross-entropy implementation"""
    # Compute log softmax
    max_logits = torch.max(logits, dim=1, keepdim=True)[0]
    exp_logits = torch.exp(logits - max_logits)  # Numerical stability
    log_probs = (logits - max_logits) - torch.log(torch.sum(exp_logits, dim=1, keepdim=True))

    # Gather log probabilities for correct classes
    batch_size = logits.size(0)
    loss = -log_probs[range(batch_size), targets].mean()
    return loss

loss_scratch = cross_entropy_manual(logits, targets)
print(f"From Scratch CE: {loss_scratch.item():.4f}")

๋‹ค์ค‘ ํด๋ž˜์Šค ๋ถ„๋ฅ˜ ์˜ˆ์ œ:

class MultiClassClassifier(nn.Module):
    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, num_classes)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)  # Raw logits (no softmax)

# Training setup
num_classes = 10
model = MultiClassClassifier(input_dim=20, num_classes=num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Dummy data
inputs = torch.randn(64, 20)  # Batch of 64
targets = torch.randint(0, num_classes, (64,))  # Class indices

# Training step
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()

print(f"Training Loss: {loss.item():.4f}")

# Inference
with torch.no_grad():
    test_input = torch.randn(1, 20)
    logits = model(test_input)
    probs = torch.softmax(logits, dim=1)
    predicted_class = torch.argmax(probs, dim=1)
    print(f"Predicted Class: {predicted_class.item()}")
    print(f"Class Probabilities: {probs[0].tolist()}")

3.3 ๋ ˆ์ด๋ธ” ์Šค๋ฌด๋”ฉ(Label Smoothing)

๊ฐœ๋…: ํ•˜๋“œ ๋ ˆ์ด๋ธ”(0 ๋˜๋Š” 1) ๋Œ€์‹  ์†Œํ”„ํŠธ ๋ ˆ์ด๋ธ” ์‚ฌ์šฉ:

y_smooth = y(1 - ฮต) + ฮต/K

where:
- ฮต is smoothing parameter (e.g., 0.1)
- K is number of classes

์žฅ์ : - ๊ณผ์‹ ๋ขฐ ๋ฐฉ์ง€ - ๋” ๋‚˜์€ ์ผ๋ฐ˜ํ™” - ์ •๊ทœํ™” ํšจ๊ณผ

PyTorch ๊ตฌํ˜„:

class LabelSmoothingCrossEntropy(nn.Module):
    def __init__(self, epsilon=0.1):
        super().__init__()
        self.epsilon = epsilon

    def forward(self, logits, targets):
        """
        Args:
            logits: (batch_size, num_classes)
            targets: (batch_size,) class indices
        """
        num_classes = logits.size(1)
        log_probs = torch.log_softmax(logits, dim=1)

        # Create smooth targets
        with torch.no_grad():
            true_dist = torch.zeros_like(log_probs)
            true_dist.fill_(self.epsilon / (num_classes - 1))
            true_dist.scatter_(1, targets.unsqueeze(1), 1.0 - self.epsilon)

        return torch.mean(torch.sum(-true_dist * log_probs, dim=1))

# Example usage
criterion = LabelSmoothingCrossEntropy(epsilon=0.1)

logits = torch.randn(4, 5)  # 4 samples, 5 classes
targets = torch.tensor([0, 2, 1, 4])

loss = criterion(logits, targets)
print(f"Label Smoothing CE: {loss.item():.4f}")

# Compare with standard CE
standard_ce = nn.CrossEntropyLoss()
loss_standard = standard_ce(logits, targets)
print(f"Standard CE: {loss_standard.item():.4f}")

์‚ฌ์šฉ ์‹œ๊ธฐ: - ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜(ImageNet, CIFAR) - ๊ณผ์‹ ๋ขฐ ๋ฐฉ์ง€ - ๋ชจ๋ธ ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜

3.4 ํฌ์ปฌ ์†์‹ค(Focal Loss)

๊ณต์‹:

FL(pโ‚œ) = -ฮฑโ‚œ(1 - pโ‚œ)^ฮณ log(pโ‚œ)

where:
- pโ‚œ = p if y=1, else 1-p
- ฮฑ balances class frequencies
- ฮณ focuses on hard examples (typically 2)

๋™๊ธฐ: - ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜• ํ•ด๊ฒฐ(์˜ˆ: ๊ฐ์ฒด ๊ฒ€์ถœ์—์„œ 1:1000 ๋น„์œจ) - ์‰ฌ์šด ์˜ˆ์ œ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋‚ฎ์ถ”๊ณ  ์–ด๋ ค์šด ๋„ค๊ฑฐํ‹ฐ๋ธŒ์— ์ง‘์ค‘ - RetinaNet ๋…ผ๋ฌธ์—์„œ ๋„์ž…

PyTorch ๊ตฌํ˜„:

class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0, reduction='mean'):
        """
        Args:
            alpha: Weighting factor for class imbalance (default 0.25)
            gamma: Focusing parameter (default 2.0)
        """
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction

    def forward(self, logits, targets):
        """
        Args:
            logits: (N, C) raw predictions
            targets: (N,) class indices
        """
        ce_loss = nn.functional.cross_entropy(logits, targets, reduction='none')
        p = torch.exp(-ce_loss)  # Probability of correct class

        # Focal loss formula
        focal_weight = (1 - p) ** self.gamma
        focal_loss = self.alpha * focal_weight * ce_loss

        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        else:
            return focal_loss

# Binary Focal Loss variant
class BinaryFocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, logits, targets):
        """
        Args:
            logits: (N,) or (N, 1) raw predictions
            targets: (N,) or (N, 1) binary labels {0, 1}
        """
        bce_loss = nn.functional.binary_cross_entropy_with_logits(
            logits, targets, reduction='none'
        )
        p = torch.sigmoid(logits)
        p_t = p * targets + (1 - p) * (1 - targets)

        alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)
        focal_weight = (1 - p_t) ** self.gamma

        focal_loss = alpha_t * focal_weight * bce_loss
        return focal_loss.mean()

# Example: Imbalanced dataset
num_classes = 3
focal_loss = FocalLoss(alpha=0.25, gamma=2.0)

# Simulate imbalanced batch (mostly class 0)
logits = torch.randn(100, num_classes)
targets = torch.cat([
    torch.zeros(80, dtype=torch.long),   # 80% class 0
    torch.ones(15, dtype=torch.long),    # 15% class 1
    torch.full((5,), 2, dtype=torch.long)  # 5% class 2
])

loss_focal = focal_loss(logits, targets)
loss_ce = nn.CrossEntropyLoss()(logits, targets)

print(f"Focal Loss: {loss_focal.item():.4f}")
print(f"CE Loss: {loss_ce.item():.4f}")

ํฌ์ปฌ ์†์‹ค ์‹œ๊ฐํ™”:

import matplotlib.pyplot as plt
import numpy as np

def visualize_focal_loss():
    """Visualize how focal loss down-weights easy examples"""
    p = np.linspace(0.01, 1, 100)  # Probability of correct class

    # Standard CE
    ce = -np.log(p)

    # Focal loss with different ฮณ
    fl_gamma_0 = ce  # ฮณ=0 is same as CE
    fl_gamma_1 = (1 - p) * ce
    fl_gamma_2 = (1 - p)**2 * ce
    fl_gamma_5 = (1 - p)**5 * ce

    plt.figure(figsize=(10, 6))
    plt.plot(p, ce, label='CE (ฮณ=0)', linewidth=2)
    plt.plot(p, fl_gamma_1, label='FL (ฮณ=1)', linewidth=2)
    plt.plot(p, fl_gamma_2, label='FL (ฮณ=2)', linewidth=2)
    plt.plot(p, fl_gamma_5, label='FL (ฮณ=5)', linewidth=2)

    plt.xlabel('Probability of Correct Class (p)', fontsize=12)
    plt.ylabel('Loss', fontsize=12)
    plt.title('Focal Loss: Down-weighting Easy Examples', fontsize=14)
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.xlim(0, 1)
    plt.ylim(0, 5)

    # Annotate easy vs hard examples
    plt.axvline(x=0.5, color='red', linestyle='--', alpha=0.5)
    plt.text(0.25, 4.5, 'Hard Examples\n(low confidence)', fontsize=10, ha='center')
    plt.text(0.75, 4.5, 'Easy Examples\n(high confidence)', fontsize=10, ha='center')

    plt.tight_layout()
    plt.savefig('focal_loss.png', dpi=150, bbox_inches='tight')
    plt.show()

visualize_focal_loss()

์‚ฌ์šฉ ์‹œ๊ธฐ: - ๊ฐ์ฒด ๊ฒ€์ถœ(RetinaNet, FCOS) - ๋ถˆ๊ท ํ˜• ๋ถ„๋ฅ˜(์‚ฌ๊ธฐ ํƒ์ง€, ์˜๋ฃŒ ์ง„๋‹จ) - ๋งŽ์€ ์‰ฌ์šด ๋„ค๊ฑฐํ‹ฐ๋ธŒ๊ฐ€ ์žˆ์„ ๋•Œ

3.5 BCE vs Cross-Entropy

์˜์‚ฌ๊ฒฐ์ • ๊ฐ€์ด๋“œ:

์ž‘์—… ์†์‹ค ํ™œ์„ฑํ™” ๋น„๊ณ 
์ด์ง„ ๋ถ„๋ฅ˜ BCEWithLogitsLoss None (ํฌํ•จ๋จ) 2๊ฐœ ํด๋ž˜์Šค, ์ƒํ˜ธ ๋ฐฐํƒ€์ 
๋‹ค์ค‘ ํด๋ž˜์Šค ๋ถ„๋ฅ˜ CrossEntropyLoss None (ํฌํ•จ๋จ) K๊ฐœ ํด๋ž˜์Šค, ์ƒํ˜ธ ๋ฐฐํƒ€์ 
๋‹ค์ค‘ ๋ ˆ์ด๋ธ” ๋ถ„๋ฅ˜ BCEWithLogitsLoss None (ํฌํ•จ๋จ) ์—ฌ๋Ÿฌ ๋…๋ฆฝ ๋ ˆ์ด๋ธ”
๋ถˆ๊ท ํ˜• ๋ถ„๋ฅ˜ FocalLoss Softmax ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜•

์˜ˆ์ œ: ๋‹ค์ค‘ ๋ ˆ์ด๋ธ” ๋ถ„๋ฅ˜:

# Multi-label: Each sample can belong to multiple classes
class MultiLabelClassifier(nn.Module):
    def __init__(self, input_dim, num_labels):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, num_labels)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)  # Raw logits

model = MultiLabelClassifier(input_dim=50, num_labels=5)
criterion = nn.BCEWithLogitsLoss()  # Use BCE, not CE!

# Multi-label targets (sample can have multiple 1s)
inputs = torch.randn(32, 50)
targets = torch.tensor([
    [1, 0, 1, 0, 1],  # Sample 1: classes 0, 2, 4
    [0, 1, 1, 0, 0],  # Sample 2: classes 1, 2
    # ... more samples
]).float()

outputs = model(inputs[:2])
loss = criterion(outputs, targets)
print(f"Multi-Label Loss: {loss.item():.4f}")

4. ์ˆœ์œ„ ๋ฐ ๋ฉ”ํŠธ๋ฆญ ํ•™์Šต ์†์‹ค(Ranking and Metric Learning Losses)

์ด ์†์‹ค๋“ค์€ ์œ ์‚ฌํ•œ ํ•ญ๋ชฉ์€ ๊ฐ€๊น๊ณ  ๋น„์œ ์‚ฌํ•œ ํ•ญ๋ชฉ์€ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ์ž„๋ฒ ๋”ฉ์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

4.1 ๋Œ€์กฐ ์†์‹ค(Contrastive Loss)

๊ณต์‹:

L = (1/2) * [y * dยฒ + (1-y) * max(0, m - d)ยฒ]

where:
- d = ||f(xโ‚) - f(xโ‚‚)||โ‚‚ (Euclidean distance)
- y = 1 if similar, 0 if dissimilar
- m is margin (e.g., 1.0)

์‚ฌ์šฉ ์‚ฌ๋ก€: - ์ƒด ๋„คํŠธ์›Œํฌ(Siamese networks) - ์–ผ๊ตด ๊ฒ€์ฆ - ์„œ๋ช… ๊ฒ€์ฆ

PyTorch ๊ตฌํ˜„:

class ContrastiveLoss(nn.Module):
    def __init__(self, margin=1.0):
        super().__init__()
        self.margin = margin

    def forward(self, output1, output2, label):
        """
        Args:
            output1: (N, embedding_dim) embeddings from first input
            output2: (N, embedding_dim) embeddings from second input
            label: (N,) 1 if similar, 0 if dissimilar
        """
        euclidean_distance = nn.functional.pairwise_distance(output1, output2)

        loss_contrastive = torch.mean(
            label * torch.pow(euclidean_distance, 2) +
            (1 - label) * torch.pow(
                torch.clamp(self.margin - euclidean_distance, min=0.0), 2
            )
        )
        return loss_contrastive

# Siamese Network Example
class SiameseNetwork(nn.Module):
    def __init__(self, input_dim, embedding_dim=128):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, embedding_dim)
        )

    def forward_one(self, x):
        return self.fc(x)

    def forward(self, x1, x2):
        out1 = self.forward_one(x1)
        out2 = self.forward_one(x2)
        return out1, out2

# Training
model = SiameseNetwork(input_dim=784, embedding_dim=128)
criterion = ContrastiveLoss(margin=1.0)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Dummy data
x1 = torch.randn(32, 784)
x2 = torch.randn(32, 784)
labels = torch.randint(0, 2, (32,)).float()  # 1=similar, 0=dissimilar

# Training step
optimizer.zero_grad()
output1, output2 = model(x1, x2)
loss = criterion(output1, output2, labels)
loss.backward()
optimizer.step()

print(f"Contrastive Loss: {loss.item():.4f}")

4.2 ํŠธ๋ฆฌํ”Œ๋ › ์†์‹ค(Triplet Loss)

๊ณต์‹:

L = max(0, d(a, p) - d(a, n) + margin)

where:
- a: anchor
- p: positive (same class as anchor)
- n: negative (different class)
- d(x, y) = ||f(x) - f(y)||โ‚‚

๋งˆ์ด๋‹ ์ „๋žต: - ํ•˜๋“œ ๋„ค๊ฑฐํ‹ฐ๋ธŒ: max d(a, p) - d(a, n) - ์„ธ๋ฏธ-ํ•˜๋“œ ๋„ค๊ฑฐํ‹ฐ๋ธŒ: d(a, p) < d(a, n) < d(a, p) + margin - ๋ฐฐ์น˜-์˜ฌ: ๋ฐฐ์น˜ ๋‚ด ๋ชจ๋“  ์œ ํšจํ•œ ํŠธ๋ฆฌํ”Œ๋ ›

PyTorch ๊ตฌํ˜„:

class TripletLoss(nn.Module):
    def __init__(self, margin=1.0):
        super().__init__()
        self.margin = margin

    def forward(self, anchor, positive, negative):
        """
        Args:
            anchor: (N, embedding_dim)
            positive: (N, embedding_dim)
            negative: (N, embedding_dim)
        """
        distance_positive = (anchor - positive).pow(2).sum(1)
        distance_negative = (anchor - negative).pow(2).sum(1)

        losses = torch.relu(distance_positive - distance_negative + self.margin)
        return losses.mean()

# Online Triplet Mining
class OnlineTripletLoss(nn.Module):
    def __init__(self, margin=1.0):
        super().__init__()
        self.margin = margin

    def forward(self, embeddings, labels):
        """
        Args:
            embeddings: (N, embedding_dim)
            labels: (N,) class labels
        """
        # Compute pairwise distances
        pairwise_dist = torch.cdist(embeddings, embeddings, p=2)

        # For each anchor, get hardest positive and negative
        batch_size = embeddings.size(0)
        triplet_loss = 0
        num_triplets = 0

        for i in range(batch_size):
            # Positive mask: same class as anchor
            pos_mask = labels == labels[i]
            pos_mask[i] = False  # Exclude anchor itself

            # Negative mask: different class
            neg_mask = labels != labels[i]

            if pos_mask.any() and neg_mask.any():
                # Hardest positive
                hardest_positive_dist = pairwise_dist[i][pos_mask].max()

                # Hardest negative (closest negative)
                hardest_negative_dist = pairwise_dist[i][neg_mask].min()

                loss = torch.relu(
                    hardest_positive_dist - hardest_negative_dist + self.margin
                )
                triplet_loss += loss
                num_triplets += 1

        return triplet_loss / max(num_triplets, 1)

# Example usage
embedding_dim = 128
criterion = TripletLoss(margin=1.0)
online_criterion = OnlineTripletLoss(margin=1.0)

# Method 1: Pre-mined triplets
anchor = torch.randn(32, embedding_dim)
positive = torch.randn(32, embedding_dim)
negative = torch.randn(32, embedding_dim)

loss = criterion(anchor, positive, negative)
print(f"Triplet Loss: {loss.item():.4f}")

# Method 2: Online mining
embeddings = torch.randn(64, embedding_dim)
labels = torch.randint(0, 10, (64,))  # 10 classes

loss_online = online_criterion(embeddings, labels)
print(f"Online Triplet Loss: {loss_online.item():.4f}")

ํ›ˆ๋ จ ์˜ˆ์ œ:

class EmbeddingNet(nn.Module):
    def __init__(self, input_dim, embedding_dim=128):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, embedding_dim)
        )

    def forward(self, x):
        return self.fc(x)

# Training loop
model = EmbeddingNet(input_dim=784, embedding_dim=128)
criterion = OnlineTripletLoss(margin=1.0)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(10):
    # Simulate batch
    inputs = torch.randn(64, 784)
    labels = torch.randint(0, 10, (64,))

    optimizer.zero_grad()
    embeddings = model(inputs)
    loss = criterion(embeddings, labels)
    loss.backward()
    optimizer.step()

    if epoch % 2 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

์‚ฌ์šฉ ์‹œ๊ธฐ: - ์–ผ๊ตด ์ธ์‹(FaceNet) - ์‚ฌ๋žŒ ์žฌ์‹๋ณ„ - ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰

4.3 InfoNCE / NT-Xent ์†์‹ค

๊ณต์‹:

L = -log [exp(sim(z_i, z_j)/ฯ„) / ฮฃโ‚– exp(sim(z_i, z_k)/ฯ„)]

where:
- z_i, z_j are positive pair embeddings
- ฯ„ is temperature parameter (e.g., 0.07)
- sim(u, v) = uยทv / (||u|| ||v||) (cosine similarity)

์‚ฌ์šฉ ์‚ฌ๋ก€: - ์ž๊ธฐ์ง€๋„ ํ•™์Šต(Self-supervised learning, SimCLR, MoCo) - ๋Œ€์กฐ์  ์–ธ์–ด-์ด๋ฏธ์ง€ ์‚ฌ์ „ ํ›ˆ๋ จ(CLIP)

PyTorch ๊ตฌํ˜„:

class InfoNCELoss(nn.Module):
    def __init__(self, temperature=0.07):
        super().__init__()
        self.temperature = temperature

    def forward(self, z_i, z_j):
        """
        Args:
            z_i: (N, embedding_dim) embeddings of view 1
            z_j: (N, embedding_dim) embeddings of view 2
        """
        batch_size = z_i.size(0)

        # Normalize embeddings
        z_i = nn.functional.normalize(z_i, dim=1)
        z_j = nn.functional.normalize(z_j, dim=1)

        # Compute similarity matrix
        representations = torch.cat([z_i, z_j], dim=0)  # (2N, dim)
        similarity_matrix = torch.mm(representations, representations.T)  # (2N, 2N)

        # Create labels: positive pairs are at (i, N+i) and (N+i, i)
        labels = torch.cat([
            torch.arange(batch_size) + batch_size,
            torch.arange(batch_size)
        ], dim=0).to(z_i.device)

        # Mask to remove self-similarity
        mask = torch.eye(2 * batch_size, dtype=torch.bool, device=z_i.device)
        similarity_matrix = similarity_matrix[~mask].view(2 * batch_size, -1)

        # Compute loss
        similarity_matrix = similarity_matrix / self.temperature
        loss = nn.functional.cross_entropy(similarity_matrix, labels)

        return loss

# Simplified NT-Xent (for SimCLR)
class NTXentLoss(nn.Module):
    def __init__(self, temperature=0.5):
        super().__init__()
        self.temperature = temperature

    def forward(self, z_i, z_j):
        """Simplified version"""
        batch_size = z_i.size(0)

        # L2 normalize
        z_i = nn.functional.normalize(z_i, dim=1)
        z_j = nn.functional.normalize(z_j, dim=1)

        # Positive similarity
        pos_sim = (z_i * z_j).sum(dim=1) / self.temperature  # (N,)

        # All similarities
        z = torch.cat([z_i, z_j], dim=0)  # (2N, dim)
        sim_matrix = torch.mm(z, z.T) / self.temperature  # (2N, 2N)

        # Remove diagonal
        sim_matrix.fill_diagonal_(-float('inf'))

        # Compute loss for i -> j
        pos_sim_expanded = pos_sim.unsqueeze(1)  # (N, 1)
        negatives_i = sim_matrix[:batch_size]  # (N, 2N)
        logits_i = torch.cat([pos_sim_expanded, negatives_i], dim=1)  # (N, 2N+1)
        labels = torch.zeros(batch_size, dtype=torch.long, device=z_i.device)

        loss = nn.functional.cross_entropy(logits_i, labels)
        return loss

# Example usage
criterion = InfoNCELoss(temperature=0.07)

# Simulate augmented views
z_i = torch.randn(128, 256)  # View 1 embeddings
z_j = torch.randn(128, 256)  # View 2 embeddings

loss = criterion(z_i, z_j)
print(f"InfoNCE Loss: {loss.item():.4f}")

์‚ฌ์šฉ ์‹œ๊ธฐ: - ์ž๊ธฐ์ง€๋„ ์‚ฌ์ „ ํ›ˆ๋ จ(SimCLR) - ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ(CLIP) - ๋Œ€์กฐ ํ•™์Šต


5. ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋ฐ ๊ฒ€์ถœ ์†์‹ค(Segmentation and Detection Losses)

5.1 ๋‹ค์ด์Šค ์†์‹ค(Dice Loss)

๊ณต์‹:

Dice Loss = 1 - (2 * |X โˆฉ Y|) / (|X| + |Y|)

where:
- X is predicted segmentation
- Y is ground truth

ํŠน์„ฑ: - ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜• ์ฒ˜๋ฆฌ(๋ฐฐ๊ฒฝ vs ์ „๊ฒฝ) - IoU์˜ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ๊ทผ์‚ฌ - ๋ฒ”์œ„: [0, 1]

PyTorch ๊ตฌํ˜„:

class DiceLoss(nn.Module):
    def __init__(self, smooth=1.0):
        """
        Args:
            smooth: Smoothing constant to avoid division by zero
        """
        super().__init__()
        self.smooth = smooth

    def forward(self, pred, target):
        """
        Args:
            pred: (N, C, H, W) predicted probabilities
            target: (N, C, H, W) one-hot encoded ground truth
        """
        pred_flat = pred.view(-1)
        target_flat = target.view(-1)

        intersection = (pred_flat * target_flat).sum()
        dice_score = (2. * intersection + self.smooth) / (
            pred_flat.sum() + target_flat.sum() + self.smooth
        )

        return 1 - dice_score

# Multi-class Dice Loss
class MultiClassDiceLoss(nn.Module):
    def __init__(self, smooth=1.0):
        super().__init__()
        self.smooth = smooth

    def forward(self, pred, target):
        """
        Args:
            pred: (N, C, H, W) logits
            target: (N, H, W) class indices
        """
        num_classes = pred.size(1)
        pred_softmax = torch.softmax(pred, dim=1)

        # Convert target to one-hot
        target_one_hot = nn.functional.one_hot(target, num_classes)
        target_one_hot = target_one_hot.permute(0, 3, 1, 2).float()

        total_loss = 0
        for c in range(num_classes):
            pred_c = pred_softmax[:, c]
            target_c = target_one_hot[:, c]

            intersection = (pred_c * target_c).sum()
            dice = (2. * intersection + self.smooth) / (
                pred_c.sum() + target_c.sum() + self.smooth
            )
            total_loss += 1 - dice

        return total_loss / num_classes

# Example usage
batch_size, num_classes, H, W = 4, 3, 256, 256

# Binary segmentation
pred_binary = torch.sigmoid(torch.randn(batch_size, 1, H, W))
target_binary = torch.randint(0, 2, (batch_size, 1, H, W)).float()

dice_loss = DiceLoss()
loss_binary = dice_loss(pred_binary, target_binary)
print(f"Binary Dice Loss: {loss_binary.item():.4f}")

# Multi-class segmentation
pred_multi = torch.randn(batch_size, num_classes, H, W)
target_multi = torch.randint(0, num_classes, (batch_size, H, W))

dice_multi = MultiClassDiceLoss()
loss_multi = dice_multi(pred_multi, target_multi)
print(f"Multi-class Dice Loss: {loss_multi.item():.4f}")

5.2 IoU ์†์‹ค / GIoU ์†์‹ค

IoU (Intersection over Union):

IoU = Area(Intersection) / Area(Union)

GIoU (Generalized IoU):

GIoU = IoU - |C \ (A โˆช B)| / |C|

where C is smallest box enclosing A and B

PyTorch ๊ตฌํ˜„:

def iou_loss(pred_boxes, target_boxes):
    """
    Args:
        pred_boxes: (N, 4) [x1, y1, x2, y2]
        target_boxes: (N, 4) [x1, y1, x2, y2]
    """
    # Intersection coordinates
    x1_inter = torch.max(pred_boxes[:, 0], target_boxes[:, 0])
    y1_inter = torch.max(pred_boxes[:, 1], target_boxes[:, 1])
    x2_inter = torch.min(pred_boxes[:, 2], target_boxes[:, 2])
    y2_inter = torch.min(pred_boxes[:, 3], target_boxes[:, 3])

    # Intersection area
    inter_area = torch.clamp(x2_inter - x1_inter, min=0) * \
                 torch.clamp(y2_inter - y1_inter, min=0)

    # Union area
    pred_area = (pred_boxes[:, 2] - pred_boxes[:, 0]) * \
                (pred_boxes[:, 3] - pred_boxes[:, 1])
    target_area = (target_boxes[:, 2] - target_boxes[:, 0]) * \
                  (target_boxes[:, 3] - target_boxes[:, 1])
    union_area = pred_area + target_area - inter_area

    # IoU
    iou = inter_area / (union_area + 1e-6)

    return 1 - iou.mean()

def giou_loss(pred_boxes, target_boxes):
    """Generalized IoU Loss"""
    # Intersection
    x1_inter = torch.max(pred_boxes[:, 0], target_boxes[:, 0])
    y1_inter = torch.max(pred_boxes[:, 1], target_boxes[:, 1])
    x2_inter = torch.min(pred_boxes[:, 2], target_boxes[:, 2])
    y2_inter = torch.min(pred_boxes[:, 3], target_boxes[:, 3])

    inter_area = torch.clamp(x2_inter - x1_inter, min=0) * \
                 torch.clamp(y2_inter - y1_inter, min=0)

    # Union
    pred_area = (pred_boxes[:, 2] - pred_boxes[:, 0]) * \
                (pred_boxes[:, 3] - pred_boxes[:, 1])
    target_area = (target_boxes[:, 2] - target_boxes[:, 0]) * \
                  (target_boxes[:, 3] - target_boxes[:, 1])
    union_area = pred_area + target_area - inter_area

    # IoU
    iou = inter_area / (union_area + 1e-6)

    # Enclosing box
    x1_c = torch.min(pred_boxes[:, 0], target_boxes[:, 0])
    y1_c = torch.min(pred_boxes[:, 1], target_boxes[:, 1])
    x2_c = torch.max(pred_boxes[:, 2], target_boxes[:, 2])
    y2_c = torch.max(pred_boxes[:, 3], target_boxes[:, 3])

    enclosing_area = (x2_c - x1_c) * (y2_c - y1_c)

    # GIoU
    giou = iou - (enclosing_area - union_area) / (enclosing_area + 1e-6)

    return 1 - giou.mean()

# Example usage
pred_boxes = torch.tensor([
    [10, 10, 50, 50],
    [20, 20, 60, 60]
]).float()

target_boxes = torch.tensor([
    [15, 15, 55, 55],
    [25, 25, 65, 65]
]).float()

loss_iou = iou_loss(pred_boxes, target_boxes)
loss_giou = giou_loss(pred_boxes, target_boxes)

print(f"IoU Loss: {loss_iou.item():.4f}")
print(f"GIoU Loss: {loss_giou.item():.4f}")

5.3 ๊ฒฐํ•ฉ ์†์‹ค(Combined Losses)

์˜ˆ์ œ: ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜์„ ์œ„ํ•œ CE + Dice:

class CombinedLoss(nn.Module):
    def __init__(self, ce_weight=0.5, dice_weight=0.5):
        super().__init__()
        self.ce_weight = ce_weight
        self.dice_weight = dice_weight
        self.ce = nn.CrossEntropyLoss()
        self.dice = MultiClassDiceLoss()

    def forward(self, pred, target):
        """
        Args:
            pred: (N, C, H, W) logits
            target: (N, H, W) class indices
        """
        ce_loss = self.ce(pred, target)
        dice_loss = self.dice(pred, target)

        return self.ce_weight * ce_loss + self.dice_weight * dice_loss

# Example
criterion = CombinedLoss(ce_weight=0.5, dice_weight=0.5)
pred = torch.randn(2, 3, 128, 128)
target = torch.randint(0, 3, (2, 128, 128))

loss = criterion(pred, target)
print(f"Combined Loss (CE + Dice): {loss.item():.4f}")

6. ์ƒ์„ฑ ๋ชจ๋ธ ์†์‹ค(Generative Model Losses)

6.1 ์ ๋Œ€์  ์†์‹ค(Adversarial Loss, GAN)

Minimax GAN:

min_G max_D V(D,G) = E[log D(x)] + E[log(1 - D(G(z)))]

PyTorch ๊ตฌํ˜„:

# Standard GAN loss
def gan_loss_discriminator(real_output, fake_output):
    """Discriminator loss"""
    real_loss = nn.functional.binary_cross_entropy_with_logits(
        real_output, torch.ones_like(real_output)
    )
    fake_loss = nn.functional.binary_cross_entropy_with_logits(
        fake_output, torch.zeros_like(fake_output)
    )
    return real_loss + fake_loss

def gan_loss_generator(fake_output):
    """Generator loss (non-saturating)"""
    return nn.functional.binary_cross_entropy_with_logits(
        fake_output, torch.ones_like(fake_output)
    )

# Wasserstein GAN loss
def wgan_loss_discriminator(real_output, fake_output):
    """WGAN discriminator loss"""
    return -(torch.mean(real_output) - torch.mean(fake_output))

def wgan_loss_generator(fake_output):
    """WGAN generator loss"""
    return -torch.mean(fake_output)

# Example usage
real_output = torch.randn(32, 1)  # Discriminator output for real images
fake_output = torch.randn(32, 1)  # Discriminator output for fake images

# Standard GAN
d_loss = gan_loss_discriminator(real_output, fake_output)
g_loss = gan_loss_generator(fake_output)
print(f"GAN D Loss: {d_loss.item():.4f}, G Loss: {g_loss.item():.4f}")

# WGAN
d_loss_wgan = wgan_loss_discriminator(real_output, fake_output)
g_loss_wgan = wgan_loss_generator(fake_output)
print(f"WGAN D Loss: {d_loss_wgan.item():.4f}, G Loss: {g_loss_wgan.item():.4f}")

6.2 VAE ์†์‹ค

๊ณต์‹:

L = Reconstruction Loss + KL Divergence
  = E[log p(x|z)] + KL(q(z|x) || p(z))

PyTorch ๊ตฌํ˜„:

def vae_loss(recon_x, x, mu, logvar, beta=1.0):
    """
    Args:
        recon_x: Reconstructed input
        x: Original input
        mu: Mean of latent distribution
        logvar: Log variance of latent distribution
        beta: Weight for KL term (ฮฒ-VAE)
    """
    # Reconstruction loss (BCE for binary images, MSE for continuous)
    recon_loss = nn.functional.binary_cross_entropy(
        recon_x, x, reduction='sum'
    )

    # KL divergence: -0.5 * ฮฃ(1 + log(ฯƒยฒ) - ฮผยฒ - ฯƒยฒ)
    kl_div = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

    return recon_loss + beta * kl_div

# Example VAE
class VAE(nn.Module):
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super().__init__()
        # Encoder
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)

        # Decoder
        self.fc3 = nn.Linear(latent_dim, hidden_dim)
        self.fc4 = nn.Linear(hidden_dim, input_dim)

    def encode(self, x):
        h = torch.relu(self.fc1(x))
        return self.fc_mu(h), self.fc_logvar(h)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def decode(self, z):
        h = torch.relu(self.fc3(z))
        return torch.sigmoid(self.fc4(h))

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        recon_x = self.decode(z)
        return recon_x, mu, logvar

# Training
model = VAE(input_dim=784, hidden_dim=400, latent_dim=20)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

x = torch.randn(64, 784).sigmoid()  # Dummy data
recon_x, mu, logvar = model(x)

loss = vae_loss(recon_x, x, mu, logvar, beta=1.0)
optimizer.zero_grad()
loss.backward()
optimizer.step()

print(f"VAE Loss: {loss.item():.4f}")

6.3 ์ง€๊ฐ์  ์†์‹ค(Perceptual Loss)

๊ฐœ๋…: ํ”ฝ์…€๋ณ„ ๋น„๊ต ๋Œ€์‹  ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋„คํŠธ์›Œํฌ(์˜ˆ: VGG)์˜ ํŠน์ง•๋ณ„ ๋น„๊ต ์‚ฌ์šฉ.

PyTorch ๊ตฌํ˜„:

import torchvision.models as models

class PerceptualLoss(nn.Module):
    def __init__(self, layers=['relu1_2', 'relu2_2', 'relu3_3']):
        super().__init__()
        # Load pretrained VGG16
        vgg = models.vgg16(pretrained=True).features
        self.feature_extractor = nn.ModuleDict()

        # Map layer names to indices
        layer_map = {
            'relu1_2': 4, 'relu2_2': 9, 'relu3_3': 16,
            'relu4_3': 23, 'relu5_3': 30
        }

        for name in layers:
            idx = layer_map[name]
            self.feature_extractor[name] = nn.Sequential(*list(vgg[:idx+1]))

        # Freeze parameters
        for param in self.parameters():
            param.requires_grad = False

    def forward(self, pred, target):
        """
        Args:
            pred: (N, 3, H, W) predicted image
            target: (N, 3, H, W) target image
        """
        loss = 0
        for name, extractor in self.feature_extractor.items():
            pred_features = extractor(pred)
            target_features = extractor(target)
            loss += nn.functional.mse_loss(pred_features, target_features)

        return loss / len(self.feature_extractor)

# Example usage (requires torchvision)
# perceptual_loss = PerceptualLoss()
# pred_img = torch.randn(4, 3, 224, 224)
# target_img = torch.randn(4, 3, 224, 224)
# loss = perceptual_loss(pred_img, target_img)
# print(f"Perceptual Loss: {loss.item():.4f}")

์‚ฌ์šฉ ์‹œ๊ธฐ: - ์Šคํƒ€์ผ ์ „์ด - ์ดˆํ•ด์ƒ๋„(Super-resolution) - ์ด๋ฏธ์ง€ ๊ฐ„ ๋ณ€ํ™˜


7. ๊ณ ๊ธ‰ ์ฃผ์ œ(Advanced Topics)

7.1 ๋‹ค์ค‘ ์ž‘์—… ์†์‹ค ๊ฐ€์ค‘์น˜(Multi-Task Loss Weighting)

๋ฌธ์ œ: ์—ฌ๋Ÿฌ ์ž‘์—…์„ ๋™์‹œ์— ํ›ˆ๋ จํ•  ๋•Œ ์†์‹ค์˜ ๊ท ํ˜•์„ ์–ด๋–ป๊ฒŒ ๋งž์ถœ๊นŒ?

๋ฐฉ๋ฒ• 1: ์ˆ˜๋™ ๊ฐ€์ค‘์น˜

total_loss = w1 * task1_loss + w2 * task2_loss + w3 * task3_loss

๋ฐฉ๋ฒ• 2: ๋ถˆํ™•์‹ค์„ฑ ๊ฐ€์ค‘์น˜(Uncertainty Weighting)

"Multi-Task Learning Using Uncertainty to Weigh Losses" (Kendall et al., 2018) ๊ธฐ๋ฐ˜.

class MultiTaskLoss(nn.Module):
    def __init__(self, num_tasks):
        super().__init__()
        # Learnable log variance for each task
        self.log_vars = nn.Parameter(torch.zeros(num_tasks))

    def forward(self, losses):
        """
        Args:
            losses: List of losses for each task
        """
        total_loss = 0
        for i, loss in enumerate(losses):
            precision = torch.exp(-self.log_vars[i])
            total_loss += precision * loss + self.log_vars[i]

        return total_loss

# Example
mtl = MultiTaskLoss(num_tasks=3)
optimizer = torch.optim.Adam(mtl.parameters(), lr=0.01)

# Simulate task losses
task_losses = [
    torch.tensor(2.5),  # Task 1
    torch.tensor(0.8),  # Task 2
    torch.tensor(1.2),  # Task 3
]

total_loss = mtl(task_losses)
print(f"Multi-task Loss: {total_loss.item():.4f}")
print(f"Learned weights: {torch.exp(-mtl.log_vars).detach()}")

๋ฐฉ๋ฒ• 3: GradNorm

๊ธฐ์šธ๊ธฐ๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ ์ž‘์—… ์†์‹ค์˜ ๊ท ํ˜•์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

class GradNorm:
    def __init__(self, model, num_tasks, alpha=1.5):
        """
        Args:
            model: Shared network
            num_tasks: Number of tasks
            alpha: Restoring force (typically 1.5)
        """
        self.model = model
        self.num_tasks = num_tasks
        self.alpha = alpha
        self.weights = nn.Parameter(torch.ones(num_tasks))
        self.initial_losses = None

    def compute_weights(self, losses, shared_params):
        """Update task weights based on gradient norms"""
        if self.initial_losses is None:
            self.initial_losses = losses.detach()

        # Compute weighted loss
        weighted_losses = losses * self.weights
        total_loss = weighted_losses.sum()

        # Compute gradients
        total_loss.backward(retain_graph=True)

        # Get gradient norms for shared layers
        grad_norms = []
        for i in range(self.num_tasks):
            # ... compute grad norm for task i
            pass

        # Update weights (simplified version)
        return self.weights

# Note: Full GradNorm implementation requires careful gradient manipulation

7.2 ์ปค๋ฆฌํ˜๋Ÿผ ์†์‹ค(Curriculum Loss)

๊ฐœ๋…: ์‰ฌ์šด ์˜ˆ์ œ๋กœ ํ›ˆ๋ จ์„ ์‹œ์ž‘ํ•˜๊ณ  ์ ์ฐจ ๋‚œ์ด๋„๋ฅผ ๋†’์ž…๋‹ˆ๋‹ค.

class CurriculumLoss(nn.Module):
    def __init__(self, base_criterion, total_epochs):
        super().__init__()
        self.base_criterion = base_criterion
        self.total_epochs = total_epochs
        self.current_epoch = 0

    def forward(self, pred, target, difficulty):
        """
        Args:
            pred: Predictions
            target: Ground truth
            difficulty: (N,) difficulty score for each sample [0, 1]
        """
        # Compute base loss
        losses = self.base_criterion(pred, target)

        # Curriculum weight: easier samples first
        progress = self.current_epoch / self.total_epochs
        threshold = progress  # Gradually increase difficulty threshold

        weights = (difficulty <= threshold).float()
        weights = weights / (weights.sum() + 1e-6) * weights.size(0)

        return (losses * weights).sum()

    def step_epoch(self):
        self.current_epoch += 1

# Example usage
criterion = CurriculumLoss(nn.CrossEntropyLoss(reduction='none'), total_epochs=100)

for epoch in range(100):
    pred = torch.randn(32, 10)
    target = torch.randint(0, 10, (32,))
    difficulty = torch.rand(32)  # Random difficulty scores

    loss = criterion(pred, target, difficulty)
    # ... backward, optimize ...

    criterion.step_epoch()

7.3 ์ปค์Šคํ…€ ์†์‹ค ํ•จ์ˆ˜

์ปค์Šคํ…€ ์†์‹ค ํ…œํ”Œ๋ฆฟ:

class CustomLoss(nn.Module):
    def __init__(self, param1, param2):
        super().__init__()
        self.param1 = param1
        self.param2 = param2

    def forward(self, pred, target):
        """
        Args:
            pred: Model predictions
            target: Ground truth

        Returns:
            loss: Scalar tensor
        """
        # Implement your loss computation
        loss = torch.mean((pred - target) ** 2)  # Example
        return loss

# Example: Asymmetric Loss (penalize overestimation more than underestimation)
class AsymmetricMSELoss(nn.Module):
    def __init__(self, over_penalty=2.0):
        super().__init__()
        self.over_penalty = over_penalty

    def forward(self, pred, target):
        error = pred - target

        # Penalize overestimation more
        loss = torch.where(
            error > 0,
            self.over_penalty * error ** 2,
            error ** 2
        )

        return loss.mean()

# Usage
asymmetric_loss = AsymmetricMSELoss(over_penalty=2.0)
pred = torch.tensor([2.5, 1.0, 3.0])
target = torch.tensor([2.0, 2.0, 2.0])

loss = asymmetric_loss(pred, target)
print(f"Asymmetric Loss: {loss.item():.4f}")

7.4 ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ํŒ

๋ฌธ์ œ 1: Log-Sum-Exp ํŠธ๋ฆญ

# Numerically unstable (can overflow)
def unstable_softmax(x):
    return torch.exp(x) / torch.sum(torch.exp(x))

# Stable version
def stable_softmax(x):
    x_max = torch.max(x)
    exp_x = torch.exp(x - x_max)
    return exp_x / torch.sum(exp_x)

# Example
x = torch.tensor([1000.0, 1001.0, 1002.0])
# unstable_softmax(x)  # Would cause overflow
stable = stable_softmax(x)
print(f"Stable Softmax: {stable}")

๋ฌธ์ œ 2: log(0) ๋ฐฉ์ง€

# Bad: Can produce NaN
loss = -torch.log(pred)

# Good: Add small epsilon
epsilon = 1e-7
loss = -torch.log(pred + epsilon)

# Better: Use clamp
loss = -torch.log(torch.clamp(pred, min=epsilon))

# Best: Use built-in stable versions
loss = nn.functional.binary_cross_entropy_with_logits(logits, target)

๋ฌธ์ œ 3: ๊ธฐ์šธ๊ธฐ ํด๋ฆฌํ•‘(Gradient Clipping)

# Prevent exploding gradients
for param in model.parameters():
    if param.grad is not None:
        torch.nn.utils.clip_grad_norm_(param, max_norm=1.0)

# Or clip by value
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

8. ์‹ค์šฉ ๊ฐ€์ด๋“œ(Practical Guide)

8.1 ์†์‹ค ์„ ํƒ ์˜์‚ฌ๊ฒฐ์ • ํŠธ๋ฆฌ

Start
  โ”‚
  โ”œโ”€ Task: Regression?
  โ”‚    โ”‚
  โ”‚    โ”œโ”€ Clean data, no outliers โ†’ MSE
  โ”‚    โ”œโ”€ Data with outliers โ†’ MAE or Huber
  โ”‚    โ””โ”€ Need smooth gradients โ†’ Huber or Log-Cosh
  โ”‚
  โ”œโ”€ Task: Classification?
  โ”‚    โ”‚
  โ”‚    โ”œโ”€ Binary classification โ†’ BCEWithLogitsLoss
  โ”‚    โ”œโ”€ Multi-class (mutually exclusive) โ†’ CrossEntropyLoss
  โ”‚    โ”œโ”€ Multi-label (independent labels) โ†’ BCEWithLogitsLoss
  โ”‚    โ””โ”€ Class imbalance โ†’ FocalLoss or weighted CE
  โ”‚
  โ”œโ”€ Task: Segmentation?
  โ”‚    โ”‚
  โ”‚    โ”œโ”€ Small objects โ†’ DiceLoss
  โ”‚    โ”œโ”€ Class imbalance โ†’ DiceLoss or FocalLoss
  โ”‚    โ””โ”€ Balanced data โ†’ CrossEntropyLoss or CE + Dice
  โ”‚
  โ”œโ”€ Task: Object Detection?
  โ”‚    โ”‚
  โ”‚    โ”œโ”€ Classification head โ†’ CrossEntropyLoss or FocalLoss
  โ”‚    โ””โ”€ Bounding box regression โ†’ IoU Loss, GIoU Loss, or Smooth L1
  โ”‚
  โ”œโ”€ Task: Metric Learning?
  โ”‚    โ”‚
  โ”‚    โ”œโ”€ Pair verification โ†’ ContrastiveLoss
  โ”‚    โ”œโ”€ Triplet comparison โ†’ TripletLoss
  โ”‚    โ””โ”€ Self-supervised โ†’ InfoNCE (NT-Xent)
  โ”‚
  โ””โ”€ Task: Generative Model?
       โ”‚
       โ”œโ”€ GAN โ†’ Adversarial Loss (BCE or Wasserstein)
       โ”œโ”€ VAE โ†’ Reconstruction Loss + KL Divergence
       โ””โ”€ Image translation โ†’ Perceptual Loss + L1/L2

8.2 ๋น„๊ต ํ‘œ

์ž‘์—… ์†์‹ค ํ•จ์ˆ˜ ํ™œ์„ฑํ™” ๋น„๊ณ 
ํšŒ๊ท€ MSE None ๊นจ๋—ํ•œ ๋ฐ์ดํ„ฐ
MAE None ์ด์ƒ์น˜ ๊ฐ•๊ฑดํ•จ
Huber None L1/L2 ๊ท ํ˜•
์ด์ง„ ๋ถ„๋ฅ˜ BCEWithLogitsLoss None (๋‚ด๋ถ€ sigmoid) ์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ 
๋‹ค์ค‘ ํด๋ž˜์Šค CrossEntropyLoss None (๋‚ด๋ถ€ softmax) ์ƒํ˜ธ ๋ฐฐํƒ€์ 
๋‹ค์ค‘ ๋ ˆ์ด๋ธ” BCEWithLogitsLoss None (๋‚ด๋ถ€ sigmoid) ๋…๋ฆฝ ๋ ˆ์ด๋ธ”
๋ถˆ๊ท ํ˜• ๋ถ„๋ฅ˜ FocalLoss Softmax ๋ถˆ๊ท ํ˜• ํ•ด๊ฒฐ
์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ DiceLoss Softmax ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜• ์ฒ˜๋ฆฌ
CE + Dice Softmax ๊ฒฐํ•ฉ ์ ‘๊ทผ๋ฒ•
๊ฒ€์ถœ(bbox) IoU / GIoU Loss None ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ํšŒ๊ท€
๊ฒ€์ถœ(class) FocalLoss Softmax ์‰ฌ์šด ๋„ค๊ฑฐํ‹ฐ๋ธŒ ์ฒ˜๋ฆฌ
์–ผ๊ตด ๊ฒ€์ฆ ContrastiveLoss None ์ƒด ๋„คํŠธ์›Œํฌ
์–ผ๊ตด ์ธ์‹ TripletLoss None ํŠธ๋ฆฌํ”Œ๋ › ๋งˆ์ด๋‹
์ž๊ธฐ์ง€๋„ InfoNCE None ๋Œ€์กฐ ํ•™์Šต
GAN BCELoss (์ ๋Œ€์ ) Sigmoid Minimax ๊ฒŒ์ž„
VAE BCE/MSE + KL Sigmoid/None ์žฌ๊ตฌ์„ฑ + ์ •๊ทœํ™”

8.3 ํ”ํ•œ ํ•จ์ •

1. ์ž˜๋ชป๋œ reduction ์‚ฌ์šฉ

# Bad: Default reduction='mean' might not be what you want
criterion = nn.CrossEntropyLoss()

# Good: Explicit reduction
criterion = nn.CrossEntropyLoss(reduction='mean')  # or 'sum', 'none'

2. ํ™œ์„ฑํ™” ์ ์šฉ ์žŠ๊ธฐ

# Bad: Using BCELoss with raw logits
pred = model(x)  # Raw logits
loss = nn.BCELoss()(pred, target)  # Wrong! BCELoss expects probabilities

# Good: Use BCEWithLogitsLoss
loss = nn.BCEWithLogitsLoss()(pred, target)

# Or apply sigmoid first
loss = nn.BCELoss()(torch.sigmoid(pred), target)

3. ์ž˜๋ชป๋œ ํƒ€๊ฒŸ ํ˜•์‹

# CrossEntropyLoss expects class indices, not one-hot
# Bad
target = torch.tensor([[1, 0, 0], [0, 1, 0]])  # One-hot
loss = nn.CrossEntropyLoss()(pred, target)  # Error!

# Good
target = torch.tensor([0, 1])  # Class indices
loss = nn.CrossEntropyLoss()(pred, target)

4. ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜• ์ฒ˜๋ฆฌ ์•ˆํ•จ

# Bad: Ignoring class imbalance (e.g., 95% class 0, 5% class 1)
criterion = nn.CrossEntropyLoss()

# Good: Use class weights
class_weights = torch.tensor([1.0, 19.0])  # Weight minority class higher
criterion = nn.CrossEntropyLoss(weight=class_weights)

# Or use FocalLoss
criterion = FocalLoss(alpha=0.25, gamma=2.0)

5. ์ž˜๋ชป๋œ ์†์‹ค ์Šค์ผ€์ผ๋ง

# Bad: Losses of different magnitudes
total_loss = loss1 + loss2  # loss1 ~ 0.01, loss2 ~ 100.0

# Good: Normalize or weight appropriately
total_loss = 100 * loss1 + loss2
# Or use learnable weights
total_loss = w1 * loss1 + w2 * loss2

8.4 ๋””๋ฒ„๊น… ํŒ

1. ์†์‹ค ๊ฐ’ ํ™•์ธ

# Monitor loss statistics
def check_loss(loss, name="Loss"):
    print(f"{name}: {loss.item():.4f}")
    assert not torch.isnan(loss), f"{name} is NaN!"
    assert not torch.isinf(loss), f"{name} is Inf!"
    assert loss >= 0, f"{name} is negative!"

# Use in training
loss = criterion(pred, target)
check_loss(loss, name="Training Loss")

2. ์†์‹ค ๊ฒฝ๊ด€ ์‹œ๊ฐํ™”

import matplotlib.pyplot as plt

# Track loss over training
losses = []
for epoch in range(num_epochs):
    # ... training ...
    losses.append(loss.item())

plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss Curve')
plt.yscale('log')  # Use log scale if loss varies widely
plt.show()

3. ๊ธฐ์ค€์„ ๊ณผ ๋น„๊ต

# Sanity check: random predictions should give expected loss
# For CrossEntropyLoss with C classes: expected loss โ‰ˆ log(C)

num_classes = 10
random_pred = torch.randn(100, num_classes)
random_target = torch.randint(0, num_classes, (100,))

loss = nn.CrossEntropyLoss()(random_pred, random_target)
expected = torch.log(torch.tensor(num_classes, dtype=torch.float))

print(f"Random Loss: {loss.item():.4f}")
print(f"Expected: {expected.item():.4f}")  # Should be โ‰ˆ 2.3026 for 10 classes

4. ๊ธฐ์šธ๊ธฐ ํ๋ฆ„ ํ™•์ธ

# Check if gradients are flowing
def check_gradients(model):
    for name, param in model.named_parameters():
        if param.grad is not None:
            grad_norm = param.grad.norm().item()
            print(f"{name}: grad norm = {grad_norm:.6f}")
            if grad_norm == 0:
                print(f"  WARNING: Zero gradient!")
        else:
            print(f"{name}: No gradient!")

# After backward
loss.backward()
check_gradients(model)

8.5 ์†์‹ค ํ•จ์ˆ˜๊ฐ€ ์ˆ˜๋ ด์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ

์˜ˆ์ œ: ์ˆ˜๋ ด ์†๋„ ๋น„๊ต

import matplotlib.pyplot as plt
import torch
import torch.nn as nn

# Simple model
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

def train_with_loss(criterion, num_epochs=100):
    """Train and return loss history"""
    model = SimpleNet()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

    # Dummy data
    X = torch.randn(1000, 10)
    y = torch.randn(1000, 1)

    losses = []
    for epoch in range(num_epochs):
        optimizer.zero_grad()
        pred = model(X)
        loss = criterion(pred, y)
        loss.backward()
        optimizer.step()
        losses.append(loss.item())

    return losses

# Compare different losses
mse_losses = train_with_loss(nn.MSELoss())
mae_losses = train_with_loss(nn.L1Loss())
huber_losses = train_with_loss(nn.SmoothL1Loss())

plt.figure(figsize=(10, 6))
plt.plot(mse_losses, label='MSE', linewidth=2)
plt.plot(mae_losses, label='MAE', linewidth=2)
plt.plot(huber_losses, label='Huber', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Convergence Speed Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.savefig('convergence_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

์—ฐ์Šต ๋ฌธ์ œ

์—ฐ์Šต ๋ฌธ์ œ 1: Tversky ์†์‹ค ๊ตฌํ˜„

Tversky ์†์‹ค์€ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜์„ ์œ„ํ•œ Dice ์†์‹ค์˜ ์ผ๋ฐ˜ํ™”๋กœ, ๊ฑฐ์ง“ ์–‘์„ฑ/์Œ์„ฑ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Tversky = (TP) / (TP + ฮฑ*FP + ฮฒ*FN)

where:
- TP = true positives
- FP = false positives
- FN = false negatives
- ฮฑ, ฮฒ control the trade-off (typically ฮฑ + ฮฒ = 1)

๊ณผ์ œ: 1. TverskyLoss๋ฅผ PyTorch nn.Module๋กœ ๊ตฌํ˜„ํ•˜๊ธฐ 2. ฮฑ=0.3, ฮฒ=0.7๋กœ ํ…Œ์ŠคํŠธ(๊ฑฐ์ง“ ์Œ์„ฑ ๊ฐ์†Œ์— ์ง‘์ค‘) 3. ์ด์ง„ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ์ž‘์—…์—์„œ DiceLoss์™€ ๋น„๊ต

์Šคํƒ€ํ„ฐ ์ฝ”๋“œ:

class TverskyLoss(nn.Module):
    def __init__(self, alpha=0.5, beta=0.5, smooth=1.0):
        super().__init__()
        # TODO: Initialize parameters
        pass

    def forward(self, pred, target):
        # TODO: Implement Tversky loss
        # Hint: Calculate TP, FP, FN
        pass

์—ฐ์Šต ๋ฌธ์ œ 2: ๋‹ค์ค‘ ์Šค์ผ€์ผ ์ง€๊ฐ์  ์†์‹ค

์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋„คํŠธ์›Œํฌ์˜ ์—ฌ๋Ÿฌ ๋ ˆ์ด์–ด(๋‹ค์–‘ํ•œ ์Šค์ผ€์ผ)์—์„œ ํŠน์ง•์„ ๋น„๊ตํ•˜๋Š” ์ง€๊ฐ์  ์†์‹ค์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

๊ณผ์ œ: 1. VGG16์˜ ['relu2_2', 'relu3_3', 'relu4_3'] ๋ ˆ์ด์–ด์—์„œ ํŠน์ง• ์ถ”์ถœ 2. ๊ฐ ๋ ˆ์ด์–ด์—์„œ ์˜ˆ์ธก ํŠน์ง•๊ณผ ํƒ€๊ฒŸ ํŠน์ง• ๊ฐ„ MSE ๊ณ„์‚ฐ 3. ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๊ฐ€์ค‘์น˜๋กœ ์†์‹ค ๊ฒฐํ•ฉ 4. ์ด๋ฏธ์ง€ ์žฌ๊ตฌ์„ฑ ์ž‘์—…์—์„œ ํ…Œ์ŠคํŠธ

์งˆ๋ฌธ: - ๋‹ค์–‘ํ•œ ๋ ˆ์ด์–ด๊ฐ€ ์†์‹ค์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋‚˜์š”? - ์ดˆ๊ธฐ ๋ ˆ์ด์–ด๋งŒ ์‚ฌ์šฉํ•  ๋•Œ์™€ ๊นŠ์€ ๋ ˆ์ด์–ด๋งŒ ์‚ฌ์šฉํ•  ๋•Œ ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚˜๋‚˜์š”?

์—ฐ์Šต ๋ฌธ์ œ 3: ์ ์‘์  ์†์‹ค ๊ท ํ˜•

๋‹ค์ค‘ ์ž‘์—… ํ•™์Šต ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์œ„ํ•œ ์ ์‘์  ์†์‹ค ๊ท ํ˜• ์Šคํ‚ด์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

๊ณผ์ œ: ์„ธ ๊ฐ€์ง€ ์ž‘์—…์œผ๋กœ ์ž์œจ ์ฃผํ–‰ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค: 1. ์˜๋ฏธ์  ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜(CrossEntropyLoss) 2. ๊นŠ์ด ์ถ”์ •(L1Loss) 3. ๊ฐ์ฒด ๊ฒ€์ถœ(FocalLoss)

๊ตฌํ˜„: 1. ๋ถˆํ™•์‹ค์„ฑ ๊ธฐ๋ฐ˜ ๊ฐ€์ค‘์น˜(7.1์ ˆ) 2. ํ›ˆ๋ จ ์ค‘ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜ ์ถ”์  3. ํ›ˆ๋ จ์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ ๊ฐ€์ค‘์น˜๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ์‹œ๊ฐํ™”

์Šคํƒ€ํ„ฐ ์ฝ”๋“œ:

class MultiTaskModel(nn.Module):
    def __init__(self):
        super().__init__()
        # TODO: Define three task heads
        pass

    def forward(self, x):
        # TODO: Return predictions for all three tasks
        pass

# Training loop
for epoch in range(num_epochs):
    # TODO:
    # 1. Forward pass
    # 2. Compute three losses
    # 3. Apply uncertainty weighting
    # 4. Backward and optimize
    pass

์งˆ๋ฌธ: - ์–ด๋–ค ์ž‘์—…์ด ๊ฐ€์žฅ ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ๋ฐ›๋‚˜์š”? ์™œ ๊ทธ๋Ÿด๊นŒ์š”? - ๊ฐ€์ค‘์น˜๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋นจ๋ฆฌ ์•ˆ์ •ํ™”๋˜๋‚˜์š”? - ๊ฐ€์ค‘์น˜๋ฅผ ๋‹ค๋ฅด๊ฒŒ ์ดˆ๊ธฐํ™”ํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?


์ฐธ๊ณ  ์ž๋ฃŒ

  1. ์†์‹ค ํ•จ์ˆ˜ ์„œ๋ฒ ์ด:
  2. Janocha, K., & Czarnecki, W. M. (2017). "On Loss Functions for Deep Neural Networks in Classification"

  3. ํšŒ๊ท€ ์†์‹ค:

  4. Huber, P. J. (1964). "Robust Estimation of a Location Parameter"

  5. ๋ถ„๋ฅ˜ ์†์‹ค:

  6. Lin, T. Y., et al. (2017). "Focal Loss for Dense Object Detection" (RetinaNet)
  7. Mรผller, R., et al. (2019). "When Does Label Smoothing Help?"

  8. ๋ฉ”ํŠธ๋ฆญ ํ•™์Šต:

  9. Schroff, F., et al. (2015). "FaceNet: A Unified Embedding for Face Recognition and Clustering" (Triplet Loss)
  10. Chopra, S., et al. (2005). "Learning a Similarity Metric Discriminatively" (Contrastive Loss)
  11. Chen, T., et al. (2020). "A Simple Framework for Contrastive Learning of Visual Representations" (SimCLR, NT-Xent)

  12. ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ์†์‹ค:

  13. Milletari, F., et al. (2016). "V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation" (Dice Loss)
  14. Rezatofighi, H., et al. (2019). "Generalized Intersection over Union" (GIoU)

  15. ์ƒ์„ฑ ๋ชจ๋ธ:

  16. Goodfellow, I., et al. (2014). "Generative Adversarial Networks"
  17. Kingma, D. P., & Welling, M. (2013). "Auto-Encoding Variational Bayes"
  18. Johnson, J., et al. (2016). "Perceptual Losses for Real-Time Style Transfer and Super-Resolution"

  19. ๋‹ค์ค‘ ์ž‘์—… ํ•™์Šต:

  20. Kendall, A., et al. (2018). "Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics"
  21. Chen, Z., et al. (2018). "GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks"

  22. PyTorch ๋ฌธ์„œ:

  23. https://pytorch.org/docs/stable/nn.html#loss-functions
  24. https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html

  25. ์ถ”๊ฐ€ ๋ฆฌ์†Œ์Šค:

  26. Murphy, K. P. (2022). "Probabilistic Machine Learning: An Introduction" (Chapter on Loss Functions)
  27. Goodfellow, I., et al. (2016). "Deep Learning" (Chapter 8: Optimization for Training Deep Models)
to navigate between lessons