04. ํ•™์Šต ๊ธฐ๋ฒ•

04. ํ•™์Šต ๊ธฐ๋ฒ•

์ด์ „: ์—ญ์ „ํŒŒ | ๋‹ค์Œ: Linear & Logistic Regression


ํ•™์Šต ๋ชฉํ‘œ

  • ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• ๋ณ€ํ˜• ์ดํ•ด (SGD, Momentum, Adam)
  • ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„๋ง
  • ์ •๊ทœํ™” ๊ธฐ๋ฒ• (Dropout, Weight Decay, Batch Norm)
  • ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€์™€ ์กฐ๊ธฐ ์ข…๋ฃŒ

1. ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• (Gradient Descent)

๊ธฐ๋ณธ ์›๋ฆฌ

W(t+1) = W(t) - ฮท ร— โˆ‡L
  • ฮท: ํ•™์Šต๋ฅ  (learning rate)
  • โˆ‡L: ์†์‹ค ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ

๋ณ€ํ˜•๋“ค

๋ฐฉ๋ฒ• ์ˆ˜์‹ ํŠน์ง•
SGD W -= lr ร— g ๋‹จ์ˆœ, ๋А๋ฆผ
Momentum v = ฮฒv + g; W -= lr ร— v ๊ด€์„ฑ ์ถ”๊ฐ€
AdaGrad ์ ์‘์  ํ•™์Šต๋ฅ  ํฌ์†Œ ๋ฐ์ดํ„ฐ์— ์œ ๋ฆฌ
RMSprop ์ง€์ˆ˜ ์ด๋™ ํ‰๊ท  AdaGrad ๊ฐœ์„ 
Adam Momentum + RMSprop ๊ฐ€์žฅ ๋ณดํŽธ์ 

2. Momentum

๊ด€์„ฑ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์ง„๋™์„ ์ค„์ž…๋‹ˆ๋‹ค.

v(t) = ฮฒ ร— v(t-1) + โˆ‡L
W(t+1) = W(t) - ฮท ร— v(t)

NumPy ๊ตฌํ˜„

def sgd_momentum(W, grad, v, lr=0.01, beta=0.9):
    v = beta * v + grad          # ์†๋„ ์—…๋ฐ์ดํŠธ
    W = W - lr * v               # ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ
    return W, v

PyTorch

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

3. Adam Optimizer

Momentum๊ณผ RMSprop์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.

m(t) = ฮฒโ‚ ร— m(t-1) + (1-ฮฒโ‚) ร— g      # 1์ฐจ ๋ชจ๋ฉ˜ํŠธ
v(t) = ฮฒโ‚‚ ร— v(t-1) + (1-ฮฒโ‚‚) ร— gยฒ     # 2์ฐจ ๋ชจ๋ฉ˜ํŠธ
mฬ‚ = m / (1 - ฮฒโ‚แต—)                    # ํŽธํ–ฅ ๋ณด์ •
vฬ‚ = v / (1 - ฮฒโ‚‚แต—)
W = W - ฮท ร— mฬ‚ / (โˆšvฬ‚ + ฮต)

NumPy ๊ตฌํ˜„

def adam(W, grad, m, v, t, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
    m = beta1 * m + (1 - beta1) * grad
    v = beta2 * v + (1 - beta2) * (grad ** 2)

    m_hat = m / (1 - beta1 ** t)
    v_hat = v / (1 - beta2 ** t)

    W = W - lr * m_hat / (np.sqrt(v_hat) + eps)
    return W, m, v

PyTorch

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

4. ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„๋ง

ํ•™์Šต ์ค‘ ํ•™์Šต๋ฅ ์„ ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ๋ฐฉ๋ฒ•

๋ฐฉ๋ฒ• ํŠน์ง•
Step Decay N ์—ํญ๋งˆ๋‹ค ฮณ ๋ฐฐ๋กœ ๊ฐ์†Œ
Exponential lr = lrโ‚€ ร— ฮณแต‰แต–แต’แถœสฐ
Cosine Annealing ์ฝ”์‚ฌ์ธ ํ•จ์ˆ˜๋กœ ๊ฐ์†Œ
ReduceLROnPlateau ๊ฒ€์ฆ ์†์‹ค ์ •์ฒด ์‹œ ๊ฐ์†Œ
Warmup ์ดˆ๊ธฐ์— ์ ์ง„์  ์ฆ๊ฐ€

PyTorch ์˜ˆ์‹œ

# Step Decay
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

# Cosine Annealing
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# ReduceLROnPlateau
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', patience=10, factor=0.5
)

# ํ•™์Šต ๋ฃจํ”„์—์„œ
for epoch in range(epochs):
    train(...)
    scheduler.step()  # ์—ํญ ๋์— ํ˜ธ์ถœ

5. Dropout

ํ•™์Šต ์ค‘ ๋žœ๋คํ•˜๊ฒŒ ๋‰ด๋Ÿฐ์„ ๋น„ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค.

์›๋ฆฌ

ํ›ˆ๋ จ: y = x ร— mask / (1 - p)   # mask๋Š” Bernoulli(1-p)
์ถ”๋ก : y = x                     # ๋งˆ์Šคํฌ ์—†์Œ

NumPy ๊ตฌํ˜„

def dropout(x, p=0.5, training=True):
    if not training:
        return x
    mask = (np.random.rand(*x.shape) > p).astype(float)
    return x * mask / (1 - p)

PyTorch

class MLPWithDropout(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout_p=0.5):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.dropout = nn.Dropout(p=dropout_p)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.dropout(x)  # ํ›ˆ๋ จ ์‹œ๋งŒ ํ™œ์„ฑํ™”
        x = self.fc2(x)
        return x

# ์ถ”๋ก  ์‹œ
model.eval()  # dropout ๋น„ํ™œ์„ฑํ™”

6. Batch Normalization

๊ฐ ์ธต์˜ ์ž…๋ ฅ์„ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜์‹

ฮผ = mean(x)
ฯƒยฒ = var(x)
xฬ‚ = (x - ฮผ) / โˆš(ฯƒยฒ + ฮต)
y = ฮณ ร— xฬ‚ + ฮฒ   # ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ

NumPy ๊ตฌํ˜„

def batch_norm(x, gamma, beta, eps=1e-5, training=True,
               running_mean=None, running_var=None, momentum=0.1):
    if training:
        mean = np.mean(x, axis=0)
        var = np.var(x, axis=0)

        # ์ด๋™ ํ‰๊ท  ์—…๋ฐ์ดํŠธ
        if running_mean is not None:
            running_mean = momentum * mean + (1 - momentum) * running_mean
            running_var = momentum * var + (1 - momentum) * running_var
    else:
        mean = running_mean
        var = running_var

    x_norm = (x - mean) / np.sqrt(var + eps)
    return gamma * x_norm + beta

PyTorch

class CNNWithBatchNorm(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3)
        self.bn1 = nn.BatchNorm2d(64)
        self.fc1 = nn.Linear(64, 10)
        self.bn_fc = nn.BatchNorm1d(10)

    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = x.flatten(1)
        x = self.bn_fc(self.fc1(x))
        return x

7. Weight Decay (L2 ์ •๊ทœํ™”)

๊ฐ€์ค‘์น˜ ํฌ๊ธฐ์— ํŒจ๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜์‹

L_total = L_data + ฮป ร— ||W||ยฒ
โˆ‡L_total = โˆ‡L_data + 2ฮปW

PyTorch

# ๋ฐฉ๋ฒ• 1: optimizer์—์„œ ์„ค์ •
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# ๋ฐฉ๋ฒ• 2: ์†์‹ค์— ์ง์ ‘ ์ถ”๊ฐ€
l2_lambda = 1e-4
l2_reg = sum(p.pow(2).sum() for p in model.parameters())
loss = criterion(output, target) + l2_lambda * l2_reg

8. ์กฐ๊ธฐ ์ข…๋ฃŒ (Early Stopping)

๊ฒ€์ฆ ์†์‹ค์ด ๊ฐœ์„ ๋˜์ง€ ์•Š์œผ๋ฉด ํ•™์Šต์„ ์ค‘๋‹จํ•ฉ๋‹ˆ๋‹ค.

PyTorch ๊ตฌํ˜„

class EarlyStopping:
    def __init__(self, patience=10, min_delta=0):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = None
        self.early_stop = False

    def __call__(self, val_loss):
        if self.best_loss is None:
            self.best_loss = val_loss
        elif val_loss > self.best_loss - self.min_delta:
            self.counter += 1
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_loss = val_loss
            self.counter = 0

# ์‚ฌ์šฉ
early_stopping = EarlyStopping(patience=10)
for epoch in range(epochs):
    train_loss = train(model, train_loader)
    val_loss = validate(model, val_loader)

    early_stopping(val_loss)
    if early_stopping.early_stop:
        print(f"Early stopping at epoch {epoch}")
        break

9. ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• (Data Augmentation)

ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ˜•ํ•˜์—ฌ ๋‹ค์–‘์„ฑ์„ ์ฆ๊ฐ€์‹œํ‚ต๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ

from torchvision import transforms

transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(10),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

10. NumPy vs PyTorch ๋น„๊ต

Optimizer ๊ตฌํ˜„

# NumPy (์ˆ˜๋™ ๊ตฌํ˜„)
m = np.zeros_like(W)
v = np.zeros_like(W)
for t in range(1, epochs + 1):
    grad = compute_gradient(W, X, y)
    m = beta1 * m + (1 - beta1) * grad
    v = beta2 * v + (1 - beta2) * grad**2
    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)
    W -= lr * m_hat / (np.sqrt(v_hat) + eps)

# PyTorch (์ž๋™)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(epochs):
    loss = criterion(model(X), y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

์ •๋ฆฌ

ํ•ต์‹ฌ ๊ฐœ๋…

  1. Optimizer: Adam์ด ๊ธฐ๋ณธ ์„ ํƒ, SGD+Momentum๋„ ์—ฌ์ „ํžˆ ์œ ํšจ
  2. ํ•™์Šต๋ฅ : ์ ์ ˆํ•œ ์Šค์ผ€์ค„๋ง์œผ๋กœ ์ˆ˜๋ ด ๊ฐœ์„ 
  3. ์ •๊ทœํ™”: Dropout, BatchNorm, Weight Decay ์กฐํ•ฉ
  4. ์กฐ๊ธฐ ์ข…๋ฃŒ: ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€์˜ ๊ธฐ๋ณธ

๊ถŒ์žฅ ์‹œ์ž‘ ์„ค์ •

# ๊ธฐ๋ณธ ์„ค์ •
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5)

๋‹ค์Œ ๋‹จ๊ณ„

07_CNN_Basics.md์—์„œ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

to navigate between lessons