04. Training Techniques
04. Training Techniques¶
Previous: Backpropagation | Next: Linear & Logistic Regression
Learning Objectives¶
- Understand gradient descent variants (SGD, Momentum, Adam)
- Learn learning rate scheduling
- Learn regularization techniques (Dropout, Weight Decay, Batch Norm)
- Learn overfitting prevention and early stopping
1. Gradient Descent¶
Basic Principle¶
W(t+1) = W(t) - η × ∇L
- η: learning rate
- ∇L: gradient of loss function
Variants¶
| Method | Formula | Characteristics |
|---|---|---|
| SGD | W -= lr × g | Simple, slow |
| Momentum | v = βv + g; W -= lr × v | Adds inertia |
| AdaGrad | Adaptive learning rate | Good for sparse data |
| RMSprop | Exponential moving average | Improved AdaGrad |
| Adam | Momentum + RMSprop | Most commonly used |
2. Momentum¶
Adds inertia to reduce oscillations.
v(t) = β × v(t-1) + ∇L
W(t+1) = W(t) - η × v(t)
NumPy Implementation¶
def sgd_momentum(W, grad, v, lr=0.01, beta=0.9):
v = beta * v + grad # Update velocity
W = W - lr * v # Update weights
return W, v
PyTorch¶
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
3. Adam Optimizer¶
Combines advantages of Momentum and RMSprop.
m(t) = β₁ × m(t-1) + (1-β₁) × g # 1st moment
v(t) = β₂ × v(t-1) + (1-β₂) × g² # 2nd moment
m̂ = m / (1 - β₁ᵗ) # Bias correction
v̂ = v / (1 - β₂ᵗ)
W = W - η × m̂ / (√v̂ + ε)
NumPy Implementation¶
def adam(W, grad, m, v, t, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * (grad ** 2)
m_hat = m / (1 - beta1 ** t)
v_hat = v / (1 - beta2 ** t)
W = W - lr * m_hat / (np.sqrt(v_hat) + eps)
return W, m, v
PyTorch¶
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
4. Learning Rate Scheduling¶
Adjust learning rate during training.
Main Methods¶
| Method | Characteristics |
|---|---|
| Step Decay | Reduce by γ every N epochs |
| Exponential | lr = lr₀ × γᵉᵖᵒᶜʰ |
| Cosine Annealing | Reduce following cosine function |
| ReduceLROnPlateau | Reduce when validation loss plateaus |
| Warmup | Gradual increase at beginning |
PyTorch Examples¶
# Step Decay
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# Cosine Annealing
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
# ReduceLROnPlateau
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', patience=10, factor=0.5
)
# In training loop
for epoch in range(epochs):
train(...)
scheduler.step() # Call at end of epoch
5. Dropout¶
Randomly deactivates neurons during training.
Principle¶
Training: y = x × mask / (1 - p) # mask is Bernoulli(1-p)
Inference: y = x # No mask
NumPy Implementation¶
def dropout(x, p=0.5, training=True):
if not training:
return x
mask = (np.random.rand(*x.shape) > p).astype(float)
return x * mask / (1 - p)
PyTorch¶
class MLPWithDropout(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, dropout_p=0.5):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.dropout = nn.Dropout(p=dropout_p)
self.fc2 = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.dropout(x) # Active only during training
x = self.fc2(x)
return x
# During inference
model.eval() # Disable dropout
6. Batch Normalization¶
Normalizes inputs at each layer.
Formula¶
μ = mean(x)
σ² = var(x)
x̂ = (x - μ) / √(σ² + ε)
y = γ × x̂ + β # Learnable parameters
NumPy Implementation¶
def batch_norm(x, gamma, beta, eps=1e-5, training=True,
running_mean=None, running_var=None, momentum=0.1):
if training:
mean = np.mean(x, axis=0)
var = np.var(x, axis=0)
# Update running averages
if running_mean is not None:
running_mean = momentum * mean + (1 - momentum) * running_mean
running_var = momentum * var + (1 - momentum) * running_var
else:
mean = running_mean
var = running_var
x_norm = (x - mean) / np.sqrt(var + eps)
return gamma * x_norm + beta
PyTorch¶
class CNNWithBatchNorm(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 64, 3)
self.bn1 = nn.BatchNorm2d(64)
self.fc1 = nn.Linear(64, 10)
self.bn_fc = nn.BatchNorm1d(10)
def forward(self, x):
x = F.relu(self.bn1(self.conv1(x)))
x = x.flatten(1)
x = self.bn_fc(self.fc1(x))
return x
7. Weight Decay (L2 Regularization)¶
Penalizes weight magnitudes.
Formula¶
L_total = L_data + λ × ||W||²
∇L_total = ∇L_data + 2λW
PyTorch¶
# Method 1: Set in optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
# Method 2: Add directly to loss
l2_lambda = 1e-4
l2_reg = sum(p.pow(2).sum() for p in model.parameters())
loss = criterion(output, target) + l2_lambda * l2_reg
8. Early Stopping¶
Stop training when validation loss stops improving.
PyTorch Implementation¶
class EarlyStopping:
def __init__(self, patience=10, min_delta=0):
self.patience = patience
self.min_delta = min_delta
self.counter = 0
self.best_loss = None
self.early_stop = False
def __call__(self, val_loss):
if self.best_loss is None:
self.best_loss = val_loss
elif val_loss > self.best_loss - self.min_delta:
self.counter += 1
if self.counter >= self.patience:
self.early_stop = True
else:
self.best_loss = val_loss
self.counter = 0
# Usage
early_stopping = EarlyStopping(patience=10)
for epoch in range(epochs):
train_loss = train(model, train_loader)
val_loss = validate(model, val_loader)
early_stopping(val_loss)
if early_stopping.early_stop:
print(f"Early stopping at epoch {epoch}")
break
9. Data Augmentation¶
Transform training data to increase diversity.
Image Data¶
from torchvision import transforms
transform = transforms.Compose([
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomRotation(10),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.RandomCrop(32, padding=4),
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
10. NumPy vs PyTorch Comparison¶
Optimizer Implementation¶
# NumPy (manual implementation)
m = np.zeros_like(W)
v = np.zeros_like(W)
for t in range(1, epochs + 1):
grad = compute_gradient(W, X, y)
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad**2
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
W -= lr * m_hat / (np.sqrt(v_hat) + eps)
# PyTorch (automatic)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(epochs):
loss = criterion(model(X), y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Summary¶
Core Concepts¶
- Optimizer: Adam is the default choice, SGD+Momentum still valid
- Learning Rate: Improve convergence with proper scheduling
- Regularization: Combine Dropout, BatchNorm, Weight Decay
- Early Stopping: Basic overfitting prevention
Recommended Starting Settings¶
# Basic configuration
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5)
Next Steps¶
In 07_CNN_Basics.md, we'll learn convolutional neural networks.