03. Understanding Backpropagation
03. Understanding Backpropagation¶
Previous: Neural Network Basics | Next: Training Techniques
Learning Objectives¶
- Understand the principles of the backpropagation algorithm
- Learn gradient calculation using the chain rule
- Implement backpropagation directly with NumPy
1. What is Backpropagation?¶
Backpropagation is an algorithm for training neural network weights.
Forward Pass: Input ──▶ Hidden Layer ──▶ Output ──▶ Loss
Backward Pass: Input ◀── Hidden Layer ◀── Output ◀── Loss
Core Ideas¶
- Forward Pass: Compute values from input to output
- Loss Calculation: Difference between prediction and ground truth
- Backward Pass: Propagate gradients from loss towards input
- Weight Update: Adjust weights using gradients
2. Chain Rule¶
The differentiation rule for composite functions.
Formula¶
y = f(g(x))
dy/dx = (dy/dg) × (dg/dx)
Example¶
z = x²
y = sin(z)
L = y²
dL/dx = (dL/dy) × (dy/dz) × (dz/dx)
= 2y × cos(z) × 2x
3. Backpropagation for a Single Neuron¶
Forward Pass¶
z = w*x + b # Linear transformation
a = sigmoid(z) # Activation
L = (a - y)² # Loss (MSE)
Backward Pass (Gradient Calculation)¶
dL/da = 2(a - y) # Gradient of loss w.r.t. activation
da/dz = sigmoid(z) * (1 - sigmoid(z)) # Sigmoid derivative
dz/dw = x # Gradient of linear transform w.r.t. weight
dz/db = 1 # Gradient of linear transform w.r.t. bias
# Apply chain rule
dL/dw = (dL/da) × (da/dz) × (dz/dw)
dL/db = (dL/da) × (da/dz) × (dz/db)
4. Loss Functions¶
MSE (Mean Squared Error)¶
L = (1/n) × Σ(y_pred - y_true)²
dL/dy_pred = (2/n) × (y_pred - y_true)
Cross-Entropy (Classification)¶
L = -Σ y_true × log(y_pred)
dL/dy_pred = -y_true / y_pred # Simplified when combined with softmax
Softmax + Cross-Entropy Combined¶
# Amazing result: becomes very simple
dL/dz = y_pred - y_true # Gradient w.r.t. softmax input
5. MLP Backpropagation¶
Backpropagation process for a 2-layer MLP.
Architecture¶
Input(x) → [W1, b1] → ReLU → [W2, b2] → Output(y)
Forward Pass¶
z1 = x @ W1 + b1
a1 = relu(z1)
z2 = a1 @ W2 + b2
y_pred = z2 # Or softmax(z2)
Backward Pass¶
# Output layer
dL/dz2 = y_pred - y_true # (for softmax + CE)
dL/dW2 = a1.T @ dL/dz2
dL/db2 = sum(dL/dz2, axis=0)
# Hidden layer
dL/da1 = dL/dz2 @ W2.T
dL/dz1 = dL/da1 * relu_derivative(z1)
dL/dW1 = x.T @ dL/dz1
dL/db1 = sum(dL/dz1, axis=0)
6. NumPy Implementation Core¶
class MLP:
def backward(self, x, y_true, y_pred, cache):
"""Backpropagation: compute gradients"""
a1, z1 = cache
# Output layer gradients
dz2 = y_pred - y_true
dW2 = a1.T @ dz2
db2 = np.sum(dz2, axis=0)
# Hidden layer gradients (chain rule)
da1 = dz2 @ self.W2.T
dz1 = da1 * (z1 > 0) # ReLU derivative
dW1 = x.T @ dz1
db1 = np.sum(dz1, axis=0)
return {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
7. PyTorch's Automatic Differentiation¶
In PyTorch, all of this is automatic.
# Forward pass
y_pred = model(x)
loss = criterion(y_pred, y_true)
# Backward pass (automatic!)
loss.backward()
# Access gradients
print(model.fc1.weight.grad)
Computational Graph¶
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2
z = y * 3
z.backward()
# x.grad = dz/dx = dz/dy × dy/dx = 3 × 2x = 12
8. Vanishing/Exploding Gradient Problems¶
Vanishing Gradient¶
- Cause: Derivatives of sigmoid/tanh close to 0
- Solution: ReLU, Residual Connections
Exploding Gradient¶
- Cause: Gradient accumulation in deep networks
- Solution: Gradient Clipping, Batch Normalization
# Gradient Clipping in PyTorch
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
9. Numerical Gradient Verification¶
A method to verify if backpropagation implementation is correct.
def numerical_gradient(f, x, h=1e-5):
"""Compute gradient using numerical differentiation"""
grad = np.zeros_like(x)
for i in range(x.size):
x_plus = x.copy()
x_plus.flat[i] += h
x_minus = x.copy()
x_minus.flat[i] -= h
grad.flat[i] = (f(x_plus) - f(x_minus)) / (2 * h)
return grad
# Verification
analytical_grad = backward(...) # Analytical gradient
numerical_grad = numerical_gradient(loss_fn, weights)
diff = np.linalg.norm(analytical_grad - numerical_grad)
assert diff < 1e-5, "Gradient check failed!"
Summary¶
Core of Backpropagation¶
- Chain Rule: Core of composite function differentiation
- Local Computation: Gradients computed independently at each layer
- Gradient Propagation: Propagate from output towards input
What You Learn from NumPy¶
- Meaning of matrix transpose and multiplication
- Role of activation function derivatives
- Gradient summation in batch processing
Moving to PyTorch¶
- All gradients computed in one line with
loss.backward() - Automatic computational graph construction
- GPU acceleration
Next Steps¶
In 04_Training_Techniques.md, we'll learn methods for weight updates using gradients.