03. Understanding Backpropagation¶

Previous: Neural Network Basics | Next: Training Techniques

Learning Objectives¶

Understand the principles of the backpropagation algorithm
Learn gradient calculation using the chain rule
Implement backpropagation directly with NumPy

1. What is Backpropagation?¶

Backpropagation is an algorithm for training neural network weights.

Forward Pass:  Input ──▶ Hidden Layer ──▶ Output ──▶ Loss
Backward Pass: Input ◀── Hidden Layer ◀── Output ◀── Loss

Core Ideas¶

Forward Pass: Compute values from input to output
Loss Calculation: Difference between prediction and ground truth
Backward Pass: Propagate gradients from loss towards input
Weight Update: Adjust weights using gradients

2. Chain Rule¶

The differentiation rule for composite functions.

Formula¶

y = f(g(x))

dy/dx = (dy/dg) × (dg/dx)

Example¶

z = x²
y = sin(z)
L = y²

dL/dx = (dL/dy) × (dy/dz) × (dz/dx)
      = 2y × cos(z) × 2x

3. Backpropagation for a Single Neuron¶

Forward Pass¶

z = w*x + b      # Linear transformation
a = sigmoid(z)    # Activation
L = (a - y)²     # Loss (MSE)

Backward Pass (Gradient Calculation)¶

dL/da = 2(a - y)                    # Gradient of loss w.r.t. activation
da/dz = sigmoid(z) * (1 - sigmoid(z))  # Sigmoid derivative
dz/dw = x                           # Gradient of linear transform w.r.t. weight
dz/db = 1                           # Gradient of linear transform w.r.t. bias

# Apply chain rule
dL/dw = (dL/da) × (da/dz) × (dz/dw)
dL/db = (dL/da) × (da/dz) × (dz/db)

4. Loss Functions¶

MSE (Mean Squared Error)¶

L = (1/n) × Σ(y_pred - y_true)²
dL/dy_pred = (2/n) × (y_pred - y_true)

Cross-Entropy (Classification)¶

L = -Σ y_true × log(y_pred)
dL/dy_pred = -y_true / y_pred  # Simplified when combined with softmax

Softmax + Cross-Entropy Combined¶

# Amazing result: becomes very simple
dL/dz = y_pred - y_true  # Gradient w.r.t. softmax input

5. MLP Backpropagation¶

Backpropagation process for a 2-layer MLP.

Architecture¶

Input(x) → [W1, b1] → ReLU → [W2, b2] → Output(y)

Forward Pass¶

z1 = x @ W1 + b1
a1 = relu(z1)
z2 = a1 @ W2 + b2
y_pred = z2  # Or softmax(z2)

Backward Pass¶

# Output layer
dL/dz2 = y_pred - y_true  # (for softmax + CE)
dL/dW2 = a1.T @ dL/dz2
dL/db2 = sum(dL/dz2, axis=0)

# Hidden layer
dL/da1 = dL/dz2 @ W2.T
dL/dz1 = dL/da1 * relu_derivative(z1)
dL/dW1 = x.T @ dL/dz1
dL/db1 = sum(dL/dz1, axis=0)

6. NumPy Implementation Core¶

class MLP:
    def backward(self, x, y_true, y_pred, cache):
        """Backpropagation: compute gradients"""
        a1, z1 = cache

        # Output layer gradients
        dz2 = y_pred - y_true
        dW2 = a1.T @ dz2
        db2 = np.sum(dz2, axis=0)

        # Hidden layer gradients (chain rule)
        da1 = dz2 @ self.W2.T
        dz1 = da1 * (z1 > 0)  # ReLU derivative
        dW1 = x.T @ dz1
        db1 = np.sum(dz1, axis=0)

        return {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}

7. PyTorch's Automatic Differentiation¶

In PyTorch, all of this is automatic.

# Forward pass
y_pred = model(x)
loss = criterion(y_pred, y_true)

# Backward pass (automatic!)
loss.backward()

# Access gradients
print(model.fc1.weight.grad)

Computational Graph¶

x = torch.tensor([2.0], requires_grad=True)
y = x ** 2
z = y * 3
z.backward()

# x.grad = dz/dx = dz/dy × dy/dx = 3 × 2x = 12

8. Vanishing/Exploding Gradient Problems¶

Vanishing Gradient¶

Cause: Derivatives of sigmoid/tanh close to 0
Solution: ReLU, Residual Connections

Exploding Gradient¶

Cause: Gradient accumulation in deep networks
Solution: Gradient Clipping, Batch Normalization

# Gradient Clipping in PyTorch
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

9. Numerical Gradient Verification¶

A method to verify if backpropagation implementation is correct.

def numerical_gradient(f, x, h=1e-5):
    """Compute gradient using numerical differentiation"""
    grad = np.zeros_like(x)
    for i in range(x.size):
        x_plus = x.copy()
        x_plus.flat[i] += h
        x_minus = x.copy()
        x_minus.flat[i] -= h
        grad.flat[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad

# Verification
analytical_grad = backward(...)  # Analytical gradient
numerical_grad = numerical_gradient(loss_fn, weights)
diff = np.linalg.norm(analytical_grad - numerical_grad)
assert diff < 1e-5, "Gradient check failed!"

Summary¶

Core of Backpropagation¶

Chain Rule: Core of composite function differentiation
Local Computation: Gradients computed independently at each layer
Gradient Propagation: Propagate from output towards input

What You Learn from NumPy¶

Meaning of matrix transpose and multiplication
Role of activation function derivatives
Gradient summation in batch processing

Moving to PyTorch¶

All gradients computed in one line with loss.backward()
Automatic computational graph construction
GPU acceleration

Next Steps¶

In 04_Training_Techniques.md, we'll learn methods for weight updates using gradients.