06. Multi-Layer Perceptron (MLP)
06. Multi-Layer Perceptron (MLP)¶
Previous: Linear & Logistic Regression | Next: CNN Basics
Overview¶
MLP is the fundamental building block of deep learning. Understanding how to train multiple layers through Backpropagation is key.
Learning Objectives¶
- Forward Pass: Understanding forward propagation in multi-layer structures
- Backward Pass: Backpropagation using the Chain Rule
- Activation Functions: Characteristics and derivatives of ReLU, Sigmoid, Tanh
- Weight Initialization: Importance of proper initialization
Mathematical Background¶
1. Forward Pass¶
Input: x โ โ^dโ
Layer 1: zโ = Wโx + bโ, aโ = ฯ(zโ)
Layer 2: zโ = Wโaโ + bโ, aโ = ฯ(zโ)
...
Output: ลท = aโ
Where:
- Wแตข โ โ^(dแตข ร dแตขโโ): weight matrix
- bแตข โ โ^dแตข: bias
- ฯ: activation function
2. Backward Pass (Backpropagation)¶
Loss: L = Loss(y, ลท)
Chain Rule:
โL/โWแตข = โL/โaแตข ร โaแตข/โzแตข ร โzแตข/โWแตข
Backpropagation order:
1. โL/โลท (derivative of loss w.r.t. output)
2. โL/โzโ = โL/โลท ร ฯ'(zโ)
3. โL/โWโ = aโโโแต ร โL/โzโ
4. โL/โaโโโ = โL/โzโ ร Wโแต
5. Repeat...
3. Activation Functions¶
ReLU: ฯ(z) = max(0, z)
ฯ'(z) = 1 if z > 0 else 0
Sigmoid: ฯ(z) = 1/(1 + eโปแถป)
ฯ'(z) = ฯ(z)(1 - ฯ(z))
Tanh: ฯ(z) = (eแถป - eโปแถป)/(eแถป + eโปแถป)
ฯ'(z) = 1 - ฯ(z)ยฒ
File Structure¶
02_MLP/
โโโ README.md
โโโ numpy/
โ โโโ mlp_numpy.py # Complete MLP implementation
โ โโโ activations_numpy.py # Activation functions
โ โโโ test_mlp.py # Tests
โโโ pytorch_lowlevel/
โ โโโ mlp_lowlevel.py # Implementation without nn.Linear
โโโ paper/
โ โโโ mlp_paper.py # Clean nn.Module
โโโ exercises/
โโโ 01_add_dropout.md
โโโ 02_batch_norm.md
โโโ 03_xor_problem.md
Core Concepts¶
1. Vanishing/Exploding Gradients¶
Problem: Gradients vanish or explode as layers get deeper
- Sigmoid: ฯ'(z) โค 0.25 โ product converges to 0
- Solution: ReLU, proper initialization, BatchNorm, ResNet
Example:
10 layers, Sigmoid โ gradient โ 0.25^10 โ 10^-6
2. Xavier/He Initialization¶
# Xavier (Glorot): for tanh, sigmoid
W = np.random.randn(in_dim, out_dim) * np.sqrt(1 / in_dim)
# Or
W = np.random.randn(in_dim, out_dim) * np.sqrt(2 / (in_dim + out_dim))
# He (Kaiming): for ReLU
W = np.random.randn(in_dim, out_dim) * np.sqrt(2 / in_dim)
3. Universal Approximation Theorem¶
A feedforward network with a single hidden layer can approximate any continuous function, given enough neurons.
Practice Problems¶
Basic¶
- Solve XOR problem (2-layer MLP)
- Compare different activation functions
- Compare learning curves with different initialization methods
Intermediate¶
- Implement Dropout
- Implement Batch Normalization
- Implement Learning Rate Scheduler
Advanced¶
- MNIST classification (>98% accuracy)
- Implement Gradient Clipping
- Implement Weight Decay (L2 regularization)
References¶
- Rumelhart et al. (1986). "Learning representations by back-propagating errors"
- Glorot & Bengio (2010). "Understanding the difficulty of training deep feedforward neural networks"
- He et al. (2015). "Delving Deep into Rectifiers" (He initialization)