06. Multi-Layer Perceptron (MLP)

06. Multi-Layer Perceptron (MLP)

Previous: Linear & Logistic Regression | Next: CNN Basics


Overview

MLP is the fundamental building block of deep learning. Understanding how to train multiple layers through Backpropagation is key.

Learning Objectives

  1. Forward Pass: Understanding forward propagation in multi-layer structures
  2. Backward Pass: Backpropagation using the Chain Rule
  3. Activation Functions: Characteristics and derivatives of ReLU, Sigmoid, Tanh
  4. Weight Initialization: Importance of proper initialization

Mathematical Background

1. Forward Pass

Input: x โˆˆ โ„^dโ‚€

Layer 1: zโ‚ = Wโ‚x + bโ‚,  aโ‚ = ฯƒ(zโ‚)
Layer 2: zโ‚‚ = Wโ‚‚aโ‚ + bโ‚‚,  aโ‚‚ = ฯƒ(zโ‚‚)
...
Output:  ลท = aโ‚™

Where:
- Wแตข โˆˆ โ„^(dแตข ร— dแตขโ‚‹โ‚): weight matrix
- bแตข โˆˆ โ„^dแตข: bias
- ฯƒ: activation function

2. Backward Pass (Backpropagation)

Loss: L = Loss(y, ลท)

Chain Rule:
โˆ‚L/โˆ‚Wแตข = โˆ‚L/โˆ‚aแตข ร— โˆ‚aแตข/โˆ‚zแตข ร— โˆ‚zแตข/โˆ‚Wแตข

Backpropagation order:
1. โˆ‚L/โˆ‚ลท (derivative of loss w.r.t. output)
2. โˆ‚L/โˆ‚zโ‚™ = โˆ‚L/โˆ‚ลท ร— ฯƒ'(zโ‚™)
3. โˆ‚L/โˆ‚Wโ‚™ = aโ‚™โ‚‹โ‚แต€ ร— โˆ‚L/โˆ‚zโ‚™
4. โˆ‚L/โˆ‚aโ‚™โ‚‹โ‚ = โˆ‚L/โˆ‚zโ‚™ ร— Wโ‚™แต€
5. Repeat...

3. Activation Functions

ReLU:     ฯƒ(z) = max(0, z)
          ฯƒ'(z) = 1 if z > 0 else 0

Sigmoid:  ฯƒ(z) = 1/(1 + eโปแถป)
          ฯƒ'(z) = ฯƒ(z)(1 - ฯƒ(z))

Tanh:     ฯƒ(z) = (eแถป - eโปแถป)/(eแถป + eโปแถป)
          ฯƒ'(z) = 1 - ฯƒ(z)ยฒ

File Structure

02_MLP/
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ numpy/
โ”‚   โ”œโ”€โ”€ mlp_numpy.py          # Complete MLP implementation
โ”‚   โ”œโ”€โ”€ activations_numpy.py   # Activation functions
โ”‚   โ””โ”€โ”€ test_mlp.py           # Tests
โ”œโ”€โ”€ pytorch_lowlevel/
โ”‚   โ””โ”€โ”€ mlp_lowlevel.py       # Implementation without nn.Linear
โ”œโ”€โ”€ paper/
โ”‚   โ””โ”€โ”€ mlp_paper.py          # Clean nn.Module
โ””โ”€โ”€ exercises/
    โ”œโ”€โ”€ 01_add_dropout.md
    โ”œโ”€โ”€ 02_batch_norm.md
    โ””โ”€โ”€ 03_xor_problem.md

Core Concepts

1. Vanishing/Exploding Gradients

Problem: Gradients vanish or explode as layers get deeper
- Sigmoid: ฯƒ'(z) โ‰ค 0.25 โ†’ product converges to 0
- Solution: ReLU, proper initialization, BatchNorm, ResNet

Example:
10 layers, Sigmoid โ†’ gradient โ‰ˆ 0.25^10 โ‰ˆ 10^-6

2. Xavier/He Initialization

# Xavier (Glorot): for tanh, sigmoid
W = np.random.randn(in_dim, out_dim) * np.sqrt(1 / in_dim)
# Or
W = np.random.randn(in_dim, out_dim) * np.sqrt(2 / (in_dim + out_dim))

# He (Kaiming): for ReLU
W = np.random.randn(in_dim, out_dim) * np.sqrt(2 / in_dim)

3. Universal Approximation Theorem

A feedforward network with a single hidden layer can approximate any continuous function, given enough neurons.


Practice Problems

Basic

  1. Solve XOR problem (2-layer MLP)
  2. Compare different activation functions
  3. Compare learning curves with different initialization methods

Intermediate

  1. Implement Dropout
  2. Implement Batch Normalization
  3. Implement Learning Rate Scheduler

Advanced

  1. MNIST classification (>98% accuracy)
  2. Implement Gradient Clipping
  3. Implement Weight Decay (L2 regularization)

References

  • Rumelhart et al. (1986). "Learning representations by back-propagating errors"
  • Glorot & Bengio (2010). "Understanding the difficulty of training deep feedforward neural networks"
  • He et al. (2015). "Delving Deep into Rectifiers" (He initialization)
to navigate between lessons