06. Multi-Layer Perceptron (MLP)
06. Multi-Layer Perceptron (MLP)¶
μ΄μ : Linear & Logistic Regression | λ€μ: CNN κΈ°μ΄
κ°μ¶
MLPλ λ₯λ¬λμ κΈ°λ³Έ building blockμ λλ€. Backpropagationμ ν΅ν΄ μ¬λ¬ λ μ΄μ΄λ₯Ό νμ΅νλ λ°©λ²μ μ΄ν΄νλ κ²μ΄ ν΅μ¬μ λλ€.
νμ΅ λͺ©ν¶
- Forward Pass: λ€μΈ΅ ꡬ쑰μμ μμ ν μ΄ν΄
- Backward Pass: Chain Ruleμ μ΄μ©ν μμ ν
- Activation Functions: ReLU, Sigmoid, Tanhμ νΉμ±κ³Ό λ―ΈλΆ
- Weight Initialization: μ¬λ°λ₯Έ μ΄κΈ°νμ μ€μμ±
μνμ λ°°κ²½¶
1. Forward Pass¶
μ
λ ₯: x β β^dβ
λ μ΄μ΄ 1: zβ = Wβx + bβ, aβ = Ο(zβ)
λ μ΄μ΄ 2: zβ = Wβaβ + bβ, aβ = Ο(zβ)
...
μΆλ ₯: Ε· = aβ
μ¬κΈ°μ:
- Wα΅’ β β^(dα΅’ Γ dα΅’ββ): κ°μ€μΉ νλ ¬
- bα΅’ β β^dα΅’: νΈν₯
- Ο: νμ±ν ν¨μ
2. Backward Pass (Backpropagation)¶
μμ€: L = Loss(y, Ε·)
Chain Rule:
βL/βWα΅’ = βL/βaα΅’ Γ βaα΅’/βzα΅’ Γ βzα΅’/βWα΅’
μμ ν μμ:
1. βL/βΕ· (μΆλ ₯μμ μμ€μ λ―ΈλΆ)
2. βL/βzβ = βL/βΕ· Γ Ο'(zβ)
3. βL/βWβ = aβββα΅ Γ βL/βzβ
4. βL/βaβββ = βL/βzβ Γ Wβα΅
5. λ°λ³΅...
3. νμ±ν ν¨μ¶
ReLU: Ο(z) = max(0, z)
Ο'(z) = 1 if z > 0 else 0
Sigmoid: Ο(z) = 1/(1 + eβ»αΆ»)
Ο'(z) = Ο(z)(1 - Ο(z))
Tanh: Ο(z) = (eαΆ» - eβ»αΆ»)/(eαΆ» + eβ»αΆ»)
Ο'(z) = 1 - Ο(z)Β²
νμΌ κ΅¬μ‘°¶
02_MLP/
βββ README.md
βββ numpy/
β βββ mlp_numpy.py # μμ ν MLP ꡬν
β βββ activations_numpy.py # νμ±ν ν¨μλ€
β βββ test_mlp.py # ν
μ€νΈ
βββ pytorch_lowlevel/
β βββ mlp_lowlevel.py # nn.Linear μμ΄ κ΅¬ν
βββ paper/
β βββ mlp_paper.py # Clean nn.Module
βββ exercises/
βββ 01_add_dropout.md
βββ 02_batch_norm.md
βββ 03_xor_problem.md
ν΅μ¬ κ°λ ¶
1. Vanishing/Exploding Gradients¶
λ¬Έμ : λ μ΄μ΄κ° κΉμ΄μ§λ©΄ gradientκ° μ¬λΌμ§κ±°λ νλ°
- Sigmoid: Ο'(z) β€ 0.25 β κ³±νλ©΄ 0μ μλ ΄
- ν΄κ²°: ReLU, μ μ ν μ΄κΈ°ν, BatchNorm, ResNet
μ:
10 layers, Sigmoid β gradient β 0.25^10 β 10^-6
2. Xavier/He μ΄κΈ°ν¶
# Xavier (Glorot): tanh, sigmoidμ©
W = np.random.randn(in_dim, out_dim) * np.sqrt(1 / in_dim)
# λλ
W = np.random.randn(in_dim, out_dim) * np.sqrt(2 / (in_dim + out_dim))
# He (Kaiming): ReLUμ©
W = np.random.randn(in_dim, out_dim) * np.sqrt(2 / in_dim)
3. Universal Approximation Theorem¶
νλμ hidden layerλ₯Ό κ°μ§ feedforward λ€νΈμν¬λ μΆ©λΆν λ΄λ°μ΄ μλ€λ©΄ μμμ μ°μ ν¨μλ₯Ό κ·Όμ¬ν μ μλ€.
μ°μ΅ λ¬Έμ ¶
κΈ°μ΄¶
- XOR λ¬Έμ ν΄κ²° (2-layer MLP)
- λ€μν νμ±ν ν¨μ λΉκ΅
- μ΄κΈ°ν λ°©λ²μ λ°λ₯Έ νμ΅ κ³‘μ λΉκ΅
μ€κΈ¶
- Dropout ꡬν
- Batch Normalization ꡬν
- Learning Rate Scheduler ꡬν
κ³ κΈ¶
- MNIST λΆλ₯ (98% μ΄μ μ νλ)
- Gradient Clipping ꡬν
- Weight Decay (L2 μ κ·ν) ꡬν
μ°Έκ³ μλ£¶
- Rumelhart et al. (1986). "Learning representations by back-propagating errors"
- Glorot & Bengio (2010). "Understanding the difficulty of training deep feedforward neural networks"
- He et al. (2015). "Delving Deep into Rectifiers" (He initialization)