06. Multi-Layer Perceptron (MLP)

06. Multi-Layer Perceptron (MLP)

이전: Linear & Logistic Regression | λ‹€μŒ: CNN 기초


κ°œμš”

MLPλŠ” λ”₯λŸ¬λ‹μ˜ κΈ°λ³Έ building blockμž…λ‹ˆλ‹€. Backpropagation을 톡해 μ—¬λŸ¬ λ ˆμ΄μ–΄λ₯Ό ν•™μŠ΅ν•˜λŠ” 방법을 μ΄ν•΄ν•˜λŠ” 것이 ν•΅μ‹¬μž…λ‹ˆλ‹€.

ν•™μŠ΅ λͺ©ν‘œ

  1. Forward Pass: λ‹€μΈ΅ κ΅¬μ‘°μ—μ„œ μˆœμ „νŒŒ 이해
  2. Backward Pass: Chain Rule을 μ΄μš©ν•œ μ—­μ „νŒŒ
  3. Activation Functions: ReLU, Sigmoid, Tanh의 νŠΉμ„±κ³Ό λ―ΈλΆ„
  4. Weight Initialization: μ˜¬λ°”λ₯Έ μ΄ˆκΈ°ν™”μ˜ μ€‘μš”μ„±

μˆ˜ν•™μ  λ°°κ²½

1. Forward Pass

μž…λ ₯: x ∈ ℝ^dβ‚€

λ ˆμ΄μ–΄ 1: z₁ = W₁x + b₁,  a₁ = Οƒ(z₁)
λ ˆμ΄μ–΄ 2: zβ‚‚ = Wβ‚‚a₁ + bβ‚‚,  aβ‚‚ = Οƒ(zβ‚‚)
...
좜λ ₯:    Ε· = aβ‚™

μ—¬κΈ°μ„œ:
- Wα΅’ ∈ ℝ^(dα΅’ Γ— dᡒ₋₁): κ°€μ€‘μΉ˜ ν–‰λ ¬
- bα΅’ ∈ ℝ^dα΅’: 편ν–₯
- Οƒ: ν™œμ„±ν™” ν•¨μˆ˜

2. Backward Pass (Backpropagation)

손싀: L = Loss(y, Ε·)

Chain Rule:
βˆ‚L/βˆ‚Wα΅’ = βˆ‚L/βˆ‚aα΅’ Γ— βˆ‚aα΅’/βˆ‚zα΅’ Γ— βˆ‚zα΅’/βˆ‚Wα΅’

μ—­μ „νŒŒ μˆœμ„œ:
1. βˆ‚L/βˆ‚Ε· (좜λ ₯μ—μ„œ μ†μ‹€μ˜ λ―ΈλΆ„)
2. βˆ‚L/βˆ‚zβ‚™ = βˆ‚L/βˆ‚Ε· Γ— Οƒ'(zβ‚™)
3. βˆ‚L/βˆ‚Wβ‚™ = aₙ₋₁ᡀ Γ— βˆ‚L/βˆ‚zβ‚™
4. βˆ‚L/βˆ‚aₙ₋₁ = βˆ‚L/βˆ‚zβ‚™ Γ— Wβ‚™α΅€
5. 반볡...

3. ν™œμ„±ν™” ν•¨μˆ˜

ReLU:     Οƒ(z) = max(0, z)
          Οƒ'(z) = 1 if z > 0 else 0

Sigmoid:  Οƒ(z) = 1/(1 + e⁻ᢻ)
          Οƒ'(z) = Οƒ(z)(1 - Οƒ(z))

Tanh:     Οƒ(z) = (eαΆ» - e⁻ᢻ)/(eαΆ» + e⁻ᢻ)
          Οƒ'(z) = 1 - Οƒ(z)Β²

파일 ꡬ쑰

02_MLP/
β”œβ”€β”€ README.md
β”œβ”€β”€ numpy/
β”‚   β”œβ”€β”€ mlp_numpy.py          # μ™„μ „ν•œ MLP κ΅¬ν˜„
β”‚   β”œβ”€β”€ activations_numpy.py   # ν™œμ„±ν™” ν•¨μˆ˜λ“€
β”‚   └── test_mlp.py           # ν…ŒμŠ€νŠΈ
β”œβ”€β”€ pytorch_lowlevel/
β”‚   └── mlp_lowlevel.py       # nn.Linear 없이 κ΅¬ν˜„
β”œβ”€β”€ paper/
β”‚   └── mlp_paper.py          # Clean nn.Module
└── exercises/
    β”œβ”€β”€ 01_add_dropout.md
    β”œβ”€β”€ 02_batch_norm.md
    └── 03_xor_problem.md

핡심 κ°œλ…

1. Vanishing/Exploding Gradients

문제: λ ˆμ΄μ–΄κ°€ κΉŠμ–΄μ§€λ©΄ gradientκ°€ μ‚¬λΌμ§€κ±°λ‚˜ 폭발
- Sigmoid: Οƒ'(z) ≀ 0.25 β†’ κ³±ν•˜λ©΄ 0에 수렴
- ν•΄κ²°: ReLU, μ μ ˆν•œ μ΄ˆκΈ°ν™”, BatchNorm, ResNet

예:
10 layers, Sigmoid β†’ gradient β‰ˆ 0.25^10 β‰ˆ 10^-6

2. Xavier/He μ΄ˆκΈ°ν™”

# Xavier (Glorot): tanh, sigmoid용
W = np.random.randn(in_dim, out_dim) * np.sqrt(1 / in_dim)
# λ˜λŠ”
W = np.random.randn(in_dim, out_dim) * np.sqrt(2 / (in_dim + out_dim))

# He (Kaiming): ReLU용
W = np.random.randn(in_dim, out_dim) * np.sqrt(2 / in_dim)

3. Universal Approximation Theorem

ν•˜λ‚˜μ˜ hidden layerλ₯Ό κ°€μ§„ feedforward λ„€νŠΈμ›Œν¬λŠ” μΆ©λΆ„ν•œ λ‰΄λŸ°μ΄ μžˆλ‹€λ©΄ μž„μ˜μ˜ 연속 ν•¨μˆ˜λ₯Ό 근사할 수 μžˆλ‹€.


μ—°μŠ΅ 문제

기초

  1. XOR 문제 ν•΄κ²° (2-layer MLP)
  2. λ‹€μ–‘ν•œ ν™œμ„±ν™” ν•¨μˆ˜ 비ꡐ
  3. μ΄ˆκΈ°ν™” 방법에 λ”°λ₯Έ ν•™μŠ΅ 곑선 비ꡐ

쀑급

  1. Dropout κ΅¬ν˜„
  2. Batch Normalization κ΅¬ν˜„
  3. Learning Rate Scheduler κ΅¬ν˜„

κ³ κΈ‰

  1. MNIST λΆ„λ₯˜ (98% 이상 정확도)
  2. Gradient Clipping κ΅¬ν˜„
  3. Weight Decay (L2 μ •κ·œν™”) κ΅¬ν˜„

참고 자료

  • Rumelhart et al. (1986). "Learning representations by back-propagating errors"
  • Glorot & Bengio (2010). "Understanding the difficulty of training deep feedforward neural networks"
  • He et al. (2015). "Delving Deep into Rectifiers" (He initialization)
to navigate between lessons