05. ResNet

05. ResNet

κ°œμš”

ResNet(Residual Network)은 2015λ…„ ILSVRCμ—μ„œ 1μœ„λ₯Ό μ°¨μ§€ν•œ 혁λͺ…적인 λͺ¨λΈμž…λ‹ˆλ‹€. Kaiming He 등이 μ œμ•ˆν•œ Skip Connection (Residual Connection)을 톡해 수백 개 μ΄μƒμ˜ λ ˆμ΄μ–΄λ₯Ό ν•™μŠ΅ν•  수 있게 λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

"κΉŠμ΄κ°€ κΉŠμ–΄μ§ˆμˆ˜λ‘ μ„±λŠ₯이 λ–¨μ–΄μ§€λŠ” degradation 문제λ₯Ό ν•΄κ²°"


μˆ˜ν•™μ  λ°°κ²½

1. Degradation Problem

문제: λ„€νŠΈμ›Œν¬κ°€ κΉŠμ–΄μ§€λ©΄ 였히렀 μ„±λŠ₯ μ €ν•˜

κ΄€μ°°:
- 56-layer network < 20-layer network (CIFAR-10)
- μ΄λŠ” overfitting이 μ•„λ‹˜ (training error도 λ†’μŒ)
- μ΅œμ ν™”μ˜ 어렀움 (vanishing/exploding gradient)

이상적 상황:
- 더 κΉŠμ€ λ„€νŠΈμ›Œν¬ β‰₯ 얕은 λ„€νŠΈμ›Œν¬
- μ΅œμ†Œν•œ identity mapping을 ν•™μŠ΅ν•  수 μžˆμ–΄μ•Ό 함

2. Residual Learning

κΈ°μ‘΄ μ ‘κ·Ό:
  H(x) = desired output
  λ„€νŠΈμ›Œν¬κ°€ H(x)λ₯Ό 직접 ν•™μŠ΅

Residual μ ‘κ·Ό:
  F(x) = H(x) - x  (μž”μ°¨)
  H(x) = F(x) + x  (μ›λž˜ λͺ©ν‘œ)

μ™œ 더 μ‰¬μš΄κ°€?
- Identity mapping ν•™μŠ΅: F(x) = 0만 되면 됨
- μž‘μ€ λ³€ν™” ν•™μŠ΅: 큰 변화보닀 쉬움
- Gradient flow: λ§μ…ˆ μ—°μ‚°μœΌλ‘œ 직접 μ „νŒŒ

3. Skip Connection의 Gradient

Forward:
  y = F(x) + x

Backward:
  βˆ‚L/βˆ‚x = βˆ‚L/βˆ‚y Γ— (βˆ‚F/βˆ‚x + 1)
              ↑
            항상 1 이상!

κ²°κ³Ό:
- Gradientκ°€ μ΅œμ†Œ 1의 경둜둜 직접 μ „νŒŒ
- 수백 λ ˆμ΄μ–΄μ—μ„œλ„ gradient μœ μ§€
- Vanishing gradient ν•΄κ²°

4. 차원 λ§žμΆ”κΈ° (Projection Shortcut)

차원이 λ‹€λ₯Ό λ•Œ (stride=2 λ˜λŠ” 채널 λ³€κ²½):

Option A: Zero Padding
  x_padded = pad(x, extra_channels)

Option B: 1Γ—1 Convolution (λ…Όλ¬Έ 채택)
  shortcut = Conv1Γ—1(x)

  x: (N, 64, 56, 56)
  ↓ stride=2, channels 64β†’128
  y: (N, 128, 28, 28)

  shortcut = Conv1Γ—1(64β†’128, stride=2)

ResNet μ•„ν‚€ν…μ²˜

BasicBlock vs Bottleneck

BasicBlock (ResNet-18, 34):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Conv 3Γ—3, BN, ReLU     β”‚
β”‚  Conv 3Γ—3, BN           β”‚
β”‚         ↓               β”‚
β”‚    + ← shortcut         β”‚
β”‚       ReLU              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Bottleneck (ResNet-50, 101, 152):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Conv 1Γ—1, BN, ReLU     β”‚  ← 채널 μΆ•μ†Œ
β”‚  Conv 3Γ—3, BN, ReLU     β”‚  ← μ£Όμš” μ—°μ‚°
β”‚  Conv 1Γ—1, BN           β”‚  ← 채널 볡원
β”‚         ↓               β”‚
β”‚    + ← shortcut         β”‚
β”‚       ReLU              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Bottleneck μž₯점:
- 3Γ—3 μ—°μ‚° 전에 채널 μΆ•μ†Œ β†’ κ³„μ‚°λŸ‰ κ°μ†Œ
- 같은 κ³„μ‚°λŸ‰μœΌλ‘œ 더 λ§Žμ€ λ ˆμ΄μ–΄

ResNet λ³€ν˜• 비ꡐ

λͺ¨λΈ λ ˆμ΄μ–΄ 블둝 블둝 수 Params
ResNet-18 18 Basic [2,2,2,2] 11.7M
ResNet-34 34 Basic [3,4,6,3] 21.8M
ResNet-50 50 Bottleneck [3,4,6,3] 25.6M
ResNet-101 101 Bottleneck [3,4,23,3] 44.5M
ResNet-152 152 Bottleneck [3,8,36,3] 60.2M

ResNet-50 상세 ꡬ쑰

μž…λ ₯: 224Γ—224Γ—3

Conv1: 7Γ—7, 64, stride=2, padding=3
  β†’ (112Γ—112Γ—64)
MaxPool: 3Γ—3, stride=2, padding=1
  β†’ (56Γ—56Γ—64)

Layer1: Bottleneck Γ— 3 (64β†’256)
  β†’ (56Γ—56Γ—256)

Layer2: Bottleneck Γ— 4 (128β†’512, stride=2)
  β†’ (28Γ—28Γ—512)

Layer3: Bottleneck Γ— 6 (256β†’1024, stride=2)
  β†’ (14Γ—14Γ—1024)

Layer4: Bottleneck Γ— 3 (512β†’2048, stride=2)
  β†’ (7Γ—7Γ—2048)

AdaptiveAvgPool: β†’ (1Γ—1Γ—2048)
FC: 2048 β†’ 1000

파일 ꡬ쑰

05_ResNet/
β”œβ”€β”€ README.md                      # 이 파일
β”œβ”€β”€ pytorch_lowlevel/
β”‚   └── resnet_lowlevel.py        # F.conv2d, μˆ˜λ™ BN
β”œβ”€β”€ paper/
β”‚   └── resnet_paper.py           # λ…Όλ¬Έ μ •ν™• μž¬ν˜„
β”œβ”€β”€ analysis/
β”‚   └── gradient_flow.py          # Skip connection 효과 뢄석
└── exercises/
    β”œβ”€β”€ 01_gradient_analysis.md   # Gradient flow 비ꡐ
    └── 02_ablation_study.md      # Shortcut μ’…λ₯˜ 비ꡐ

핡심 κ°œλ…

1. Identity Mapping이 μ€‘μš”ν•œ 이유

# Pre-activation ResNet (v2)
def forward(self, x):
    identity = x

    out = self.bn1(x)
    out = F.relu(out)
    out = self.conv1(out)

    out = self.bn2(out)
    out = F.relu(out)
    out = self.conv2(out)

    return out + identity  # Clean identity path

# Post-activation (original)
def forward(self, x):
    identity = self.shortcut(x)

    out = self.conv1(x)
    out = self.bn1(out)
    out = F.relu(out)

    out = self.conv2(out)
    out = self.bn2(out)

    out = F.relu(out + identity)  # ReLUκ°€ identityλ₯Ό λ³€ν˜•
    return out

2. ResNet의 앙상블 관점

ResNet은 λ‹€μ–‘ν•œ 깊이의 경둜 μ•™μƒλΈ”λ‘œ λ³Ό 수 있음

n개 블둝 β†’ 2^n 개의 κ°€λŠ₯ν•œ 경둜
- 일뢀 블둝을 "κ±΄λ„ˆλ›°λŠ”" 경둜
- λͺ¨λ“  블둝을 κ±°μΉ˜λŠ” 경둜

μ‹€ν—˜: ν•™μŠ΅ ν›„ 일뢀 블둝 μ œκ±°ν•΄λ„ μ„±λŠ₯ μœ μ§€
β†’ λ‹€μ–‘ν•œ 깊이의 κ²½λ‘œκ°€ ν•¨κ»˜ ν•™μŠ΅λ¨

3. Batch Normalization의 μ—­ν• 

ResNetμ—μ„œ BN이 μ€‘μš”ν•œ 이유:

1. λ‚΄λΆ€ κ³΅λ³€λŸ‰ λ³€ν™” κ°μ†Œ
   - λ ˆμ΄μ–΄ μž…λ ₯의 뢄포 μ•ˆμ •ν™”

2. ν•™μŠ΅λ₯  증가 κ°€λŠ₯
   - 더 λΉ λ₯Έ 수렴

3. Regularization 효과
   - λ―Έλ‹ˆλ°°μΉ˜ 톡계 μ‚¬μš© β†’ λ…Έμ΄μ¦ˆ

4. Gradient flow κ°œμ„ 
   - μ •κ·œν™”λ‘œ gradient μ•ˆμ •ν™”

4. ResNet 이후 λ°œμ „

ResNeXt (2017):
- Grouped convolution으둜 cardinality λ„μž…
- ResNeXt-50: ResNet-101 μ„±λŠ₯, 더 적은 νŒŒλΌλ―Έν„°

DenseNet (2017):
- λͺ¨λ“  λ ˆμ΄μ–΄λ₯Ό λͺ¨λ“  후속 λ ˆμ΄μ–΄μ— μ—°κ²°
- Feature reuse κ·ΉλŒ€ν™”

EfficientNet (2019):
- Width, depth, resolution λ™μ‹œ μŠ€μΌ€μΌλ§
- Compound scaling

RegNet (2020):
- 졜적 λ„€νŠΈμ›Œν¬ ꡬ쑰 탐색
- λ‹¨μˆœν•˜κ³  κ·œμΉ™μ μΈ 섀계

κ΅¬ν˜„ 레벨

Level 2: PyTorch Low-Level (pytorch_lowlevel/)

  • F.conv2d, μˆ˜λ™ BatchNorm
  • BasicBlock, Bottleneck μˆ˜λ™ κ΅¬ν˜„
  • Shortcut projection κ΅¬ν˜„
  • νŒŒλΌλ―Έν„° μˆ˜λ™ 관리

Level 3: Paper Implementation (paper/)

  • ResNet-18/34/50/101/152 전체
  • Pre-activation ResNet (v2)
  • Zero-padding vs Projection shortcut 비ꡐ

Level 4: Code Analysis (analysis/)

  • torchvision ResNet μ½”λ“œ 뢄석
  • Gradient flow μ‹œκ°ν™”
  • 쀑간 블둝 제거 μ‹€ν—˜

ν•™μŠ΅ 체크리슀트

  • [ ] Degradation problem 이해
  • [ ] Residual learning μˆ˜μ‹ μœ λ„
  • [ ] Skip connection의 gradient 이점
  • [ ] BasicBlock vs Bottleneck 차이
  • [ ] ResNet-50 μ•„ν‚€ν…μ²˜ μ•”κΈ°
  • [ ] Projection shortcut κ΅¬ν˜„ 방법
  • [ ] Pre/Post-activation 차이
  • [ ] ResNet의 앙상블 관점 이해

참고 자료

to navigate between lessons