12. ResNet
12. ResNet¶
Previous: VGG | Next: RNN Basics
Overview¶
ResNet (Residual Network) won 1st place in ILSVRC 2015 and is a revolutionary model. Kaiming He et al. proposed Skip Connections (Residual Connections) that enable training networks with hundreds of layers.
"Solving the degradation problem where performance decreases as depth increases"
Mathematical Background¶
1. Degradation Problem¶
Problem: Performance degrades as network gets deeper
Observation:
- 56-layer network < 20-layer network (CIFAR-10)
- This is not overfitting (even training error is high)
- Optimization difficulty (vanishing/exploding gradient)
Ideal situation:
- Deeper network ≥ Shallow network
- Should at least be able to learn identity mapping
2. Residual Learning¶
Traditional approach:
H(x) = desired output
Network directly learns H(x)
Residual approach:
F(x) = H(x) - x (residual)
H(x) = F(x) + x (original goal)
Why is this easier?
- Learning identity mapping: Just need F(x) = 0
- Learning small changes: Easier than large changes
- Gradient flow: Direct propagation through addition operation
3. Skip Connection Gradient¶
Forward:
y = F(x) + x
Backward:
∂L/∂x = ∂L/∂y × (∂F/∂x + 1)
↑
Always ≥ 1!
Result:
- Gradient propagates directly through at least path of 1
- Gradient maintained even in hundreds of layers
- Solves vanishing gradient
4. Dimension Matching (Projection Shortcut)¶
When dimensions differ (stride=2 or channel change):
Option A: Zero Padding
x_padded = pad(x, extra_channels)
Option B: 1×1 Convolution (paper choice)
shortcut = Conv1×1(x)
x: (N, 64, 56, 56)
↓ stride=2, channels 64→128
y: (N, 128, 28, 28)
shortcut = Conv1×1(64→128, stride=2)
ResNet Architecture¶
BasicBlock vs Bottleneck¶
BasicBlock (ResNet-18, 34):
┌─────────────────────────┐
│ Conv 3×3, BN, ReLU │
│ Conv 3×3, BN │
│ ↓ │
│ + ← shortcut │
│ ReLU │
└─────────────────────────┘
Bottleneck (ResNet-50, 101, 152):
┌─────────────────────────┐
│ Conv 1×1, BN, ReLU │ ← Channel reduction
│ Conv 3×3, BN, ReLU │ ← Main operation
│ Conv 1×1, BN │ ← Channel expansion
│ ↓ │
│ + ← shortcut │
│ ReLU │
└─────────────────────────┘
Bottleneck advantages:
- Reduce channels before 3×3 operation → less computation
- More layers with same computation
ResNet Variant Comparison¶
| Model | Layers | Block | Block Count | Params |
|---|---|---|---|---|
| ResNet-18 | 18 | Basic | [2,2,2,2] | 11.7M |
| ResNet-34 | 34 | Basic | [3,4,6,3] | 21.8M |
| ResNet-50 | 50 | Bottleneck | [3,4,6,3] | 25.6M |
| ResNet-101 | 101 | Bottleneck | [3,4,23,3] | 44.5M |
| ResNet-152 | 152 | Bottleneck | [3,8,36,3] | 60.2M |
ResNet-50 Detailed Structure¶
Input: 224×224×3
Conv1: 7×7, 64, stride=2, padding=3
→ (112×112×64)
MaxPool: 3×3, stride=2, padding=1
→ (56×56×64)
Layer1: Bottleneck × 3 (64→256)
→ (56×56×256)
Layer2: Bottleneck × 4 (128→512, stride=2)
→ (28×28×512)
Layer3: Bottleneck × 6 (256→1024, stride=2)
→ (14×14×1024)
Layer4: Bottleneck × 3 (512→2048, stride=2)
→ (7×7×2048)
AdaptiveAvgPool: → (1×1×2048)
FC: 2048 → 1000
File Structure¶
05_ResNet/
├── README.md # This file
├── pytorch_lowlevel/
│ └── resnet_lowlevel.py # F.conv2d, manual BN
├── paper/
│ └── resnet_paper.py # Exact paper reproduction
├── analysis/
│ └── gradient_flow.py # Skip connection effect analysis
└── exercises/
├── 01_gradient_analysis.md # Compare gradient flow
└── 02_ablation_study.md # Compare shortcut types
Core Concepts¶
1. Why Identity Mapping Matters¶
# Pre-activation ResNet (v2)
def forward(self, x):
identity = x
out = self.bn1(x)
out = F.relu(out)
out = self.conv1(out)
out = self.bn2(out)
out = F.relu(out)
out = self.conv2(out)
return out + identity # Clean identity path
# Post-activation (original)
def forward(self, x):
identity = self.shortcut(x)
out = self.conv1(x)
out = self.bn1(out)
out = F.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out = F.relu(out + identity) # ReLU transforms identity
return out
2. ResNet as Ensemble Perspective¶
ResNet can be viewed as an ensemble of various depth paths
n blocks → 2^n possible paths
- Paths that "skip" some blocks
- Paths that go through all blocks
Experiment: Performance maintained even after removing some blocks during training
→ Various depth paths learned together
3. Role of Batch Normalization¶
Why BN is important in ResNet:
1. Reduces internal covariate shift
- Stabilizes layer input distribution
2. Enables higher learning rate
- Faster convergence
3. Regularization effect
- Noise from mini-batch statistics
4. Improves gradient flow
- Stabilizes gradient through normalization
4. Developments After ResNet¶
ResNeXt (2017):
- Introduces cardinality with grouped convolution
- ResNeXt-50: ResNet-101 performance, fewer parameters
DenseNet (2017):
- Connects all layers to all subsequent layers
- Maximizes feature reuse
EfficientNet (2019):
- Simultaneous scaling of width, depth, resolution
- Compound scaling
RegNet (2020):
- Optimal network structure search
- Simple and regular design
Implementation Levels¶
Level 2: PyTorch Low-Level (pytorch_lowlevel/)¶
- F.conv2d, manual BatchNorm
- Manual implementation of BasicBlock, Bottleneck
- Implement shortcut projection
- Manual parameter management
Level 3: Paper Implementation (paper/)¶
- Complete ResNet-18/34/50/101/152
- Pre-activation ResNet (v2)
- Compare Zero-padding vs Projection shortcut
Level 4: Code Analysis (analysis/)¶
- Analyze torchvision ResNet code
- Visualize gradient flow
- Experiment with removing intermediate blocks
Learning Checklist¶
- [ ] Understand degradation problem
- [ ] Derive residual learning formula
- [ ] Gradient benefits of skip connections
- [ ] Difference between BasicBlock vs Bottleneck
- [ ] Memorize ResNet-50 architecture
- [ ] Implementation method of projection shortcut
- [ ] Difference between Pre/Post-activation
- [ ] Understand ResNet ensemble perspective
References¶
- He et al. (2015). "Deep Residual Learning for Image Recognition"
- He et al. (2016). "Identity Mappings in Deep Residual Networks" (v2)
- torchvision ResNet
- d2l.ai: ResNet
- ../04_VGG/README.md