11. VGG
11. VGG¶
Previous: CNN (LeNet) | Next: ResNet
Overview¶
VGGNet finished 2nd in ILSVRC 2014, proposed by Karen Simonyan and Andrew Zisserman. The paper "Very Deep Convolutional Networks for Large-Scale Image Recognition" demonstrated that stacking small 3x3 filters deeply is effective.
Mathematical Background¶
1. Effect of 3x3 Filter Stacking¶
Why stack multiple 3x3 filters?
Two 3x3 convs β One 5x5 conv (same receptive field)
Three 3x3 convs β One 7x7 conv
Advantages:
1. Reduced parameters:
- 7x7: 49CΒ² parameters
- 3x3 Γ 3: 27CΒ² parameters (45% reduction)
2. Increased non-linearity:
- 7x7: 1 ReLU
- 3x3 Γ 3: 3 ReLUs β can learn more complex functions
2. Receptive Field Calculation¶
Receptive field increases as layers stack:
RF = (RF_prev - 1) Γ stride + kernel_size
Example (stride=1, kernel=3):
- Layer 1: RF = 3
- Layer 2: RF = 5
- Layer 3: RF = 7
- Layer 4: RF = 9
...
After MaxPool (kernel=2, stride=2):
- RF doubles
3. Feature Map Size Changes¶
Conv (stride=1, padding=1, kernel=3):
H_out = H_in (maintains size)
MaxPool (kernel=2, stride=2):
H_out = H_in / 2 (halves size)
224 β [ConvΓ2] β 224 β Pool β 112 β [ConvΓ2] β 112 β Pool β 56 β ...
VGG Architecture¶
VGG Variant Comparison¶
| Configuration | VGG11 | VGG13 | VGG16 | VGG19 |
|---|---|---|---|---|
| Conv Layers | 8 | 10 | 13 | 16 |
| FC Layers | 3 | 3 | 3 | 3 |
| Total Layers | 11 | 13 | 16 | 19 |
| Parameters | 133M | 133M | 138M | 144M |
VGG16 Detailed Structure¶
Input: 224Γ224Γ3 RGB image
Block 1: [Conv3-64] Γ 2 + MaxPool
(224Γ224Γ3) β (224Γ224Γ64) β (112Γ112Γ64)
Block 2: [Conv3-128] Γ 2 + MaxPool
(112Γ112Γ64) β (112Γ112Γ128) β (56Γ56Γ128)
Block 3: [Conv3-256] Γ 3 + MaxPool
(56Γ56Γ128) β (56Γ56Γ256) β (28Γ28Γ256)
Block 4: [Conv3-512] Γ 3 + MaxPool
(28Γ28Γ256) β (28Γ28Γ512) β (14Γ14Γ512)
Block 5: [Conv3-512] Γ 3 + MaxPool
(14Γ14Γ512) β (14Γ14Γ512) β (7Γ7Γ512)
Classifier:
Flatten: 7Γ7Γ512 = 25,088
FC1: 25088 β 4096 + ReLU + Dropout
FC2: 4096 β 4096 + ReLU + Dropout
FC3: 4096 β 1000 (classes)
Parameter distribution:
- Conv layers: ~15M (11%)
- FC layers: ~124M (89%) β Most!
VGG Configuration¶
cfg = {
'VGG11': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
'VGG13': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
'VGG16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
'VGG19': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],
}
# 'M' = MaxPool
File Structure¶
04_VGG/
βββ README.md # This file
βββ pytorch_lowlevel/
β βββ vgg_lowlevel.py # Using F.conv2d, F.linear
βββ paper/
β βββ vgg_paper.py # Exact paper architecture reproduction
βββ exercises/
βββ 01_feature_visualization.md # Visualize feature maps per block
βββ 02_transfer_learning.md # Use pretrained weights
Core Concepts¶
1. Deep & Narrow vs Shallow & Wide¶
Before VGG: Large filters + shallow networks
- AlexNet: 11Γ11, 5Γ5 filters
- Few layers
VGG: Small filters + deep networks
- Only 3Γ3 filters (+ some 1Γ1)
- 16~19 layers
Conclusion: Depth is crucial for performance
2. Uniform Structure¶
VGG design principles:
1. All Conv are 3Γ3, stride=1, padding=1
2. All MaxPool are 2Γ2, stride=2
3. Double channels per block (64β128β256β512)
4. Simple and regular β easy to understand/implement
3. VGG Limitations¶
Disadvantages:
1. Too many parameters (138M, ResNet-50: 25M)
2. High memory consumption (FC layers)
3. Slow training
4. Gradient vanishing (as it gets deeper)
Follow-up research:
- GoogLeNet: Efficiency with Inception modules
- ResNet: Deeper with skip connections
- MobileNet: Depthwise separable conv
4. VGG as Feature Extractor¶
VGG widely used as feature extractor:
1. Style Transfer
- Content: block4_conv2
- Style: block1~5_conv1
2. Perceptual Loss
- Compare VGG features instead of pixel loss
3. Object Detection
- VGG backbone + detection head
Implementation Levels¶
Level 2: PyTorch Low-Level (pytorch_lowlevel/)¶
- Use F.conv2d, F.max_pool2d, F.linear
- Don't use nn.Conv2d, nn.Linear
- Manual parameter initialization and management
- Block-wise modularization
Level 3: Paper Implementation (paper/)¶
- Reproduce all paper settings
- Add Batch Normalization (VGG-BN)
- Support various VGG variants
Learning Checklist¶
- [ ] Understand advantages of 3Γ3 filter stacking
- [ ] Master receptive field calculation method
- [ ] Memorize VGG16 architecture
- [ ] Understand parameter distribution (Conv vs FC)
- [ ] How to use VGG as feature extractor
- [ ] Compare VGG limitations with follow-up models
References¶
- Simonyan & Zisserman (2014). "Very Deep Convolutional Networks for Large-Scale Image Recognition"
- torchvision VGG
- CS231n: ConvNets
- ../03_CNN_LeNet/README.md