11. VGG

11. VGG

Previous: CNN (LeNet) | Next: ResNet


Overview

VGGNet finished 2nd in ILSVRC 2014, proposed by Karen Simonyan and Andrew Zisserman. The paper "Very Deep Convolutional Networks for Large-Scale Image Recognition" demonstrated that stacking small 3x3 filters deeply is effective.


Mathematical Background

1. Effect of 3x3 Filter Stacking

Why stack multiple 3x3 filters?

Two 3x3 convs β‰ˆ One 5x5 conv (same receptive field)
Three 3x3 convs β‰ˆ One 7x7 conv

Advantages:
1. Reduced parameters:
   - 7x7: 49CΒ² parameters
   - 3x3 Γ— 3: 27CΒ² parameters (45% reduction)

2. Increased non-linearity:
   - 7x7: 1 ReLU
   - 3x3 Γ— 3: 3 ReLUs β†’ can learn more complex functions

2. Receptive Field Calculation

Receptive field increases as layers stack:

RF = (RF_prev - 1) Γ— stride + kernel_size

Example (stride=1, kernel=3):
- Layer 1: RF = 3
- Layer 2: RF = 5
- Layer 3: RF = 7
- Layer 4: RF = 9
...

After MaxPool (kernel=2, stride=2):
- RF doubles

3. Feature Map Size Changes

Conv (stride=1, padding=1, kernel=3):
  H_out = H_in  (maintains size)

MaxPool (kernel=2, stride=2):
  H_out = H_in / 2  (halves size)

224 β†’ [ConvΓ—2] β†’ 224 β†’ Pool β†’ 112 β†’ [ConvΓ—2] β†’ 112 β†’ Pool β†’ 56 β†’ ...

VGG Architecture

VGG Variant Comparison

Configuration VGG11 VGG13 VGG16 VGG19
Conv Layers 8 10 13 16
FC Layers 3 3 3 3
Total Layers 11 13 16 19
Parameters 133M 133M 138M 144M

VGG16 Detailed Structure

Input: 224Γ—224Γ—3 RGB image

Block 1: [Conv3-64] Γ— 2 + MaxPool
  (224Γ—224Γ—3) β†’ (224Γ—224Γ—64) β†’ (112Γ—112Γ—64)

Block 2: [Conv3-128] Γ— 2 + MaxPool
  (112Γ—112Γ—64) β†’ (112Γ—112Γ—128) β†’ (56Γ—56Γ—128)

Block 3: [Conv3-256] Γ— 3 + MaxPool
  (56Γ—56Γ—128) β†’ (56Γ—56Γ—256) β†’ (28Γ—28Γ—256)

Block 4: [Conv3-512] Γ— 3 + MaxPool
  (28Γ—28Γ—256) β†’ (28Γ—28Γ—512) β†’ (14Γ—14Γ—512)

Block 5: [Conv3-512] Γ— 3 + MaxPool
  (14Γ—14Γ—512) β†’ (14Γ—14Γ—512) β†’ (7Γ—7Γ—512)

Classifier:
  Flatten: 7Γ—7Γ—512 = 25,088
  FC1: 25088 β†’ 4096 + ReLU + Dropout
  FC2: 4096 β†’ 4096 + ReLU + Dropout
  FC3: 4096 β†’ 1000 (classes)

Parameter distribution:
- Conv layers: ~15M (11%)
- FC layers: ~124M (89%)  ← Most!

VGG Configuration

cfg = {
    'VGG11': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
    'VGG13': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
    'VGG16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
    'VGG19': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],
}
# 'M' = MaxPool

File Structure

04_VGG/
β”œβ”€β”€ README.md                      # This file
β”œβ”€β”€ pytorch_lowlevel/
β”‚   └── vgg_lowlevel.py           # Using F.conv2d, F.linear
β”œβ”€β”€ paper/
β”‚   └── vgg_paper.py              # Exact paper architecture reproduction
└── exercises/
    β”œβ”€β”€ 01_feature_visualization.md   # Visualize feature maps per block
    └── 02_transfer_learning.md       # Use pretrained weights

Core Concepts

1. Deep & Narrow vs Shallow & Wide

Before VGG: Large filters + shallow networks
  - AlexNet: 11Γ—11, 5Γ—5 filters
  - Few layers

VGG: Small filters + deep networks
  - Only 3Γ—3 filters (+ some 1Γ—1)
  - 16~19 layers

Conclusion: Depth is crucial for performance

2. Uniform Structure

VGG design principles:

1. All Conv are 3Γ—3, stride=1, padding=1
2. All MaxPool are 2Γ—2, stride=2
3. Double channels per block (64β†’128β†’256β†’512)
4. Simple and regular β†’ easy to understand/implement

3. VGG Limitations

Disadvantages:
1. Too many parameters (138M, ResNet-50: 25M)
2. High memory consumption (FC layers)
3. Slow training
4. Gradient vanishing (as it gets deeper)

Follow-up research:
- GoogLeNet: Efficiency with Inception modules
- ResNet: Deeper with skip connections
- MobileNet: Depthwise separable conv

4. VGG as Feature Extractor

VGG widely used as feature extractor:

1. Style Transfer
   - Content: block4_conv2
   - Style: block1~5_conv1

2. Perceptual Loss
   - Compare VGG features instead of pixel loss

3. Object Detection
   - VGG backbone + detection head

Implementation Levels

Level 2: PyTorch Low-Level (pytorch_lowlevel/)

  • Use F.conv2d, F.max_pool2d, F.linear
  • Don't use nn.Conv2d, nn.Linear
  • Manual parameter initialization and management
  • Block-wise modularization

Level 3: Paper Implementation (paper/)

  • Reproduce all paper settings
  • Add Batch Normalization (VGG-BN)
  • Support various VGG variants

Learning Checklist

  • [ ] Understand advantages of 3Γ—3 filter stacking
  • [ ] Master receptive field calculation method
  • [ ] Memorize VGG16 architecture
  • [ ] Understand parameter distribution (Conv vs FC)
  • [ ] How to use VGG as feature extractor
  • [ ] Compare VGG limitations with follow-up models

References

to navigate between lessons