19. BERT

19. BERT

Previous: Transformer | Next: GPT


Overview

BERT (Bidirectional Encoder Representations from Transformers) is a model released by Google in 2018 that revolutionized NLP. It uses bidirectional context to understand word meanings.


Mathematical Background

1. Masked Language Modeling (MLM)

Objective function:
L_MLM = -Σ log P(x_mask | x_context)

Masking strategy (15% of tokens):
- 80%: replace with [MASK] token
- 10%: replace with random token
- 10%: keep original

Example:
Input: "The [MASK] sat on the mat"
Goal: predict "cat"

2. Next Sentence Prediction (NSP)

50% IsNext:    Sentence A  Sentence B (actual continuation)
50% NotNext:   Sentence A  Random B

Input: [CLS] Sentence A [SEP] Sentence B [SEP]
Output: IsNext / NotNext classification

3. BERT Embedding

Token Embedding:     word meaning
Segment Embedding:   distinguish sentence A/B
Position Embedding:  position information

Input = Token_Emb + Segment_Emb + Position_Emb

BERT Architecture

BERT-Base:
- Hidden size: 768
- Layers: 12
- Attention heads: 12
- Parameters: 110M

BERT-Large:
- Hidden size: 1024
- Layers: 24
- Attention heads: 16
- Parameters: 340M

Structure:
[CLS] Token1 Token2 ... [SEP] Token1 ... [SEP]
  
Embedding Layer (Token + Segment + Position)
  
Transformer Encoder × L layers
  
[CLS]: classification / Token: token prediction

File Structure

08_BERT/
├── README.md
├── pytorch_lowlevel/
   └── bert_lowlevel.py        # Direct BERT Encoder implementation
├── paper/
   └── bert_paper.py           # Paper reproduction
└── exercises/
    ├── 01_mlm_training.md      # MLM training practice
    └── 02_finetuning.md        # Classification fine-tuning

Core Concepts

1. Bidirectional Context

GPT (Left-to-Right):
"The cat sat"  reference only left to predict next

BERT (Bidirectional):
"The [MASK] sat on the mat"  reference both sides to predict [MASK]

Advantage: richer contextual understanding
Disadvantage: unsuitable for text generation

2. Pre-training & Fine-tuning

Phase 1: Pre-training (large corpus)
- MLM + NSP tasks
- Wikipedia + BookCorpus (3.3B tokens)

Phase 2: Fine-tuning (downstream task)
- Classify with [CLS] token
- Or sequence labeling with all token outputs

3. Input Format

Single sentence: [CLS] tokens [SEP]
Sentence pair:   [CLS] tokens_A [SEP] tokens_B [SEP]

Segment IDs:
[CLS] A A A [SEP] B B B [SEP]
  0   0 0 0   0   1 1 1   1

Implementation Levels

Level 2: PyTorch Low-Level (pytorch_lowlevel/)

  • Use F.linear, F.layer_norm
  • Don't use nn.TransformerEncoder
  • Manual embedding implementation

Level 3: Paper Implementation (paper/)

  • Reproduce exact paper specifications
  • MLM + NSP pre-training
  • Classification fine-tuning

Level 4: Code Analysis (separate document)

  • Analyze HuggingFace transformers code
  • BertModel, BertForSequenceClassification

Learning Checklist

  • [ ] Understand MLM masking strategy
  • [ ] Understand NSP task
  • [ ] Understand Token/Segment/Position Embedding
  • [ ] Role of [CLS] token
  • [ ] Fine-tuning methods (classification, NER, QA)
  • [ ] Differences between BERT vs GPT

References

to navigate between lessons