08. BERT

08. BERT

κ°œμš”

BERT (Bidirectional Encoder Representations from Transformers)λŠ” Google이 2018년에 λ°œν‘œν•œ λͺ¨λΈλ‘œ, NLP 뢄야에 혁λͺ…을 μΌμœΌμΌ°μŠ΅λ‹ˆλ‹€. μ–‘λ°©ν–₯ μ»¨ν…μŠ€νŠΈλ₯Ό μ‚¬μš©ν•˜μ—¬ λ‹¨μ–΄μ˜ 의미λ₯Ό μ΄ν•΄ν•©λ‹ˆλ‹€.


μˆ˜ν•™μ  λ°°κ²½

1. Masked Language Modeling (MLM)

λͺ©μ ν•¨μˆ˜:
L_MLM = -Ξ£ log P(x_mask | x_context)

λ§ˆμŠ€ν‚Ή μ „λž΅ (15% 토큰):
- 80%: [MASK] ν† ν°μœΌλ‘œ λŒ€μ²΄
- 10%: 랜덀 ν† ν°μœΌλ‘œ λŒ€μ²΄
- 10%: 원본 μœ μ§€

μ˜ˆμ‹œ:
μž…λ ₯: "The [MASK] sat on the mat"
λͺ©ν‘œ: "cat" 예츑

2. Next Sentence Prediction (NSP)

50% IsNext:    Sentence A β†’ Sentence B (μ‹€μ œ 연속)
50% NotNext:   Sentence A β†’ Random B

μž…λ ₯: [CLS] Sentence A [SEP] Sentence B [SEP]
좜λ ₯: IsNext / NotNext λΆ„λ₯˜

3. BERT Embedding

Token Embedding:     λ‹¨μ–΄μ˜ 의미
Segment Embedding:   λ¬Έμž₯ A/B ꡬ뢄
Position Embedding:  μœ„μΉ˜ 정보

Input = Token_Emb + Segment_Emb + Position_Emb

BERT μ•„ν‚€ν…μ²˜

BERT-Base:
- Hidden size: 768
- Layers: 12
- Attention heads: 12
- Parameters: 110M

BERT-Large:
- Hidden size: 1024
- Layers: 24
- Attention heads: 16
- Parameters: 340M

ꡬ쑰:
[CLS] Token1 Token2 ... [SEP] Token1 ... [SEP]
  ↓
Embedding Layer (Token + Segment + Position)
  ↓
Transformer Encoder Γ— L layers
  ↓
[CLS]: λΆ„λ₯˜ / Token: 토큰 예츑

파일 ꡬ쑰

08_BERT/
β”œβ”€β”€ README.md
β”œβ”€β”€ pytorch_lowlevel/
β”‚   └── bert_lowlevel.py        # BERT Encoder 직접 κ΅¬ν˜„
β”œβ”€β”€ paper/
β”‚   └── bert_paper.py           # λ…Όλ¬Έ μž¬ν˜„
└── exercises/
    β”œβ”€β”€ 01_mlm_training.md      # MLM ν•™μŠ΅ μ‹€μŠ΅
    └── 02_finetuning.md        # λΆ„λ₯˜ fine-tuning

핡심 κ°œλ…

1. Bidirectional Context

GPT (Left-to-Right):
"The cat sat" β†’ μ™Όμͺ½λ§Œ μ°Έμ‘°ν•˜μ—¬ λ‹€μŒ 예츑

BERT (Bidirectional):
"The [MASK] sat on the mat" β†’ μ–‘μͺ½ λͺ¨λ‘ μ°Έμ‘°ν•˜μ—¬ [MASK] 예츑

μž₯점: 더 ν’λΆ€ν•œ λ¬Έλ§₯ 이해
단점: ν…μŠ€νŠΈ 생성에 뢀적합

2. Pre-training & Fine-tuning

Phase 1: Pre-training (λŒ€κ·œλͺ¨ corpus)
- MLM + NSP νƒœμŠ€ν¬
- Wikipedia + BookCorpus (3.3B 토큰)

Phase 2: Fine-tuning (downstream task)
- [CLS] ν† ν°μœΌλ‘œ λΆ„λ₯˜
- λ˜λŠ” λͺ¨λ“  토큰 좜λ ₯으둜 μ‹œν€€μŠ€ 라벨링

3. μž…λ ₯ ν˜•μ‹

단일 λ¬Έμž₯: [CLS] tokens [SEP]
λ¬Έμž₯ 쌍:   [CLS] tokens_A [SEP] tokens_B [SEP]

Segment IDs:
[CLS] A A A [SEP] B B B [SEP]
  0   0 0 0   0   1 1 1   1

κ΅¬ν˜„ 레벨

Level 2: PyTorch Low-Level (pytorch_lowlevel/)

  • F.linear, F.layer_norm μ‚¬μš©
  • nn.TransformerEncoder λ―Έμ‚¬μš©
  • Embedding μˆ˜λ™ κ΅¬ν˜„

Level 3: Paper Implementation (paper/)

  • λ…Όλ¬Έμ˜ μ •ν™•ν•œ 사양 μž¬ν˜„
  • MLM + NSP pre-training
  • λΆ„λ₯˜ fine-tuning

Level 4: Code Analysis (별도 λ¬Έμ„œ)

  • HuggingFace transformers μ½”λ“œ 뢄석
  • BertModel, BertForSequenceClassification

ν•™μŠ΅ 체크리슀트

  • [ ] MLM λ§ˆμŠ€ν‚Ή μ „λž΅ 이해
  • [ ] NSP νƒœμŠ€ν¬ 이해
  • [ ] Token/Segment/Position Embedding 이해
  • [ ] [CLS] ν† ν°μ˜ μ—­ν• 
  • [ ] Fine-tuning 방법 (λΆ„λ₯˜, NER, QA)
  • [ ] BERT vs GPT 차이점

참고 자료

to navigate between lessons