08. BERT
08. BERT¶
κ°μ¶
BERT (Bidirectional Encoder Representations from Transformers)λ Googleμ΄ 2018λ μ λ°νν λͺ¨λΈλ‘, NLP λΆμΌμ νλͺ μ μΌμΌμΌ°μ΅λλ€. μλ°©ν₯ 컨ν μ€νΈλ₯Ό μ¬μ©νμ¬ λ¨μ΄μ μλ―Έλ₯Ό μ΄ν΄ν©λλ€.
μνμ λ°°κ²½¶
1. Masked Language Modeling (MLM)¶
λͺ©μ ν¨μ:
L_MLM = -Ξ£ log P(x_mask | x_context)
λ§μ€νΉ μ λ΅ (15% ν ν°):
- 80%: [MASK] ν ν°μΌλ‘ λ체
- 10%: λλ€ ν ν°μΌλ‘ λ체
- 10%: μλ³Έ μ μ§
μμ:
μ
λ ₯: "The [MASK] sat on the mat"
λͺ©ν: "cat" μμΈ‘
2. Next Sentence Prediction (NSP)¶
50% IsNext: Sentence A β Sentence B (μ€μ μ°μ)
50% NotNext: Sentence A β Random B
μ
λ ₯: [CLS] Sentence A [SEP] Sentence B [SEP]
μΆλ ₯: IsNext / NotNext λΆλ₯
3. BERT Embedding¶
Token Embedding: λ¨μ΄μ μλ―Έ
Segment Embedding: λ¬Έμ₯ A/B ꡬλΆ
Position Embedding: μμΉ μ 보
Input = Token_Emb + Segment_Emb + Position_Emb
BERT μν€ν μ²¶
BERT-Base:
- Hidden size: 768
- Layers: 12
- Attention heads: 12
- Parameters: 110M
BERT-Large:
- Hidden size: 1024
- Layers: 24
- Attention heads: 16
- Parameters: 340M
ꡬ쑰:
[CLS] Token1 Token2 ... [SEP] Token1 ... [SEP]
β
Embedding Layer (Token + Segment + Position)
β
Transformer Encoder Γ L layers
β
[CLS]: λΆλ₯ / Token: ν ν° μμΈ‘
νμΌ κ΅¬μ‘°¶
08_BERT/
βββ README.md
βββ pytorch_lowlevel/
β βββ bert_lowlevel.py # BERT Encoder μ§μ ꡬν
βββ paper/
β βββ bert_paper.py # λ
Όλ¬Έ μ¬ν
βββ exercises/
βββ 01_mlm_training.md # MLM νμ΅ μ€μ΅
βββ 02_finetuning.md # λΆλ₯ fine-tuning
ν΅μ¬ κ°λ ¶
1. Bidirectional Context¶
GPT (Left-to-Right):
"The cat sat" β μΌμͺ½λ§ μ°Έμ‘°νμ¬ λ€μ μμΈ‘
BERT (Bidirectional):
"The [MASK] sat on the mat" β μμͺ½ λͺ¨λ μ°Έμ‘°νμ¬ [MASK] μμΈ‘
μ₯μ : λ νλΆν λ¬Έλ§₯ μ΄ν΄
λ¨μ : ν
μ€νΈ μμ±μ λΆμ ν©
2. Pre-training & Fine-tuning¶
Phase 1: Pre-training (λκ·λͺ¨ corpus)
- MLM + NSP νμ€ν¬
- Wikipedia + BookCorpus (3.3B ν ν°)
Phase 2: Fine-tuning (downstream task)
- [CLS] ν ν°μΌλ‘ λΆλ₯
- λλ λͺ¨λ ν ν° μΆλ ₯μΌλ‘ μνμ€ λΌλ²¨λ§
3. μ λ ₯ νμ¶
λ¨μΌ λ¬Έμ₯: [CLS] tokens [SEP]
λ¬Έμ₯ μ: [CLS] tokens_A [SEP] tokens_B [SEP]
Segment IDs:
[CLS] A A A [SEP] B B B [SEP]
0 0 0 0 0 1 1 1 1
ꡬν λ 벨¶
Level 2: PyTorch Low-Level (pytorch_lowlevel/)¶
- F.linear, F.layer_norm μ¬μ©
- nn.TransformerEncoder λ―Έμ¬μ©
- Embedding μλ ꡬν
Level 3: Paper Implementation (paper/)¶
- λ Όλ¬Έμ μ νν μ¬μ μ¬ν
- MLM + NSP pre-training
- λΆλ₯ fine-tuning
Level 4: Code Analysis (λ³λ λ¬Έμ)¶
- HuggingFace transformers μ½λ λΆμ
- BertModel, BertForSequenceClassification
νμ΅ μ²΄ν¬λ¦¬μ€νΈ¶
- [ ] MLM λ§μ€νΉ μ λ΅ μ΄ν΄
- [ ] NSP νμ€ν¬ μ΄ν΄
- [ ] Token/Segment/Position Embedding μ΄ν΄
- [ ] [CLS] ν ν°μ μν
- [ ] Fine-tuning λ°©λ² (λΆλ₯, NER, QA)
- [ ] BERT vs GPT μ°¨μ΄μ
μ°Έκ³ μλ£¶
- Devlin et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
- HuggingFace BERT
- ../LLM_and_NLP/03_BERT_GPT_Architecture.md