04. BERT ์ดํด
ํ์ต ๋ชฉํ
- BERT ์ํคํ
์ฒ ์ดํด
- ์ฌ์ ํ์ต ๋ชฉํ (MLM, NSP)
- ์
๋ ฅ ํํ
- ๋ค์ํ BERT ๋ณํ
1. BERT ๊ฐ์
BERT = Transformer ์ธ์ฝ๋ ์คํ
ํน์ง:
- ์๋ฐฉํฅ ๋ฌธ๋งฅ ์ดํด
- ์ฌ์ ํ์ต + ํ์ธํ๋ ํจ๋ฌ๋ค์
- ๋ค์ํ NLP ํ์คํฌ์ ๋ฒ์ฉ ์ ์ฉ
๋ชจ๋ธ ํฌ๊ธฐ
| ๋ชจ๋ธ |
๋ ์ด์ด |
d_model |
ํค๋ |
ํ๋ผ๋ฏธํฐ |
| BERT-base |
12 |
768 |
12 |
110M |
| BERT-large |
24 |
1024 |
16 |
340M |
2. ์
๋ ฅ ํํ
์ธ ๊ฐ์ง ์๋ฒ ๋ฉ์ ํฉ
์
๋ ฅ: [CLS] I love NLP [SEP] It is fun [SEP]
Token Embedding: [E_CLS, E_I, E_love, E_NLP, E_SEP, E_It, E_is, E_fun, E_SEP]
Segment Embedding: [E_A, E_A, E_A, E_A, E_A, E_B, E_B, E_B, E_B ]
Position Embedding: [E_0, E_1, E_2, E_3, E_4, E_5, E_6, E_7, E_8 ]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
= ์ต์ข
์
๋ ฅ ์๋ฒ ๋ฉ (ํฉ)
ํน์ ํ ํฐ
| ํ ํฐ |
์ญํ |
| [CLS] |
๋ถ๋ฅ ํ์คํฌ์ฉ ์ง๊ณ ํ ํฐ |
| [SEP] |
๋ฌธ์ฅ ๊ตฌ๋ถ์ |
| [PAD] |
ํจ๋ฉ |
| [MASK] |
MLM์์ ๋ง์คํน๋ ํ ํฐ |
| [UNK] |
๋ฏธ๋ฑ๋ก ๋จ์ด |
์
๋ ฅ ๊ตฌํ
import torch
import torch.nn as nn
class BERTEmbedding(nn.Module):
def __init__(self, vocab_size, d_model=768, max_len=512, dropout=0.1):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, d_model)
self.position_embedding = nn.Embedding(max_len, d_model)
self.segment_embedding = nn.Embedding(2, d_model)
self.layer_norm = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, input_ids, segment_ids):
seq_len = input_ids.size(1)
# ์์น ์ธ๋ฑ์ค
position_ids = torch.arange(seq_len, device=input_ids.device)
# ์๋ฒ ๋ฉ ํฉ
embeddings = (
self.token_embedding(input_ids) +
self.position_embedding(position_ids) +
self.segment_embedding(segment_ids)
)
embeddings = self.layer_norm(embeddings)
return self.dropout(embeddings)
3. ์ฌ์ ํ์ต ๋ชฉํ
Masked Language Model (MLM)
15%์ ํ ํฐ์ ์ ํ:
- 80%: [MASK]๋ก ๊ต์ฒด
- 10%: ๋๋ค ํ ํฐ์ผ๋ก ๊ต์ฒด
- 10%: ๊ทธ๋๋ก ์ ์ง
์์:
์
๋ ฅ: "The cat sat on the mat"
โ "The [MASK] sat on the mat"
๋ชฉํ: [MASK] โ "cat" ์์ธก
import random
def create_mlm_data(tokens, vocab, mask_prob=0.15):
"""MLM ํ์ต ๋ฐ์ดํฐ ์์ฑ"""
labels = [-100] * len(tokens) # -100์ ์์ค ๊ณ์ฐ์์ ๋ฌด์
for i, token in enumerate(tokens):
if random.random() < mask_prob:
labels[i] = vocab[token] # ์๋ ํ ํฐ ID
rand = random.random()
if rand < 0.8:
tokens[i] = '[MASK]'
elif rand < 0.9:
tokens[i] = random.choice(list(vocab.keys()))
# else: ๊ทธ๋๋ก ์ ์ง
return tokens, labels
Next Sentence Prediction (NSP)
์
๋ ฅ: [CLS] ๋ฌธ์ฅA [SEP] ๋ฌธ์ฅB [SEP]
๋ชฉํ: ๋ฌธ์ฅB๊ฐ ๋ฌธ์ฅA์ ์ค์ ๋ค์ ๋ฌธ์ฅ์ธ์ง ์ด์ง ๋ถ๋ฅ
์์:
๊ธ์ (IsNext):
A: "The man went to the store"
B: "He bought a gallon of milk"
๋ถ์ (NotNext):
A: "The man went to the store"
B: "Penguins are flightless birds"
class BERTPreTrainingHeads(nn.Module):
def __init__(self, d_model, vocab_size):
super().__init__()
# MLM ํค๋
self.mlm = nn.Sequential(
nn.Linear(d_model, d_model),
nn.GELU(),
nn.LayerNorm(d_model),
nn.Linear(d_model, vocab_size)
)
# NSP ํค๋
self.nsp = nn.Linear(d_model, 2)
def forward(self, sequence_output, cls_output):
mlm_scores = self.mlm(sequence_output) # (batch, seq, vocab)
nsp_scores = self.nsp(cls_output) # (batch, 2)
return mlm_scores, nsp_scores
4. BERT ์ ์ฒด ๊ตฌ์กฐ
class BERT(nn.Module):
def __init__(self, vocab_size, d_model=768, num_heads=12,
num_layers=12, d_ff=3072, max_len=512, dropout=0.1):
super().__init__()
self.embedding = BERTEmbedding(vocab_size, d_model, max_len, dropout)
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=num_heads,
dim_feedforward=d_ff,
dropout=dropout,
batch_first=True
)
self.encoder = nn.TransformerEncoder(encoder_layer, num_layers)
def forward(self, input_ids, segment_ids, attention_mask=None):
# ์๋ฒ ๋ฉ
x = self.embedding(input_ids, segment_ids)
# ํจ๋ฉ ๋ง์คํฌ ๋ณํ
if attention_mask is not None:
# (batch, seq) โ (batch, seq) with True for padding
attention_mask = (attention_mask == 0)
# ์ธ์ฝ๋
output = self.encoder(x, src_key_padding_mask=attention_mask)
return output # (batch, seq, d_model)
class BERTForPreTraining(nn.Module):
def __init__(self, vocab_size, d_model=768, **kwargs):
super().__init__()
self.bert = BERT(vocab_size, d_model, **kwargs)
self.heads = BERTPreTrainingHeads(d_model, vocab_size)
def forward(self, input_ids, segment_ids, attention_mask=None):
sequence_output = self.bert(input_ids, segment_ids, attention_mask)
cls_output = sequence_output[:, 0] # [CLS] ํ ํฐ
mlm_scores, nsp_scores = self.heads(sequence_output, cls_output)
return mlm_scores, nsp_scores
5. ํ์ธํ๋ ํจํด
๋ฌธ์ฅ ๋ถ๋ฅ (Single Sentence)
class BERTForSequenceClassification(nn.Module):
def __init__(self, bert, num_classes, dropout=0.1):
super().__init__()
self.bert = bert
self.dropout = nn.Dropout(dropout)
self.classifier = nn.Linear(bert.embedding.token_embedding.embedding_dim,
num_classes)
def forward(self, input_ids, segment_ids, attention_mask):
output = self.bert(input_ids, segment_ids, attention_mask)
cls_output = output[:, 0] # [CLS]
cls_output = self.dropout(cls_output)
return self.classifier(cls_output)
ํ ํฐ ๋ถ๋ฅ (NER)
class BERTForTokenClassification(nn.Module):
def __init__(self, bert, num_labels, dropout=0.1):
super().__init__()
self.bert = bert
self.dropout = nn.Dropout(dropout)
self.classifier = nn.Linear(bert.embedding.token_embedding.embedding_dim,
num_labels)
def forward(self, input_ids, segment_ids, attention_mask):
output = self.bert(input_ids, segment_ids, attention_mask)
output = self.dropout(output)
return self.classifier(output) # (batch, seq, num_labels)
์ง์์๋ต (QA)
class BERTForQuestionAnswering(nn.Module):
def __init__(self, bert):
super().__init__()
self.bert = bert
hidden_size = bert.embedding.token_embedding.embedding_dim
self.qa_outputs = nn.Linear(hidden_size, 2) # start, end
def forward(self, input_ids, segment_ids, attention_mask):
output = self.bert(input_ids, segment_ids, attention_mask)
logits = self.qa_outputs(output) # (batch, seq, 2)
start_logits = logits[:, :, 0] # (batch, seq)
end_logits = logits[:, :, 1]
return start_logits, end_logits
6. BERT ๋ณํ ๋ชจ๋ธ
RoBERTa
๋ณ๊ฒฝ์ :
- NSP ์ ๊ฑฐ (MLM๋ง ์ฌ์ฉ)
- ๋์ ๋ง์คํน (๋งค ์ํฌํฌ ๋ค๋ฅธ ๋ง์คํน)
- ๋ ํฐ ๋ฐฐ์น, ๋ ๊ธด ํ์ต
- Byte-Level BPE ํ ํฌ๋์ด์
๊ฒฐ๊ณผ: BERT๋ณด๋ค ์ฑ๋ฅ ํฅ์
ALBERT
๋ณ๊ฒฝ์ :
- ์๋ฒ ๋ฉ ๋ถํด (VรE, EรH โ VรE, E<<H)
- ๋ ์ด์ด ํ๋ผ๋ฏธํฐ ๊ณต์
- NSP โ SOP (Sentence Order Prediction)
๊ฒฐ๊ณผ: ํ๋ผ๋ฏธํฐ ๋ํญ ๊ฐ์, ์ ์ฌ ์ฑ๋ฅ
DistilBERT
๋ณ๊ฒฝ์ :
- ์ง์ ์ฆ๋ฅ (Teacher: BERT โ Student: ์์ ๋ชจ๋ธ)
- 6 ๋ ์ด์ด (BERT์ ์ ๋ฐ)
๊ฒฐ๊ณผ: 40% ์์, 60% ๋น ๋ฆ, 97% ์ฑ๋ฅ ์ ์ง
Comparison
| ๋ชจ๋ธ |
๋ ์ด์ด |
ํ๋ผ๋ฏธํฐ |
์๋ |
ํน์ง |
| BERT-base |
12 |
110M |
1x |
๊ธฐ์ค |
| RoBERTa |
12 |
125M |
1x |
์ต์ ํ๋ ํ์ต |
| ALBERT-base |
12 |
12M |
1x |
ํ๋ผ๋ฏธํฐ ๊ณต์ |
| DistilBERT |
6 |
66M |
2x |
์ง์ ์ฆ๋ฅ |
7. HuggingFace BERT ์ฌ์ฉ
๊ธฐ๋ณธ ์ฌ์ฉ
from transformers import BertTokenizer, BertModel
# ํ ํฌ๋์ด์ ์ ๋ชจ๋ธ ๋ก๋
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# ์ธ์ฝ๋ฉ
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors='pt')
# ์์ ํ
outputs = model(**inputs)
# ์ถ๋ ฅ
last_hidden_state = outputs.last_hidden_state # (1, seq, 768)
pooler_output = outputs.pooler_output # (1, 768) - [CLS] ๋ณํ
๋ถ๋ฅ ๋ชจ๋ธ
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=2
)
inputs = tokenizer("I love this movie!", return_tensors='pt')
outputs = model(**inputs)
logits = outputs.logits # (1, 2)
Attention ์๊ฐํ
from transformers import BertModel
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
inputs = tokenizer("The cat sat on the mat", return_tensors='pt')
outputs = model(**inputs)
# Attention weights: (num_layers, batch, heads, seq, seq)
attentions = outputs.attentions
# ์ฒซ ๋ฒ์งธ ๋ ์ด์ด, ์ฒซ ๋ฒ์งธ ํค๋
attn = attentions[0][0, 0].detach().numpy()
8. BERT ์
๋ ฅ ํฌ๋งท
Single Sentence
[CLS] sentence [SEP]
segment_ids: [0, 0, 0, ..., 0]
Sentence Pair
[CLS] sentence A [SEP] sentence B [SEP]
segment_ids: [0, 0, ..., 0, 1, 1, ..., 1]
HuggingFace์์ Pair ์ฒ๋ฆฌ
# ๋ ๋ฌธ์ฅ ์
๋ ฅ
text_a = "How old are you?"
text_b = "I am 25 years old."
inputs = tokenizer(
text_a, text_b,
padding='max_length',
max_length=32,
truncation=True,
return_tensors='pt'
)
print(inputs['token_type_ids']) # segment_ids
# [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, ...]
์ ๋ฆฌ
ํต์ฌ ๊ฐ๋
- ์๋ฐฉํฅ ์ธ์ฝ๋: ์ ์ฒด ๋ฌธ๋งฅ์ ์๋ฐฉํฅ์ผ๋ก ์ดํด
- MLM: ๋ง์คํน๋ ํ ํฐ ์์ธก์ผ๋ก ๋ฌธ๋งฅ ํ์ต
- NSP: ๋ฌธ์ฅ ๊ด๊ณ ์ดํด (RoBERTa์์ ์ ๊ฑฐ)
- [CLS] ํ ํฐ: ๋ฌธ์ฅ ์์ค ํํ
- Segment Embedding: ๋ฌธ์ฅ ๊ตฌ๋ถ
ํต์ฌ ์ฝ๋
# HuggingFace BERT
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# ์ธ์ฝ๋ฉ
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)
cls_embedding = outputs.last_hidden_state[:, 0] # [CLS]
๋ค์ ๋จ๊ณ
05_GPT_Understanding.md์์ GPT ๋ชจ๋ธ๊ณผ ์๊ธฐํ๊ท ์ธ์ด ๋ชจ๋ธ์ ํ์ตํฉ๋๋ค.