04. BERT ์ดํ•ด

04. BERT ์ดํ•ด

ํ•™์Šต ๋ชฉํ‘œ

  • BERT ์•„ํ‚คํ…์ฒ˜ ์ดํ•ด
  • ์‚ฌ์ „ํ•™์Šต ๋ชฉํ‘œ (MLM, NSP)
  • ์ž…๋ ฅ ํ‘œํ˜„
  • ๋‹ค์–‘ํ•œ BERT ๋ณ€ํ˜•

1. BERT ๊ฐœ์š”

Bidirectional Encoder Representations from Transformers

BERT = Transformer ์ธ์ฝ”๋” ์Šคํƒ

ํŠน์ง•:
- ์–‘๋ฐฉํ–ฅ ๋ฌธ๋งฅ ์ดํ•ด
- ์‚ฌ์ „ํ•™์Šต + ํŒŒ์ธํŠœ๋‹ ํŒจ๋Ÿฌ๋‹ค์ž„
- ๋‹ค์–‘ํ•œ NLP ํƒœ์Šคํฌ์— ๋ฒ”์šฉ ์ ์šฉ

๋ชจ๋ธ ํฌ๊ธฐ

๋ชจ๋ธ ๋ ˆ์ด์–ด d_model ํ—ค๋“œ ํŒŒ๋ผ๋ฏธํ„ฐ
BERT-base 12 768 12 110M
BERT-large 24 1024 16 340M

2. ์ž…๋ ฅ ํ‘œํ˜„

์„ธ ๊ฐ€์ง€ ์ž„๋ฒ ๋”ฉ์˜ ํ•ฉ

์ž…๋ ฅ: [CLS] I love NLP [SEP] It is fun [SEP]

Token Embedding:    [E_CLS, E_I, E_love, E_NLP, E_SEP, E_It, E_is, E_fun, E_SEP]
Segment Embedding:  [E_A,   E_A, E_A,    E_A,   E_A,   E_B,  E_B,  E_B,   E_B  ]
Position Embedding: [E_0,   E_1, E_2,    E_3,   E_4,   E_5,  E_6,  E_7,   E_8  ]
                    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
                    = ์ตœ์ข… ์ž…๋ ฅ ์ž„๋ฒ ๋”ฉ (ํ•ฉ)

ํŠน์ˆ˜ ํ† ํฐ

ํ† ํฐ ์—ญํ• 
[CLS] ๋ถ„๋ฅ˜ ํƒœ์Šคํฌ์šฉ ์ง‘๊ณ„ ํ† ํฐ
[SEP] ๋ฌธ์žฅ ๊ตฌ๋ถ„์ž
[PAD] ํŒจ๋”ฉ
[MASK] MLM์—์„œ ๋งˆ์Šคํ‚น๋œ ํ† ํฐ
[UNK] ๋ฏธ๋“ฑ๋ก ๋‹จ์–ด

์ž…๋ ฅ ๊ตฌํ˜„

import torch
import torch.nn as nn

class BERTEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model=768, max_len=512, dropout=0.1):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_len, d_model)
        self.segment_embedding = nn.Embedding(2, d_model)
        self.layer_norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input_ids, segment_ids):
        seq_len = input_ids.size(1)

        # ์œ„์น˜ ์ธ๋ฑ์Šค
        position_ids = torch.arange(seq_len, device=input_ids.device)

        # ์ž„๋ฒ ๋”ฉ ํ•ฉ
        embeddings = (
            self.token_embedding(input_ids) +
            self.position_embedding(position_ids) +
            self.segment_embedding(segment_ids)
        )

        embeddings = self.layer_norm(embeddings)
        return self.dropout(embeddings)

3. ์‚ฌ์ „ํ•™์Šต ๋ชฉํ‘œ

Masked Language Model (MLM)

15%์˜ ํ† ํฐ์„ ์„ ํƒ:
- 80%: [MASK]๋กœ ๊ต์ฒด
- 10%: ๋žœ๋ค ํ† ํฐ์œผ๋กœ ๊ต์ฒด
- 10%: ๊ทธ๋Œ€๋กœ ์œ ์ง€

์˜ˆ์‹œ:
์ž…๋ ฅ: "The cat sat on the mat"
     โ†’ "The [MASK] sat on the mat"
๋ชฉํ‘œ: [MASK] โ†’ "cat" ์˜ˆ์ธก
import random

def create_mlm_data(tokens, vocab, mask_prob=0.15):
    """MLM ํ•™์Šต ๋ฐ์ดํ„ฐ ์ƒ์„ฑ"""
    labels = [-100] * len(tokens)  # -100์€ ์†์‹ค ๊ณ„์‚ฐ์—์„œ ๋ฌด์‹œ

    for i, token in enumerate(tokens):
        if random.random() < mask_prob:
            labels[i] = vocab[token]  # ์›๋ž˜ ํ† ํฐ ID

            rand = random.random()
            if rand < 0.8:
                tokens[i] = '[MASK]'
            elif rand < 0.9:
                tokens[i] = random.choice(list(vocab.keys()))
            # else: ๊ทธ๋Œ€๋กœ ์œ ์ง€

    return tokens, labels

Next Sentence Prediction (NSP)

์ž…๋ ฅ: [CLS] ๋ฌธ์žฅA [SEP] ๋ฌธ์žฅB [SEP]
๋ชฉํ‘œ: ๋ฌธ์žฅB๊ฐ€ ๋ฌธ์žฅA์˜ ์‹ค์ œ ๋‹ค์Œ ๋ฌธ์žฅ์ธ์ง€ ์ด์ง„ ๋ถ„๋ฅ˜

์˜ˆ์‹œ:
๊ธ์ • (IsNext):
    A: "The man went to the store"
    B: "He bought a gallon of milk"

๋ถ€์ • (NotNext):
    A: "The man went to the store"
    B: "Penguins are flightless birds"
class BERTPreTrainingHeads(nn.Module):
    def __init__(self, d_model, vocab_size):
        super().__init__()
        # MLM ํ—ค๋“œ
        self.mlm = nn.Sequential(
            nn.Linear(d_model, d_model),
            nn.GELU(),
            nn.LayerNorm(d_model),
            nn.Linear(d_model, vocab_size)
        )
        # NSP ํ—ค๋“œ
        self.nsp = nn.Linear(d_model, 2)

    def forward(self, sequence_output, cls_output):
        mlm_scores = self.mlm(sequence_output)  # (batch, seq, vocab)
        nsp_scores = self.nsp(cls_output)       # (batch, 2)
        return mlm_scores, nsp_scores

4. BERT ์ „์ฒด ๊ตฌ์กฐ

class BERT(nn.Module):
    def __init__(self, vocab_size, d_model=768, num_heads=12,
                 num_layers=12, d_ff=3072, max_len=512, dropout=0.1):
        super().__init__()

        self.embedding = BERTEmbedding(vocab_size, d_model, max_len, dropout)

        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=num_heads,
            dim_feedforward=d_ff,
            dropout=dropout,
            batch_first=True
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers)

    def forward(self, input_ids, segment_ids, attention_mask=None):
        # ์ž„๋ฒ ๋”ฉ
        x = self.embedding(input_ids, segment_ids)

        # ํŒจ๋”ฉ ๋งˆ์Šคํฌ ๋ณ€ํ™˜
        if attention_mask is not None:
            # (batch, seq) โ†’ (batch, seq) with True for padding
            attention_mask = (attention_mask == 0)

        # ์ธ์ฝ”๋”
        output = self.encoder(x, src_key_padding_mask=attention_mask)

        return output  # (batch, seq, d_model)


class BERTForPreTraining(nn.Module):
    def __init__(self, vocab_size, d_model=768, **kwargs):
        super().__init__()
        self.bert = BERT(vocab_size, d_model, **kwargs)
        self.heads = BERTPreTrainingHeads(d_model, vocab_size)

    def forward(self, input_ids, segment_ids, attention_mask=None):
        sequence_output = self.bert(input_ids, segment_ids, attention_mask)
        cls_output = sequence_output[:, 0]  # [CLS] ํ† ํฐ

        mlm_scores, nsp_scores = self.heads(sequence_output, cls_output)
        return mlm_scores, nsp_scores

5. ํŒŒ์ธํŠœ๋‹ ํŒจํ„ด

๋ฌธ์žฅ ๋ถ„๋ฅ˜ (Single Sentence)

class BERTForSequenceClassification(nn.Module):
    def __init__(self, bert, num_classes, dropout=0.1):
        super().__init__()
        self.bert = bert
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(bert.embedding.token_embedding.embedding_dim,
                                    num_classes)

    def forward(self, input_ids, segment_ids, attention_mask):
        output = self.bert(input_ids, segment_ids, attention_mask)
        cls_output = output[:, 0]  # [CLS]
        cls_output = self.dropout(cls_output)
        return self.classifier(cls_output)

ํ† ํฐ ๋ถ„๋ฅ˜ (NER)

class BERTForTokenClassification(nn.Module):
    def __init__(self, bert, num_labels, dropout=0.1):
        super().__init__()
        self.bert = bert
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(bert.embedding.token_embedding.embedding_dim,
                                    num_labels)

    def forward(self, input_ids, segment_ids, attention_mask):
        output = self.bert(input_ids, segment_ids, attention_mask)
        output = self.dropout(output)
        return self.classifier(output)  # (batch, seq, num_labels)

์งˆ์˜์‘๋‹ต (QA)

class BERTForQuestionAnswering(nn.Module):
    def __init__(self, bert):
        super().__init__()
        self.bert = bert
        hidden_size = bert.embedding.token_embedding.embedding_dim
        self.qa_outputs = nn.Linear(hidden_size, 2)  # start, end

    def forward(self, input_ids, segment_ids, attention_mask):
        output = self.bert(input_ids, segment_ids, attention_mask)
        logits = self.qa_outputs(output)  # (batch, seq, 2)

        start_logits = logits[:, :, 0]  # (batch, seq)
        end_logits = logits[:, :, 1]

        return start_logits, end_logits

6. BERT ๋ณ€ํ˜• ๋ชจ๋ธ

RoBERTa

๋ณ€๊ฒฝ์ :
- NSP ์ œ๊ฑฐ (MLM๋งŒ ์‚ฌ์šฉ)
- ๋™์  ๋งˆ์Šคํ‚น (๋งค ์—ํฌํฌ ๋‹ค๋ฅธ ๋งˆ์Šคํ‚น)
- ๋” ํฐ ๋ฐฐ์น˜, ๋” ๊ธด ํ•™์Šต
- Byte-Level BPE ํ† ํฌ๋‚˜์ด์ €

๊ฒฐ๊ณผ: BERT๋ณด๋‹ค ์„ฑ๋Šฅ ํ–ฅ์ƒ

ALBERT

๋ณ€๊ฒฝ์ :
- ์ž„๋ฒ ๋”ฉ ๋ถ„ํ•ด (Vร—E, Eร—H โ†’ Vร—E, E<<H)
- ๋ ˆ์ด์–ด ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต์œ 
- NSP โ†’ SOP (Sentence Order Prediction)

๊ฒฐ๊ณผ: ํŒŒ๋ผ๋ฏธํ„ฐ ๋Œ€ํญ ๊ฐ์†Œ, ์œ ์‚ฌ ์„ฑ๋Šฅ

DistilBERT

๋ณ€๊ฒฝ์ :
- ์ง€์‹ ์ฆ๋ฅ˜ (Teacher: BERT โ†’ Student: ์ž‘์€ ๋ชจ๋ธ)
- 6 ๋ ˆ์ด์–ด (BERT์˜ ์ ˆ๋ฐ˜)

๊ฒฐ๊ณผ: 40% ์ž‘์Œ, 60% ๋น ๋ฆ„, 97% ์„ฑ๋Šฅ ์œ ์ง€

Comparison

๋ชจ๋ธ ๋ ˆ์ด์–ด ํŒŒ๋ผ๋ฏธํ„ฐ ์†๋„ ํŠน์ง•
BERT-base 12 110M 1x ๊ธฐ์ค€
RoBERTa 12 125M 1x ์ตœ์ ํ™”๋œ ํ•™์Šต
ALBERT-base 12 12M 1x ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต์œ 
DistilBERT 6 66M 2x ์ง€์‹ ์ฆ๋ฅ˜

7. HuggingFace BERT ์‚ฌ์šฉ

๊ธฐ๋ณธ ์‚ฌ์šฉ

from transformers import BertTokenizer, BertModel

# ํ† ํฌ๋‚˜์ด์ €์™€ ๋ชจ๋ธ ๋กœ๋“œ
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# ์ธ์ฝ”๋”ฉ
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors='pt')

# ์ˆœ์ „ํŒŒ
outputs = model(**inputs)

# ์ถœ๋ ฅ
last_hidden_state = outputs.last_hidden_state  # (1, seq, 768)
pooler_output = outputs.pooler_output          # (1, 768) - [CLS] ๋ณ€ํ™˜

๋ถ„๋ฅ˜ ๋ชจ๋ธ

from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

inputs = tokenizer("I love this movie!", return_tensors='pt')
outputs = model(**inputs)
logits = outputs.logits  # (1, 2)

Attention ์‹œ๊ฐํ™”

from transformers import BertModel

model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)

inputs = tokenizer("The cat sat on the mat", return_tensors='pt')
outputs = model(**inputs)

# Attention weights: (num_layers, batch, heads, seq, seq)
attentions = outputs.attentions

# ์ฒซ ๋ฒˆ์งธ ๋ ˆ์ด์–ด, ์ฒซ ๋ฒˆ์งธ ํ—ค๋“œ
attn = attentions[0][0, 0].detach().numpy()

8. BERT ์ž…๋ ฅ ํฌ๋งท

Single Sentence

[CLS] sentence [SEP]
segment_ids: [0, 0, 0, ..., 0]

Sentence Pair

[CLS] sentence A [SEP] sentence B [SEP]
segment_ids: [0, 0, ..., 0, 1, 1, ..., 1]

HuggingFace์—์„œ Pair ์ฒ˜๋ฆฌ

# ๋‘ ๋ฌธ์žฅ ์ž…๋ ฅ
text_a = "How old are you?"
text_b = "I am 25 years old."

inputs = tokenizer(
    text_a, text_b,
    padding='max_length',
    max_length=32,
    truncation=True,
    return_tensors='pt'
)

print(inputs['token_type_ids'])  # segment_ids
# [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, ...]

์ •๋ฆฌ

ํ•ต์‹ฌ ๊ฐœ๋…

  1. ์–‘๋ฐฉํ–ฅ ์ธ์ฝ”๋”: ์ „์ฒด ๋ฌธ๋งฅ์„ ์–‘๋ฐฉํ–ฅ์œผ๋กœ ์ดํ•ด
  2. MLM: ๋งˆ์Šคํ‚น๋œ ํ† ํฐ ์˜ˆ์ธก์œผ๋กœ ๋ฌธ๋งฅ ํ•™์Šต
  3. NSP: ๋ฌธ์žฅ ๊ด€๊ณ„ ์ดํ•ด (RoBERTa์—์„œ ์ œ๊ฑฐ)
  4. [CLS] ํ† ํฐ: ๋ฌธ์žฅ ์ˆ˜์ค€ ํ‘œํ˜„
  5. Segment Embedding: ๋ฌธ์žฅ ๊ตฌ๋ถ„

ํ•ต์‹ฌ ์ฝ”๋“œ

# HuggingFace BERT
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# ์ธ์ฝ”๋”ฉ
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)
cls_embedding = outputs.last_hidden_state[:, 0]  # [CLS]

๋‹ค์Œ ๋‹จ๊ณ„

05_GPT_Understanding.md์—์„œ GPT ๋ชจ๋ธ๊ณผ ์ž๊ธฐํšŒ๊ท€ ์–ธ์–ด ๋ชจ๋ธ์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

to navigate between lessons