02. Word2Vecκ³Ό GloVe

02. Word2Vecκ³Ό GloVe

ν•™μŠ΅ λͺ©ν‘œ

  • λΆ„μ‚° ν‘œν˜„μ˜ κ°œλ…
  • Word2Vec (Skip-gram, CBOW)
  • GloVe μž„λ² λ”©
  • μ‚¬μ „ν•™μŠ΅ μž„λ² λ”© ν™œμš©

1. 단어 μž„λ² λ”© κ°œμš”

One-Hot vs λΆ„μ‚° ν‘œν˜„

One-Hot (ν¬μ†Œ ν‘œν˜„):
    "king"  β†’ [1, 0, 0, 0, ...]  (V차원)
    "queen" β†’ [0, 1, 0, 0, ...]

문제: 의미적 μœ μ‚¬μ„± ν‘œν˜„ λΆˆκ°€
      cosine_similarity(king, queen) = 0

λΆ„μ‚° ν‘œν˜„ (Dense):
    "king"  β†’ [0.2, -0.5, 0.8, ...]  (d차원, d << V)
    "queen" β†’ [0.3, -0.4, 0.7, ...]

μž₯점: 의미적 μœ μ‚¬μ„± 반영
      cosine_similarity(king, queen) β‰ˆ 0.9

λΆ„μ‚° κ°€μ„€

"같은 λ§₯λ½μ—μ„œ λ“±μž₯ν•˜λŠ” λ‹¨μ–΄λŠ” λΉ„μŠ·ν•œ 의미λ₯Ό κ°–λŠ”λ‹€" (You shall know a word by the company it keeps)

"The cat sat on the ___"  β†’ mat, floor, couch
"The dog lay on the ___"  β†’ mat, floor, couch

cat β‰ˆ dog (μœ μ‚¬ν•œ λ§₯락)

2. Word2Vec

Skip-gram

μ£Όλ³€ 단어λ₯Ό μ˜ˆμΈ‘ν•˜μ—¬ 쀑심 단어 ν‘œν˜„ ν•™μŠ΅

μž…λ ₯: center word β†’ 예츑: context words

λ¬Έμž₯: "The quick brown fox jumps"
쀑심 단어: "brown" (window=2)
예츑 λŒ€μƒ: ["quick", "fox"] λ˜λŠ” ["The", "quick", "fox", "jumps"]

λͺ¨λΈ:
    "brown" β†’ μž„λ² λ”© β†’ Softmax β†’ P(context | center)

CBOW (Continuous Bag of Words)

μ£Όλ³€ λ‹¨μ–΄λ‘œ 쀑심 단어 예츑

μž…λ ₯: context words β†’ 예츑: center word

λ¬Έμž₯: "The quick brown fox jumps"
μ£Όλ³€ 단어: ["quick", "fox"]
예츑 λŒ€μƒ: "brown"

λͺ¨λΈ:
    ["quick", "fox"] β†’ 평균 μž„λ² λ”© β†’ Softmax β†’ P(center | context)

Word2Vec ꡬ쑰

import torch
import torch.nn as nn

class SkipGram(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        # μž…λ ₯ μž„λ² λ”© (쀑심 단어)
        self.center_embeddings = nn.Embedding(vocab_size, embed_dim)
        # 좜λ ₯ μž„λ² λ”© (μ£Όλ³€ 단어)
        self.context_embeddings = nn.Embedding(vocab_size, embed_dim)

    def forward(self, center, context):
        # center: (batch,)
        # context: (batch,)
        center_emb = self.center_embeddings(center)   # (batch, embed)
        context_emb = self.context_embeddings(context)  # (batch, embed)

        # λ‚΄μ μœΌλ‘œ μœ μ‚¬λ„ 계산
        score = (center_emb * context_emb).sum(dim=1)  # (batch,)
        return score

class CBOW(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.context_embeddings = nn.Embedding(vocab_size, embed_dim)
        self.center_embeddings = nn.Embedding(vocab_size, embed_dim)

    def forward(self, context, center):
        # context: (batch, window*2)
        # center: (batch,)
        context_emb = self.context_embeddings(context)  # (batch, window*2, embed)
        context_mean = context_emb.mean(dim=1)  # (batch, embed)

        center_emb = self.center_embeddings(center)  # (batch, embed)

        score = (context_mean * center_emb).sum(dim=1)
        return score

Negative Sampling

전체 μ–΄νœ˜μ— λŒ€ν•œ SoftmaxλŠ” 계산 λΉ„μš©μ΄ 큼

class SkipGramNegSampling(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.center_embeddings = nn.Embedding(vocab_size, embed_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embed_dim)

    def forward(self, center, context, neg_context):
        # center: (batch,)
        # context: (batch,) - μ‹€μ œ μ£Όλ³€ 단어
        # neg_context: (batch, k) - 랜덀 μƒ˜ν”Œλ§λœ 단어

        center_emb = self.center_embeddings(center)  # (batch, embed)

        # Positive: μ‹€μ œ μ£Όλ³€ λ‹¨μ–΄μ™€μ˜ μœ μ‚¬λ„
        pos_emb = self.context_embeddings(context)
        pos_score = (center_emb * pos_emb).sum(dim=1)  # (batch,)

        # Negative: 랜덀 λ‹¨μ–΄μ™€μ˜ μœ μ‚¬λ„
        neg_emb = self.context_embeddings(neg_context)  # (batch, k, embed)
        neg_score = torch.bmm(neg_emb, center_emb.unsqueeze(2)).squeeze()  # (batch, k)

        return pos_score, neg_score

# 손싀 ν•¨μˆ˜
def negative_sampling_loss(pos_score, neg_score):
    pos_loss = -torch.log(torch.sigmoid(pos_score) + 1e-10)
    neg_loss = -torch.log(torch.sigmoid(-neg_score) + 1e-10).sum(dim=1)
    return (pos_loss + neg_loss).mean()

3. GloVe

κ°œλ…

μ „μ—­ λ™μ‹œ μΆœν˜„ 톡계 ν™œμš©

λ™μ‹œ μΆœν˜„ ν–‰λ ¬ X:
    X[i,j] = 단어 i와 jκ°€ ν•¨κ»˜ λ“±μž₯ν•œ 횟수

λͺ©ν‘œ:
    w_i Β· w_j + b_i + b_j β‰ˆ log(X[i,j])

GloVe 손싀 ν•¨μˆ˜

def glove_loss(w_i, w_j, b_i, b_j, X_ij, x_max=100, alpha=0.75):
    """
    w_i, w_j: 단어 μž„λ² λ”©
    b_i, b_j: 편ν–₯
    X_ij: λ™μ‹œ μΆœν˜„ 횟수
    """
    # κ°€μ€‘μΉ˜ ν•¨μˆ˜ (λΉˆλ„κ°€ λ„ˆλ¬΄ 높은 단어 μ™„ν™”)
    weight = torch.clamp(X_ij / x_max, max=1.0) ** alpha

    # 예츑과 μ‹€μ œμ˜ 차이
    prediction = (w_i * w_j).sum(dim=1) + b_i + b_j
    target = torch.log(X_ij + 1e-10)

    loss = weight * (prediction - target) ** 2
    return loss.mean()

GloVe κ΅¬ν˜„

class GloVe(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        # 두 μž„λ² λ”© ν–‰λ ¬
        self.w_embeddings = nn.Embedding(vocab_size, embed_dim)
        self.c_embeddings = nn.Embedding(vocab_size, embed_dim)
        self.w_bias = nn.Embedding(vocab_size, 1)
        self.c_bias = nn.Embedding(vocab_size, 1)

    def forward(self, i, j, cooccur):
        w_i = self.w_embeddings(i)
        w_j = self.c_embeddings(j)
        b_i = self.w_bias(i).squeeze()
        b_j = self.c_bias(j).squeeze()

        return glove_loss(w_i, w_j, b_i, b_j, cooccur)

    def get_embedding(self, word_idx):
        # μ΅œμ’… μž„λ² λ”©: 두 μž„λ² λ”©μ˜ 평균
        return (self.w_embeddings.weight[word_idx] +
                self.c_embeddings.weight[word_idx]) / 2

4. μ‚¬μ „ν•™μŠ΅ μž„λ² λ”© μ‚¬μš©

Gensim Word2Vec

from gensim.models import Word2Vec

# ν•™μŠ΅
sentences = [["I", "love", "NLP"], ["NLP", "is", "fun"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# μœ μ‚¬ 단어
similar = model.wv.most_similar("NLP", topn=5)

# 벑터 κ°€μ Έμ˜€κΈ°
vector = model.wv["NLP"]

# μ €μž₯/λ‘œλ“œ
model.save("word2vec.model")
model = Word2Vec.load("word2vec.model")

μ‚¬μ „ν•™μŠ΅ GloVe

import numpy as np

def load_glove(path, embed_dim=100):
    """GloVe ν…μŠ€νŠΈ 파일 λ‘œλ“œ"""
    embeddings = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# μ‚¬μš©
glove = load_glove('glove.6B.100d.txt')
vector = glove.get('king', np.zeros(100))

PyTorch μž„λ² λ”© λ ˆμ΄μ–΄μ— 적용

import torch
import torch.nn as nn

def create_embedding_layer(vocab, glove, embed_dim=100, freeze=True):
    """μ‚¬μ „ν•™μŠ΅ μž„λ² λ”©μœΌλ‘œ Embedding λ ˆμ΄μ–΄ μ΄ˆκΈ°ν™”"""
    vocab_size = len(vocab)
    embedding_matrix = torch.zeros(vocab_size, embed_dim)

    found = 0
    for word, idx in vocab.word2idx.items():
        if word in glove:
            embedding_matrix[idx] = torch.from_numpy(glove[word])
            found += 1
        else:
            # 랜덀 μ΄ˆκΈ°ν™”
            embedding_matrix[idx] = torch.randn(embed_dim) * 0.1

    print(f"μ‚¬μ „ν•™μŠ΅ μž„λ² λ”© 적용: {found}/{vocab_size}")

    embedding = nn.Embedding.from_pretrained(
        embedding_matrix,
        freeze=freeze,  # Trueλ©΄ ν•™μŠ΅ν•˜μ§€ μ•ŠμŒ
        padding_idx=vocab.word2idx.get('<pad>', 0)
    )
    return embedding

# λͺ¨λΈμ— 적용
class TextClassifier(nn.Module):
    def __init__(self, vocab, glove, num_classes):
        super().__init__()
        self.embedding = create_embedding_layer(vocab, glove, freeze=False)
        self.fc = nn.Linear(100, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)  # (batch, seq, 100)
        pooled = embedded.mean(dim=1)  # 평균 풀링
        return self.fc(pooled)

5. μž„λ² λ”© μ—°μ‚°

μœ μ‚¬λ„ 계산

import torch
import torch.nn.functional as F

def cosine_similarity(v1, v2):
    return F.cosine_similarity(v1.unsqueeze(0), v2.unsqueeze(0))

# κ°€μž₯ μœ μ‚¬ν•œ 단어 μ°ΎκΈ°
def most_similar(word, embeddings, vocab, topk=5):
    word_vec = embeddings[vocab[word]]
    similarities = F.cosine_similarity(word_vec.unsqueeze(0), embeddings)
    values, indices = similarities.topk(topk + 1)

    results = []
    for val, idx in zip(values[1:], indices[1:]):  # 자기 μžμ‹  μ œμ™Έ
        results.append((vocab.idx2word[idx.item()], val.item()))
    return results

단어 μ—°μ‚°

def word_analogy(a, b, c, embeddings, vocab, topk=5):
    """
    a : b = c : ?
    예: king : queen = man : woman

    vector(?) = vector(b) - vector(a) + vector(c)
    """
    vec_a = embeddings[vocab[a]]
    vec_b = embeddings[vocab[b]]
    vec_c = embeddings[vocab[c]]

    # μœ μΆ” 벑터
    target_vec = vec_b - vec_a + vec_c

    # κ°€μž₯ μœ μ‚¬ν•œ 단어 μ°ΎκΈ°
    similarities = F.cosine_similarity(target_vec.unsqueeze(0), embeddings)
    values, indices = similarities.topk(topk + 3)

    # a, b, c μ œμ™Έ
    exclude = {vocab[a], vocab[b], vocab[c]}
    results = []
    for val, idx in zip(values, indices):
        if idx.item() not in exclude:
            results.append((vocab.idx2word[idx.item()], val.item()))
        if len(results) == topk:
            break
    return results

# μ˜ˆμ‹œ
# word_analogy("king", "queen", "man", embeddings, vocab)
# β†’ [("woman", 0.85), ...]

6. μ‹œκ°ν™”

t-SNE μ‹œκ°ν™”

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

def visualize_embeddings(embeddings, words, vocab):
    # μ„ νƒν•œ λ‹¨μ–΄μ˜ μž„λ² λ”©
    indices = [vocab[w] for w in words]
    vectors = embeddings[indices].numpy()

    # t-SNE
    tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(words)-1))
    reduced = tsne.fit_transform(vectors)

    # μ‹œκ°ν™”
    plt.figure(figsize=(12, 8))
    plt.scatter(reduced[:, 0], reduced[:, 1])

    for i, word in enumerate(words):
        plt.annotate(word, (reduced[i, 0], reduced[i, 1]))

    plt.title('Word Embeddings (t-SNE)')
    plt.savefig('embeddings_tsne.png')
    plt.close()

# μ‚¬μš©
words = ['king', 'queen', 'man', 'woman', 'dog', 'cat', 'apple', 'orange']
visualize_embeddings(embeddings, words, vocab)

7. Word2Vec vs GloVe 비ꡐ

ν•­λͺ© Word2Vec GloVe
방식 예츑 기반 톡계 기반
ν•™μŠ΅ μœˆλ„μš° λ‚΄ 단어 μ „μ—­ λ™μ‹œ μΆœν˜„
λ©”λͺ¨λ¦¬ 적음 λ™μ‹œ μΆœν˜„ ν–‰λ ¬ ν•„μš”
ν•™μŠ΅ 속도 Negative Sampling으둜 빠름 ν–‰λ ¬ μ „μ²˜λ¦¬ ν›„ 빠름
μ„±λŠ₯ μœ μ‚¬ μœ μ‚¬

정리

핡심 κ°œλ…

  1. λΆ„μ‚° ν‘œν˜„: 단어λ₯Ό λ°€μ§‘ λ²‘ν„°λ‘œ ν‘œν˜„
  2. Skip-gram: 쀑심 β†’ μ£Όλ³€ 예츑
  3. CBOW: μ£Όλ³€ β†’ 쀑심 예츑
  4. GloVe: λ™μ‹œ μΆœν˜„ 톡계 ν™œμš©
  5. 단어 μ—°μ‚°: king - queen + man β‰ˆ woman

핡심 μ½”λ“œ

# Gensim Word2Vec
from gensim.models import Word2Vec
model = Word2Vec(sentences, vector_size=100, window=5)

# μ‚¬μ „ν•™μŠ΅ μž„λ² λ”© 적용
embedding = nn.Embedding.from_pretrained(pretrained_matrix, freeze=False)

# μœ μ‚¬λ„
similarity = F.cosine_similarity(vec1, vec2)

λ‹€μŒ 단계

03_Transformer_Review.mdμ—μ„œ Transformer μ•„ν‚€ν…μ²˜λ₯Ό NLP κ΄€μ μ—μ„œ λ³΅μŠ΅ν•©λ‹ˆλ‹€.

to navigate between lessons