07. Tokenization ์‹ฌํ™”

07. Tokenization ์‹ฌํ™”

๊ฐœ์š”

Tokenization์€ ํ…์ŠคํŠธ๋ฅผ ๋ชจ๋ธ์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ํ† ํฐ ์‹œํ€€์Šค๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. Foundation Model์˜ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์— ์ง์ ‘์ ์ธ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ค‘์š”ํ•œ ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค.


1. Tokenization ํŒจ๋Ÿฌ๋‹ค์ž„

1.1 ์—ญ์‚ฌ์  ๋ฐœ์ „

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   Tokenization ์ง„ํ™”                              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  Word-level (์ „ํ†ต)                                               โ”‚
โ”‚  "I love NLP" โ†’ ["I", "love", "NLP"]                            โ”‚
โ”‚  ๋ฌธ์ œ: OOV (Out-of-Vocabulary), ๊ฑฐ๋Œ€ํ•œ ์–ดํœ˜ ํฌ๊ธฐ                 โ”‚
โ”‚                                                                  โ”‚
โ”‚       โ†“                                                          โ”‚
โ”‚                                                                  โ”‚
โ”‚  Character-level                                                 โ”‚
โ”‚  "I love NLP" โ†’ ["I", " ", "l", "o", "v", "e", " ", ...]        โ”‚
โ”‚  ๋ฌธ์ œ: ๋„ˆ๋ฌด ๊ธด ์‹œํ€€์Šค, ์˜๋ฏธ ๋‹จ์œ„ ์†์‹ค                            โ”‚
โ”‚                                                                  โ”‚
โ”‚       โ†“                                                          โ”‚
โ”‚                                                                  โ”‚
โ”‚  Subword (ํ˜„์žฌ ์ฃผ๋ฅ˜)                                             โ”‚
โ”‚  "I love NLP" โ†’ ["I", "ฤ love", "ฤ N", "LP"]                      โ”‚
โ”‚  ์žฅ์ : OOV ์—†์Œ, ์ ์ ˆํ•œ ์‹œํ€€์Šค ๊ธธ์ด, ํ˜•ํƒœ์†Œ์  ์˜๋ฏธ ๋ณด์กด          โ”‚
โ”‚                                                                  โ”‚
โ”‚       โ†“ (๋ฏธ๋ž˜)                                                   โ”‚
โ”‚                                                                  โ”‚
โ”‚  Byte-level / Tokenizer-free                                    โ”‚
โ”‚  Raw bytes ๋˜๋Š” ํ•™์Šต๋œ ํ† ํฐํ™” ์—†์ด ์ฒ˜๋ฆฌ                          โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

1.2 ์ฃผ์š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋น„๊ต

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐฉ์‹ ๋Œ€ํ‘œ ๋ชจ๋ธ ํŠน์ง•
BPE ๋นˆ๋„ ๊ธฐ๋ฐ˜ ๋ณ‘ํ•ฉ GPT, RoBERTa, LLaMA ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ
WordPiece ์šฐ๋„ ๊ธฐ๋ฐ˜ ๋ณ‘ํ•ฉ BERT, DistilBERT ํ™•๋ฅ ์  ์„ ํƒ
Unigram ํ™•๋ฅ  ๋ชจ๋ธ T5, ALBERT, XLNet ์ตœ์  ๋ถ„ํ•  ํƒ์ƒ‰
SentencePiece ์–ธ์–ด ๋…๋ฆฝ์  ๋‹ค๊ตญ์–ด ๋ชจ๋ธ BPE/Unigram ๊ตฌํ˜„

2. BPE (Byte-Pair Encoding)

2.1 ์•Œ๊ณ ๋ฆฌ์ฆ˜

BPE ํ•™์Šต ๊ณผ์ •:

1. ์ดˆ๊ธฐ ์–ดํœ˜ = ๋ชจ๋“  ๋ฌธ์ž + ํŠน์ˆ˜ ํ† ํฐ
2. ๋ฐ˜๋ณต:
   a. ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•œ ์ธ์ ‘ ํ† ํฐ ์Œ ์ฐพ๊ธฐ
   b. ํ•ด๋‹น ์Œ์„ ์ƒˆ ํ† ํฐ์œผ๋กœ ๋ณ‘ํ•ฉ
   c. ์–ดํœ˜์— ์ถ”๊ฐ€
3. ๋ชฉํ‘œ ์–ดํœ˜ ํฌ๊ธฐ๊นŒ์ง€ ๋ฐ˜๋ณต

์˜ˆ์‹œ:
์ดˆ๊ธฐ: ['l', 'o', 'w', 'e', 'r', 'n', 'i', 'g', 'h', 't']

Step 1: 'l' + 'o' โ†’ 'lo' (๊ฐ€์žฅ ๋นˆ๋ฒˆ)
Step 2: 'lo' + 'w' โ†’ 'low'
Step 3: 'e' + 'r' โ†’ 'er'
Step 4: 'n' + 'i' โ†’ 'ni'
Step 5: 'ni' + 'g' โ†’ 'nig'
Step 6: 'nig' + 'h' โ†’ 'nigh'
Step 7: 'nigh' + 't' โ†’ 'night'
...

์ตœ์ข…: "lower" โ†’ ['low', 'er'], "night" โ†’ ['night']

2.2 ๊ตฌํ˜„

from collections import Counter, defaultdict
from typing import Dict, List, Tuple
import re

class BPETokenizer:
    """Byte-Pair Encoding Tokenizer"""

    def __init__(self, vocab_size: int = 10000):
        self.vocab_size = vocab_size
        self.vocab = {}
        self.merges = {}
        self.special_tokens = ['<pad>', '<unk>', '<bos>', '<eos>']

    def train(self, texts: List[str]):
        """BPE ํ•™์Šต"""
        # 1. ๋‹จ์–ด ๋นˆ๋„ ๊ณ„์‚ฐ
        word_freqs = self._count_words(texts)

        # 2. ์ดˆ๊ธฐ ์–ดํœ˜ (๋ฌธ์ž ๋‹จ์œ„)
        self.vocab = {char: i for i, char in enumerate(self.special_tokens)}
        for word in word_freqs:
            for char in word:
                if char not in self.vocab:
                    self.vocab[char] = len(self.vocab)

        # 3. ๋‹จ์–ด๋ฅผ ๋ฌธ์ž ๋ฆฌ์ŠคํŠธ๋กœ ๋ถ„ํ• 
        splits = {word: list(word) for word in word_freqs}

        # 4. ๋ณ‘ํ•ฉ ๋ฐ˜๋ณต
        while len(self.vocab) < self.vocab_size:
            # ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•œ ์Œ ์ฐพ๊ธฐ
            pair_freqs = self._count_pairs(splits, word_freqs)
            if not pair_freqs:
                break

            best_pair = max(pair_freqs, key=pair_freqs.get)

            # ๋ณ‘ํ•ฉ
            splits = self._merge_pair(splits, best_pair)

            # ์–ดํœ˜์— ์ถ”๊ฐ€
            new_token = ''.join(best_pair)
            self.vocab[new_token] = len(self.vocab)
            self.merges[best_pair] = new_token

            if len(self.vocab) % 1000 == 0:
                print(f"Vocab size: {len(self.vocab)}")

    def _count_words(self, texts: List[str]) -> Dict[str, int]:
        """๋‹จ์–ด ๋นˆ๋„ ๊ณ„์‚ฐ"""
        word_freqs = Counter()
        for text in texts:
            words = text.split()
            word_freqs.update(words)
        return dict(word_freqs)

    def _count_pairs(
        self,
        splits: Dict[str, List[str]],
        word_freqs: Dict[str, int]
    ) -> Dict[Tuple[str, str], int]:
        """์ธ์ ‘ ํ† ํฐ ์Œ ๋นˆ๋„ ๊ณ„์‚ฐ"""
        pair_freqs = defaultdict(int)
        for word, freq in word_freqs.items():
            split = splits[word]
            for i in range(len(split) - 1):
                pair = (split[i], split[i + 1])
                pair_freqs[pair] += freq
        return pair_freqs

    def _merge_pair(
        self,
        splits: Dict[str, List[str]],
        pair: Tuple[str, str]
    ) -> Dict[str, List[str]]:
        """์Œ์„ ๋ณ‘ํ•ฉ"""
        new_splits = {}
        for word, split in splits.items():
            new_split = []
            i = 0
            while i < len(split):
                if i < len(split) - 1 and (split[i], split[i + 1]) == pair:
                    new_split.append(split[i] + split[i + 1])
                    i += 2
                else:
                    new_split.append(split[i])
                    i += 1
            new_splits[word] = new_split
        return new_splits

    def encode(self, text: str) -> List[int]:
        """ํ…์ŠคํŠธ โ†’ ํ† ํฐ ID"""
        words = text.split()
        ids = []

        for word in words:
            # ๋ฌธ์ž๋กœ ๋ถ„ํ• 
            tokens = list(word)

            # ํ•™์Šต๋œ ๋ณ‘ํ•ฉ ์ ์šฉ
            for pair, merged in self.merges.items():
                i = 0
                while i < len(tokens) - 1:
                    if (tokens[i], tokens[i + 1]) == pair:
                        tokens = tokens[:i] + [merged] + tokens[i + 2:]
                    else:
                        i += 1

            # ID๋กœ ๋ณ€ํ™˜
            for token in tokens:
                if token in self.vocab:
                    ids.append(self.vocab[token])
                else:
                    ids.append(self.vocab['<unk>'])

        return ids

    def decode(self, ids: List[int]) -> str:
        """ํ† ํฐ ID โ†’ ํ…์ŠคํŠธ"""
        id_to_token = {v: k for k, v in self.vocab.items()}
        tokens = [id_to_token.get(id, '<unk>') for id in ids]
        return ''.join(tokens)


# ์‚ฌ์šฉ ์˜ˆ์‹œ
tokenizer = BPETokenizer(vocab_size=5000)

texts = [
    "the quick brown fox jumps over the lazy dog",
    "machine learning is transforming the world",
    # ... ๋” ๋งŽ์€ ํ…์ŠคํŠธ
]

tokenizer.train(texts * 1000)  # ๋ฐ˜๋ณตํ•˜์—ฌ ์ถฉ๋ถ„ํ•œ ๋นˆ๋„ ํ™•๋ณด

text = "the transformer model"
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(f"Original: {text}")
print(f"IDs: {ids}")
print(f"Decoded: {decoded}")

3. WordPiece

3.1 BPE์™€์˜ ์ฐจ์ด์ 

BPE: ๋นˆ๋„ ๊ธฐ๋ฐ˜
- ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•œ ์Œ์„ ๋ณ‘ํ•ฉ
- count(ab)๊ฐ€ ์ตœ๋Œ€์ธ (a, b) ์„ ํƒ

WordPiece: ์šฐ๋„ ๊ธฐ๋ฐ˜
- ๋ณ‘ํ•ฉ ์‹œ ์ „์ฒด ์šฐ๋„ ์ฆ๊ฐ€๊ฐ€ ์ตœ๋Œ€์ธ ์Œ ์„ ํƒ
- score(a, b) = count(ab) / (count(a) * count(b))
- ํฌ๊ท€ ์Œ์ด๋”๋ผ๋„ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ํฌ๊ท€ํ•˜๋ฉด ์„ ํƒ๋  ์ˆ˜ ์žˆ์Œ

3.2 ๊ตฌํ˜„

class WordPieceTokenizer:
    """WordPiece Tokenizer (BERT ์Šคํƒ€์ผ)"""

    def __init__(self, vocab_size: int = 30000):
        self.vocab_size = vocab_size
        self.vocab = {}
        self.prefix = "##"  # ๋‹จ์–ด ๋‚ด๋ถ€ ํ† ํฐ ํ‘œ์‹œ

    def train(self, texts: List[str]):
        """WordPiece ํ•™์Šต"""
        word_freqs = self._count_words(texts)

        # ์ดˆ๊ธฐ ์–ดํœ˜: ๋ฌธ์ž + ## ์ ‘๋‘์‚ฌ ๋ฒ„์ „
        self.vocab = {'[PAD]': 0, '[UNK]': 1, '[CLS]': 2, '[SEP]': 3, '[MASK]': 4}

        chars = set()
        for word in word_freqs:
            for i, char in enumerate(word):
                if i == 0:
                    chars.add(char)
                else:
                    chars.add(self.prefix + char)

        for char in sorted(chars):
            self.vocab[char] = len(self.vocab)

        # ๋ถ„ํ•  ์ดˆ๊ธฐํ™”
        splits = {}
        for word in word_freqs:
            split = [word[0]] + [self.prefix + c for c in word[1:]]
            splits[word] = split

        # ๋ณ‘ํ•ฉ (์šฐ๋„ ๊ธฐ๋ฐ˜)
        while len(self.vocab) < self.vocab_size:
            pair_scores = self._compute_pair_scores(splits, word_freqs)
            if not pair_scores:
                break

            best_pair = max(pair_scores, key=pair_scores.get)
            splits = self._merge_pair(splits, best_pair)

            new_token = best_pair[0] + best_pair[1].replace(self.prefix, '')
            self.vocab[new_token] = len(self.vocab)

    def _compute_pair_scores(
        self,
        splits: Dict[str, List[str]],
        word_freqs: Dict[str, int]
    ) -> Dict[Tuple[str, str], float]:
        """WordPiece ์ ์ˆ˜ ๊ณ„์‚ฐ"""
        # ๊ฐœ๋ณ„ ํ† ํฐ ๋นˆ๋„
        token_freqs = defaultdict(int)
        for word, freq in word_freqs.items():
            for token in splits[word]:
                token_freqs[token] += freq

        # ์Œ ๋นˆ๋„
        pair_freqs = defaultdict(int)
        for word, freq in word_freqs.items():
            split = splits[word]
            for i in range(len(split) - 1):
                pair = (split[i], split[i + 1])
                pair_freqs[pair] += freq

        # ์ ์ˆ˜: count(ab) / (count(a) * count(b))
        scores = {}
        for pair, freq in pair_freqs.items():
            score = freq / (token_freqs[pair[0]] * token_freqs[pair[1]])
            scores[pair] = score

        return scores

    def _merge_pair(
        self,
        splits: Dict[str, List[str]],
        pair: Tuple[str, str]
    ) -> Dict[str, List[str]]:
        """์Œ ๋ณ‘ํ•ฉ"""
        new_splits = {}
        merged = pair[0] + pair[1].replace(self.prefix, '')

        for word, split in splits.items():
            new_split = []
            i = 0
            while i < len(split):
                if i < len(split) - 1 and (split[i], split[i + 1]) == pair:
                    new_split.append(merged)
                    i += 2
                else:
                    new_split.append(split[i])
                    i += 1
            new_splits[word] = new_split

        return new_splits

    def encode(self, text: str) -> List[int]:
        """Greedy longest-match tokenization"""
        words = text.lower().split()
        ids = []

        for word in words:
            tokens = self._tokenize_word(word)
            for token in tokens:
                ids.append(self.vocab.get(token, self.vocab['[UNK]']))

        return ids

    def _tokenize_word(self, word: str) -> List[str]:
        """๋‹จ์–ด๋ฅผ WordPiece ํ† ํฐ์œผ๋กœ ๋ถ„ํ• """
        tokens = []
        start = 0

        while start < len(word):
            end = len(word)
            found = False

            while start < end:
                substr = word[start:end]
                if start > 0:
                    substr = self.prefix + substr

                if substr in self.vocab:
                    tokens.append(substr)
                    found = True
                    break

                end -= 1

            if not found:
                tokens.append('[UNK]')
                start += 1
            else:
                start = end

        return tokens

4. Unigram LM

4.1 ๊ฐœ๋…

Unigram: ํ™•๋ฅ ์  ํ† ํฐํ™”

1. ํฐ ์ดˆ๊ธฐ ์–ดํœ˜๋กœ ์‹œ์ž‘ (substrings)
2. ๊ฐ ํ† ํฐ์˜ ํ™•๋ฅ  ์ถ”์ •: P(token)
3. Viterbi ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ตœ์  ๋ถ„ํ• :
   argmax P(x_1) * P(x_2) * ... * P(x_n)
4. ์–ดํœ˜ ์ถ•์†Œ: ์ œ๊ฑฐ ์‹œ ์†์‹ค์ด ์ž‘์€ ํ† ํฐ ์ œ๊ฑฐ
5. ๋ชฉํ‘œ ํฌ๊ธฐ๊นŒ์ง€ ๋ฐ˜๋ณต

์žฅ์ :
- BPE/WordPiece์™€ ๋‹ฌ๋ฆฌ ์—ฌ๋Ÿฌ ๋ถ„ํ•  ํ›„๋ณด ์ƒ˜ํ”Œ๋ง ๊ฐ€๋Šฅ
- ๋” robustํ•œ ํ† ํฐํ™”

4.2 SentencePiece์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ

import sentencepiece as spm

# SentencePiece ํ•™์Šต (BPE ๋˜๋Š” Unigram)
def train_sentencepiece(
    input_file: str,
    model_prefix: str,
    vocab_size: int = 32000,
    model_type: str = 'unigram'  # 'bpe' or 'unigram'
):
    """SentencePiece ๋ชจ๋ธ ํ•™์Šต"""
    spm.SentencePieceTrainer.train(
        input=input_file,
        model_prefix=model_prefix,
        vocab_size=vocab_size,
        model_type=model_type,
        character_coverage=0.9995,  # ๋‹ค๊ตญ์–ด์šฉ
        num_threads=16,
        split_digits=True,  # ์ˆซ์ž ๋ถ„๋ฆฌ
        byte_fallback=True,  # OOV๋ฅผ byte๋กœ ์ฒ˜๋ฆฌ
        # ํŠน์ˆ˜ ํ† ํฐ
        pad_id=0,
        unk_id=1,
        bos_id=2,
        eos_id=3,
        pad_piece='<pad>',
        unk_piece='<unk>',
        bos_piece='<s>',
        eos_piece='</s>',
    )


# ์‚ฌ์šฉ
def use_sentencepiece(model_path: str):
    """SentencePiece ์‚ฌ์šฉ"""
    sp = spm.SentencePieceProcessor()
    sp.load(model_path)

    text = "Hello, how are you doing today?"

    # ์ธ์ฝ”๋”ฉ
    ids = sp.encode(text, out_type=int)
    pieces = sp.encode(text, out_type=str)

    print(f"Text: {text}")
    print(f"Pieces: {pieces}")
    print(f"IDs: {ids}")

    # ๋””์ฝ”๋”ฉ
    decoded = sp.decode(ids)
    print(f"Decoded: {decoded}")

    # ํ™•๋ฅ ์  ์ƒ˜ํ”Œ๋ง (Unigram๋งŒ)
    for _ in range(3):
        sampled = sp.encode(text, out_type=str, enable_sampling=True, alpha=0.1)
        print(f"Sampled: {sampled}")


# ํ•™์Šต ์˜ˆ์‹œ
# train_sentencepiece('corpus.txt', 'tokenizer', vocab_size=32000, model_type='unigram')
# use_sentencepiece('tokenizer.model')

5. Byte-Level BPE

5.1 GPT-2/3/4 ์Šคํƒ€์ผ

Byte-Level BPE:
- ๊ธฐ๋ณธ ์–ดํœ˜ = 256 ๋ฐ”์ดํŠธ
- ์–ด๋–ค UTF-8 ํ…์ŠคํŠธ๋„ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ (OOV ์—†์Œ)
- GPT-2๋ถ€ํ„ฐ ์‚ฌ์šฉ

ํŠน์ˆ˜ ์ฒ˜๋ฆฌ:
- ๊ณต๋ฐฑ: 'ฤ ' (G with dot above)๋กœ ํ‘œ์‹œ
- "Hello world" โ†’ ["Hello", "ฤ world"]
- ๋‹จ์–ด ๊ฒฝ๊ณ„ ๋ช…์‹œ์  ํ‘œํ˜„

5.2 HuggingFace Tokenizers

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders
from tokenizers.processors import TemplateProcessing

def create_byte_level_bpe(
    files: List[str],
    vocab_size: int = 50257
) -> Tokenizer:
    """GPT-2 ์Šคํƒ€์ผ Byte-Level BPE ์ƒ์„ฑ"""

    # 1. ๋นˆ ํ† ํฌ๋‚˜์ด์ € ์ƒ์„ฑ
    tokenizer = Tokenizer(models.BPE())

    # 2. Pre-tokenization (๋ฐ”์ดํŠธ ๋ ˆ๋ฒจ)
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)

    # 3. ๋””์ฝ”๋”
    tokenizer.decoder = decoders.ByteLevel()

    # 4. ํ•™์Šต
    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        special_tokens=['<|endoftext|>', '<|padding|>'],
        show_progress=True,
    )

    tokenizer.train(files, trainer)

    # 5. Post-processing
    tokenizer.post_processor = TemplateProcessing(
        single="$A <|endoftext|>",
        special_tokens=[("<|endoftext|>", tokenizer.token_to_id("<|endoftext|>"))],
    )

    return tokenizer


# ์‚ฌ์šฉ
def demonstrate_byte_level():
    """Byte-Level BPE ๋ฐ๋ชจ"""
    from transformers import GPT2Tokenizer

    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

    texts = [
        "Hello, world!",
        "์•ˆ๋…•ํ•˜์„ธ์š”",  # ํ•œ๊ตญ์–ด
        "๐ŸŽ‰ Party time!",  # ์ด๋ชจ์ง€
        "The cafรฉ serves naรฏve croissants",  # ํŠน์ˆ˜๋ฌธ์ž
    ]

    for text in texts:
        tokens = tokenizer.tokenize(text)
        ids = tokenizer.encode(text)

        print(f"\nText: {text}")
        print(f"Tokens: {tokens}")
        print(f"IDs: {ids}")
        print(f"Decoded: {tokenizer.decode(ids)}")


demonstrate_byte_level()

6. ๋‹ค๊ตญ์–ด Tokenization

6.1 ๋„์ „ ๊ณผ์ œ

๋ฌธ์ œ์ :
1. Fertility ๋ถˆ๊ท ํ˜•: ๊ฐ™์€ ์˜๋ฏธ๋ผ๋„ ์–ธ์–ด๋ณ„ ํ† ํฐ ์ˆ˜ ์ฐจ์ด
   - "hello" (1 token) vs "ไฝ ๅฅฝ" (2-3 tokens) vs "์•ˆ๋…•" (2-4 tokens)

2. ์ €์ž์› ์–ธ์–ด under-representation:
   - ์˜์–ด ์ค‘์‹ฌ ํ•™์Šต โ†’ ๋‹ค๋ฅธ ์–ธ์–ด ์–ดํœ˜ ๋ถ€์กฑ

3. ์ฝ”๋“œ ์Šค์œ„์นญ:
   - "I love ๊น€์น˜" โ†’ ์˜์–ด/ํ•œ๊ตญ์–ด ํ˜ผ์šฉ ์ฒ˜๋ฆฌ

ํ•ด๊ฒฐ์ฑ…:
1. Character coverage: 99.95% ์ด์ƒ
2. ์–ธ์–ด๋ณ„ ์ƒ˜ํ”Œ๋ง ๋น„์œจ ์กฐ์ •
3. Byte fallback ํ™œ์„ฑํ™”

6.2 ๋‹ค๊ตญ์–ด ํ† ํฌ๋‚˜์ด์ € ๊ตฌ์ถ•

from collections import defaultdict
import unicodedata

class MultilingualTokenizerConfig:
    """๋‹ค๊ตญ์–ด ํ† ํฌ๋‚˜์ด์ € ์„ค์ •"""

    # ์–ธ์–ด๋ณ„ ์ƒ˜ํ”Œ๋ง ๋น„์œจ (BLOOM ์Šคํƒ€์ผ)
    LANGUAGE_WEIGHTS = {
        'en': 0.30,   # ์˜์–ด
        'zh': 0.15,   # ์ค‘๊ตญ์–ด
        'code': 0.15, # ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ฝ”๋“œ
        'fr': 0.08,   # ํ”„๋ž‘์Šค์–ด
        'es': 0.07,   # ์ŠคํŽ˜์ธ์–ด
        'pt': 0.05,   # ํฌ๋ฅดํˆฌ๊ฐˆ์–ด
        'de': 0.05,   # ๋…์ผ์–ด
        'ar': 0.05,   # ์•„๋ž์–ด
        'hi': 0.03,   # ํžŒ๋””์–ด
        'ko': 0.02,   # ํ•œ๊ตญ์–ด
        'ja': 0.02,   # ์ผ๋ณธ์–ด
        'other': 0.03,
    }

    @staticmethod
    def estimate_fertility(tokenizer, texts_by_lang: dict) -> dict:
        """
        ์–ธ์–ด๋ณ„ Fertility ์ธก์ •

        Fertility = ํ† ํฐ ์ˆ˜ / ๋ฌธ์ž ์ˆ˜
        ๋‚ฎ์„์ˆ˜๋ก ํšจ์œจ์ 
        """
        fertility = {}

        for lang, texts in texts_by_lang.items():
            total_chars = 0
            total_tokens = 0

            for text in texts:
                chars = len(text)
                tokens = len(tokenizer.encode(text))

                total_chars += chars
                total_tokens += tokens

            fertility[lang] = total_tokens / max(total_chars, 1)

        return fertility


def create_multilingual_tokenizer(
    corpus_files: dict,  # {language: file_path}
    vocab_size: int = 100000
):
    """๋‹ค๊ตญ์–ด SentencePiece ํ† ํฌ๋‚˜์ด์ €"""

    # 1. ์–ธ์–ด๋ณ„ ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ (๊ฐ€์ค‘์น˜ ์ ์šฉ)
    merged_file = 'merged_corpus.txt'
    weights = MultilingualTokenizerConfig.LANGUAGE_WEIGHTS

    with open(merged_file, 'w') as out:
        for lang, file_path in corpus_files.items():
            weight = weights.get(lang, 0.01)
            sample_ratio = weight / sum(weights.values())

            with open(file_path, 'r') as f:
                lines = f.readlines()
                n_samples = int(len(lines) * sample_ratio * 10)  # ์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง

                for line in lines[:n_samples]:
                    out.write(line)

    # 2. SentencePiece ํ•™์Šต
    spm.SentencePieceTrainer.train(
        input=merged_file,
        model_prefix='multilingual',
        vocab_size=vocab_size,
        model_type='bpe',
        character_coverage=0.9995,  # ๋†’์€ ์ปค๋ฒ„๋ฆฌ์ง€
        byte_fallback=True,
        split_digits=True,
        # ํŠน์ˆ˜ ํ† ํฐ
        user_defined_symbols=['<code>', '</code>', '<math>', '</math>'],
    )

    return 'multilingual.model'

7. Tokenizer-Free ๋ชจ๋ธ

7.1 ByT5 (Byte-level T5)

class ByteLevelModel:
    """
    ByT5 ์Šคํƒ€์ผ Byte-Level ๋ชจ๋ธ

    ํŠน์ง•:
    - ํ† ํฌ๋‚˜์ด์ € ์—†์Œ
    - ์ž…๋ ฅ: raw UTF-8 bytes (0-255)
    - ์žฅ์ : ์–ธ์–ด ๋…๋ฆฝ์ , ๋…ธ์ด์ฆˆ์— ๊ฐ•ํ•จ
    - ๋‹จ์ : ๊ธด ์‹œํ€€์Šค (3-4๋ฐฐ)
    """

    VOCAB_SIZE = 259  # 256 bytes + 3 special tokens

    def __init__(self):
        self.pad_id = 256
        self.eos_id = 257
        self.unk_id = 258

    def encode(self, text: str) -> List[int]:
        """ํ…์ŠคํŠธ โ†’ bytes"""
        return list(text.encode('utf-8'))

    def decode(self, ids: List[int]) -> str:
        """bytes โ†’ ํ…์ŠคํŠธ"""
        # ํŠน์ˆ˜ ํ† ํฐ ์ œ๊ฑฐ
        bytes_list = [b for b in ids if b < 256]
        return bytes(bytes_list).decode('utf-8', errors='replace')


# ByT5 ์‚ฌ์šฉ ์˜ˆ์‹œ
from transformers import AutoTokenizer, T5ForConditionalGeneration

def use_byt5():
    """ByT5 ์‚ฌ์šฉ"""
    tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")
    model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")

    # Byte-level ์ธ์ฝ”๋”ฉ
    text = "translate English to German: Hello, how are you?"
    inputs = tokenizer(text, return_tensors="pt")

    print(f"Text length: {len(text)} chars")
    print(f"Token length: {inputs['input_ids'].shape[1]} tokens")
    # Byte-level์ด๋ฏ€๋กœ ๋Œ€๋žต ๋น„์Šท

    # ์ƒ์„ฑ
    outputs = model.generate(**inputs, max_length=100)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Result: {result}")

7.2 MEGABYTE

MEGABYTE ์•„ํ‚คํ…์ฒ˜:
- Patch-based byte modeling
- Global model: ํฐ transformer, patch ๋ ˆ๋ฒจ
- Local model: ์ž‘์€ transformer, byte ๋ ˆ๋ฒจ

์žฅ์ :
- ๊ธด byte ์‹œํ€€์Šค ํšจ์œจ์  ์ฒ˜๋ฆฌ
- O(nยฒ) โ†’ O(nยฒ/p + p * n) ๋ณต์žก๋„ (p = patch ํฌ๊ธฐ)

8. ์ฝ”๋“œ์šฉ Tokenization

8.1 ์ฝ”๋“œ ํŠนํ™” ์ „๋žต

class CodeTokenizer:
    """
    ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ฝ”๋“œ์šฉ ํ† ํฌ๋‚˜์ด์ €

    ๊ณ ๋ ค์‚ฌํ•ญ:
    1. ๋“ค์—ฌ์“ฐ๊ธฐ ๋ณด์กด
    2. ์‹๋ณ„์ž ๋ถ„ํ•  (camelCase, snake_case)
    3. ์ˆซ์ž ๋ฆฌํ„ฐ๋Ÿด
    4. ํŠน์ˆ˜ ๋ฌธ์ž (==, !=, <=, etc.)
    """

    def preprocess_code(self, code: str) -> str:
        """์ฝ”๋“œ ์ „์ฒ˜๋ฆฌ"""
        # ๋“ค์—ฌ์“ฐ๊ธฐ๋ฅผ ํŠน์ˆ˜ ํ† ํฐ์œผ๋กœ
        lines = code.split('\n')
        processed = []

        for line in lines:
            # ๋“ค์—ฌ์“ฐ๊ธฐ ๊ณ„์‚ฐ
            indent = len(line) - len(line.lstrip())
            indent_tokens = '<INDENT>' * (indent // 4)

            processed.append(indent_tokens + line.lstrip())

        return '\n'.join(processed)

    def split_identifier(self, identifier: str) -> List[str]:
        """์‹๋ณ„์ž ๋ถ„ํ• """
        # camelCase
        import re
        tokens = re.sub('([A-Z])', r' \1', identifier).split()

        # snake_case
        result = []
        for token in tokens:
            result.extend(token.split('_'))

        return [t for t in result if t]


# Codex/StarCoder ์Šคํƒ€์ผ
def create_code_tokenizer():
    """์ฝ”๋“œ์šฉ ํ† ํฌ๋‚˜์ด์ € (StarCoder ์Šคํƒ€์ผ)"""
    from tokenizers import Tokenizer, models, trainers, pre_tokenizers

    tokenizer = Tokenizer(models.BPE())

    # ์ฝ”๋“œ์šฉ pre-tokenization
    tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
        pre_tokenizers.ByteLevel(add_prefix_space=False),
        # ์ˆซ์ž ๋ถ„๋ฆฌ
        pre_tokenizers.Digits(individual_digits=True),
    ])

    # ํŠน์ˆ˜ ํ† ํฐ
    special_tokens = [
        '<|endoftext|>',
        '<fim_prefix>',  # Fill-in-the-middle
        '<fim_middle>',
        '<fim_suffix>',
        '<filename>',
        '<gh_stars>',
        '<issue_start>',
        '<issue_comment>',
        '<issue_closed>',
        '<jupyter_start>',
        '<jupyter_code>',
        '<jupyter_output>',
        '<empty_output>',
        '<commit_before>',
        '<commit_msg>',
        '<commit_after>',
    ]

    trainer = trainers.BpeTrainer(
        vocab_size=49152,
        special_tokens=special_tokens,
    )

    return tokenizer, trainer

9. ์‹ค์Šต: ํ† ํฌ๋‚˜์ด์ € ๋ถ„์„

from transformers import AutoTokenizer
import matplotlib.pyplot as plt

def analyze_tokenizers():
    """๋‹ค์–‘ํ•œ ํ† ํฌ๋‚˜์ด์ € ๋น„๊ต ๋ถ„์„"""

    tokenizers = {
        'GPT-2': AutoTokenizer.from_pretrained('gpt2'),
        'BERT': AutoTokenizer.from_pretrained('bert-base-uncased'),
        'T5': AutoTokenizer.from_pretrained('t5-base'),
        'LLaMA': AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf'),
    }

    test_texts = {
        'English': "The quick brown fox jumps over the lazy dog.",
        'Korean': "๋น ๋ฅธ ๊ฐˆ์ƒ‰ ์—ฌ์šฐ๊ฐ€ ๊ฒŒ์œผ๋ฅธ ๊ฐœ๋ฅผ ๋›ฐ์–ด๋„˜์Šต๋‹ˆ๋‹ค.",
        'Code': "def hello_world():\n    print('Hello, World!')",
        'Math': "The equation e^(iฯ€) + 1 = 0 is beautiful.",
        'Mixed': "I love eating ๊น€์น˜ with rice ๐Ÿš",
    }

    # ๋ถ„์„
    results = {}
    for tok_name, tokenizer in tokenizers.items():
        results[tok_name] = {}

        for text_name, text in test_texts.items():
            try:
                tokens = tokenizer.tokenize(text)
                ids = tokenizer.encode(text)

                results[tok_name][text_name] = {
                    'n_tokens': len(tokens),
                    'n_chars': len(text),
                    'fertility': len(tokens) / len(text),
                    'tokens': tokens[:10],  # ์ฒ˜์Œ 10๊ฐœ๋งŒ
                }
            except:
                results[tok_name][text_name] = None

    # ์ถœ๋ ฅ
    for tok_name, tok_results in results.items():
        print(f"\n{'='*50}")
        print(f"Tokenizer: {tok_name}")
        print('='*50)

        for text_name, result in tok_results.items():
            if result:
                print(f"\n{text_name}:")
                print(f"  Tokens: {result['n_tokens']}")
                print(f"  Chars: {result['n_chars']}")
                print(f"  Fertility: {result['fertility']:.3f}")
                print(f"  Sample: {result['tokens']}")

    # Fertility ์‹œ๊ฐํ™”
    fig, ax = plt.subplots(figsize=(10, 6))

    x = list(test_texts.keys())
    width = 0.2
    positions = range(len(x))

    for i, (tok_name, tok_results) in enumerate(results.items()):
        fertilities = [
            tok_results[text_name]['fertility'] if tok_results.get(text_name) else 0
            for text_name in x
        ]
        offset = (i - len(results) / 2) * width
        ax.bar([p + offset for p in positions], fertilities, width, label=tok_name)

    ax.set_xlabel('Text Type')
    ax.set_ylabel('Fertility (tokens/chars)')
    ax.set_title('Tokenizer Fertility Comparison')
    ax.set_xticks(positions)
    ax.set_xticklabels(x)
    ax.legend()

    plt.tight_layout()
    plt.savefig('tokenizer_comparison.png')
    plt.show()


if __name__ == "__main__":
    analyze_tokenizers()

์ฐธ๊ณ  ์ž๋ฃŒ

๋…ผ๋ฌธ

  • Sennrich et al. (2016). "Neural Machine Translation of Rare Words with Subword Units" (BPE)
  • Kudo & Richardson (2018). "SentencePiece: A simple and language independent subword tokenizer"
  • Xue et al. (2021). "ByT5: Towards a token-free future with pre-trained byte-to-byte models"

๋„๊ตฌ

๊ด€๋ จ ๋ ˆ์Šจ

to navigate between lessons