07. Tokenization ์ฌํ
07. Tokenization ์ฌํ¶
๊ฐ์¶
Tokenization์ ํ ์คํธ๋ฅผ ๋ชจ๋ธ์ด ์ฒ๋ฆฌํ ์ ์๋ ํ ํฐ ์ํ์ค๋ก ๋ณํํ๋ ๊ณผ์ ์ ๋๋ค. Foundation Model์ ์ฑ๋ฅ๊ณผ ํจ์จ์ฑ์ ์ง์ ์ ์ธ ์ํฅ์ ๋ฏธ์น๋ ์ค์ํ ์ ์ฒ๋ฆฌ ๋จ๊ณ์ ๋๋ค.
1. Tokenization ํจ๋ฌ๋ค์¶
1.1 ์ญ์ฌ์ ๋ฐ์ ¶
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Tokenization ์งํ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Word-level (์ ํต) โ
โ "I love NLP" โ ["I", "love", "NLP"] โ
โ ๋ฌธ์ : OOV (Out-of-Vocabulary), ๊ฑฐ๋ํ ์ดํ ํฌ๊ธฐ โ
โ โ
โ โ โ
โ โ
โ Character-level โ
โ "I love NLP" โ ["I", " ", "l", "o", "v", "e", " ", ...] โ
โ ๋ฌธ์ : ๋๋ฌด ๊ธด ์ํ์ค, ์๋ฏธ ๋จ์ ์์ค โ
โ โ
โ โ โ
โ โ
โ Subword (ํ์ฌ ์ฃผ๋ฅ) โ
โ "I love NLP" โ ["I", "ฤ love", "ฤ N", "LP"] โ
โ ์ฅ์ : OOV ์์, ์ ์ ํ ์ํ์ค ๊ธธ์ด, ํํ์์ ์๋ฏธ ๋ณด์กด โ
โ โ
โ โ (๋ฏธ๋) โ
โ โ
โ Byte-level / Tokenizer-free โ
โ Raw bytes ๋๋ ํ์ต๋ ํ ํฐํ ์์ด ์ฒ๋ฆฌ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1.2 ์ฃผ์ ์๊ณ ๋ฆฌ์ฆ ๋น๊ต¶
| ์๊ณ ๋ฆฌ์ฆ | ๋ฐฉ์ | ๋ํ ๋ชจ๋ธ | ํน์ง |
|---|---|---|---|
| BPE | ๋น๋ ๊ธฐ๋ฐ ๋ณํฉ | GPT, RoBERTa, LLaMA | ๊ฐ์ฅ ๋๋ฆฌ ์ฌ์ฉ |
| WordPiece | ์ฐ๋ ๊ธฐ๋ฐ ๋ณํฉ | BERT, DistilBERT | ํ๋ฅ ์ ์ ํ |
| Unigram | ํ๋ฅ ๋ชจ๋ธ | T5, ALBERT, XLNet | ์ต์ ๋ถํ ํ์ |
| SentencePiece | ์ธ์ด ๋ ๋ฆฝ์ | ๋ค๊ตญ์ด ๋ชจ๋ธ | BPE/Unigram ๊ตฌํ |
2. BPE (Byte-Pair Encoding)¶
2.1 ์๊ณ ๋ฆฌ์ฆ¶
BPE ํ์ต ๊ณผ์ :
1. ์ด๊ธฐ ์ดํ = ๋ชจ๋ ๋ฌธ์ + ํน์ ํ ํฐ
2. ๋ฐ๋ณต:
a. ๊ฐ์ฅ ๋น๋ฒํ ์ธ์ ํ ํฐ ์ ์ฐพ๊ธฐ
b. ํด๋น ์์ ์ ํ ํฐ์ผ๋ก ๋ณํฉ
c. ์ดํ์ ์ถ๊ฐ
3. ๋ชฉํ ์ดํ ํฌ๊ธฐ๊น์ง ๋ฐ๋ณต
์์:
์ด๊ธฐ: ['l', 'o', 'w', 'e', 'r', 'n', 'i', 'g', 'h', 't']
Step 1: 'l' + 'o' โ 'lo' (๊ฐ์ฅ ๋น๋ฒ)
Step 2: 'lo' + 'w' โ 'low'
Step 3: 'e' + 'r' โ 'er'
Step 4: 'n' + 'i' โ 'ni'
Step 5: 'ni' + 'g' โ 'nig'
Step 6: 'nig' + 'h' โ 'nigh'
Step 7: 'nigh' + 't' โ 'night'
...
์ต์ข
: "lower" โ ['low', 'er'], "night" โ ['night']
2.2 ๊ตฌํ¶
from collections import Counter, defaultdict
from typing import Dict, List, Tuple
import re
class BPETokenizer:
"""Byte-Pair Encoding Tokenizer"""
def __init__(self, vocab_size: int = 10000):
self.vocab_size = vocab_size
self.vocab = {}
self.merges = {}
self.special_tokens = ['<pad>', '<unk>', '<bos>', '<eos>']
def train(self, texts: List[str]):
"""BPE ํ์ต"""
# 1. ๋จ์ด ๋น๋ ๊ณ์ฐ
word_freqs = self._count_words(texts)
# 2. ์ด๊ธฐ ์ดํ (๋ฌธ์ ๋จ์)
self.vocab = {char: i for i, char in enumerate(self.special_tokens)}
for word in word_freqs:
for char in word:
if char not in self.vocab:
self.vocab[char] = len(self.vocab)
# 3. ๋จ์ด๋ฅผ ๋ฌธ์ ๋ฆฌ์คํธ๋ก ๋ถํ
splits = {word: list(word) for word in word_freqs}
# 4. ๋ณํฉ ๋ฐ๋ณต
while len(self.vocab) < self.vocab_size:
# ๊ฐ์ฅ ๋น๋ฒํ ์ ์ฐพ๊ธฐ
pair_freqs = self._count_pairs(splits, word_freqs)
if not pair_freqs:
break
best_pair = max(pair_freqs, key=pair_freqs.get)
# ๋ณํฉ
splits = self._merge_pair(splits, best_pair)
# ์ดํ์ ์ถ๊ฐ
new_token = ''.join(best_pair)
self.vocab[new_token] = len(self.vocab)
self.merges[best_pair] = new_token
if len(self.vocab) % 1000 == 0:
print(f"Vocab size: {len(self.vocab)}")
def _count_words(self, texts: List[str]) -> Dict[str, int]:
"""๋จ์ด ๋น๋ ๊ณ์ฐ"""
word_freqs = Counter()
for text in texts:
words = text.split()
word_freqs.update(words)
return dict(word_freqs)
def _count_pairs(
self,
splits: Dict[str, List[str]],
word_freqs: Dict[str, int]
) -> Dict[Tuple[str, str], int]:
"""์ธ์ ํ ํฐ ์ ๋น๋ ๊ณ์ฐ"""
pair_freqs = defaultdict(int)
for word, freq in word_freqs.items():
split = splits[word]
for i in range(len(split) - 1):
pair = (split[i], split[i + 1])
pair_freqs[pair] += freq
return pair_freqs
def _merge_pair(
self,
splits: Dict[str, List[str]],
pair: Tuple[str, str]
) -> Dict[str, List[str]]:
"""์์ ๋ณํฉ"""
new_splits = {}
for word, split in splits.items():
new_split = []
i = 0
while i < len(split):
if i < len(split) - 1 and (split[i], split[i + 1]) == pair:
new_split.append(split[i] + split[i + 1])
i += 2
else:
new_split.append(split[i])
i += 1
new_splits[word] = new_split
return new_splits
def encode(self, text: str) -> List[int]:
"""ํ
์คํธ โ ํ ํฐ ID"""
words = text.split()
ids = []
for word in words:
# ๋ฌธ์๋ก ๋ถํ
tokens = list(word)
# ํ์ต๋ ๋ณํฉ ์ ์ฉ
for pair, merged in self.merges.items():
i = 0
while i < len(tokens) - 1:
if (tokens[i], tokens[i + 1]) == pair:
tokens = tokens[:i] + [merged] + tokens[i + 2:]
else:
i += 1
# ID๋ก ๋ณํ
for token in tokens:
if token in self.vocab:
ids.append(self.vocab[token])
else:
ids.append(self.vocab['<unk>'])
return ids
def decode(self, ids: List[int]) -> str:
"""ํ ํฐ ID โ ํ
์คํธ"""
id_to_token = {v: k for k, v in self.vocab.items()}
tokens = [id_to_token.get(id, '<unk>') for id in ids]
return ''.join(tokens)
# ์ฌ์ฉ ์์
tokenizer = BPETokenizer(vocab_size=5000)
texts = [
"the quick brown fox jumps over the lazy dog",
"machine learning is transforming the world",
# ... ๋ ๋ง์ ํ
์คํธ
]
tokenizer.train(texts * 1000) # ๋ฐ๋ณตํ์ฌ ์ถฉ๋ถํ ๋น๋ ํ๋ณด
text = "the transformer model"
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)
print(f"Original: {text}")
print(f"IDs: {ids}")
print(f"Decoded: {decoded}")
3. WordPiece¶
3.1 BPE์์ ์ฐจ์ด์ ¶
BPE: ๋น๋ ๊ธฐ๋ฐ
- ๊ฐ์ฅ ๋น๋ฒํ ์์ ๋ณํฉ
- count(ab)๊ฐ ์ต๋์ธ (a, b) ์ ํ
WordPiece: ์ฐ๋ ๊ธฐ๋ฐ
- ๋ณํฉ ์ ์ ์ฒด ์ฐ๋ ์ฆ๊ฐ๊ฐ ์ต๋์ธ ์ ์ ํ
- score(a, b) = count(ab) / (count(a) * count(b))
- ํฌ๊ท ์์ด๋๋ผ๋ ๊ตฌ์ฑ ์์๊ฐ ํฌ๊ทํ๋ฉด ์ ํ๋ ์ ์์
3.2 ๊ตฌํ¶
class WordPieceTokenizer:
"""WordPiece Tokenizer (BERT ์คํ์ผ)"""
def __init__(self, vocab_size: int = 30000):
self.vocab_size = vocab_size
self.vocab = {}
self.prefix = "##" # ๋จ์ด ๋ด๋ถ ํ ํฐ ํ์
def train(self, texts: List[str]):
"""WordPiece ํ์ต"""
word_freqs = self._count_words(texts)
# ์ด๊ธฐ ์ดํ: ๋ฌธ์ + ## ์ ๋์ฌ ๋ฒ์
self.vocab = {'[PAD]': 0, '[UNK]': 1, '[CLS]': 2, '[SEP]': 3, '[MASK]': 4}
chars = set()
for word in word_freqs:
for i, char in enumerate(word):
if i == 0:
chars.add(char)
else:
chars.add(self.prefix + char)
for char in sorted(chars):
self.vocab[char] = len(self.vocab)
# ๋ถํ ์ด๊ธฐํ
splits = {}
for word in word_freqs:
split = [word[0]] + [self.prefix + c for c in word[1:]]
splits[word] = split
# ๋ณํฉ (์ฐ๋ ๊ธฐ๋ฐ)
while len(self.vocab) < self.vocab_size:
pair_scores = self._compute_pair_scores(splits, word_freqs)
if not pair_scores:
break
best_pair = max(pair_scores, key=pair_scores.get)
splits = self._merge_pair(splits, best_pair)
new_token = best_pair[0] + best_pair[1].replace(self.prefix, '')
self.vocab[new_token] = len(self.vocab)
def _compute_pair_scores(
self,
splits: Dict[str, List[str]],
word_freqs: Dict[str, int]
) -> Dict[Tuple[str, str], float]:
"""WordPiece ์ ์ ๊ณ์ฐ"""
# ๊ฐ๋ณ ํ ํฐ ๋น๋
token_freqs = defaultdict(int)
for word, freq in word_freqs.items():
for token in splits[word]:
token_freqs[token] += freq
# ์ ๋น๋
pair_freqs = defaultdict(int)
for word, freq in word_freqs.items():
split = splits[word]
for i in range(len(split) - 1):
pair = (split[i], split[i + 1])
pair_freqs[pair] += freq
# ์ ์: count(ab) / (count(a) * count(b))
scores = {}
for pair, freq in pair_freqs.items():
score = freq / (token_freqs[pair[0]] * token_freqs[pair[1]])
scores[pair] = score
return scores
def _merge_pair(
self,
splits: Dict[str, List[str]],
pair: Tuple[str, str]
) -> Dict[str, List[str]]:
"""์ ๋ณํฉ"""
new_splits = {}
merged = pair[0] + pair[1].replace(self.prefix, '')
for word, split in splits.items():
new_split = []
i = 0
while i < len(split):
if i < len(split) - 1 and (split[i], split[i + 1]) == pair:
new_split.append(merged)
i += 2
else:
new_split.append(split[i])
i += 1
new_splits[word] = new_split
return new_splits
def encode(self, text: str) -> List[int]:
"""Greedy longest-match tokenization"""
words = text.lower().split()
ids = []
for word in words:
tokens = self._tokenize_word(word)
for token in tokens:
ids.append(self.vocab.get(token, self.vocab['[UNK]']))
return ids
def _tokenize_word(self, word: str) -> List[str]:
"""๋จ์ด๋ฅผ WordPiece ํ ํฐ์ผ๋ก ๋ถํ """
tokens = []
start = 0
while start < len(word):
end = len(word)
found = False
while start < end:
substr = word[start:end]
if start > 0:
substr = self.prefix + substr
if substr in self.vocab:
tokens.append(substr)
found = True
break
end -= 1
if not found:
tokens.append('[UNK]')
start += 1
else:
start = end
return tokens
4. Unigram LM¶
4.1 ๊ฐ๋ ¶
Unigram: ํ๋ฅ ์ ํ ํฐํ
1. ํฐ ์ด๊ธฐ ์ดํ๋ก ์์ (substrings)
2. ๊ฐ ํ ํฐ์ ํ๋ฅ ์ถ์ : P(token)
3. Viterbi ์๊ณ ๋ฆฌ์ฆ์ผ๋ก ์ต์ ๋ถํ :
argmax P(x_1) * P(x_2) * ... * P(x_n)
4. ์ดํ ์ถ์: ์ ๊ฑฐ ์ ์์ค์ด ์์ ํ ํฐ ์ ๊ฑฐ
5. ๋ชฉํ ํฌ๊ธฐ๊น์ง ๋ฐ๋ณต
์ฅ์ :
- BPE/WordPiece์ ๋ฌ๋ฆฌ ์ฌ๋ฌ ๋ถํ ํ๋ณด ์ํ๋ง ๊ฐ๋ฅ
- ๋ robustํ ํ ํฐํ
4.2 SentencePiece์ ํจ๊ป ์ฌ์ฉ¶
import sentencepiece as spm
# SentencePiece ํ์ต (BPE ๋๋ Unigram)
def train_sentencepiece(
input_file: str,
model_prefix: str,
vocab_size: int = 32000,
model_type: str = 'unigram' # 'bpe' or 'unigram'
):
"""SentencePiece ๋ชจ๋ธ ํ์ต"""
spm.SentencePieceTrainer.train(
input=input_file,
model_prefix=model_prefix,
vocab_size=vocab_size,
model_type=model_type,
character_coverage=0.9995, # ๋ค๊ตญ์ด์ฉ
num_threads=16,
split_digits=True, # ์ซ์ ๋ถ๋ฆฌ
byte_fallback=True, # OOV๋ฅผ byte๋ก ์ฒ๋ฆฌ
# ํน์ ํ ํฐ
pad_id=0,
unk_id=1,
bos_id=2,
eos_id=3,
pad_piece='<pad>',
unk_piece='<unk>',
bos_piece='<s>',
eos_piece='</s>',
)
# ์ฌ์ฉ
def use_sentencepiece(model_path: str):
"""SentencePiece ์ฌ์ฉ"""
sp = spm.SentencePieceProcessor()
sp.load(model_path)
text = "Hello, how are you doing today?"
# ์ธ์ฝ๋ฉ
ids = sp.encode(text, out_type=int)
pieces = sp.encode(text, out_type=str)
print(f"Text: {text}")
print(f"Pieces: {pieces}")
print(f"IDs: {ids}")
# ๋์ฝ๋ฉ
decoded = sp.decode(ids)
print(f"Decoded: {decoded}")
# ํ๋ฅ ์ ์ํ๋ง (Unigram๋ง)
for _ in range(3):
sampled = sp.encode(text, out_type=str, enable_sampling=True, alpha=0.1)
print(f"Sampled: {sampled}")
# ํ์ต ์์
# train_sentencepiece('corpus.txt', 'tokenizer', vocab_size=32000, model_type='unigram')
# use_sentencepiece('tokenizer.model')
5. Byte-Level BPE¶
5.1 GPT-2/3/4 ์คํ์ผ¶
Byte-Level BPE:
- ๊ธฐ๋ณธ ์ดํ = 256 ๋ฐ์ดํธ
- ์ด๋ค UTF-8 ํ
์คํธ๋ ์ฒ๋ฆฌ ๊ฐ๋ฅ (OOV ์์)
- GPT-2๋ถํฐ ์ฌ์ฉ
ํน์ ์ฒ๋ฆฌ:
- ๊ณต๋ฐฑ: 'ฤ ' (G with dot above)๋ก ํ์
- "Hello world" โ ["Hello", "ฤ world"]
- ๋จ์ด ๊ฒฝ๊ณ ๋ช
์์ ํํ
5.2 HuggingFace Tokenizers¶
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders
from tokenizers.processors import TemplateProcessing
def create_byte_level_bpe(
files: List[str],
vocab_size: int = 50257
) -> Tokenizer:
"""GPT-2 ์คํ์ผ Byte-Level BPE ์์ฑ"""
# 1. ๋น ํ ํฌ๋์ด์ ์์ฑ
tokenizer = Tokenizer(models.BPE())
# 2. Pre-tokenization (๋ฐ์ดํธ ๋ ๋ฒจ)
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
# 3. ๋์ฝ๋
tokenizer.decoder = decoders.ByteLevel()
# 4. ํ์ต
trainer = trainers.BpeTrainer(
vocab_size=vocab_size,
special_tokens=['<|endoftext|>', '<|padding|>'],
show_progress=True,
)
tokenizer.train(files, trainer)
# 5. Post-processing
tokenizer.post_processor = TemplateProcessing(
single="$A <|endoftext|>",
special_tokens=[("<|endoftext|>", tokenizer.token_to_id("<|endoftext|>"))],
)
return tokenizer
# ์ฌ์ฉ
def demonstrate_byte_level():
"""Byte-Level BPE ๋ฐ๋ชจ"""
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
texts = [
"Hello, world!",
"์๋
ํ์ธ์", # ํ๊ตญ์ด
"๐ Party time!", # ์ด๋ชจ์ง
"The cafรฉ serves naรฏve croissants", # ํน์๋ฌธ์
]
for text in texts:
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(f"\nText: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
print(f"Decoded: {tokenizer.decode(ids)}")
demonstrate_byte_level()
6. ๋ค๊ตญ์ด Tokenization¶
6.1 ๋์ ๊ณผ์ ¶
๋ฌธ์ ์ :
1. Fertility ๋ถ๊ท ํ: ๊ฐ์ ์๋ฏธ๋ผ๋ ์ธ์ด๋ณ ํ ํฐ ์ ์ฐจ์ด
- "hello" (1 token) vs "ไฝ ๅฅฝ" (2-3 tokens) vs "์๋
" (2-4 tokens)
2. ์ ์์ ์ธ์ด under-representation:
- ์์ด ์ค์ฌ ํ์ต โ ๋ค๋ฅธ ์ธ์ด ์ดํ ๋ถ์กฑ
3. ์ฝ๋ ์ค์์นญ:
- "I love ๊น์น" โ ์์ด/ํ๊ตญ์ด ํผ์ฉ ์ฒ๋ฆฌ
ํด๊ฒฐ์ฑ
:
1. Character coverage: 99.95% ์ด์
2. ์ธ์ด๋ณ ์ํ๋ง ๋น์จ ์กฐ์
3. Byte fallback ํ์ฑํ
6.2 ๋ค๊ตญ์ด ํ ํฌ๋์ด์ ๊ตฌ์ถ¶
from collections import defaultdict
import unicodedata
class MultilingualTokenizerConfig:
"""๋ค๊ตญ์ด ํ ํฌ๋์ด์ ์ค์ """
# ์ธ์ด๋ณ ์ํ๋ง ๋น์จ (BLOOM ์คํ์ผ)
LANGUAGE_WEIGHTS = {
'en': 0.30, # ์์ด
'zh': 0.15, # ์ค๊ตญ์ด
'code': 0.15, # ํ๋ก๊ทธ๋๋ฐ ์ฝ๋
'fr': 0.08, # ํ๋์ค์ด
'es': 0.07, # ์คํ์ธ์ด
'pt': 0.05, # ํฌ๋ฅดํฌ๊ฐ์ด
'de': 0.05, # ๋
์ผ์ด
'ar': 0.05, # ์๋์ด
'hi': 0.03, # ํ๋์ด
'ko': 0.02, # ํ๊ตญ์ด
'ja': 0.02, # ์ผ๋ณธ์ด
'other': 0.03,
}
@staticmethod
def estimate_fertility(tokenizer, texts_by_lang: dict) -> dict:
"""
์ธ์ด๋ณ Fertility ์ธก์
Fertility = ํ ํฐ ์ / ๋ฌธ์ ์
๋ฎ์์๋ก ํจ์จ์
"""
fertility = {}
for lang, texts in texts_by_lang.items():
total_chars = 0
total_tokens = 0
for text in texts:
chars = len(text)
tokens = len(tokenizer.encode(text))
total_chars += chars
total_tokens += tokens
fertility[lang] = total_tokens / max(total_chars, 1)
return fertility
def create_multilingual_tokenizer(
corpus_files: dict, # {language: file_path}
vocab_size: int = 100000
):
"""๋ค๊ตญ์ด SentencePiece ํ ํฌ๋์ด์ """
# 1. ์ธ์ด๋ณ ๋ฐ์ดํฐ ๋ณํฉ (๊ฐ์ค์น ์ ์ฉ)
merged_file = 'merged_corpus.txt'
weights = MultilingualTokenizerConfig.LANGUAGE_WEIGHTS
with open(merged_file, 'w') as out:
for lang, file_path in corpus_files.items():
weight = weights.get(lang, 0.01)
sample_ratio = weight / sum(weights.values())
with open(file_path, 'r') as f:
lines = f.readlines()
n_samples = int(len(lines) * sample_ratio * 10) # ์ค๋ฒ์ํ๋ง
for line in lines[:n_samples]:
out.write(line)
# 2. SentencePiece ํ์ต
spm.SentencePieceTrainer.train(
input=merged_file,
model_prefix='multilingual',
vocab_size=vocab_size,
model_type='bpe',
character_coverage=0.9995, # ๋์ ์ปค๋ฒ๋ฆฌ์ง
byte_fallback=True,
split_digits=True,
# ํน์ ํ ํฐ
user_defined_symbols=['<code>', '</code>', '<math>', '</math>'],
)
return 'multilingual.model'
7. Tokenizer-Free ๋ชจ๋ธ¶
7.1 ByT5 (Byte-level T5)¶
class ByteLevelModel:
"""
ByT5 ์คํ์ผ Byte-Level ๋ชจ๋ธ
ํน์ง:
- ํ ํฌ๋์ด์ ์์
- ์
๋ ฅ: raw UTF-8 bytes (0-255)
- ์ฅ์ : ์ธ์ด ๋
๋ฆฝ์ , ๋
ธ์ด์ฆ์ ๊ฐํจ
- ๋จ์ : ๊ธด ์ํ์ค (3-4๋ฐฐ)
"""
VOCAB_SIZE = 259 # 256 bytes + 3 special tokens
def __init__(self):
self.pad_id = 256
self.eos_id = 257
self.unk_id = 258
def encode(self, text: str) -> List[int]:
"""ํ
์คํธ โ bytes"""
return list(text.encode('utf-8'))
def decode(self, ids: List[int]) -> str:
"""bytes โ ํ
์คํธ"""
# ํน์ ํ ํฐ ์ ๊ฑฐ
bytes_list = [b for b in ids if b < 256]
return bytes(bytes_list).decode('utf-8', errors='replace')
# ByT5 ์ฌ์ฉ ์์
from transformers import AutoTokenizer, T5ForConditionalGeneration
def use_byt5():
"""ByT5 ์ฌ์ฉ"""
tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")
model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
# Byte-level ์ธ์ฝ๋ฉ
text = "translate English to German: Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
print(f"Text length: {len(text)} chars")
print(f"Token length: {inputs['input_ids'].shape[1]} tokens")
# Byte-level์ด๋ฏ๋ก ๋๋ต ๋น์ท
# ์์ฑ
outputs = model.generate(**inputs, max_length=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Result: {result}")
7.2 MEGABYTE¶
MEGABYTE ์ํคํ
์ฒ:
- Patch-based byte modeling
- Global model: ํฐ transformer, patch ๋ ๋ฒจ
- Local model: ์์ transformer, byte ๋ ๋ฒจ
์ฅ์ :
- ๊ธด byte ์ํ์ค ํจ์จ์ ์ฒ๋ฆฌ
- O(nยฒ) โ O(nยฒ/p + p * n) ๋ณต์ก๋ (p = patch ํฌ๊ธฐ)
8. ์ฝ๋์ฉ Tokenization¶
8.1 ์ฝ๋ ํนํ ์ ๋ต¶
class CodeTokenizer:
"""
ํ๋ก๊ทธ๋๋ฐ ์ฝ๋์ฉ ํ ํฌ๋์ด์
๊ณ ๋ ค์ฌํญ:
1. ๋ค์ฌ์ฐ๊ธฐ ๋ณด์กด
2. ์๋ณ์ ๋ถํ (camelCase, snake_case)
3. ์ซ์ ๋ฆฌํฐ๋ด
4. ํน์ ๋ฌธ์ (==, !=, <=, etc.)
"""
def preprocess_code(self, code: str) -> str:
"""์ฝ๋ ์ ์ฒ๋ฆฌ"""
# ๋ค์ฌ์ฐ๊ธฐ๋ฅผ ํน์ ํ ํฐ์ผ๋ก
lines = code.split('\n')
processed = []
for line in lines:
# ๋ค์ฌ์ฐ๊ธฐ ๊ณ์ฐ
indent = len(line) - len(line.lstrip())
indent_tokens = '<INDENT>' * (indent // 4)
processed.append(indent_tokens + line.lstrip())
return '\n'.join(processed)
def split_identifier(self, identifier: str) -> List[str]:
"""์๋ณ์ ๋ถํ """
# camelCase
import re
tokens = re.sub('([A-Z])', r' \1', identifier).split()
# snake_case
result = []
for token in tokens:
result.extend(token.split('_'))
return [t for t in result if t]
# Codex/StarCoder ์คํ์ผ
def create_code_tokenizer():
"""์ฝ๋์ฉ ํ ํฌ๋์ด์ (StarCoder ์คํ์ผ)"""
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
tokenizer = Tokenizer(models.BPE())
# ์ฝ๋์ฉ pre-tokenization
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
pre_tokenizers.ByteLevel(add_prefix_space=False),
# ์ซ์ ๋ถ๋ฆฌ
pre_tokenizers.Digits(individual_digits=True),
])
# ํน์ ํ ํฐ
special_tokens = [
'<|endoftext|>',
'<fim_prefix>', # Fill-in-the-middle
'<fim_middle>',
'<fim_suffix>',
'<filename>',
'<gh_stars>',
'<issue_start>',
'<issue_comment>',
'<issue_closed>',
'<jupyter_start>',
'<jupyter_code>',
'<jupyter_output>',
'<empty_output>',
'<commit_before>',
'<commit_msg>',
'<commit_after>',
]
trainer = trainers.BpeTrainer(
vocab_size=49152,
special_tokens=special_tokens,
)
return tokenizer, trainer
9. ์ค์ต: ํ ํฌ๋์ด์ ๋ถ์¶
from transformers import AutoTokenizer
import matplotlib.pyplot as plt
def analyze_tokenizers():
"""๋ค์ํ ํ ํฌ๋์ด์ ๋น๊ต ๋ถ์"""
tokenizers = {
'GPT-2': AutoTokenizer.from_pretrained('gpt2'),
'BERT': AutoTokenizer.from_pretrained('bert-base-uncased'),
'T5': AutoTokenizer.from_pretrained('t5-base'),
'LLaMA': AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf'),
}
test_texts = {
'English': "The quick brown fox jumps over the lazy dog.",
'Korean': "๋น ๋ฅธ ๊ฐ์ ์ฌ์ฐ๊ฐ ๊ฒ์ผ๋ฅธ ๊ฐ๋ฅผ ๋ฐ์ด๋์ต๋๋ค.",
'Code': "def hello_world():\n print('Hello, World!')",
'Math': "The equation e^(iฯ) + 1 = 0 is beautiful.",
'Mixed': "I love eating ๊น์น with rice ๐",
}
# ๋ถ์
results = {}
for tok_name, tokenizer in tokenizers.items():
results[tok_name] = {}
for text_name, text in test_texts.items():
try:
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
results[tok_name][text_name] = {
'n_tokens': len(tokens),
'n_chars': len(text),
'fertility': len(tokens) / len(text),
'tokens': tokens[:10], # ์ฒ์ 10๊ฐ๋ง
}
except:
results[tok_name][text_name] = None
# ์ถ๋ ฅ
for tok_name, tok_results in results.items():
print(f"\n{'='*50}")
print(f"Tokenizer: {tok_name}")
print('='*50)
for text_name, result in tok_results.items():
if result:
print(f"\n{text_name}:")
print(f" Tokens: {result['n_tokens']}")
print(f" Chars: {result['n_chars']}")
print(f" Fertility: {result['fertility']:.3f}")
print(f" Sample: {result['tokens']}")
# Fertility ์๊ฐํ
fig, ax = plt.subplots(figsize=(10, 6))
x = list(test_texts.keys())
width = 0.2
positions = range(len(x))
for i, (tok_name, tok_results) in enumerate(results.items()):
fertilities = [
tok_results[text_name]['fertility'] if tok_results.get(text_name) else 0
for text_name in x
]
offset = (i - len(results) / 2) * width
ax.bar([p + offset for p in positions], fertilities, width, label=tok_name)
ax.set_xlabel('Text Type')
ax.set_ylabel('Fertility (tokens/chars)')
ax.set_title('Tokenizer Fertility Comparison')
ax.set_xticks(positions)
ax.set_xticklabels(x)
ax.legend()
plt.tight_layout()
plt.savefig('tokenizer_comparison.png')
plt.show()
if __name__ == "__main__":
analyze_tokenizers()
์ฐธ๊ณ ์๋ฃ¶
๋ ผ๋ฌธ¶
- Sennrich et al. (2016). "Neural Machine Translation of Rare Words with Subword Units" (BPE)
- Kudo & Richardson (2018). "SentencePiece: A simple and language independent subword tokenizer"
- Xue et al. (2021). "ByT5: Towards a token-free future with pre-trained byte-to-byte models"
๋๊ตฌ¶
- HuggingFace Tokenizers
- SentencePiece
- tiktoken (OpenAI)