06. HuggingFace ๊ธฐ์ดˆ

06. HuggingFace ๊ธฐ์ดˆ

ํ•™์Šต ๋ชฉํ‘œ

  • Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ดํ•ด
  • Pipeline API ์‚ฌ์šฉ
  • ํ† ํฌ๋‚˜์ด์ €์™€ ๋ชจ๋ธ ๋กœ๋“œ
  • ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ ์ˆ˜ํ–‰

1. HuggingFace ์ƒํƒœ๊ณ„

์ฃผ์š” ๊ตฌ์„ฑ์š”์†Œ

HuggingFace
โ”œโ”€โ”€ Transformers   # ๋ชจ๋ธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
โ”œโ”€โ”€ Datasets       # ๋ฐ์ดํ„ฐ์…‹
โ”œโ”€โ”€ Tokenizers     # ํ† ํฌ๋‚˜์ด์ €
โ”œโ”€โ”€ Hub            # ๋ชจ๋ธ/๋ฐ์ดํ„ฐ ์ €์žฅ์†Œ
โ”œโ”€โ”€ Accelerate     # ๋ถ„์‚ฐ ํ•™์Šต
โ””โ”€โ”€ Evaluate       # ํ‰๊ฐ€ ๋ฉ”ํŠธ๋ฆญ

์„ค์น˜

pip install transformers datasets tokenizers accelerate evaluate

2. Pipeline API

๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ์‚ฌ์šฉ๋ฒ•

from transformers import pipeline

# ๊ฐ์„ฑ ๋ถ„์„
classifier = pipeline("sentiment-analysis")
result = classifier("I love this movie!")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]

# ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ
results = classifier([
    "I love this movie!",
    "This is terrible."
])

์ง€์› ํƒœ์Šคํฌ

ํƒœ์Šคํฌ Pipeline ์ด๋ฆ„ ์„ค๋ช…
๊ฐ์„ฑ ๋ถ„์„ sentiment-analysis ๊ธ์ •/๋ถ€์ • ๋ถ„๋ฅ˜
ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ text-classification ์ผ๋ฐ˜ ๋ถ„๋ฅ˜
NER ner ๊ฐœ์ฒด๋ช… ์ธ์‹
QA question-answering ์งˆ์˜์‘๋‹ต
์š”์•ฝ summarization ํ…์ŠคํŠธ ์š”์•ฝ
๋ฒˆ์—ญ translation ์–ธ์–ด ๋ฒˆ์—ญ
ํ…์ŠคํŠธ ์ƒ์„ฑ text-generation ๋ฌธ์žฅ ์ƒ์„ฑ
Fill-Mask fill-mask ๋งˆ์Šคํฌ ์˜ˆ์ธก
Zero-shot zero-shot-classification ๋ ˆ์ด๋ธ” ์—†๋Š” ๋ถ„๋ฅ˜

๋‹ค์–‘ํ•œ Pipeline ์˜ˆ์ œ

# ์งˆ์˜์‘๋‹ต
qa = pipeline("question-answering")
result = qa(
    question="What is the capital of France?",
    context="Paris is the capital and most populous city of France."
)
# {'answer': 'Paris', 'score': 0.99, 'start': 0, 'end': 5}

# ์š”์•ฝ
summarizer = pipeline("summarization")
text = "Very long article text here..."
summary = summarizer(text, max_length=50, min_length=10)

# ๋ฒˆ์—ญ
translator = pipeline("translation_en_to_fr")
result = translator("Hello, how are you?")
# [{'translation_text': 'Bonjour, comment allez-vous?'}]

# ํ…์ŠคํŠธ ์ƒ์„ฑ
generator = pipeline("text-generation", model="gpt2")
result = generator("Once upon a time", max_length=50)

# NER
ner = pipeline("ner", grouped_entities=True)
result = ner("My name is John and I work at Google in New York")
# [{'entity_group': 'PER', 'word': 'John', ...},
#  {'entity_group': 'ORG', 'word': 'Google', ...},
#  {'entity_group': 'LOC', 'word': 'New York', ...}]

# Zero-shot ๋ถ„๋ฅ˜
classifier = pipeline("zero-shot-classification")
result = classifier(
    "I want to go to the beach",
    candidate_labels=["travel", "cooking", "technology"]
)
# {'labels': ['travel', 'cooking', 'technology'], 'scores': [0.95, 0.03, 0.02]}

ํŠน์ • ๋ชจ๋ธ ์ง€์ •

# ํ•œ๊ตญ์–ด ๋ชจ๋ธ
classifier = pipeline(
    "sentiment-analysis",
    model="beomi/kcbert-base"
)

# ๋‹ค๊ตญ์–ด ๋ชจ๋ธ
qa = pipeline(
    "question-answering",
    model="deepset/xlm-roberta-large-squad2"
)

3. ํ† ํฌ๋‚˜์ด์ €

AutoTokenizer

from transformers import AutoTokenizer

# ์ž๋™์œผ๋กœ ์ ํ•ฉํ•œ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# ์ธ์ฝ”๋”ฉ
text = "Hello, how are you?"
encoded = tokenizer(text)
print(encoded)
# {'input_ids': [101, 7592, ...], 'attention_mask': [1, 1, ...], ...}

# ํ…์„œ๋กœ ๋ฐ˜ํ™˜
encoded = tokenizer(text, return_tensors='pt')

์ฃผ์š” ํŒŒ๋ผ๋ฏธํ„ฐ

encoded = tokenizer(
    text,
    padding=True,              # ํŒจ๋”ฉ ์ถ”๊ฐ€
    truncation=True,           # ์ตœ๋Œ€ ๊ธธ์ด ์ž๋ฅด๊ธฐ
    max_length=128,            # ์ตœ๋Œ€ ๊ธธ์ด
    return_tensors='pt',       # PyTorch ํ…์„œ
    return_attention_mask=True,
    return_token_type_ids=True
)

๋ฐฐ์น˜ ์ธ์ฝ”๋”ฉ

texts = ["Hello world", "How are you?", "I'm fine"]

# ๋™์  ํŒจ๋”ฉ
encoded = tokenizer(
    texts,
    padding=True,     # ๊ฐ€์žฅ ๊ธด ์‹œํ€€์Šค์— ๋งž์ถค
    truncation=True,
    return_tensors='pt'
)

print(encoded['input_ids'].shape)  # (3, max_len)

๋””์ฝ”๋”ฉ

# ๋””์ฝ”๋”ฉ
decoded = tokenizer.decode(encoded['input_ids'][0])
print(decoded)  # "[CLS] hello world [SEP]"

# ํŠน์ˆ˜ ํ† ํฐ ์ œ๊ฑฐ
decoded = tokenizer.decode(encoded['input_ids'][0], skip_special_tokens=True)
print(decoded)  # "hello world"

ํ† ํฐ ํ™•์ธ

# ํ† ํฐ ๋ชฉ๋ก
tokens = tokenizer.tokenize("Hello, how are you?")
print(tokens)  # ['hello', ',', 'how', 'are', 'you', '?']

# ํ† ํฐ โ†’ ID
ids = tokenizer.convert_tokens_to_ids(tokens)

# ID โ†’ ํ† ํฐ
tokens = tokenizer.convert_ids_to_tokens(ids)

4. ๋ชจ๋ธ ๋กœ๋“œ

AutoModel

from transformers import AutoModel, AutoModelForSequenceClassification

# ๊ธฐ๋ณธ ๋ชจ๋ธ (์ถœ๋ ฅ: ์€๋‹‰ ์ƒํƒœ)
model = AutoModel.from_pretrained("bert-base-uncased")

# ๋ถ„๋ฅ˜ ๋ชจ๋ธ (์ถœ๋ ฅ: ๋กœ์ง“)
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2
)

ํƒœ์Šคํฌ๋ณ„ AutoModel

from transformers import (
    AutoModelForSequenceClassification,  # ๋ฌธ์žฅ ๋ถ„๋ฅ˜
    AutoModelForTokenClassification,      # ํ† ํฐ ๋ถ„๋ฅ˜ (NER)
    AutoModelForQuestionAnswering,        # QA
    AutoModelForCausalLM,                 # GPT ์Šคํƒ€์ผ ์ƒ์„ฑ
    AutoModelForSeq2SeqLM,                # ์ธ์ฝ”๋”-๋””์ฝ”๋” (๋ฒˆ์—ญ, ์š”์•ฝ)
    AutoModelForMaskedLM                  # BERT ์Šคํƒ€์ผ MLM
)

์ถ”๋ก 

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# ์ธ์ฝ”๋”ฉ
inputs = tokenizer("I love this movie!", return_tensors="pt")

# ์ถ”๋ก 
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# ์˜ˆ์ธก
predictions = torch.softmax(logits, dim=-1)
predicted_class = predictions.argmax().item()
print(f"Class: {predicted_class}, Confidence: {predictions[0][predicted_class]:.4f}")

5. Datasets ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

๋ฐ์ดํ„ฐ์…‹ ๋กœ๋“œ

from datasets import load_dataset

# HuggingFace Hub์—์„œ ๋กœ๋“œ
dataset = load_dataset("imdb")
print(dataset)
# DatasetDict({
#     train: Dataset({features: ['text', 'label'], num_rows: 25000})
#     test: Dataset({features: ['text', 'label'], num_rows: 25000})
# })

# ๋ถ„ํ•  ์ง€์ •
train_data = load_dataset("imdb", split="train")
test_data = load_dataset("imdb", split="test[:1000]")  # ์ฒ˜์Œ 1000๊ฐœ

# ์ƒ˜ํ”Œ ํ™•์ธ
print(train_data[0])
# {'text': '...', 'label': 1}

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

def preprocess(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=256
    )

# map ์ ์šฉ
tokenized_dataset = dataset.map(preprocess, batched=True)

# ๋ถˆํ•„์š”ํ•œ ์ปฌ๋Ÿผ ์ œ๊ฑฐ
tokenized_dataset = tokenized_dataset.remove_columns(['text'])

# PyTorch ํฌ๋งท ์„ค์ •
tokenized_dataset.set_format('torch')

DataLoader ์ƒ์„ฑ

from torch.utils.data import DataLoader

train_loader = DataLoader(
    tokenized_dataset['train'],
    batch_size=16,
    shuffle=True
)

for batch in train_loader:
    print(batch['input_ids'].shape)  # (16, 256)
    break

6. Trainer API

๊ธฐ๋ณธ ํ•™์Šต

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset

# ๋ฐ์ดํ„ฐ
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=256)

tokenized = dataset.map(tokenize, batched=True)

# ๋ชจ๋ธ
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2
)

# ํ•™์Šต ์„ค์ •
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['test'],
)

# ํ•™์Šต
trainer.train()

# ํ‰๊ฐ€
results = trainer.evaluate()
print(results)

์ปค์Šคํ…€ ๋ฉ”ํŠธ๋ฆญ

import evaluate

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['test'],
    compute_metrics=compute_metrics
)

7. ๋ชจ๋ธ ์ €์žฅ/๋กœ๋“œ

๋กœ์ปฌ ์ €์žฅ

# ์ €์žฅ
model.save_pretrained("./my_model")
tokenizer.save_pretrained("./my_model")

# ๋กœ๋“œ
model = AutoModelForSequenceClassification.from_pretrained("./my_model")
tokenizer = AutoTokenizer.from_pretrained("./my_model")

Hub์— ์—…๋กœ๋“œ

# ๋กœ๊ทธ์ธ
from huggingface_hub import login
login(token="your_token")

# ์—…๋กœ๋“œ
model.push_to_hub("my-username/my-model")
tokenizer.push_to_hub("my-username/my-model")

# ๋˜๋Š” Trainer๋กœ
trainer.push_to_hub("my-model")

8. ์‹ค์ „ ์˜ˆ์ œ: ๊ฐ์„ฑ ๋ถ„๋ฅ˜

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset
import evaluate

# 1. ๋ฐ์ดํ„ฐ ๋กœ๋“œ
dataset = load_dataset("imdb")

# 2. ํ† ํฌ๋‚˜์ด์ €
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize(batch):
    return tokenizer(batch['text'], truncation=True, padding='max_length', max_length=256)

tokenized = dataset.map(tokenize, batched=True)
tokenized.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

# 3. ๋ชจ๋ธ
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)

# 4. ๋ฉ”ํŠธ๋ฆญ
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions = eval_pred.predictions.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

# 5. ํ•™์Šต ์„ค์ •
args = TrainingArguments(
    output_dir="./imdb_classifier",
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=torch.cuda.is_available(),  # Mixed Precision
)

# 6. Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['test'],
    compute_metrics=compute_metrics,
)

# 7. ํ•™์Šต
trainer.train()

# 8. ์ถ”๋ก 
def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    return "Positive" if probs[0][1] > 0.5 else "Negative", probs[0][1].item()

print(predict("This movie was amazing!"))
# ('Positive', 0.9876)

์ •๋ฆฌ

ํ•ต์‹ฌ ํด๋ž˜์Šค

ํด๋ž˜์Šค ์šฉ๋„
pipeline ๋น ๋ฅธ ์ถ”๋ก 
AutoTokenizer ํ† ํฌ๋‚˜์ด์ € ์ž๋™ ๋กœ๋“œ
AutoModel* ๋ชจ๋ธ ์ž๋™ ๋กœ๋“œ
Trainer ํ•™์Šต ๋ฃจํ”„ ์ž๋™ํ™”
TrainingArguments ํ•™์Šต ์„ค์ •

ํ•ต์‹ฌ ์ฝ”๋“œ

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

# ๋น ๋ฅธ ์ถ”๋ก 
classifier = pipeline("sentiment-analysis")
result = classifier("I love this!")

# ์ปค์Šคํ…€ ์ถ”๋ก 
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
inputs = tokenizer("Hello", return_tensors="pt")
outputs = model(**inputs)

๋‹ค์Œ ๋‹จ๊ณ„

07_Fine_Tuning.md์—์„œ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์— ๋Œ€ํ•œ ํŒŒ์ธํŠœ๋‹ ๊ธฐ๋ฒ•์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

to navigate between lessons