11. Small Language Models
11. Small Language Models¶
๊ฐ์¶
๋ํ ๋ชจ๋ธ(100B+)์ด ํ์ ์ง๋ง, ์ค์ ํ๋ก๋์ ํ๊ฒฝ์์๋ Small Language Models (SLM)์ด ๋ ์ค์ฉ์ ์ ๋๋ค. ์ด ๋ ์จ์์๋ 7B ์ดํ ๋ชจ๋ธ์ ์ํคํ ์ฒ, ํ์ต ์ ๋ต, ํ์ฉ ๋ฐฉ๋ฒ์ ๋ค๋ฃน๋๋ค.
1. SLM์ ์ค์์ฑ¶
1.1 ์ ์์ ๋ชจ๋ธ์ธ๊ฐ?¶
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SLM vs LLM ๋น๊ต โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ SLM (1-7B) LLM (70B+) โ
โ โ
โ ๐ฐ ๋น์ฉ ๋ฎ์ ๋์ โ
โ โก ์ง์ฐ์๊ฐ ๋ฎ์ (<100ms) ๋์ (>500ms) โ
โ ๐ฅ๏ธ ํ๋์จ์ด ๋จ์ผ GPU/CPU ๋ค์ค GPU ํ์ โ
โ ๐ฑ ์ฃ์ง ๋ฐฐํฌ ๊ฐ๋ฅ ์ด๋ ค์ โ
โ ๐ ํ๋ผ์ด๋ฒ์ ์จํ๋ ๋ฏธ์ค ์ฌ์ ์ด๋ ค์ โ
โ ๐ฏ ํนํ ํ์คํฌ ๋น์ฉ ํจ์จ์ ๊ณผ์ โ
โ โ
โ ์ฌ์ฉ ์ฌ๋ก: โ
โ - ๋ชจ๋ฐ์ผ ์ฑ (On-device) โ
โ - ์๋ฒ ๋๋ ์์คํ
โ
โ - ๊ณ ๋น๋ API ์๋น์ค โ
โ - ๋น์ฉ ๋ฏผ๊ฐํ ์คํํธ์
โ
โ - ๊ฐ์ธ์ ๋ณด ๋ณดํธ๊ฐ ์ค์ํ ๋๋ฉ์ธ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1.2 SLM ๋ชจ๋ธ ๋น๊ต¶
| ๋ชจ๋ธ | ํ๋ผ๋ฏธํฐ | ํ์ต ํ ํฐ | ํน์ง |
|---|---|---|---|
| Phi-3 | 3.8B | 3.3T | MS, ์ถ๋ก ํนํ |
| Gemma 2 | 2B / 9B | 8T | Google, ์ฝ๋ ๊ฐ์ |
| Qwen 2.5 | 0.5B - 7B | 18T | ๋ค๊ตญ์ด, ์ํ |
| Llama 3.2 | 1B / 3B | 15T | ๋ชจ๋ฐ์ผ ์ต์ ํ |
| TinyLlama | 1.1B | 3T | ํจ์จ์ ํ์ต |
| StableLM 2 | 1.6B | 2T | Stability AI |
| SmolLM | 135M - 1.7B | 1T | HuggingFace |
2. ์ํคํ ์ฒ ์ต์ ํ¶
2.1 Phi ์๋ฆฌ์ฆ (Microsoft)¶
"""
Phi-3: "Textbooks Are All You Need" ์ฒ ํ
ํต์ฌ ์์ด๋์ด:
1. ๋ฐ์ดํฐ ํ์ง > ๋ฐ์ดํฐ ์
2. ํฉ์ฑ ๋ฐ์ดํฐ ํ์ฉ (GPT-4๋ก ์์ฑ)
3. ๊ต๊ณผ์๊ธ ํ์ง์ ๋ฐ์ดํฐ๋ง ์ฌ์ฉ
๊ฒฐ๊ณผ: 3.8B๋ก GPT-3.5๊ธ ์ถ๋ก ๋ฅ๋ ฅ
"""
class Phi3Config:
"""Phi-3 ์ํคํ
์ฒ ์ค์ """
# Phi-3-mini (3.8B)
hidden_size = 3072
num_layers = 32
num_attention_heads = 32
num_key_value_heads = 32 # No GQA
intermediate_size = 8192 # FFN ํ์ฅ๋น ~2.7x
vocab_size = 32064
max_position_embeddings = 4096 # ํ์ฅ ๊ฐ๋ฅ
# ํน์ง
# - SuRoPE (Scaled RoPE)
# - LayerNorm (RMSNorm ๋์ )
# - SwiGLU FFN
# Phi-3 ์ฌ์ฉ ์์
from transformers import AutoModelForCausalLM, AutoTokenizer
def use_phi3():
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct"
)
# ์ถ๋ก
messages = [
{"role": "user", "content": "Explain the Pythagorean theorem."}
]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", return_dict=True
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7
)
return tokenizer.decode(outputs[0])
2.2 Gemma 2 (Google)¶
"""
Gemma 2: ํจ์จ์ ์ธ ์ํคํ
์ฒ ์ค๊ณ
ํต์ฌ ํน์ง:
1. Alternating Local-Global Attention
2. Soft-Capping (Logits & Attention)
3. Pre-Norm + Post-Norm hybrid
4. Knowledge Distillation from larger models
"""
class Gemma2Config:
"""Gemma 2 ์ํคํ
์ฒ"""
# Gemma 2 2B
hidden_size = 2304
num_layers = 26
num_attention_heads = 8
num_key_value_heads = 4 # GQA ์ฌ์ฉ
intermediate_size = 9216
vocab_size = 256128 # ํฐ vocab
# Gemma 2 9B
# hidden_size = 3584
# num_layers = 42
# num_attention_heads = 16
# num_key_value_heads = 8
class GemmaAttentionWithSoftCap(nn.Module):
"""Gemma 2 ์คํ์ผ Soft-Capping Attention"""
def __init__(self, config, layer_idx: int):
super().__init__()
self.config = config
self.layer_idx = layer_idx
# Local vs Global attention ๊ต๋
# ์ง์ ๋ ์ด์ด: Local (sliding window)
# ํ์ ๋ ์ด์ด: Global (full attention)
self.is_local = (layer_idx % 2 == 0)
self.sliding_window = 4096 if self.is_local else None
# Soft-cap ๊ฐ
self.attn_logit_softcap = 50.0
# Projections
self.q_proj = nn.Linear(config.hidden_size, config.hidden_size)
self.k_proj = nn.Linear(config.hidden_size, config.hidden_size // 2) # GQA
self.v_proj = nn.Linear(config.hidden_size, config.hidden_size // 2)
self.o_proj = nn.Linear(config.hidden_size, config.hidden_size)
def forward(self, hidden_states, attention_mask=None):
batch, seq_len, _ = hidden_states.shape
Q = self.q_proj(hidden_states)
K = self.k_proj(hidden_states)
V = self.v_proj(hidden_states)
# GQA: K, V ํ์ฅ
K = K.repeat_interleave(2, dim=-1) # ๊ฐ์ํ
V = V.repeat_interleave(2, dim=-1)
# Attention scores
scores = torch.matmul(Q, K.transpose(-2, -1))
scores = scores / math.sqrt(Q.shape[-1])
# Soft-capping: tanh๋ก ๋ฒ์ ์ ํ
scores = self.attn_logit_softcap * torch.tanh(scores / self.attn_logit_softcap)
# Sliding window mask (local attention)
if self.is_local and self.sliding_window:
mask = self._create_sliding_window_mask(seq_len)
scores = scores + mask
# Causal mask
causal_mask = torch.triu(
torch.ones(seq_len, seq_len) * float('-inf'),
diagonal=1
).to(scores.device)
scores = scores + causal_mask
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, V)
return self.o_proj(output)
def _create_sliding_window_mask(self, seq_len):
"""Sliding window attention mask"""
mask = torch.ones(seq_len, seq_len) * float('-inf')
for i in range(seq_len):
start = max(0, i - self.sliding_window)
mask[i, start:i+1] = 0
return mask
2.3 Qwen 2.5 (Alibaba)¶
"""
Qwen 2.5: ๋ค๊ตญ์ด & ์ํ ๊ฐ์
ํน์ง:
1. ๋๊ท๋ชจ ๋ค๊ตญ์ด ํ์ต (29๊ฐ ์ธ์ด)
2. ์ฝ๋/์ํ ํนํ ๋ฐ์ดํฐ
3. ๊ธด ์ปจํ
์คํธ (128K)
4. ๋ค์ํ ํฌ๊ธฐ (0.5B ~ 72B)
"""
class Qwen25Config:
"""Qwen 2.5 ์ํคํ
์ฒ"""
# Qwen2.5-0.5B (๊ฐ์ฅ ์์ ๋ฒ์ )
hidden_size = 896
num_layers = 24
num_attention_heads = 14
num_key_value_heads = 2 # ํจ์จ์ GQA
intermediate_size = 4864
vocab_size = 151936
# Qwen2.5-7B
# hidden_size = 3584
# num_layers = 28
# num_attention_heads = 28
# num_key_value_heads = 4
# Qwen ์ฌ์ฉ ์์
def use_qwen():
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
# ๋ค๊ตญ์ด ํ
์คํธ
prompts = [
"Explain machine learning in simple terms.",
"็จ็ฎๅ็่ฏ่งฃ้ๆบๅจๅญฆไน ", # ์ค๊ตญ์ด
"๊ธฐ๊ณ ํ์ต์ ์ฝ๊ฒ ์ค๋ช
ํด์ฃผ์ธ์", # ํ๊ตญ์ด
]
for prompt in prompts:
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print("-" * 50)
3. ํ์ต ์ ๋ต¶
3.1 ๋ฐ์ดํฐ ํ์ง vs ์¶
"""
SLM ํ์ต์ ํต์ฌ: ๊ณ ํ์ง ๋ฐ์ดํฐ
Phi์ ๊ตํ:
- ์น ํฌ๋กค๋ง ๋ฐ์ดํฐ (ํ์ง ๋ฎ์) < ๊ต๊ณผ์๊ธ ๋ฐ์ดํฐ
- ํฉ์ฑ ๋ฐ์ดํฐ (GPT-4 ์์ฑ)๊ฐ ํจ๊ณผ์
- ํํฐ๋ง์ด ๋งค์ฐ ์ค์
"""
class HighQualityDataPipeline:
"""๊ณ ํ์ง ๋ฐ์ดํฐ ํ์ดํ๋ผ์ธ"""
def __init__(self, quality_model):
self.quality_model = quality_model
def filter_data(self, texts: list, threshold: float = 0.8):
"""ํ์ง ๊ธฐ๋ฐ ํํฐ๋ง"""
filtered = []
for text in texts:
score = self.quality_model.score(text)
if score > threshold:
filtered.append(text)
print(f"Filtered: {len(texts)} โ {len(filtered)}")
return filtered
def generate_synthetic_data(
self,
teacher_model,
topics: list,
n_samples: int = 10000
):
"""ํฉ์ฑ ๋ฐ์ดํฐ ์์ฑ"""
synthetic_data = []
for topic in topics:
prompt = f"""Create an educational explanation about {topic}.
The explanation should be:
1. Clear and concise
2. Include examples
3. Suitable for learning"""
for _ in range(n_samples // len(topics)):
response = teacher_model.generate(prompt)
# ํ์ง ๊ฒ์ฆ
if self._validate_response(response):
synthetic_data.append({
'topic': topic,
'content': response
})
return synthetic_data
def _validate_response(self, response: str) -> bool:
"""์๋ต ํ์ง ๊ฒ์ฆ"""
# ๊ธธ์ด ์ฒดํฌ
if len(response.split()) < 50:
return False
# ๋ฐ๋ณต ์ฒดํฌ
sentences = response.split('.')
if len(set(sentences)) / len(sentences) < 0.8:
return False
return True
3.2 Knowledge Distillation¶
"""
Knowledge Distillation: ํฐ ๋ชจ๋ธ โ ์์ ๋ชจ๋ธ
Teacher (๋ํ ๋ชจ๋ธ)์ ์ง์์ Student (SLM)์๊ฒ ์ ๋ฌ
"""
class DistillationTrainer:
"""KD ๊ธฐ๋ฐ SLM ํ์ต"""
def __init__(
self,
teacher_model, # ์: Llama 70B
student_model, # ์: 3B ๋ชจ๋ธ
temperature: float = 2.0,
alpha: float = 0.5 # soft/hard loss ๋น์จ
):
self.teacher = teacher_model
self.student = student_model
self.temperature = temperature
self.alpha = alpha
# Teacher๋ ํ์ต ์ ํจ
self.teacher.eval()
for param in self.teacher.parameters():
param.requires_grad = False
def distillation_loss(
self,
student_logits: torch.Tensor,
teacher_logits: torch.Tensor,
labels: torch.Tensor
) -> torch.Tensor:
"""
Distillation Loss = ฮฑ ร Soft Loss + (1-ฮฑ) ร Hard Loss
Soft Loss: KL(student_soft || teacher_soft)
Hard Loss: CrossEntropy(student, labels)
"""
T = self.temperature
# Soft targets (temperature scaling)
teacher_soft = F.softmax(teacher_logits / T, dim=-1)
student_soft = F.log_softmax(student_logits / T, dim=-1)
# KL Divergence (soft loss)
soft_loss = F.kl_div(
student_soft,
teacher_soft,
reduction='batchmean'
) * (T ** 2) # Temperature scaling ๋ณด์
# Cross Entropy (hard loss)
hard_loss = F.cross_entropy(
student_logits.view(-1, student_logits.size(-1)),
labels.view(-1),
ignore_index=-100
)
# Combined loss
loss = self.alpha * soft_loss + (1 - self.alpha) * hard_loss
return loss
def train_step(self, batch):
"""ํ์ต ์คํ
"""
input_ids = batch['input_ids']
labels = batch['labels']
# Teacher forward (no grad)
with torch.no_grad():
teacher_outputs = self.teacher(input_ids)
teacher_logits = teacher_outputs.logits
# Student forward
student_outputs = self.student(input_ids)
student_logits = student_outputs.logits
# Distillation loss
loss = self.distillation_loss(
student_logits, teacher_logits, labels
)
return loss
# Response-level Distillation (๋ ํจ๊ณผ์ )
class ResponseDistillation:
"""์๋ต ์์ค KD"""
def __init__(self, teacher_model, student_model):
self.teacher = teacher_model
self.student = student_model
def generate_training_data(self, prompts: list):
"""Teacher ์๋ต์ผ๋ก ํ์ต ๋ฐ์ดํฐ ์์ฑ"""
training_data = []
for prompt in prompts:
# Teacher ์๋ต ์์ฑ
teacher_response = self.teacher.generate(
prompt,
max_new_tokens=512,
temperature=0.7
)
training_data.append({
'prompt': prompt,
'response': teacher_response
})
return training_data
def train_on_responses(self, training_data):
"""Teacher ์๋ต์ผ๋ก Student ํ์ต"""
# Standard SFT (Supervised Fine-Tuning)
for item in training_data:
full_text = f"{item['prompt']}\n{item['response']}"
# ... SFT ํ์ต
3.3 ํจ์จ์ ํ์ต ๊ธฐ๋ฒ¶
"""
SLM ํ์ต ํจ์จํ ๊ธฐ๋ฒ
"""
# 1. Gradient Accumulation (์์ ๋ฐฐ์น๋ก ํฐ effective batch)
def train_with_grad_accumulation(
model,
dataloader,
accumulation_steps: int = 8
):
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for i, batch in enumerate(dataloader):
outputs = model(**batch)
loss = outputs.loss / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
# 2. LoRA๋ก ํจ์จ์ fine-tuning
from peft import LoraConfig, get_peft_model
def setup_lora_training(model):
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.1,
bias="none"
)
model = get_peft_model(model, lora_config)
# ํ์ต ๊ฐ๋ฅ ํ๋ผ๋ฏธํฐ ํ์ธ
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")
return model
# 3. QLoRA (์์ํ + LoRA)
from transformers import BitsAndBytesConfig
def setup_qlora_training(model_name):
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
# LoRA ์ถ๊ฐ
return setup_lora_training(model)
4. ๋ฐฐํฌ ์ต์ ํ¶
4.1 ์์ํ¶
"""
SLM ์์ํ: ๋ฉ๋ชจ๋ฆฌ & ์๋ ์ต์ ํ
"""
# 1. GPTQ (Post-Training Quantization)
from transformers import GPTQConfig
def quantize_with_gptq(model_name):
gptq_config = GPTQConfig(
bits=4,
dataset="c4",
tokenizer=AutoTokenizer.from_pretrained(model_name)
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=gptq_config,
device_map="auto"
)
return model
# 2. AWQ (Activation-aware Weight Quantization)
from awq import AutoAWQForCausalLM
def quantize_with_awq(model_path, output_path):
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# ์์ํ
model.quantize(
tokenizer,
quant_config={
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
)
# ์ ์ฅ
model.save_quantized(output_path)
# 3. llama.cpp (GGUF ํฌ๋งท)
"""
llama.cpp ์์ํ ๋ ๋ฒจ:
- Q2_K: 2๋นํธ (๋งค์ฐ ์์, ํ์ง ์ ํ)
- Q4_K_M: 4๋นํธ (๊ถ์ฅ, ํ์ง/ํฌ๊ธฐ ๊ท ํ)
- Q5_K_M: 5๋นํธ (๋์ ํ์ง)
- Q8_0: 8๋นํธ (๊ฑฐ์ ์๋ณธ ํ์ง)
๋ช
๋ น์ด:
./quantize model.gguf model-q4_k_m.gguf Q4_K_M
"""
# ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋ ๋น๊ต
def compare_memory_usage():
"""ํ๋ผ๋ฏธํฐ ์์ ๋ฐ๋ฅธ ๋ฉ๋ชจ๋ฆฌ"""
configs = [
("3B FP16", 3e9 * 2), # 6GB
("3B Q8", 3e9 * 1), # 3GB
("3B Q4", 3e9 * 0.5), # 1.5GB
("7B FP16", 7e9 * 2), # 14GB
("7B Q4", 7e9 * 0.5), # 3.5GB
]
print("Model\t\tMemory (GB)")
print("-" * 30)
for name, memory in configs:
print(f"{name}\t\t{memory / 1e9:.1f}")
4.2 On-Device ๋ฐฐํฌ¶
"""
๋ชจ๋ฐ์ผ/์ฃ์ง ๋๋ฐ์ด์ค ๋ฐฐํฌ
"""
# 1. ONNX ๋ณํ
def convert_to_onnx(model, tokenizer, output_path):
from optimum.onnxruntime import ORTModelForCausalLM
# ONNX ๋ณํ ๋ฐ ์ต์ ํ
ort_model = ORTModelForCausalLM.from_pretrained(
model,
export=True,
provider="CPUExecutionProvider"
)
ort_model.save_pretrained(output_path)
# 2. TensorRT-LLM (NVIDIA GPU)
"""
TensorRT-LLM ์ฌ์ฉ:
1. ๋ชจ๋ธ ๋ณํ: python convert_checkpoint.py
2. ์์ง ๋น๋: trtllm-build
3. ์ถ๋ก : python run.py
"""
# 3. llama.cpp (CPU ์ถ๋ก )
"""
llama.cpp ์ฌ์ฉ:
1. GGUF ๋ณํ
2. llama-cli ์คํ
./llama-cli -m model.gguf \
-n 256 \
-p "Hello, how are you?" \
-t 4 # threads
"""
# 4. MLC-LLM (๋ค์ํ ํ๋ซํผ)
"""
MLC-LLM: iOS, Android, WebGPU, CUDA
mlc_chat ์ฑ์ผ๋ก ๋ชจ๋ฐ์ผ ๋ฐฐํฌ ๊ฐ๋ฅ
"""
5. ๋ฒค์น๋งํฌ & ํ๊ฐ¶
5.1 SLM ๋ฒค์น๋งํฌ ๊ฒฐ๊ณผ¶
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SLM ๋ฒค์น๋งํฌ ๋น๊ต (2024.10 ๊ธฐ์ค) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Model Params MMLU GSM8K HumanEval TriviaQA โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Phi-3-mini 3.8B 69.9% 82.5% 57.9% 63.5% โ
โ Gemma-2-9B 9B 71.3% 68.6% 54.3% 73.5% โ
โ Qwen2.5-7B 7B 74.2% 82.6% 75.6% 71.4% โ
โ Llama-3.2-3B 3B 63.4% 44.4% 36.0% 63.4% โ
โ SmolLM-1.7B 1.7B 42.3% 18.2% 28.7% 42.1% โ
โ โ
โ ์ฐธ๊ณ : GPT-4 - 86.4% 92.0% 67.0% 87.6% โ
โ โ
โ โป Phi-3์ ์์ ํฌ๊ธฐ ๋๋น ๋ฐ์ด๋ ์ถ๋ก ๋ฅ๋ ฅ โ
โ โป Qwen2.5๋ ์ฝ๋(HumanEval)์์ ๊ฐ์ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
5.2 ํ์คํฌ๋ณ SLM ์ ํ ๊ฐ์ด๋¶
"""
ํ์คํฌ๋ณ SLM ์ถ์ฒ
"""
TASK_MODEL_RECOMMENDATIONS = {
# ์ผ๋ฐ ๋ํ
"general_chat": {
"best": "Qwen2.5-7B-Instruct",
"budget": "Qwen2.5-1.5B-Instruct",
"mobile": "Qwen2.5-0.5B-Instruct"
},
# ์ฝ๋ ์์ฑ
"code_generation": {
"best": "Qwen2.5-Coder-7B",
"budget": "CodeGemma-2B",
"mobile": "Phi-3-mini"
},
# ์ํ/์ถ๋ก
"math_reasoning": {
"best": "Qwen2.5-Math-7B",
"budget": "Phi-3-mini",
"mobile": "Phi-3-mini"
},
# ํ๊ตญ์ด
"korean": {
"best": "Qwen2.5-7B-Instruct", # ๋ค๊ตญ์ด ๊ฐ์
"budget": "EXAONE-3.0-7.8B-Instruct",
"mobile": "Qwen2.5-1.5B-Instruct"
},
# RAG/๊ฒ์
"rag": {
"best": "Gemma-2-9B",
"budget": "Llama-3.2-3B",
"mobile": "Phi-3-mini"
},
# ์์ฝ
"summarization": {
"best": "Qwen2.5-7B-Instruct",
"budget": "Gemma-2-2B",
"mobile": "SmolLM-1.7B"
}
}
def select_model(task: str, constraint: str = "best"):
"""ํ์คํฌ์ ์ ์ฝ์ ๋ง๋ ๋ชจ๋ธ ์ ํ"""
if task in TASK_MODEL_RECOMMENDATIONS:
return TASK_MODEL_RECOMMENDATIONS[task].get(constraint)
return "Qwen2.5-7B-Instruct" # ๊ธฐ๋ณธ๊ฐ
6. ์ค์ต: SLM Fine-tuning¶
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
def finetune_slm():
"""SLM QLoRA Fine-tuning ์์ """
# 1. ๋ชจ๋ธ ๋ก๋ (4๋นํธ ์์ํ)
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# 2. LoRA ์ค์
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 3. ๋ฐ์ดํฐ์
dataset = load_dataset("timdettmers/openassistant-guanaco")
def preprocess(examples):
texts = []
for text in examples['text']:
# Qwen chat format
texts.append(text + tokenizer.eos_token)
tokenized = tokenizer(
texts,
truncation=True,
max_length=1024,
padding="max_length"
)
tokenized['labels'] = tokenized['input_ids'].copy()
return tokenized
tokenized_dataset = dataset['train'].map(
preprocess,
batched=True,
remove_columns=dataset['train'].column_names
)
# 4. ํ์ต
training_args = TrainingArguments(
output_dir="./qwen-finetuned",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
logging_steps=10,
save_steps=500,
bf16=True,
optim="paged_adamw_8bit"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
tokenizer=tokenizer,
)
trainer.train()
# 5. ์ ์ฅ
model.save_pretrained("./qwen-lora-adapter")
print("Fine-tuning complete!")
if __name__ == "__main__":
finetune_slm()
์ฐธ๊ณ ์๋ฃ¶
๋ ผ๋ฌธ¶
- Gunasekar et al. (2023). "Textbooks Are All You Need" (Phi)
- Gemma Team (2024). "Gemma 2: Improving Open Language Models"
- Yang et al. (2024). "Qwen2 Technical Report"