11. Small Language Models
11. Small Language Models¶
Overview¶
While large models (100B+) are making headlines, Small Language Models (SLM) are more practical for real production environments. This lesson covers the architecture, training strategies, and usage methods for models with 7B parameters or fewer.
1. Importance of SLMs¶
1.1 Why Small Models?¶
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â SLM vs LLM Comparison â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââĪ
â â
â SLM (1-7B) LLM (70B+) â
â â
â ð° Cost Low High â
â ⥠Latency Low (<100ms) High (>500ms) â
â ðĨïļ Hardware Single GPU/CPU Multi-GPU Required â
â ðą Edge Deploy Possible Difficult â
â ð Privacy Easy On-premise Difficult â
â ðŊ Specialized Cost-effective Overkill â
â â
â Use Cases: â
â - Mobile apps (On-device) â
â - Embedded systems â
â - High-frequency API services â
â - Cost-sensitive startups â
â - Privacy-critical domains â
â â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
1.2 SLM Model Comparison¶
| Model | Parameters | Training Tokens | Features |
|---|---|---|---|
| Phi-3 | 3.8B | 3.3T | MS, reasoning-focused |
| Gemma 2 | 2B / 9B | 8T | Google, strong at code |
| Qwen 2.5 | 0.5B - 7B | 18T | Multilingual, math |
| Llama 3.2 | 1B / 3B | 15T | Mobile-optimized |
| TinyLlama | 1.1B | 3T | Efficient training |
| StableLM 2 | 1.6B | 2T | Stability AI |
| SmolLM | 135M - 1.7B | 1T | HuggingFace |
2. Architecture Optimization¶
2.1 Phi Series (Microsoft)¶
"""
Phi-3: "Textbooks Are All You Need" Philosophy
Core Ideas:
1. Data quality > Data quantity
2. Utilize synthetic data (generated by GPT-4)
3. Use only textbook-quality data
Result: GPT-3.5-level reasoning with 3.8B parameters
"""
class Phi3Config:
"""Phi-3 Architecture Configuration"""
# Phi-3-mini (3.8B)
hidden_size = 3072
num_layers = 32
num_attention_heads = 32
num_key_value_heads = 32 # No GQA
intermediate_size = 8192 # FFN expansion ratio ~2.7x
vocab_size = 32064
max_position_embeddings = 4096 # Extendable
# Features
# - SuRoPE (Scaled RoPE)
# - LayerNorm (instead of RMSNorm)
# - SwiGLU FFN
# Phi-3 Usage Example
from transformers import AutoModelForCausalLM, AutoTokenizer
def use_phi3():
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct"
)
# Inference
messages = [
{"role": "user", "content": "Explain the Pythagorean theorem."}
]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", return_dict=True
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7
)
return tokenizer.decode(outputs[0])
2.2 Gemma 2 (Google)¶
"""
Gemma 2: Efficient Architecture Design
Key Features:
1. Alternating Local-Global Attention
2. Soft-Capping (Logits & Attention)
3. Pre-Norm + Post-Norm hybrid
4. Knowledge Distillation from larger models
"""
class Gemma2Config:
"""Gemma 2 Architecture"""
# Gemma 2 2B
hidden_size = 2304
num_layers = 26
num_attention_heads = 8
num_key_value_heads = 4 # Uses GQA
intermediate_size = 9216
vocab_size = 256128 # Large vocab
# Gemma 2 9B
# hidden_size = 3584
# num_layers = 42
# num_attention_heads = 16
# num_key_value_heads = 8
class GemmaAttentionWithSoftCap(nn.Module):
"""Gemma 2 Style Soft-Capping Attention"""
def __init__(self, config, layer_idx: int):
super().__init__()
self.config = config
self.layer_idx = layer_idx
# Alternating Local vs Global attention
# Even layers: Local (sliding window)
# Odd layers: Global (full attention)
self.is_local = (layer_idx % 2 == 0)
self.sliding_window = 4096 if self.is_local else None
# Soft-cap value
self.attn_logit_softcap = 50.0
# Projections
self.q_proj = nn.Linear(config.hidden_size, config.hidden_size)
self.k_proj = nn.Linear(config.hidden_size, config.hidden_size // 2) # GQA
self.v_proj = nn.Linear(config.hidden_size, config.hidden_size // 2)
self.o_proj = nn.Linear(config.hidden_size, config.hidden_size)
def forward(self, hidden_states, attention_mask=None):
batch, seq_len, _ = hidden_states.shape
Q = self.q_proj(hidden_states)
K = self.k_proj(hidden_states)
V = self.v_proj(hidden_states)
# GQA: Expand K, V
K = K.repeat_interleave(2, dim=-1) # Simplified
V = V.repeat_interleave(2, dim=-1)
# Attention scores
scores = torch.matmul(Q, K.transpose(-2, -1))
scores = scores / math.sqrt(Q.shape[-1])
# Soft-capping: limit range with tanh
scores = self.attn_logit_softcap * torch.tanh(scores / self.attn_logit_softcap)
# Sliding window mask (local attention)
if self.is_local and self.sliding_window:
mask = self._create_sliding_window_mask(seq_len)
scores = scores + mask
# Causal mask
causal_mask = torch.triu(
torch.ones(seq_len, seq_len) * float('-inf'),
diagonal=1
).to(scores.device)
scores = scores + causal_mask
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, V)
return self.o_proj(output)
def _create_sliding_window_mask(self, seq_len):
"""Sliding window attention mask"""
mask = torch.ones(seq_len, seq_len) * float('-inf')
for i in range(seq_len):
start = max(0, i - self.sliding_window)
mask[i, start:i+1] = 0
return mask
2.3 Qwen 2.5 (Alibaba)¶
"""
Qwen 2.5: Strong at Multilingual & Math
Features:
1. Large-scale multilingual training (29 languages)
2. Specialized code/math data
3. Long context (128K)
4. Various sizes (0.5B ~ 72B)
"""
class Qwen25Config:
"""Qwen 2.5 Architecture"""
# Qwen2.5-0.5B (smallest version)
hidden_size = 896
num_layers = 24
num_attention_heads = 14
num_key_value_heads = 2 # Efficient GQA
intermediate_size = 4864
vocab_size = 151936
# Qwen2.5-7B
# hidden_size = 3584
# num_layers = 28
# num_attention_heads = 28
# num_key_value_heads = 4
# Qwen Usage Example
def use_qwen():
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
# Multilingual test
prompts = [
"Explain machine learning in simple terms.",
"įĻįŪåįčŊč§ĢéæšåĻåĶäđ ", # Chinese
"ęļ°ęģ íėĩė ė―ęē ėĪëŠ
íīėĢžėļė", # Korean
]
for prompt in prompts:
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print("-" * 50)
3. Training Strategies¶
3.1 Data Quality vs Quantity¶
"""
Key to SLM Training: High-Quality Data
Lessons from Phi:
- Web crawl data (low quality) < Textbook-quality data
- Synthetic data (GPT-4 generated) is effective
- Filtering is extremely important
"""
class HighQualityDataPipeline:
"""High-Quality Data Pipeline"""
def __init__(self, quality_model):
self.quality_model = quality_model
def filter_data(self, texts: list, threshold: float = 0.8):
"""Quality-based filtering"""
filtered = []
for text in texts:
score = self.quality_model.score(text)
if score > threshold:
filtered.append(text)
print(f"Filtered: {len(texts)} â {len(filtered)}")
return filtered
def generate_synthetic_data(
self,
teacher_model,
topics: list,
n_samples: int = 10000
):
"""Generate synthetic data"""
synthetic_data = []
for topic in topics:
prompt = f"""Create an educational explanation about {topic}.
The explanation should be:
1. Clear and concise
2. Include examples
3. Suitable for learning"""
for _ in range(n_samples // len(topics)):
response = teacher_model.generate(prompt)
# Quality verification
if self._validate_response(response):
synthetic_data.append({
'topic': topic,
'content': response
})
return synthetic_data
def _validate_response(self, response: str) -> bool:
"""Validate response quality"""
# Length check
if len(response.split()) < 50:
return False
# Repetition check
sentences = response.split('.')
if len(set(sentences)) / len(sentences) < 0.8:
return False
return True
3.2 Knowledge Distillation¶
"""
Knowledge Distillation: Large Model â Small Model
Transfer knowledge from Teacher (large model) to Student (SLM)
"""
class DistillationTrainer:
"""KD-based SLM Training"""
def __init__(
self,
teacher_model, # e.g., Llama 70B
student_model, # e.g., 3B model
temperature: float = 2.0,
alpha: float = 0.5 # soft/hard loss ratio
):
self.teacher = teacher_model
self.student = student_model
self.temperature = temperature
self.alpha = alpha
# Teacher is not trained
self.teacher.eval()
for param in self.teacher.parameters():
param.requires_grad = False
def distillation_loss(
self,
student_logits: torch.Tensor,
teacher_logits: torch.Tensor,
labels: torch.Tensor
) -> torch.Tensor:
"""
Distillation Loss = Îą Ã Soft Loss + (1-Îą) Ã Hard Loss
Soft Loss: KL(student_soft || teacher_soft)
Hard Loss: CrossEntropy(student, labels)
"""
T = self.temperature
# Soft targets (temperature scaling)
teacher_soft = F.softmax(teacher_logits / T, dim=-1)
student_soft = F.log_softmax(student_logits / T, dim=-1)
# KL Divergence (soft loss)
soft_loss = F.kl_div(
student_soft,
teacher_soft,
reduction='batchmean'
) * (T ** 2) # Temperature scaling correction
# Cross Entropy (hard loss)
hard_loss = F.cross_entropy(
student_logits.view(-1, student_logits.size(-1)),
labels.view(-1),
ignore_index=-100
)
# Combined loss
loss = self.alpha * soft_loss + (1 - self.alpha) * hard_loss
return loss
def train_step(self, batch):
"""Training step"""
input_ids = batch['input_ids']
labels = batch['labels']
# Teacher forward (no grad)
with torch.no_grad():
teacher_outputs = self.teacher(input_ids)
teacher_logits = teacher_outputs.logits
# Student forward
student_outputs = self.student(input_ids)
student_logits = student_outputs.logits
# Distillation loss
loss = self.distillation_loss(
student_logits, teacher_logits, labels
)
return loss
# Response-level Distillation (more effective)
class ResponseDistillation:
"""Response-level KD"""
def __init__(self, teacher_model, student_model):
self.teacher = teacher_model
self.student = student_model
def generate_training_data(self, prompts: list):
"""Generate training data from Teacher responses"""
training_data = []
for prompt in prompts:
# Generate Teacher response
teacher_response = self.teacher.generate(
prompt,
max_new_tokens=512,
temperature=0.7
)
training_data.append({
'prompt': prompt,
'response': teacher_response
})
return training_data
def train_on_responses(self, training_data):
"""Train Student on Teacher responses"""
# Standard SFT (Supervised Fine-Tuning)
for item in training_data:
full_text = f"{item['prompt']}\n{item['response']}"
# ... SFT training
3.3 Efficient Training Techniques¶
"""
SLM Training Efficiency Techniques
"""
# 1. Gradient Accumulation (large effective batch with small batches)
def train_with_grad_accumulation(
model,
dataloader,
accumulation_steps: int = 8
):
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for i, batch in enumerate(dataloader):
outputs = model(**batch)
loss = outputs.loss / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
# 2. Efficient fine-tuning with LoRA
from peft import LoraConfig, get_peft_model
def setup_lora_training(model):
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.1,
bias="none"
)
model = get_peft_model(model, lora_config)
# Check trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")
return model
# 3. QLoRA (Quantization + LoRA)
from transformers import BitsAndBytesConfig
def setup_qlora_training(model_name):
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
# Add LoRA
return setup_lora_training(model)
4. Deployment Optimization¶
4.1 Quantization¶
"""
SLM Quantization: Memory & Speed Optimization
"""
# 1. GPTQ (Post-Training Quantization)
from transformers import GPTQConfig
def quantize_with_gptq(model_name):
gptq_config = GPTQConfig(
bits=4,
dataset="c4",
tokenizer=AutoTokenizer.from_pretrained(model_name)
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=gptq_config,
device_map="auto"
)
return model
# 2. AWQ (Activation-aware Weight Quantization)
from awq import AutoAWQForCausalLM
def quantize_with_awq(model_path, output_path):
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Quantize
model.quantize(
tokenizer,
quant_config={
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
)
# Save
model.save_quantized(output_path)
# 3. llama.cpp (GGUF format)
"""
llama.cpp quantization levels:
- Q2_K: 2-bit (very small, quality degradation)
- Q4_K_M: 4-bit (recommended, quality/size balance)
- Q5_K_M: 5-bit (high quality)
- Q8_0: 8-bit (near-original quality)
Command:
./quantize model.gguf model-q4_k_m.gguf Q4_K_M
"""
# Memory usage comparison
def compare_memory_usage():
"""Memory by parameter count"""
configs = [
("3B FP16", 3e9 * 2), # 6GB
("3B Q8", 3e9 * 1), # 3GB
("3B Q4", 3e9 * 0.5), # 1.5GB
("7B FP16", 7e9 * 2), # 14GB
("7B Q4", 7e9 * 0.5), # 3.5GB
]
print("Model\t\tMemory (GB)")
print("-" * 30)
for name, memory in configs:
print(f"{name}\t\t{memory / 1e9:.1f}")
4.2 On-Device Deployment¶
"""
Mobile/Edge Device Deployment
"""
# 1. ONNX Conversion
def convert_to_onnx(model, tokenizer, output_path):
from optimum.onnxruntime import ORTModelForCausalLM
# ONNX conversion and optimization
ort_model = ORTModelForCausalLM.from_pretrained(
model,
export=True,
provider="CPUExecutionProvider"
)
ort_model.save_pretrained(output_path)
# 2. TensorRT-LLM (NVIDIA GPU)
"""
TensorRT-LLM usage:
1. Model conversion: python convert_checkpoint.py
2. Engine build: trtllm-build
3. Inference: python run.py
"""
# 3. llama.cpp (CPU inference)
"""
llama.cpp usage:
1. Convert to GGUF
2. Run llama-cli
./llama-cli -m model.gguf \
-n 256 \
-p "Hello, how are you?" \
-t 4 # threads
"""
# 4. MLC-LLM (various platforms)
"""
MLC-LLM: iOS, Android, WebGPU, CUDA
Mobile deployment possible with mlc_chat app
"""
5. Benchmarks & Evaluation¶
5.1 SLM Benchmark Results¶
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â SLM Benchmark Comparison (As of 2024.10) â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââĪ
â â
â Model Params MMLU GSM8K HumanEval TriviaQA â
â âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â Phi-3-mini 3.8B 69.9% 82.5% 57.9% 63.5% â
â Gemma-2-9B 9B 71.3% 68.6% 54.3% 73.5% â
â Qwen2.5-7B 7B 74.2% 82.6% 75.6% 71.4% â
â Llama-3.2-3B 3B 63.4% 44.4% 36.0% 63.4% â
â SmolLM-1.7B 1.7B 42.3% 18.2% 28.7% 42.1% â
â â
â Reference: GPT-4 - 86.4% 92.0% 67.0% 87.6% â
â â
â âŧ Phi-3 shows excellent reasoning for its size â
â âŧ Qwen2.5 excels at code (HumanEval) â
â â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
5.2 Task-specific SLM Selection Guide¶
"""
Task-specific SLM Recommendations
"""
TASK_MODEL_RECOMMENDATIONS = {
# General chat
"general_chat": {
"best": "Qwen2.5-7B-Instruct",
"budget": "Qwen2.5-1.5B-Instruct",
"mobile": "Qwen2.5-0.5B-Instruct"
},
# Code generation
"code_generation": {
"best": "Qwen2.5-Coder-7B",
"budget": "CodeGemma-2B",
"mobile": "Phi-3-mini"
},
# Math/reasoning
"math_reasoning": {
"best": "Qwen2.5-Math-7B",
"budget": "Phi-3-mini",
"mobile": "Phi-3-mini"
},
# Korean
"korean": {
"best": "Qwen2.5-7B-Instruct", # Strong multilingual
"budget": "EXAONE-3.0-7.8B-Instruct",
"mobile": "Qwen2.5-1.5B-Instruct"
},
# RAG/search
"rag": {
"best": "Gemma-2-9B",
"budget": "Llama-3.2-3B",
"mobile": "Phi-3-mini"
},
# Summarization
"summarization": {
"best": "Qwen2.5-7B-Instruct",
"budget": "Gemma-2-2B",
"mobile": "SmolLM-1.7B"
}
}
def select_model(task: str, constraint: str = "best"):
"""Select model for task and constraints"""
if task in TASK_MODEL_RECOMMENDATIONS:
return TASK_MODEL_RECOMMENDATIONS[task].get(constraint)
return "Qwen2.5-7B-Instruct" # Default
6. Hands-on: SLM Fine-tuning¶
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
def finetune_slm():
"""SLM QLoRA Fine-tuning Example"""
# 1. Load model (4-bit quantization)
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# 2. LoRA configuration
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 3. Dataset
dataset = load_dataset("timdettmers/openassistant-guanaco")
def preprocess(examples):
texts = []
for text in examples['text']:
# Qwen chat format
texts.append(text + tokenizer.eos_token)
tokenized = tokenizer(
texts,
truncation=True,
max_length=1024,
padding="max_length"
)
tokenized['labels'] = tokenized['input_ids'].copy()
return tokenized
tokenized_dataset = dataset['train'].map(
preprocess,
batched=True,
remove_columns=dataset['train'].column_names
)
# 4. Training
training_args = TrainingArguments(
output_dir="./qwen-finetuned",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
logging_steps=10,
save_steps=500,
bf16=True,
optim="paged_adamw_8bit"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
tokenizer=tokenizer,
)
trainer.train()
# 5. Save
model.save_pretrained("./qwen-lora-adapter")
print("Fine-tuning complete!")
if __name__ == "__main__":
finetune_slm()
References¶
Papers¶
- Gunasekar et al. (2023). "Textbooks Are All You Need" (Phi)
- Gemma Team (2024). "Gemma 2: Improving Open Language Models"
- Yang et al. (2024). "Qwen2 Technical Report"