11. Small Language Models¶

Overview¶

While large models (100B+) are making headlines, Small Language Models (SLM) are more practical for real production environments. This lesson covers the architecture, training strategies, and usage methods for models with 7B parameters or fewer.

1. Importance of SLMs¶

1.1 Why Small Models?¶

┌──────────────────────────────────────────────────────────────────┐
│                     SLM vs LLM Comparison                        │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│                    SLM (1-7B)              LLM (70B+)            │
│                                                                  │
│  💰 Cost          Low                      High                  │
│  ⚡ Latency       Low (<100ms)             High (>500ms)         │
│  🖥️ Hardware     Single GPU/CPU           Multi-GPU Required    │
│  📱 Edge Deploy  Possible                 Difficult             │
│  🔒 Privacy      Easy On-premise          Difficult             │
│  🎯 Specialized  Cost-effective           Overkill              │
│                                                                  │
│  Use Cases:                                                      │
│  - Mobile apps (On-device)                                       │
│  - Embedded systems                                              │
│  - High-frequency API services                                   │
│  - Cost-sensitive startups                                       │
│  - Privacy-critical domains                                      │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

1.2 SLM Model Comparison¶

Model	Parameters	Training Tokens	Features
Phi-3	3.8B	3.3T	MS, reasoning-focused
Gemma 2	2B / 9B	8T	Google, strong at code
Qwen 2.5	0.5B - 7B	18T	Multilingual, math
Llama 3.2	1B / 3B	15T	Mobile-optimized
TinyLlama	1.1B	3T	Efficient training
StableLM 2	1.6B	2T	Stability AI
SmolLM	135M - 1.7B	1T	HuggingFace

2. Architecture Optimization¶

2.1 Phi Series (Microsoft)¶

"""
Phi-3: "Textbooks Are All You Need" Philosophy

Core Ideas:
1. Data quality > Data quantity
2. Utilize synthetic data (generated by GPT-4)
3. Use only textbook-quality data

Result: GPT-3.5-level reasoning with 3.8B parameters
"""

class Phi3Config:
    """Phi-3 Architecture Configuration"""

    # Phi-3-mini (3.8B)
    hidden_size = 3072
    num_layers = 32
    num_attention_heads = 32
    num_key_value_heads = 32  # No GQA
    intermediate_size = 8192  # FFN expansion ratio ~2.7x
    vocab_size = 32064
    max_position_embeddings = 4096  # Extendable

    # Features
    # - SuRoPE (Scaled RoPE)
    # - LayerNorm (instead of RMSNorm)
    # - SwiGLU FFN


# Phi-3 Usage Example
from transformers import AutoModelForCausalLM, AutoTokenizer

def use_phi3():
    model = AutoModelForCausalLM.from_pretrained(
        "microsoft/Phi-3-mini-4k-instruct",
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True
    )
    tokenizer = AutoTokenizer.from_pretrained(
        "microsoft/Phi-3-mini-4k-instruct"
    )

    # Inference
    messages = [
        {"role": "user", "content": "Explain the Pythagorean theorem."}
    ]

    inputs = tokenizer.apply_chat_template(
        messages, return_tensors="pt", return_dict=True
    ).to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7
    )

    return tokenizer.decode(outputs[0])

2.2 Gemma 2 (Google)¶

"""
Gemma 2: Efficient Architecture Design

Key Features:
1. Alternating Local-Global Attention
2. Soft-Capping (Logits & Attention)
3. Pre-Norm + Post-Norm hybrid
4. Knowledge Distillation from larger models
"""

class Gemma2Config:
    """Gemma 2 Architecture"""

    # Gemma 2 2B
    hidden_size = 2304
    num_layers = 26
    num_attention_heads = 8
    num_key_value_heads = 4  # Uses GQA
    intermediate_size = 9216
    vocab_size = 256128  # Large vocab

    # Gemma 2 9B
    # hidden_size = 3584
    # num_layers = 42
    # num_attention_heads = 16
    # num_key_value_heads = 8


class GemmaAttentionWithSoftCap(nn.Module):
    """Gemma 2 Style Soft-Capping Attention"""

    def __init__(self, config, layer_idx: int):
        super().__init__()
        self.config = config
        self.layer_idx = layer_idx

        # Alternating Local vs Global attention
        # Even layers: Local (sliding window)
        # Odd layers: Global (full attention)
        self.is_local = (layer_idx % 2 == 0)
        self.sliding_window = 4096 if self.is_local else None

        # Soft-cap value
        self.attn_logit_softcap = 50.0

        # Projections
        self.q_proj = nn.Linear(config.hidden_size, config.hidden_size)
        self.k_proj = nn.Linear(config.hidden_size, config.hidden_size // 2)  # GQA
        self.v_proj = nn.Linear(config.hidden_size, config.hidden_size // 2)
        self.o_proj = nn.Linear(config.hidden_size, config.hidden_size)

    def forward(self, hidden_states, attention_mask=None):
        batch, seq_len, _ = hidden_states.shape

        Q = self.q_proj(hidden_states)
        K = self.k_proj(hidden_states)
        V = self.v_proj(hidden_states)

        # GQA: Expand K, V
        K = K.repeat_interleave(2, dim=-1)  # Simplified
        V = V.repeat_interleave(2, dim=-1)

        # Attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1))
        scores = scores / math.sqrt(Q.shape[-1])

        # Soft-capping: limit range with tanh
        scores = self.attn_logit_softcap * torch.tanh(scores / self.attn_logit_softcap)

        # Sliding window mask (local attention)
        if self.is_local and self.sliding_window:
            mask = self._create_sliding_window_mask(seq_len)
            scores = scores + mask

        # Causal mask
        causal_mask = torch.triu(
            torch.ones(seq_len, seq_len) * float('-inf'),
            diagonal=1
        ).to(scores.device)
        scores = scores + causal_mask

        weights = F.softmax(scores, dim=-1)
        output = torch.matmul(weights, V)

        return self.o_proj(output)

    def _create_sliding_window_mask(self, seq_len):
        """Sliding window attention mask"""
        mask = torch.ones(seq_len, seq_len) * float('-inf')
        for i in range(seq_len):
            start = max(0, i - self.sliding_window)
            mask[i, start:i+1] = 0
        return mask

2.3 Qwen 2.5 (Alibaba)¶

"""
Qwen 2.5: Strong at Multilingual & Math

Features:
1. Large-scale multilingual training (29 languages)
2. Specialized code/math data
3. Long context (128K)
4. Various sizes (0.5B ~ 72B)
"""

class Qwen25Config:
    """Qwen 2.5 Architecture"""

    # Qwen2.5-0.5B (smallest version)
    hidden_size = 896
    num_layers = 24
    num_attention_heads = 14
    num_key_value_heads = 2  # Efficient GQA
    intermediate_size = 4864
    vocab_size = 151936

    # Qwen2.5-7B
    # hidden_size = 3584
    # num_layers = 28
    # num_attention_heads = 28
    # num_key_value_heads = 4


# Qwen Usage Example
def use_qwen():
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model = AutoModelForCausalLM.from_pretrained(
        "Qwen/Qwen2.5-0.5B-Instruct",
        torch_dtype="auto",
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

    # Multilingual test
    prompts = [
        "Explain machine learning in simple terms.",
        "用简单的话解释机器学习",  # Chinese
        "기계 학습을 쉽게 설명해주세요",  # Korean
    ]

    for prompt in prompts:
        messages = [{"role": "user", "content": prompt}]
        text = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = tokenizer([text], return_tensors="pt").to(model.device)
        outputs = model.generate(**inputs, max_new_tokens=128)
        print(tokenizer.decode(outputs[0], skip_special_tokens=True))
        print("-" * 50)

3. Training Strategies¶

3.1 Data Quality vs Quantity¶

"""
Key to SLM Training: High-Quality Data

Lessons from Phi:
- Web crawl data (low quality) < Textbook-quality data
- Synthetic data (GPT-4 generated) is effective
- Filtering is extremely important
"""

class HighQualityDataPipeline:
    """High-Quality Data Pipeline"""

    def __init__(self, quality_model):
        self.quality_model = quality_model

    def filter_data(self, texts: list, threshold: float = 0.8):
        """Quality-based filtering"""
        filtered = []
        for text in texts:
            score = self.quality_model.score(text)
            if score > threshold:
                filtered.append(text)

        print(f"Filtered: {len(texts)} → {len(filtered)}")
        return filtered

    def generate_synthetic_data(
        self,
        teacher_model,
        topics: list,
        n_samples: int = 10000
    ):
        """Generate synthetic data"""
        synthetic_data = []

        for topic in topics:
            prompt = f"""Create an educational explanation about {topic}.
            The explanation should be:
            1. Clear and concise
            2. Include examples
            3. Suitable for learning"""

            for _ in range(n_samples // len(topics)):
                response = teacher_model.generate(prompt)

                # Quality verification
                if self._validate_response(response):
                    synthetic_data.append({
                        'topic': topic,
                        'content': response
                    })

        return synthetic_data

    def _validate_response(self, response: str) -> bool:
        """Validate response quality"""
        # Length check
        if len(response.split()) < 50:
            return False

        # Repetition check
        sentences = response.split('.')
        if len(set(sentences)) / len(sentences) < 0.8:
            return False

        return True

3.2 Knowledge Distillation¶

"""
Knowledge Distillation: Large Model → Small Model

Transfer knowledge from Teacher (large model) to Student (SLM)
"""

class DistillationTrainer:
    """KD-based SLM Training"""

    def __init__(
        self,
        teacher_model,  # e.g., Llama 70B
        student_model,  # e.g., 3B model
        temperature: float = 2.0,
        alpha: float = 0.5  # soft/hard loss ratio
    ):
        self.teacher = teacher_model
        self.student = student_model
        self.temperature = temperature
        self.alpha = alpha

        # Teacher is not trained
        self.teacher.eval()
        for param in self.teacher.parameters():
            param.requires_grad = False

    def distillation_loss(
        self,
        student_logits: torch.Tensor,
        teacher_logits: torch.Tensor,
        labels: torch.Tensor
    ) -> torch.Tensor:
        """
        Distillation Loss = α × Soft Loss + (1-α) × Hard Loss

        Soft Loss: KL(student_soft || teacher_soft)
        Hard Loss: CrossEntropy(student, labels)
        """
        T = self.temperature

        # Soft targets (temperature scaling)
        teacher_soft = F.softmax(teacher_logits / T, dim=-1)
        student_soft = F.log_softmax(student_logits / T, dim=-1)

        # KL Divergence (soft loss)
        soft_loss = F.kl_div(
            student_soft,
            teacher_soft,
            reduction='batchmean'
        ) * (T ** 2)  # Temperature scaling correction

        # Cross Entropy (hard loss)
        hard_loss = F.cross_entropy(
            student_logits.view(-1, student_logits.size(-1)),
            labels.view(-1),
            ignore_index=-100
        )

        # Combined loss
        loss = self.alpha * soft_loss + (1 - self.alpha) * hard_loss

        return loss

    def train_step(self, batch):
        """Training step"""
        input_ids = batch['input_ids']
        labels = batch['labels']

        # Teacher forward (no grad)
        with torch.no_grad():
            teacher_outputs = self.teacher(input_ids)
            teacher_logits = teacher_outputs.logits

        # Student forward
        student_outputs = self.student(input_ids)
        student_logits = student_outputs.logits

        # Distillation loss
        loss = self.distillation_loss(
            student_logits, teacher_logits, labels
        )

        return loss


# Response-level Distillation (more effective)
class ResponseDistillation:
    """Response-level KD"""

    def __init__(self, teacher_model, student_model):
        self.teacher = teacher_model
        self.student = student_model

    def generate_training_data(self, prompts: list):
        """Generate training data from Teacher responses"""
        training_data = []

        for prompt in prompts:
            # Generate Teacher response
            teacher_response = self.teacher.generate(
                prompt,
                max_new_tokens=512,
                temperature=0.7
            )

            training_data.append({
                'prompt': prompt,
                'response': teacher_response
            })

        return training_data

    def train_on_responses(self, training_data):
        """Train Student on Teacher responses"""
        # Standard SFT (Supervised Fine-Tuning)
        for item in training_data:
            full_text = f"{item['prompt']}\n{item['response']}"
            # ... SFT training

3.3 Efficient Training Techniques¶

"""
SLM Training Efficiency Techniques
"""

# 1. Gradient Accumulation (large effective batch with small batches)
def train_with_grad_accumulation(
    model,
    dataloader,
    accumulation_steps: int = 8
):
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    for i, batch in enumerate(dataloader):
        outputs = model(**batch)
        loss = outputs.loss / accumulation_steps
        loss.backward()

        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()


# 2. Efficient fine-tuning with LoRA
from peft import LoraConfig, get_peft_model

def setup_lora_training(model):
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.1,
        bias="none"
    )

    model = get_peft_model(model, lora_config)

    # Check trainable parameters
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")

    return model


# 3. QLoRA (Quantization + LoRA)
from transformers import BitsAndBytesConfig

def setup_qlora_training(model_name):
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto"
    )

    # Add LoRA
    return setup_lora_training(model)

4. Deployment Optimization¶

4.1 Quantization¶

"""
SLM Quantization: Memory & Speed Optimization
"""

# 1. GPTQ (Post-Training Quantization)
from transformers import GPTQConfig

def quantize_with_gptq(model_name):
    gptq_config = GPTQConfig(
        bits=4,
        dataset="c4",
        tokenizer=AutoTokenizer.from_pretrained(model_name)
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=gptq_config,
        device_map="auto"
    )

    return model


# 2. AWQ (Activation-aware Weight Quantization)
from awq import AutoAWQForCausalLM

def quantize_with_awq(model_path, output_path):
    model = AutoAWQForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # Quantize
    model.quantize(
        tokenizer,
        quant_config={
            "zero_point": True,
            "q_group_size": 128,
            "w_bit": 4,
            "version": "GEMM"
        }
    )

    # Save
    model.save_quantized(output_path)


# 3. llama.cpp (GGUF format)
"""
llama.cpp quantization levels:
- Q2_K: 2-bit (very small, quality degradation)
- Q4_K_M: 4-bit (recommended, quality/size balance)
- Q5_K_M: 5-bit (high quality)
- Q8_0: 8-bit (near-original quality)

Command:
./quantize model.gguf model-q4_k_m.gguf Q4_K_M
"""


# Memory usage comparison
def compare_memory_usage():
    """Memory by parameter count"""
    configs = [
        ("3B FP16", 3e9 * 2),       # 6GB
        ("3B Q8", 3e9 * 1),         # 3GB
        ("3B Q4", 3e9 * 0.5),       # 1.5GB
        ("7B FP16", 7e9 * 2),       # 14GB
        ("7B Q4", 7e9 * 0.5),       # 3.5GB
    ]

    print("Model\t\tMemory (GB)")
    print("-" * 30)
    for name, memory in configs:
        print(f"{name}\t\t{memory / 1e9:.1f}")

4.2 On-Device Deployment¶

"""
Mobile/Edge Device Deployment
"""

# 1. ONNX Conversion
def convert_to_onnx(model, tokenizer, output_path):
    from optimum.onnxruntime import ORTModelForCausalLM

    # ONNX conversion and optimization
    ort_model = ORTModelForCausalLM.from_pretrained(
        model,
        export=True,
        provider="CPUExecutionProvider"
    )

    ort_model.save_pretrained(output_path)


# 2. TensorRT-LLM (NVIDIA GPU)
"""
TensorRT-LLM usage:
1. Model conversion: python convert_checkpoint.py
2. Engine build: trtllm-build
3. Inference: python run.py
"""


# 3. llama.cpp (CPU inference)
"""
llama.cpp usage:
1. Convert to GGUF
2. Run llama-cli

./llama-cli -m model.gguf \
    -n 256 \
    -p "Hello, how are you?" \
    -t 4  # threads
"""


# 4. MLC-LLM (various platforms)
"""
MLC-LLM: iOS, Android, WebGPU, CUDA

Mobile deployment possible with mlc_chat app
"""

5. Benchmarks & Evaluation¶

5.1 SLM Benchmark Results¶

┌──────────────────────────────────────────────────────────────────┐
│            SLM Benchmark Comparison (As of 2024.10)              │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Model          Params  MMLU    GSM8K   HumanEval  TriviaQA     │
│  ─────────────────────────────────────────────────────────────  │
│  Phi-3-mini     3.8B    69.9%   82.5%   57.9%      63.5%        │
│  Gemma-2-9B     9B      71.3%   68.6%   54.3%      73.5%        │
│  Qwen2.5-7B     7B      74.2%   82.6%   75.6%      71.4%        │
│  Llama-3.2-3B   3B      63.4%   44.4%   36.0%      63.4%        │
│  SmolLM-1.7B    1.7B    42.3%   18.2%   28.7%      42.1%        │
│                                                                  │
│  Reference: GPT-4  -    86.4%   92.0%   67.0%      87.6%        │
│                                                                  │
│  ※ Phi-3 shows excellent reasoning for its size                 │
│  ※ Qwen2.5 excels at code (HumanEval)                           │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

5.2 Task-specific SLM Selection Guide¶

"""
Task-specific SLM Recommendations
"""

TASK_MODEL_RECOMMENDATIONS = {
    # General chat
    "general_chat": {
        "best": "Qwen2.5-7B-Instruct",
        "budget": "Qwen2.5-1.5B-Instruct",
        "mobile": "Qwen2.5-0.5B-Instruct"
    },

    # Code generation
    "code_generation": {
        "best": "Qwen2.5-Coder-7B",
        "budget": "CodeGemma-2B",
        "mobile": "Phi-3-mini"
    },

    # Math/reasoning
    "math_reasoning": {
        "best": "Qwen2.5-Math-7B",
        "budget": "Phi-3-mini",
        "mobile": "Phi-3-mini"
    },

    # Korean
    "korean": {
        "best": "Qwen2.5-7B-Instruct",  # Strong multilingual
        "budget": "EXAONE-3.0-7.8B-Instruct",
        "mobile": "Qwen2.5-1.5B-Instruct"
    },

    # RAG/search
    "rag": {
        "best": "Gemma-2-9B",
        "budget": "Llama-3.2-3B",
        "mobile": "Phi-3-mini"
    },

    # Summarization
    "summarization": {
        "best": "Qwen2.5-7B-Instruct",
        "budget": "Gemma-2-2B",
        "mobile": "SmolLM-1.7B"
    }
}


def select_model(task: str, constraint: str = "best"):
    """Select model for task and constraints"""
    if task in TASK_MODEL_RECOMMENDATIONS:
        return TASK_MODEL_RECOMMENDATIONS[task].get(constraint)
    return "Qwen2.5-7B-Instruct"  # Default

6. Hands-on: SLM Fine-tuning¶

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset

def finetune_slm():
    """SLM QLoRA Fine-tuning Example"""

    # 1. Load model (4-bit quantization)
    model_name = "Qwen/Qwen2.5-1.5B-Instruct"

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto"
    )

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    # 2. LoRA configuration
    model = prepare_model_for_kbit_training(model)

    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                       "gate_proj", "up_proj", "down_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    # 3. Dataset
    dataset = load_dataset("timdettmers/openassistant-guanaco")

    def preprocess(examples):
        texts = []
        for text in examples['text']:
            # Qwen chat format
            texts.append(text + tokenizer.eos_token)

        tokenized = tokenizer(
            texts,
            truncation=True,
            max_length=1024,
            padding="max_length"
        )
        tokenized['labels'] = tokenized['input_ids'].copy()
        return tokenized

    tokenized_dataset = dataset['train'].map(
        preprocess,
        batched=True,
        remove_columns=dataset['train'].column_names
    )

    # 4. Training
    training_args = TrainingArguments(
        output_dir="./qwen-finetuned",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        logging_steps=10,
        save_steps=500,
        bf16=True,
        optim="paged_adamw_8bit"
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset,
        tokenizer=tokenizer,
    )

    trainer.train()

    # 5. Save
    model.save_pretrained("./qwen-lora-adapter")

    print("Fine-tuning complete!")


if __name__ == "__main__":
    finetune_slm()

References¶

Papers¶

Gunasekar et al. (2023). "Textbooks Are All You Need" (Phi)
Gemma Team (2024). "Gemma 2: Improving Open Language Models"
Yang et al. (2024). "Qwen2 Technical Report"