11. Small Language Models

11. Small Language Models

๊ฐœ์š”

๋Œ€ํ˜• ๋ชจ๋ธ(100B+)์ด ํ™”์ œ์ง€๋งŒ, ์‹ค์ œ ํ”„๋กœ๋•์…˜ ํ™˜๊ฒฝ์—์„œ๋Š” Small Language Models (SLM)์ด ๋” ์‹ค์šฉ์ ์ž…๋‹ˆ๋‹ค. ์ด ๋ ˆ์Šจ์—์„œ๋Š” 7B ์ดํ•˜ ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ฒ˜, ํ•™์Šต ์ „๋žต, ํ™œ์šฉ ๋ฐฉ๋ฒ•์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค.


1. SLM์˜ ์ค‘์š”์„ฑ

1.1 ์™œ ์ž‘์€ ๋ชจ๋ธ์ธ๊ฐ€?

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   SLM vs LLM ๋น„๊ต                               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚                    SLM (1-7B)              LLM (70B+)            โ”‚
โ”‚                                                                  โ”‚
โ”‚  ๐Ÿ’ฐ ๋น„์šฉ          ๋‚ฎ์Œ                      ๋†’์Œ                 โ”‚
โ”‚  โšก ์ง€์—ฐ์‹œ๊ฐ„      ๋‚ฎ์Œ (<100ms)             ๋†’์Œ (>500ms)        โ”‚
โ”‚  ๐Ÿ–ฅ๏ธ ํ•˜๋“œ์›จ์–ด     ๋‹จ์ผ GPU/CPU             ๋‹ค์ค‘ GPU ํ•„์ˆ˜        โ”‚
โ”‚  ๐Ÿ“ฑ ์—ฃ์ง€ ๋ฐฐํฌ    ๊ฐ€๋Šฅ                      ์–ด๋ ค์›€               โ”‚
โ”‚  ๐Ÿ”’ ํ”„๋ผ์ด๋ฒ„์‹œ   ์˜จํ”„๋ ˆ๋ฏธ์Šค ์‰ฌ์›€           ์–ด๋ ค์›€               โ”‚
โ”‚  ๐ŸŽฏ ํŠนํ™” ํƒœ์Šคํฌ  ๋น„์šฉ ํšจ์œจ์                ๊ณผ์ž‰                 โ”‚
โ”‚                                                                  โ”‚
โ”‚  ์‚ฌ์šฉ ์‚ฌ๋ก€:                                                      โ”‚
โ”‚  - ๋ชจ๋ฐ”์ผ ์•ฑ (On-device)                                        โ”‚
โ”‚  - ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ                                              โ”‚
โ”‚  - ๊ณ ๋นˆ๋„ API ์„œ๋น„์Šค                                            โ”‚
โ”‚  - ๋น„์šฉ ๋ฏผ๊ฐํ•œ ์Šคํƒ€ํŠธ์—…                                         โ”‚
โ”‚  - ๊ฐœ์ธ์ •๋ณด ๋ณดํ˜ธ๊ฐ€ ์ค‘์š”ํ•œ ๋„๋ฉ”์ธ                                โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

1.2 SLM ๋ชจ๋ธ ๋น„๊ต

๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ํ•™์Šต ํ† ํฐ ํŠน์ง•
Phi-3 3.8B 3.3T MS, ์ถ”๋ก  ํŠนํ™”
Gemma 2 2B / 9B 8T Google, ์ฝ”๋“œ ๊ฐ•์ 
Qwen 2.5 0.5B - 7B 18T ๋‹ค๊ตญ์–ด, ์ˆ˜ํ•™
Llama 3.2 1B / 3B 15T ๋ชจ๋ฐ”์ผ ์ตœ์ ํ™”
TinyLlama 1.1B 3T ํšจ์œจ์  ํ•™์Šต
StableLM 2 1.6B 2T Stability AI
SmolLM 135M - 1.7B 1T HuggingFace

2. ์•„ํ‚คํ…์ฒ˜ ์ตœ์ ํ™”

2.1 Phi ์‹œ๋ฆฌ์ฆˆ (Microsoft)

"""
Phi-3: "Textbooks Are All You Need" ์ฒ ํ•™

ํ•ต์‹ฌ ์•„์ด๋””์–ด:
1. ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ > ๋ฐ์ดํ„ฐ ์–‘
2. ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ํ™œ์šฉ (GPT-4๋กœ ์ƒ์„ฑ)
3. ๊ต๊ณผ์„œ๊ธ‰ ํ’ˆ์งˆ์˜ ๋ฐ์ดํ„ฐ๋งŒ ์‚ฌ์šฉ

๊ฒฐ๊ณผ: 3.8B๋กœ GPT-3.5๊ธ‰ ์ถ”๋ก  ๋Šฅ๋ ฅ
"""

class Phi3Config:
    """Phi-3 ์•„ํ‚คํ…์ฒ˜ ์„ค์ •"""

    # Phi-3-mini (3.8B)
    hidden_size = 3072
    num_layers = 32
    num_attention_heads = 32
    num_key_value_heads = 32  # No GQA
    intermediate_size = 8192  # FFN ํ™•์žฅ๋น„ ~2.7x
    vocab_size = 32064
    max_position_embeddings = 4096  # ํ™•์žฅ ๊ฐ€๋Šฅ

    # ํŠน์ง•
    # - SuRoPE (Scaled RoPE)
    # - LayerNorm (RMSNorm ๋Œ€์‹ )
    # - SwiGLU FFN


# Phi-3 ์‚ฌ์šฉ ์˜ˆ์‹œ
from transformers import AutoModelForCausalLM, AutoTokenizer

def use_phi3():
    model = AutoModelForCausalLM.from_pretrained(
        "microsoft/Phi-3-mini-4k-instruct",
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True
    )
    tokenizer = AutoTokenizer.from_pretrained(
        "microsoft/Phi-3-mini-4k-instruct"
    )

    # ์ถ”๋ก 
    messages = [
        {"role": "user", "content": "Explain the Pythagorean theorem."}
    ]

    inputs = tokenizer.apply_chat_template(
        messages, return_tensors="pt", return_dict=True
    ).to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7
    )

    return tokenizer.decode(outputs[0])

2.2 Gemma 2 (Google)

"""
Gemma 2: ํšจ์œจ์ ์ธ ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„

ํ•ต์‹ฌ ํŠน์ง•:
1. Alternating Local-Global Attention
2. Soft-Capping (Logits & Attention)
3. Pre-Norm + Post-Norm hybrid
4. Knowledge Distillation from larger models
"""

class Gemma2Config:
    """Gemma 2 ์•„ํ‚คํ…์ฒ˜"""

    # Gemma 2 2B
    hidden_size = 2304
    num_layers = 26
    num_attention_heads = 8
    num_key_value_heads = 4  # GQA ์‚ฌ์šฉ
    intermediate_size = 9216
    vocab_size = 256128  # ํฐ vocab

    # Gemma 2 9B
    # hidden_size = 3584
    # num_layers = 42
    # num_attention_heads = 16
    # num_key_value_heads = 8


class GemmaAttentionWithSoftCap(nn.Module):
    """Gemma 2 ์Šคํƒ€์ผ Soft-Capping Attention"""

    def __init__(self, config, layer_idx: int):
        super().__init__()
        self.config = config
        self.layer_idx = layer_idx

        # Local vs Global attention ๊ต๋Œ€
        # ์ง์ˆ˜ ๋ ˆ์ด์–ด: Local (sliding window)
        # ํ™€์ˆ˜ ๋ ˆ์ด์–ด: Global (full attention)
        self.is_local = (layer_idx % 2 == 0)
        self.sliding_window = 4096 if self.is_local else None

        # Soft-cap ๊ฐ’
        self.attn_logit_softcap = 50.0

        # Projections
        self.q_proj = nn.Linear(config.hidden_size, config.hidden_size)
        self.k_proj = nn.Linear(config.hidden_size, config.hidden_size // 2)  # GQA
        self.v_proj = nn.Linear(config.hidden_size, config.hidden_size // 2)
        self.o_proj = nn.Linear(config.hidden_size, config.hidden_size)

    def forward(self, hidden_states, attention_mask=None):
        batch, seq_len, _ = hidden_states.shape

        Q = self.q_proj(hidden_states)
        K = self.k_proj(hidden_states)
        V = self.v_proj(hidden_states)

        # GQA: K, V ํ™•์žฅ
        K = K.repeat_interleave(2, dim=-1)  # ๊ฐ„์†Œํ™”
        V = V.repeat_interleave(2, dim=-1)

        # Attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1))
        scores = scores / math.sqrt(Q.shape[-1])

        # Soft-capping: tanh๋กœ ๋ฒ”์œ„ ์ œํ•œ
        scores = self.attn_logit_softcap * torch.tanh(scores / self.attn_logit_softcap)

        # Sliding window mask (local attention)
        if self.is_local and self.sliding_window:
            mask = self._create_sliding_window_mask(seq_len)
            scores = scores + mask

        # Causal mask
        causal_mask = torch.triu(
            torch.ones(seq_len, seq_len) * float('-inf'),
            diagonal=1
        ).to(scores.device)
        scores = scores + causal_mask

        weights = F.softmax(scores, dim=-1)
        output = torch.matmul(weights, V)

        return self.o_proj(output)

    def _create_sliding_window_mask(self, seq_len):
        """Sliding window attention mask"""
        mask = torch.ones(seq_len, seq_len) * float('-inf')
        for i in range(seq_len):
            start = max(0, i - self.sliding_window)
            mask[i, start:i+1] = 0
        return mask

2.3 Qwen 2.5 (Alibaba)

"""
Qwen 2.5: ๋‹ค๊ตญ์–ด & ์ˆ˜ํ•™ ๊ฐ•์ 

ํŠน์ง•:
1. ๋Œ€๊ทœ๋ชจ ๋‹ค๊ตญ์–ด ํ•™์Šต (29๊ฐœ ์–ธ์–ด)
2. ์ฝ”๋“œ/์ˆ˜ํ•™ ํŠนํ™” ๋ฐ์ดํ„ฐ
3. ๊ธด ์ปจํ…์ŠคํŠธ (128K)
4. ๋‹ค์–‘ํ•œ ํฌ๊ธฐ (0.5B ~ 72B)
"""

class Qwen25Config:
    """Qwen 2.5 ์•„ํ‚คํ…์ฒ˜"""

    # Qwen2.5-0.5B (๊ฐ€์žฅ ์ž‘์€ ๋ฒ„์ „)
    hidden_size = 896
    num_layers = 24
    num_attention_heads = 14
    num_key_value_heads = 2  # ํšจ์œจ์  GQA
    intermediate_size = 4864
    vocab_size = 151936

    # Qwen2.5-7B
    # hidden_size = 3584
    # num_layers = 28
    # num_attention_heads = 28
    # num_key_value_heads = 4


# Qwen ์‚ฌ์šฉ ์˜ˆ์‹œ
def use_qwen():
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model = AutoModelForCausalLM.from_pretrained(
        "Qwen/Qwen2.5-0.5B-Instruct",
        torch_dtype="auto",
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

    # ๋‹ค๊ตญ์–ด ํ…Œ์ŠคํŠธ
    prompts = [
        "Explain machine learning in simple terms.",
        "็”จ็ฎ€ๅ•็š„่ฏ่งฃ้‡Šๆœบๅ™จๅญฆไน ",  # ์ค‘๊ตญ์–ด
        "๊ธฐ๊ณ„ ํ•™์Šต์„ ์‰ฝ๊ฒŒ ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”",  # ํ•œ๊ตญ์–ด
    ]

    for prompt in prompts:
        messages = [{"role": "user", "content": prompt}]
        text = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = tokenizer([text], return_tensors="pt").to(model.device)
        outputs = model.generate(**inputs, max_new_tokens=128)
        print(tokenizer.decode(outputs[0], skip_special_tokens=True))
        print("-" * 50)

3. ํ•™์Šต ์ „๋žต

3.1 ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ vs ์–‘

"""
SLM ํ•™์Šต์˜ ํ•ต์‹ฌ: ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ

Phi์˜ ๊ตํ›ˆ:
- ์›น ํฌ๋กค๋ง ๋ฐ์ดํ„ฐ (ํ’ˆ์งˆ ๋‚ฎ์Œ) < ๊ต๊ณผ์„œ๊ธ‰ ๋ฐ์ดํ„ฐ
- ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ (GPT-4 ์ƒ์„ฑ)๊ฐ€ ํšจ๊ณผ์ 
- ํ•„ํ„ฐ๋ง์ด ๋งค์šฐ ์ค‘์š”
"""

class HighQualityDataPipeline:
    """๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ"""

    def __init__(self, quality_model):
        self.quality_model = quality_model

    def filter_data(self, texts: list, threshold: float = 0.8):
        """ํ’ˆ์งˆ ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง"""
        filtered = []
        for text in texts:
            score = self.quality_model.score(text)
            if score > threshold:
                filtered.append(text)

        print(f"Filtered: {len(texts)} โ†’ {len(filtered)}")
        return filtered

    def generate_synthetic_data(
        self,
        teacher_model,
        topics: list,
        n_samples: int = 10000
    ):
        """ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ"""
        synthetic_data = []

        for topic in topics:
            prompt = f"""Create an educational explanation about {topic}.
            The explanation should be:
            1. Clear and concise
            2. Include examples
            3. Suitable for learning"""

            for _ in range(n_samples // len(topics)):
                response = teacher_model.generate(prompt)

                # ํ’ˆ์งˆ ๊ฒ€์ฆ
                if self._validate_response(response):
                    synthetic_data.append({
                        'topic': topic,
                        'content': response
                    })

        return synthetic_data

    def _validate_response(self, response: str) -> bool:
        """์‘๋‹ต ํ’ˆ์งˆ ๊ฒ€์ฆ"""
        # ๊ธธ์ด ์ฒดํฌ
        if len(response.split()) < 50:
            return False

        # ๋ฐ˜๋ณต ์ฒดํฌ
        sentences = response.split('.')
        if len(set(sentences)) / len(sentences) < 0.8:
            return False

        return True

3.2 Knowledge Distillation

"""
Knowledge Distillation: ํฐ ๋ชจ๋ธ โ†’ ์ž‘์€ ๋ชจ๋ธ

Teacher (๋Œ€ํ˜• ๋ชจ๋ธ)์˜ ์ง€์‹์„ Student (SLM)์—๊ฒŒ ์ „๋‹ฌ
"""

class DistillationTrainer:
    """KD ๊ธฐ๋ฐ˜ SLM ํ•™์Šต"""

    def __init__(
        self,
        teacher_model,  # ์˜ˆ: Llama 70B
        student_model,  # ์˜ˆ: 3B ๋ชจ๋ธ
        temperature: float = 2.0,
        alpha: float = 0.5  # soft/hard loss ๋น„์œจ
    ):
        self.teacher = teacher_model
        self.student = student_model
        self.temperature = temperature
        self.alpha = alpha

        # Teacher๋Š” ํ•™์Šต ์•ˆ ํ•จ
        self.teacher.eval()
        for param in self.teacher.parameters():
            param.requires_grad = False

    def distillation_loss(
        self,
        student_logits: torch.Tensor,
        teacher_logits: torch.Tensor,
        labels: torch.Tensor
    ) -> torch.Tensor:
        """
        Distillation Loss = ฮฑ ร— Soft Loss + (1-ฮฑ) ร— Hard Loss

        Soft Loss: KL(student_soft || teacher_soft)
        Hard Loss: CrossEntropy(student, labels)
        """
        T = self.temperature

        # Soft targets (temperature scaling)
        teacher_soft = F.softmax(teacher_logits / T, dim=-1)
        student_soft = F.log_softmax(student_logits / T, dim=-1)

        # KL Divergence (soft loss)
        soft_loss = F.kl_div(
            student_soft,
            teacher_soft,
            reduction='batchmean'
        ) * (T ** 2)  # Temperature scaling ๋ณด์ •

        # Cross Entropy (hard loss)
        hard_loss = F.cross_entropy(
            student_logits.view(-1, student_logits.size(-1)),
            labels.view(-1),
            ignore_index=-100
        )

        # Combined loss
        loss = self.alpha * soft_loss + (1 - self.alpha) * hard_loss

        return loss

    def train_step(self, batch):
        """ํ•™์Šต ์Šคํ…"""
        input_ids = batch['input_ids']
        labels = batch['labels']

        # Teacher forward (no grad)
        with torch.no_grad():
            teacher_outputs = self.teacher(input_ids)
            teacher_logits = teacher_outputs.logits

        # Student forward
        student_outputs = self.student(input_ids)
        student_logits = student_outputs.logits

        # Distillation loss
        loss = self.distillation_loss(
            student_logits, teacher_logits, labels
        )

        return loss


# Response-level Distillation (๋” ํšจ๊ณผ์ )
class ResponseDistillation:
    """์‘๋‹ต ์ˆ˜์ค€ KD"""

    def __init__(self, teacher_model, student_model):
        self.teacher = teacher_model
        self.student = student_model

    def generate_training_data(self, prompts: list):
        """Teacher ์‘๋‹ต์œผ๋กœ ํ•™์Šต ๋ฐ์ดํ„ฐ ์ƒ์„ฑ"""
        training_data = []

        for prompt in prompts:
            # Teacher ์‘๋‹ต ์ƒ์„ฑ
            teacher_response = self.teacher.generate(
                prompt,
                max_new_tokens=512,
                temperature=0.7
            )

            training_data.append({
                'prompt': prompt,
                'response': teacher_response
            })

        return training_data

    def train_on_responses(self, training_data):
        """Teacher ์‘๋‹ต์œผ๋กœ Student ํ•™์Šต"""
        # Standard SFT (Supervised Fine-Tuning)
        for item in training_data:
            full_text = f"{item['prompt']}\n{item['response']}"
            # ... SFT ํ•™์Šต

3.3 ํšจ์œจ์  ํ•™์Šต ๊ธฐ๋ฒ•

"""
SLM ํ•™์Šต ํšจ์œจํ™” ๊ธฐ๋ฒ•
"""

# 1. Gradient Accumulation (์ž‘์€ ๋ฐฐ์น˜๋กœ ํฐ effective batch)
def train_with_grad_accumulation(
    model,
    dataloader,
    accumulation_steps: int = 8
):
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    for i, batch in enumerate(dataloader):
        outputs = model(**batch)
        loss = outputs.loss / accumulation_steps
        loss.backward()

        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()


# 2. LoRA๋กœ ํšจ์œจ์  fine-tuning
from peft import LoraConfig, get_peft_model

def setup_lora_training(model):
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.1,
        bias="none"
    )

    model = get_peft_model(model, lora_config)

    # ํ•™์Šต ๊ฐ€๋Šฅ ํŒŒ๋ผ๋ฏธํ„ฐ ํ™•์ธ
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")

    return model


# 3. QLoRA (์–‘์žํ™” + LoRA)
from transformers import BitsAndBytesConfig

def setup_qlora_training(model_name):
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto"
    )

    # LoRA ์ถ”๊ฐ€
    return setup_lora_training(model)

4. ๋ฐฐํฌ ์ตœ์ ํ™”

4.1 ์–‘์žํ™”

"""
SLM ์–‘์žํ™”: ๋ฉ”๋ชจ๋ฆฌ & ์†๋„ ์ตœ์ ํ™”
"""

# 1. GPTQ (Post-Training Quantization)
from transformers import GPTQConfig

def quantize_with_gptq(model_name):
    gptq_config = GPTQConfig(
        bits=4,
        dataset="c4",
        tokenizer=AutoTokenizer.from_pretrained(model_name)
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=gptq_config,
        device_map="auto"
    )

    return model


# 2. AWQ (Activation-aware Weight Quantization)
from awq import AutoAWQForCausalLM

def quantize_with_awq(model_path, output_path):
    model = AutoAWQForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # ์–‘์žํ™”
    model.quantize(
        tokenizer,
        quant_config={
            "zero_point": True,
            "q_group_size": 128,
            "w_bit": 4,
            "version": "GEMM"
        }
    )

    # ์ €์žฅ
    model.save_quantized(output_path)


# 3. llama.cpp (GGUF ํฌ๋งท)
"""
llama.cpp ์–‘์žํ™” ๋ ˆ๋ฒจ:
- Q2_K: 2๋น„ํŠธ (๋งค์šฐ ์ž‘์Œ, ํ’ˆ์งˆ ์ €ํ•˜)
- Q4_K_M: 4๋น„ํŠธ (๊ถŒ์žฅ, ํ’ˆ์งˆ/ํฌ๊ธฐ ๊ท ํ˜•)
- Q5_K_M: 5๋น„ํŠธ (๋†’์€ ํ’ˆ์งˆ)
- Q8_0: 8๋น„ํŠธ (๊ฑฐ์˜ ์›๋ณธ ํ’ˆ์งˆ)

๋ช…๋ น์–ด:
./quantize model.gguf model-q4_k_m.gguf Q4_K_M
"""


# ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ๋น„๊ต
def compare_memory_usage():
    """ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์— ๋”ฐ๋ฅธ ๋ฉ”๋ชจ๋ฆฌ"""
    configs = [
        ("3B FP16", 3e9 * 2),       # 6GB
        ("3B Q8", 3e9 * 1),         # 3GB
        ("3B Q4", 3e9 * 0.5),       # 1.5GB
        ("7B FP16", 7e9 * 2),       # 14GB
        ("7B Q4", 7e9 * 0.5),       # 3.5GB
    ]

    print("Model\t\tMemory (GB)")
    print("-" * 30)
    for name, memory in configs:
        print(f"{name}\t\t{memory / 1e9:.1f}")

4.2 On-Device ๋ฐฐํฌ

"""
๋ชจ๋ฐ”์ผ/์—ฃ์ง€ ๋””๋ฐ”์ด์Šค ๋ฐฐํฌ
"""

# 1. ONNX ๋ณ€ํ™˜
def convert_to_onnx(model, tokenizer, output_path):
    from optimum.onnxruntime import ORTModelForCausalLM

    # ONNX ๋ณ€ํ™˜ ๋ฐ ์ตœ์ ํ™”
    ort_model = ORTModelForCausalLM.from_pretrained(
        model,
        export=True,
        provider="CPUExecutionProvider"
    )

    ort_model.save_pretrained(output_path)


# 2. TensorRT-LLM (NVIDIA GPU)
"""
TensorRT-LLM ์‚ฌ์šฉ:
1. ๋ชจ๋ธ ๋ณ€ํ™˜: python convert_checkpoint.py
2. ์—”์ง„ ๋นŒ๋“œ: trtllm-build
3. ์ถ”๋ก : python run.py
"""


# 3. llama.cpp (CPU ์ถ”๋ก )
"""
llama.cpp ์‚ฌ์šฉ:
1. GGUF ๋ณ€ํ™˜
2. llama-cli ์‹คํ–‰

./llama-cli -m model.gguf \
    -n 256 \
    -p "Hello, how are you?" \
    -t 4  # threads
"""


# 4. MLC-LLM (๋‹ค์–‘ํ•œ ํ”Œ๋žซํผ)
"""
MLC-LLM: iOS, Android, WebGPU, CUDA

mlc_chat ์•ฑ์œผ๋กœ ๋ชจ๋ฐ”์ผ ๋ฐฐํฌ ๊ฐ€๋Šฅ
"""

5. ๋ฒค์น˜๋งˆํฌ & ํ‰๊ฐ€

5.1 SLM ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚            SLM ๋ฒค์น˜๋งˆํฌ ๋น„๊ต (2024.10 ๊ธฐ์ค€)                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  Model          Params  MMLU    GSM8K   HumanEval  TriviaQA     โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”‚
โ”‚  Phi-3-mini     3.8B    69.9%   82.5%   57.9%      63.5%        โ”‚
โ”‚  Gemma-2-9B     9B      71.3%   68.6%   54.3%      73.5%        โ”‚
โ”‚  Qwen2.5-7B     7B      74.2%   82.6%   75.6%      71.4%        โ”‚
โ”‚  Llama-3.2-3B   3B      63.4%   44.4%   36.0%      63.4%        โ”‚
โ”‚  SmolLM-1.7B    1.7B    42.3%   18.2%   28.7%      42.1%        โ”‚
โ”‚                                                                  โ”‚
โ”‚  ์ฐธ๊ณ : GPT-4    -       86.4%   92.0%   67.0%      87.6%        โ”‚
โ”‚                                                                  โ”‚
โ”‚  โ€ป Phi-3์€ ์ž‘์€ ํฌ๊ธฐ ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ์ถ”๋ก  ๋Šฅ๋ ฅ                       โ”‚
โ”‚  โ€ป Qwen2.5๋Š” ์ฝ”๋“œ(HumanEval)์—์„œ ๊ฐ•์                             โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

5.2 ํƒœ์Šคํฌ๋ณ„ SLM ์„ ํƒ ๊ฐ€์ด๋“œ

"""
ํƒœ์Šคํฌ๋ณ„ SLM ์ถ”์ฒœ
"""

TASK_MODEL_RECOMMENDATIONS = {
    # ์ผ๋ฐ˜ ๋Œ€ํ™”
    "general_chat": {
        "best": "Qwen2.5-7B-Instruct",
        "budget": "Qwen2.5-1.5B-Instruct",
        "mobile": "Qwen2.5-0.5B-Instruct"
    },

    # ์ฝ”๋“œ ์ƒ์„ฑ
    "code_generation": {
        "best": "Qwen2.5-Coder-7B",
        "budget": "CodeGemma-2B",
        "mobile": "Phi-3-mini"
    },

    # ์ˆ˜ํ•™/์ถ”๋ก 
    "math_reasoning": {
        "best": "Qwen2.5-Math-7B",
        "budget": "Phi-3-mini",
        "mobile": "Phi-3-mini"
    },

    # ํ•œ๊ตญ์–ด
    "korean": {
        "best": "Qwen2.5-7B-Instruct",  # ๋‹ค๊ตญ์–ด ๊ฐ•์ 
        "budget": "EXAONE-3.0-7.8B-Instruct",
        "mobile": "Qwen2.5-1.5B-Instruct"
    },

    # RAG/๊ฒ€์ƒ‰
    "rag": {
        "best": "Gemma-2-9B",
        "budget": "Llama-3.2-3B",
        "mobile": "Phi-3-mini"
    },

    # ์š”์•ฝ
    "summarization": {
        "best": "Qwen2.5-7B-Instruct",
        "budget": "Gemma-2-2B",
        "mobile": "SmolLM-1.7B"
    }
}


def select_model(task: str, constraint: str = "best"):
    """ํƒœ์Šคํฌ์™€ ์ œ์•ฝ์— ๋งž๋Š” ๋ชจ๋ธ ์„ ํƒ"""
    if task in TASK_MODEL_RECOMMENDATIONS:
        return TASK_MODEL_RECOMMENDATIONS[task].get(constraint)
    return "Qwen2.5-7B-Instruct"  # ๊ธฐ๋ณธ๊ฐ’

6. ์‹ค์Šต: SLM Fine-tuning

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset

def finetune_slm():
    """SLM QLoRA Fine-tuning ์˜ˆ์ œ"""

    # 1. ๋ชจ๋ธ ๋กœ๋“œ (4๋น„ํŠธ ์–‘์žํ™”)
    model_name = "Qwen/Qwen2.5-1.5B-Instruct"

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto"
    )

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    # 2. LoRA ์„ค์ •
    model = prepare_model_for_kbit_training(model)

    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                       "gate_proj", "up_proj", "down_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    # 3. ๋ฐ์ดํ„ฐ์…‹
    dataset = load_dataset("timdettmers/openassistant-guanaco")

    def preprocess(examples):
        texts = []
        for text in examples['text']:
            # Qwen chat format
            texts.append(text + tokenizer.eos_token)

        tokenized = tokenizer(
            texts,
            truncation=True,
            max_length=1024,
            padding="max_length"
        )
        tokenized['labels'] = tokenized['input_ids'].copy()
        return tokenized

    tokenized_dataset = dataset['train'].map(
        preprocess,
        batched=True,
        remove_columns=dataset['train'].column_names
    )

    # 4. ํ•™์Šต
    training_args = TrainingArguments(
        output_dir="./qwen-finetuned",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        logging_steps=10,
        save_steps=500,
        bf16=True,
        optim="paged_adamw_8bit"
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset,
        tokenizer=tokenizer,
    )

    trainer.train()

    # 5. ์ €์žฅ
    model.save_pretrained("./qwen-lora-adapter")

    print("Fine-tuning complete!")


if __name__ == "__main__":
    finetune_slm()

์ฐธ๊ณ  ์ž๋ฃŒ

๋…ผ๋ฌธ

  • Gunasekar et al. (2023). "Textbooks Are All You Need" (Phi)
  • Gemma Team (2024). "Gemma 2: Improving Open Language Models"
  • Yang et al. (2024). "Qwen2 Technical Report"

๋ชจ๋ธ

๊ด€๋ จ ๋ ˆ์Šจ

to navigate between lessons