22. Inference ์ตœ์ ํ™”

22. Inference ์ตœ์ ํ™”

๊ฐœ์š”

LLM ์ถ”๋ก (inference) ์ตœ์ ํ™”๋Š” ํ”„๋กœ๋•์…˜ ํ™˜๊ฒฝ์—์„œ ๋น„์šฉ๊ณผ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ค„์ด๋Š” ํ•ต์‹ฌ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. ์ด ๋ ˆ์Šจ์—์„œ๋Š” vLLM, TGI, Speculative Decoding ๋“ฑ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค.


1. LLM ์ถ”๋ก ์˜ ๋ณ‘๋ชฉ

1.1 Memory Bottleneck

LLM ์ถ”๋ก  ํŠน์„ฑ:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  KV Cache ํฌ๊ธฐ ๊ณ„์‚ฐ:                                    โ”‚
โ”‚                                                         โ”‚
โ”‚  Memory = 2 ร— n_layers ร— n_heads ร— head_dim ร— seq_len  โ”‚
โ”‚                       ร— batch_size ร— dtype_size        โ”‚
โ”‚                                                         โ”‚
โ”‚  ์˜ˆ: LLaMA-7B, batch=1, seq=2048, FP16                 โ”‚
โ”‚  = 2 ร— 32 ร— 32 ร— 128 ร— 2048 ร— 1 ร— 2 bytes             โ”‚
โ”‚  = 1.07 GB per sequence                                โ”‚
โ”‚                                                         โ”‚
โ”‚  batch=32์ผ ๊ฒฝ์šฐ: ~34 GB (KV cache๋งŒ)                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๋ฌธ์ œ:
1. GPU ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ
2. ๊ฐ€๋ณ€ ๊ธธ์ด ์‹œํ€€์Šค โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋‹จํŽธํ™”
3. ๋ฐฐ์น˜ ํฌ๊ธฐ ์ œํ•œ โ†’ ๋‚ฎ์€ ์ฒ˜๋ฆฌ๋Ÿ‰

1.2 Compute Bottleneck

Autoregressive ์ƒ์„ฑ์˜ ๋น„ํšจ์œจ:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Step 1: [prompt] โ†’ token_1                            โ”‚
โ”‚  Step 2: [prompt, token_1] โ†’ token_2                   โ”‚
โ”‚  Step 3: [prompt, token_1, token_2] โ†’ token_3          โ”‚
โ”‚  ...                                                   โ”‚
โ”‚                                                        โ”‚
โ”‚  ๊ฐ step์—์„œ:                                          โ”‚
โ”‚  - ์ „์ฒด KV cache ๋กœ๋“œ                                   โ”‚
โ”‚  - ๋‹จ 1๊ฐœ ํ† ํฐ ์ƒ์„ฑ                                     โ”‚
โ”‚  - GPU utilization ๋‚ฎ์Œ (memory-bound)                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

2. vLLM

2.1 PagedAttention

PagedAttention ํ•ต์‹ฌ ์•„์ด๋””์–ด:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ๊ธฐ์กด ๋ฐฉ์‹: ์—ฐ์† ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น                                โ”‚
โ”‚                                                            โ”‚
โ”‚  Sequence A: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘]  (padding ๋‚ญ๋น„)  โ”‚
โ”‚  Sequence B: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘]  (๋” ๋งŽ์€ ๋‚ญ๋น„)  โ”‚
โ”‚                                                            โ”‚
โ”‚  PagedAttention: ๋น„์—ฐ์† ๋ธ”๋ก ํ• ๋‹น                          โ”‚
โ”‚                                                            โ”‚
โ”‚  Block Pool: [B1][B2][B3][B4][B5][B6][B7][B8]...           โ”‚
โ”‚                                                            โ”‚
โ”‚  Sequence A โ†’ [B1, B3, B5, B7] (ํ•„์š”ํ•œ ๋งŒํผ๋งŒ)             โ”‚
โ”‚  Sequence B โ†’ [B2, B4] (ํšจ์œจ์ )                            โ”‚
โ”‚                                                            โ”‚
โ”‚  ์žฅ์ :                                                     โ”‚
โ”‚  - ๋ฉ”๋ชจ๋ฆฌ ๋‚ญ๋น„ ์ตœ์†Œํ™”                                      โ”‚
โ”‚  - ๋™์  ํ• ๋‹น/ํ•ด์ œ                                          โ”‚
โ”‚  - Copy-on-Write ์ง€์› (beam search ํšจ์œจํ™”)                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

2.2 vLLM ์‚ฌ์šฉ

from vllm import LLM, SamplingParams

class VLLMInference:
    """vLLM ์ถ”๋ก  ์—”์ง„"""

    def __init__(
        self,
        model_name: str,
        tensor_parallel_size: int = 1,
        gpu_memory_utilization: float = 0.9
    ):
        self.llm = LLM(
            model=model_name,
            tensor_parallel_size=tensor_parallel_size,
            gpu_memory_utilization=gpu_memory_utilization,
            trust_remote_code=True
        )

    def generate(
        self,
        prompts: list,
        max_tokens: int = 256,
        temperature: float = 0.7,
        top_p: float = 0.9
    ):
        """๋ฐฐ์น˜ ์ƒ์„ฑ"""
        sampling_params = SamplingParams(
            temperature=temperature,
            top_p=top_p,
            max_tokens=max_tokens
        )

        outputs = self.llm.generate(prompts, sampling_params)

        results = []
        for output in outputs:
            generated_text = output.outputs[0].text
            results.append({
                "prompt": output.prompt,
                "generated": generated_text,
                "tokens": len(output.outputs[0].token_ids)
            })

        return results

    def streaming_generate(self, prompt: str, **kwargs):
        """์ŠคํŠธ๋ฆฌ๋ฐ ์ƒ์„ฑ"""
        from vllm import AsyncLLMEngine, AsyncEngineArgs

        # Async engine ํ•„์š”
        engine_args = AsyncEngineArgs(model=self.model_name)
        engine = AsyncLLMEngine.from_engine_args(engine_args)

        # ์ŠคํŠธ๋ฆฌ๋ฐ ๊ตฌํ˜„์€ ๋ณ„๋„ async ์ฝ”๋“œ ํ•„์š”


# vLLM ์„œ๋ฒ„ ์‹คํ–‰
"""
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --tensor-parallel-size 2 \
    --port 8000
"""

# OpenAI ํ˜ธํ™˜ API ์‚ฌ์šฉ
import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[{"role": "user", "content": "Hello!"}]
)

3. Text Generation Inference (TGI)

3.1 TGI ํŠน์ง•

TGI (HuggingFace):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ํ•ต์‹ฌ ๊ธฐ๋Šฅ:                                                โ”‚
โ”‚  - Continuous batching                                     โ”‚
โ”‚  - Flash Attention 2                                       โ”‚
โ”‚  - Tensor parallelism                                      โ”‚
โ”‚  - Token streaming                                         โ”‚
โ”‚  - Quantization (GPTQ, AWQ, EETQ)                         โ”‚
โ”‚  - Watermarking                                            โ”‚
โ”‚                                                            โ”‚
โ”‚  ์ง€์› ๋ชจ๋ธ:                                                โ”‚
โ”‚  - LLaMA, Mistral, Falcon                                 โ”‚
โ”‚  - GPT-2, BLOOM, StarCoder                                โ”‚
โ”‚  - T5, BART                                               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

3.2 TGI ์‚ฌ์šฉ

# Docker๋กœ TGI ์‹คํ–‰
"""
docker run --gpus all --shm-size 1g -p 8080:80 \
    -v $PWD/data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-7b-chat-hf \
    --num-shard 2 \
    --quantize awq
"""

from huggingface_hub import InferenceClient

class TGIClient:
    """TGI ํด๋ผ์ด์–ธํŠธ"""

    def __init__(self, endpoint: str = "http://localhost:8080"):
        self.client = InferenceClient(endpoint)

    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 256,
        temperature: float = 0.7,
        stream: bool = False
    ):
        """์ƒ์„ฑ"""
        if stream:
            return self._stream_generate(prompt, max_new_tokens, temperature)

        response = self.client.text_generation(
            prompt,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            details=True
        )

        return response

    def _stream_generate(self, prompt: str, max_new_tokens: int, temperature: float):
        """์ŠคํŠธ๋ฆฌ๋ฐ ์ƒ์„ฑ"""
        for token in self.client.text_generation(
            prompt,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            stream=True
        ):
            yield token

    def get_model_info(self):
        """๋ชจ๋ธ ์ •๋ณด"""
        return self.client.get_model_info()


# ์‚ฌ์šฉ ์˜ˆ์‹œ
def tgi_example():
    client = TGIClient()

    # ์ผ๋ฐ˜ ์ƒ์„ฑ
    response = client.generate(
        "Write a short poem about AI:",
        max_new_tokens=100
    )
    print(response.generated_text)

    # ์ŠคํŠธ๋ฆฌ๋ฐ
    print("\nStreaming:")
    for token in client.generate("Once upon a time,", stream=True):
        print(token, end="", flush=True)

4. Speculative Decoding

4.1 ๊ฐœ๋…

Speculative Decoding:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ์•„์ด๋””์–ด: ์ž‘์€ ๋ชจ๋ธ๋กœ ์ดˆ์•ˆ ์ƒ์„ฑ โ†’ ํฐ ๋ชจ๋ธ๋กœ ๊ฒ€์ฆ          โ”‚
โ”‚                                                            โ”‚
โ”‚  ์ผ๋ฐ˜ decoding:                                            โ”‚
โ”‚  Large Model: t1 โ†’ t2 โ†’ t3 โ†’ t4 โ†’ t5  (5 forward passes)  โ”‚
โ”‚                                                            โ”‚
โ”‚  Speculative decoding:                                     โ”‚
โ”‚  Draft Model: [t1, t2, t3, t4, t5]  (๋น ๋ฅธ ์ถ”์ธก)           โ”‚
โ”‚  Large Model: verify all at once    (1 forward pass)      โ”‚
โ”‚                                                            โ”‚
โ”‚  ๊ฒฐ๊ณผ: t1 โœ“, t2 โœ“, t3 โœ— โ†’ ์žฌ์ƒ์„ฑ                          โ”‚
โ”‚                                                            โ”‚
โ”‚  ์†๋„ ํ–ฅ์ƒ: 2-3x (acceptance rate์— ๋”ฐ๋ผ)                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

4.2 ๊ตฌํ˜„

import torch
from typing import Tuple

class SpeculativeDecoder:
    """Speculative Decoding ๊ตฌํ˜„"""

    def __init__(
        self,
        target_model,  # ํฐ ๋ชจ๋ธ
        draft_model,   # ์ž‘์€ ๋ชจ๋ธ
        tokenizer,
        num_speculative_tokens: int = 5
    ):
        self.target = target_model
        self.draft = draft_model
        self.tokenizer = tokenizer
        self.k = num_speculative_tokens

    @torch.no_grad()
    def generate(
        self,
        input_ids: torch.Tensor,
        max_new_tokens: int = 100,
        temperature: float = 1.0
    ) -> torch.Tensor:
        """Speculative decoding์œผ๋กœ ์ƒ์„ฑ"""
        generated = input_ids.clone()

        while generated.shape[1] - input_ids.shape[1] < max_new_tokens:
            # 1. Draft model๋กœ k๊ฐœ ํ† ํฐ ์ถ”์ธก
            draft_tokens, draft_probs = self._draft_tokens(
                generated, self.k, temperature
            )

            # 2. Target model๋กœ ๊ฒ€์ฆ
            accepted, target_probs = self._verify_tokens(
                generated, draft_tokens, temperature
            )

            # 3. ์ˆ˜๋ฝ๋œ ํ† ํฐ ์ถ”๊ฐ€
            generated = torch.cat([generated, accepted], dim=1)

            # 4. ๋งˆ์ง€๋ง‰ ๊ฑฐ์ ˆ ์œ„์น˜์—์„œ target์œผ๋กœ ์ƒ˜ํ”Œ๋ง
            if accepted.shape[1] < self.k:
                # ์ผ๋ถ€ ๊ฑฐ์ ˆ๋จ โ†’ target์—์„œ ๋‹ค์Œ ํ† ํฐ ์ƒ˜ํ”Œ๋ง
                next_token = self._sample_from_target(
                    generated, target_probs, temperature
                )
                generated = torch.cat([generated, next_token], dim=1)

        return generated

    def _draft_tokens(
        self,
        context: torch.Tensor,
        k: int,
        temperature: float
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """Draft model๋กœ k๊ฐœ ํ† ํฐ ์ƒ์„ฑ"""
        draft_tokens = []
        draft_probs = []
        current = context

        for _ in range(k):
            outputs = self.draft(current)
            logits = outputs.logits[:, -1] / temperature
            probs = torch.softmax(logits, dim=-1)

            # ์ƒ˜ํ”Œ๋ง
            token = torch.multinomial(probs, num_samples=1)
            draft_tokens.append(token)
            draft_probs.append(probs)

            current = torch.cat([current, token], dim=1)

        return torch.cat(draft_tokens, dim=1), torch.stack(draft_probs, dim=1)

    def _verify_tokens(
        self,
        context: torch.Tensor,
        draft_tokens: torch.Tensor,
        temperature: float
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """Target model๋กœ ๊ฒ€์ฆ"""
        # ์ „์ฒด ์‹œํ€€์Šค์— ๋Œ€ํ•ด ํ•œ ๋ฒˆ์— forward
        full_seq = torch.cat([context, draft_tokens], dim=1)
        outputs = self.target(full_seq)

        # Target probabilities
        target_logits = outputs.logits[:, context.shape[1]-1:-1] / temperature
        target_probs = torch.softmax(target_logits, dim=-1)

        # Draft probabilities (์ด๋ฏธ ๊ณ„์‚ฐ๋จ)
        draft_probs = self._get_draft_probs(context, draft_tokens, temperature)

        # Acceptance probability: min(1, p_target / p_draft)
        accepted = []
        for i in range(draft_tokens.shape[1]):
            token = draft_tokens[:, i]
            p_target = target_probs[:, i].gather(1, token.unsqueeze(1))
            p_draft = draft_probs[:, i].gather(1, token.unsqueeze(1))

            accept_prob = torch.clamp(p_target / p_draft, max=1.0)

            if torch.rand(1) < accept_prob:
                accepted.append(token)
            else:
                break

        if accepted:
            return torch.stack(accepted, dim=1), target_probs
        else:
            return torch.tensor([]).reshape(1, 0), target_probs

    def _get_draft_probs(self, context, draft_tokens, temperature):
        """Draft probs ์žฌ๊ณ„์‚ฐ"""
        full_seq = torch.cat([context, draft_tokens], dim=1)
        outputs = self.draft(full_seq)
        logits = outputs.logits[:, context.shape[1]-1:-1] / temperature
        return torch.softmax(logits, dim=-1)

    def _sample_from_target(self, context, target_probs, temperature):
        """Target model์—์„œ ์ƒ˜ํ”Œ๋ง"""
        # Rejection ์œ„์น˜์˜ ๋‹ค์Œ ํ† ํฐ
        probs = target_probs[:, -1]
        return torch.multinomial(probs, num_samples=1)

5. ์–‘์žํ™” (Quantization)

5.1 ์–‘์žํ™” ๋ฐฉ๋ฒ• ๋น„๊ต

๋ฐฉ๋ฒ• ์ •๋ฐ€๋„ ์†๋„ ํ’ˆ์งˆ ๋ฉ”๋ชจ๋ฆฌ
FP16 16-bit 1x 100% 1x
GPTQ 4-bit ~1.5x 98-99% 0.25x
AWQ 4-bit ~2x 98-99% 0.25x
GGUF 2-8bit ~2x 95-99% 0.15-0.5x
bitsandbytes 4/8-bit ~1.2x 97-99% 0.25-0.5x

5.2 ์–‘์žํ™” ์‚ฌ์šฉ

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# bitsandbytes 4-bit
def load_4bit_model(model_name: str):
    """4-bit ์–‘์žํ™” ๋ชจ๋ธ ๋กœ๋“œ"""
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto"
    )

    return model


# GPTQ
def load_gptq_model(model_name: str):
    """GPTQ ์–‘์žํ™” ๋ชจ๋ธ ๋กœ๋“œ"""
    from auto_gptq import AutoGPTQForCausalLM

    model = AutoGPTQForCausalLM.from_quantized(
        model_name,
        device_map="auto",
        use_safetensors=True
    )

    return model


# AWQ
def load_awq_model(model_name: str):
    """AWQ ์–‘์žํ™” ๋ชจ๋ธ ๋กœ๋“œ"""
    from awq import AutoAWQForCausalLM

    model = AutoAWQForCausalLM.from_quantized(
        model_name,
        fuse_layers=True,
        device_map="auto"
    )

    return model

6. ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ์ตœ์ ํ™”

6.1 Continuous Batching

class ContinuousBatcher:
    """Continuous Batching ๊ตฌํ˜„"""

    def __init__(
        self,
        model,
        tokenizer,
        max_batch_size: int = 32,
        max_seq_len: int = 2048
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.max_batch_size = max_batch_size
        self.max_seq_len = max_seq_len

        # ํ™œ์„ฑ ์š”์ฒญ๋“ค
        self.active_requests = {}
        self.request_queue = []

    def add_request(self, request_id: str, prompt: str, max_tokens: int):
        """์ƒˆ ์š”์ฒญ ์ถ”๊ฐ€"""
        tokens = self.tokenizer.encode(prompt)
        self.request_queue.append({
            "id": request_id,
            "tokens": tokens,
            "generated": [],
            "max_tokens": max_tokens
        })

    def step(self) -> dict:
        """ํ•œ ์Šคํ… ์ฒ˜๋ฆฌ"""
        # 1. ์ƒˆ ์š”์ฒญ์„ ๋ฐฐ์น˜์— ์ถ”๊ฐ€
        while (len(self.active_requests) < self.max_batch_size and
               self.request_queue):
            req = self.request_queue.pop(0)
            self.active_requests[req["id"]] = req

        if not self.active_requests:
            return {}

        # 2. ๋ฐฐ์น˜ ๊ตฌ์„ฑ
        batch_ids, batch_tokens = self._prepare_batch()

        # 3. Forward pass
        with torch.no_grad():
            outputs = self.model(batch_tokens)
            next_tokens = outputs.logits[:, -1].argmax(dim=-1)

        # 4. ๊ฒฐ๊ณผ ์—…๋ฐ์ดํŠธ
        results = {}
        completed = []

        for i, req_id in enumerate(batch_ids):
            req = self.active_requests[req_id]
            token = next_tokens[i].item()
            req["generated"].append(token)

            # ์™„๋ฃŒ ์ฒดํฌ
            if (len(req["generated"]) >= req["max_tokens"] or
                token == self.tokenizer.eos_token_id):
                results[req_id] = self.tokenizer.decode(req["generated"])
                completed.append(req_id)

        # 5. ์™„๋ฃŒ๋œ ์š”์ฒญ ์ œ๊ฑฐ
        for req_id in completed:
            del self.active_requests[req_id]

        return results

    def _prepare_batch(self):
        """๋ฐฐ์น˜ ์ค€๋น„ (padding)"""
        batch_ids = list(self.active_requests.keys())
        sequences = []

        for req_id in batch_ids:
            req = self.active_requests[req_id]
            seq = req["tokens"] + req["generated"]
            sequences.append(seq)

        # Padding
        max_len = max(len(s) for s in sequences)
        padded = []
        for seq in sequences:
            padded.append(seq + [self.tokenizer.pad_token_id] * (max_len - len(seq)))

        return batch_ids, torch.tensor(padded)

7. ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํ‚น

import time
from dataclasses import dataclass
from typing import List

@dataclass
class BenchmarkResult:
    throughput: float  # tokens/second
    latency_p50: float  # ms
    latency_p99: float  # ms
    memory_gb: float

def benchmark_inference(
    model,
    tokenizer,
    prompts: List[str],
    max_tokens: int = 100,
    num_runs: int = 10
) -> BenchmarkResult:
    """์ถ”๋ก  ๋ฒค์น˜๋งˆํฌ"""
    latencies = []
    total_tokens = 0

    # Warmup
    for prompt in prompts[:2]:
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        _ = model.generate(**inputs, max_new_tokens=10)

    # Benchmark
    for _ in range(num_runs):
        for prompt in prompts:
            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

            start = time.perf_counter()
            outputs = model.generate(**inputs, max_new_tokens=max_tokens)
            end = time.perf_counter()

            latencies.append((end - start) * 1000)  # ms
            total_tokens += outputs.shape[1] - inputs["input_ids"].shape[1]

    # ๋ฉ”๋ชจ๋ฆฌ
    if torch.cuda.is_available():
        memory_gb = torch.cuda.max_memory_allocated() / 1e9
    else:
        memory_gb = 0

    latencies.sort()

    return BenchmarkResult(
        throughput=total_tokens / (sum(latencies) / 1000),
        latency_p50=latencies[len(latencies) // 2],
        latency_p99=latencies[int(len(latencies) * 0.99)],
        memory_gb=memory_gb
    )

ํ•ต์‹ฌ ์ •๋ฆฌ

์ถ”๋ก  ์ตœ์ ํ™” ๊ธฐ๋ฒ•

1. PagedAttention (vLLM): KV cache ํšจ์œจํ™”
2. Continuous Batching: ๋™์  ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ
3. Speculative Decoding: Draft+Verify
4. Quantization: 4-bit/8-bit ์••์ถ•
5. Flash Attention: Memory-efficient attention
6. Tensor Parallelism: ๋‹ค์ค‘ GPU ๋ถ„์‚ฐ

๋„๊ตฌ ์„ ํƒ ๊ฐ€์ด๋“œ

- ๊ณ ์ฒ˜๋ฆฌ๋Ÿ‰ ์„œ๋น™: vLLM
- HuggingFace ํ†ตํ•ฉ: TGI
- ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค: llama.cpp + GGUF
- ๊ฐœ๋ฐœ/์‹คํ—˜: Transformers + bitsandbytes

์ฐธ๊ณ  ์ž๋ฃŒ

  1. Kwon et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention"
  2. Leviathan et al. (2023). "Fast Inference from Transformers via Speculative Decoding"
  3. vLLM Documentation
  4. TGI Documentation
to navigate between lessons