22. Inference ์ต์ ํ
22. Inference ์ต์ ํ¶
๊ฐ์¶
LLM ์ถ๋ก (inference) ์ต์ ํ๋ ํ๋ก๋์ ํ๊ฒฝ์์ ๋น์ฉ๊ณผ ์ง์ฐ ์๊ฐ์ ์ค์ด๋ ํต์ฌ ๊ธฐ์ ์ ๋๋ค. ์ด ๋ ์จ์์๋ vLLM, TGI, Speculative Decoding ๋ฑ์ ๋ค๋ฃน๋๋ค.
1. LLM ์ถ๋ก ์ ๋ณ๋ชฉ¶
1.1 Memory Bottleneck¶
LLM ์ถ๋ก ํน์ฑ:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ KV Cache ํฌ๊ธฐ ๊ณ์ฐ: โ
โ โ
โ Memory = 2 ร n_layers ร n_heads ร head_dim ร seq_len โ
โ ร batch_size ร dtype_size โ
โ โ
โ ์: LLaMA-7B, batch=1, seq=2048, FP16 โ
โ = 2 ร 32 ร 32 ร 128 ร 2048 ร 1 ร 2 bytes โ
โ = 1.07 GB per sequence โ
โ โ
โ batch=32์ผ ๊ฒฝ์ฐ: ~34 GB (KV cache๋ง) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๋ฌธ์ :
1. GPU ๋ฉ๋ชจ๋ฆฌ ์ ํ
2. ๊ฐ๋ณ ๊ธธ์ด ์ํ์ค โ ๋ฉ๋ชจ๋ฆฌ ๋จํธํ
3. ๋ฐฐ์น ํฌ๊ธฐ ์ ํ โ ๋ฎ์ ์ฒ๋ฆฌ๋
1.2 Compute Bottleneck¶
Autoregressive ์์ฑ์ ๋นํจ์จ:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 1: [prompt] โ token_1 โ
โ Step 2: [prompt, token_1] โ token_2 โ
โ Step 3: [prompt, token_1, token_2] โ token_3 โ
โ ... โ
โ โ
โ ๊ฐ step์์: โ
โ - ์ ์ฒด KV cache ๋ก๋ โ
โ - ๋จ 1๊ฐ ํ ํฐ ์์ฑ โ
โ - GPU utilization ๋ฎ์ (memory-bound) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2. vLLM¶
2.1 PagedAttention¶
PagedAttention ํต์ฌ ์์ด๋์ด:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๊ธฐ์กด ๋ฐฉ์: ์ฐ์ ๋ฉ๋ชจ๋ฆฌ ํ ๋น โ
โ โ
โ Sequence A: [โโโโโโโโโโโโโโโโโโโโโโโโโโ] (padding ๋ญ๋น) โ
โ Sequence B: [โโโโโโโโโโโโโโโโโโโโโโโโโโ] (๋ ๋ง์ ๋ญ๋น) โ
โ โ
โ PagedAttention: ๋น์ฐ์ ๋ธ๋ก ํ ๋น โ
โ โ
โ Block Pool: [B1][B2][B3][B4][B5][B6][B7][B8]... โ
โ โ
โ Sequence A โ [B1, B3, B5, B7] (ํ์ํ ๋งํผ๋ง) โ
โ Sequence B โ [B2, B4] (ํจ์จ์ ) โ
โ โ
โ ์ฅ์ : โ
โ - ๋ฉ๋ชจ๋ฆฌ ๋ญ๋น ์ต์ํ โ
โ - ๋์ ํ ๋น/ํด์ โ
โ - Copy-on-Write ์ง์ (beam search ํจ์จํ) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2.2 vLLM ์ฌ์ฉ¶
from vllm import LLM, SamplingParams
class VLLMInference:
"""vLLM ์ถ๋ก ์์ง"""
def __init__(
self,
model_name: str,
tensor_parallel_size: int = 1,
gpu_memory_utilization: float = 0.9
):
self.llm = LLM(
model=model_name,
tensor_parallel_size=tensor_parallel_size,
gpu_memory_utilization=gpu_memory_utilization,
trust_remote_code=True
)
def generate(
self,
prompts: list,
max_tokens: int = 256,
temperature: float = 0.7,
top_p: float = 0.9
):
"""๋ฐฐ์น ์์ฑ"""
sampling_params = SamplingParams(
temperature=temperature,
top_p=top_p,
max_tokens=max_tokens
)
outputs = self.llm.generate(prompts, sampling_params)
results = []
for output in outputs:
generated_text = output.outputs[0].text
results.append({
"prompt": output.prompt,
"generated": generated_text,
"tokens": len(output.outputs[0].token_ids)
})
return results
def streaming_generate(self, prompt: str, **kwargs):
"""์คํธ๋ฆฌ๋ฐ ์์ฑ"""
from vllm import AsyncLLMEngine, AsyncEngineArgs
# Async engine ํ์
engine_args = AsyncEngineArgs(model=self.model_name)
engine = AsyncLLMEngine.from_engine_args(engine_args)
# ์คํธ๋ฆฌ๋ฐ ๊ตฌํ์ ๋ณ๋ async ์ฝ๋ ํ์
# vLLM ์๋ฒ ์คํ
"""
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--tensor-parallel-size 2 \
--port 8000
"""
# OpenAI ํธํ API ์ฌ์ฉ
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "Hello!"}]
)
3. Text Generation Inference (TGI)¶
3.1 TGI ํน์ง¶
TGI (HuggingFace):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ํต์ฌ ๊ธฐ๋ฅ: โ
โ - Continuous batching โ
โ - Flash Attention 2 โ
โ - Tensor parallelism โ
โ - Token streaming โ
โ - Quantization (GPTQ, AWQ, EETQ) โ
โ - Watermarking โ
โ โ
โ ์ง์ ๋ชจ๋ธ: โ
โ - LLaMA, Mistral, Falcon โ
โ - GPT-2, BLOOM, StarCoder โ
โ - T5, BART โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
3.2 TGI ์ฌ์ฉ¶
# Docker๋ก TGI ์คํ
"""
docker run --gpus all --shm-size 1g -p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-2-7b-chat-hf \
--num-shard 2 \
--quantize awq
"""
from huggingface_hub import InferenceClient
class TGIClient:
"""TGI ํด๋ผ์ด์ธํธ"""
def __init__(self, endpoint: str = "http://localhost:8080"):
self.client = InferenceClient(endpoint)
def generate(
self,
prompt: str,
max_new_tokens: int = 256,
temperature: float = 0.7,
stream: bool = False
):
"""์์ฑ"""
if stream:
return self._stream_generate(prompt, max_new_tokens, temperature)
response = self.client.text_generation(
prompt,
max_new_tokens=max_new_tokens,
temperature=temperature,
details=True
)
return response
def _stream_generate(self, prompt: str, max_new_tokens: int, temperature: float):
"""์คํธ๋ฆฌ๋ฐ ์์ฑ"""
for token in self.client.text_generation(
prompt,
max_new_tokens=max_new_tokens,
temperature=temperature,
stream=True
):
yield token
def get_model_info(self):
"""๋ชจ๋ธ ์ ๋ณด"""
return self.client.get_model_info()
# ์ฌ์ฉ ์์
def tgi_example():
client = TGIClient()
# ์ผ๋ฐ ์์ฑ
response = client.generate(
"Write a short poem about AI:",
max_new_tokens=100
)
print(response.generated_text)
# ์คํธ๋ฆฌ๋ฐ
print("\nStreaming:")
for token in client.generate("Once upon a time,", stream=True):
print(token, end="", flush=True)
4. Speculative Decoding¶
4.1 ๊ฐ๋ ¶
Speculative Decoding:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ์์ด๋์ด: ์์ ๋ชจ๋ธ๋ก ์ด์ ์์ฑ โ ํฐ ๋ชจ๋ธ๋ก ๊ฒ์ฆ โ
โ โ
โ ์ผ๋ฐ decoding: โ
โ Large Model: t1 โ t2 โ t3 โ t4 โ t5 (5 forward passes) โ
โ โ
โ Speculative decoding: โ
โ Draft Model: [t1, t2, t3, t4, t5] (๋น ๋ฅธ ์ถ์ธก) โ
โ Large Model: verify all at once (1 forward pass) โ
โ โ
โ ๊ฒฐ๊ณผ: t1 โ, t2 โ, t3 โ โ ์ฌ์์ฑ โ
โ โ
โ ์๋ ํฅ์: 2-3x (acceptance rate์ ๋ฐ๋ผ) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
4.2 ๊ตฌํ¶
import torch
from typing import Tuple
class SpeculativeDecoder:
"""Speculative Decoding ๊ตฌํ"""
def __init__(
self,
target_model, # ํฐ ๋ชจ๋ธ
draft_model, # ์์ ๋ชจ๋ธ
tokenizer,
num_speculative_tokens: int = 5
):
self.target = target_model
self.draft = draft_model
self.tokenizer = tokenizer
self.k = num_speculative_tokens
@torch.no_grad()
def generate(
self,
input_ids: torch.Tensor,
max_new_tokens: int = 100,
temperature: float = 1.0
) -> torch.Tensor:
"""Speculative decoding์ผ๋ก ์์ฑ"""
generated = input_ids.clone()
while generated.shape[1] - input_ids.shape[1] < max_new_tokens:
# 1. Draft model๋ก k๊ฐ ํ ํฐ ์ถ์ธก
draft_tokens, draft_probs = self._draft_tokens(
generated, self.k, temperature
)
# 2. Target model๋ก ๊ฒ์ฆ
accepted, target_probs = self._verify_tokens(
generated, draft_tokens, temperature
)
# 3. ์๋ฝ๋ ํ ํฐ ์ถ๊ฐ
generated = torch.cat([generated, accepted], dim=1)
# 4. ๋ง์ง๋ง ๊ฑฐ์ ์์น์์ target์ผ๋ก ์ํ๋ง
if accepted.shape[1] < self.k:
# ์ผ๋ถ ๊ฑฐ์ ๋จ โ target์์ ๋ค์ ํ ํฐ ์ํ๋ง
next_token = self._sample_from_target(
generated, target_probs, temperature
)
generated = torch.cat([generated, next_token], dim=1)
return generated
def _draft_tokens(
self,
context: torch.Tensor,
k: int,
temperature: float
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Draft model๋ก k๊ฐ ํ ํฐ ์์ฑ"""
draft_tokens = []
draft_probs = []
current = context
for _ in range(k):
outputs = self.draft(current)
logits = outputs.logits[:, -1] / temperature
probs = torch.softmax(logits, dim=-1)
# ์ํ๋ง
token = torch.multinomial(probs, num_samples=1)
draft_tokens.append(token)
draft_probs.append(probs)
current = torch.cat([current, token], dim=1)
return torch.cat(draft_tokens, dim=1), torch.stack(draft_probs, dim=1)
def _verify_tokens(
self,
context: torch.Tensor,
draft_tokens: torch.Tensor,
temperature: float
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Target model๋ก ๊ฒ์ฆ"""
# ์ ์ฒด ์ํ์ค์ ๋ํด ํ ๋ฒ์ forward
full_seq = torch.cat([context, draft_tokens], dim=1)
outputs = self.target(full_seq)
# Target probabilities
target_logits = outputs.logits[:, context.shape[1]-1:-1] / temperature
target_probs = torch.softmax(target_logits, dim=-1)
# Draft probabilities (์ด๋ฏธ ๊ณ์ฐ๋จ)
draft_probs = self._get_draft_probs(context, draft_tokens, temperature)
# Acceptance probability: min(1, p_target / p_draft)
accepted = []
for i in range(draft_tokens.shape[1]):
token = draft_tokens[:, i]
p_target = target_probs[:, i].gather(1, token.unsqueeze(1))
p_draft = draft_probs[:, i].gather(1, token.unsqueeze(1))
accept_prob = torch.clamp(p_target / p_draft, max=1.0)
if torch.rand(1) < accept_prob:
accepted.append(token)
else:
break
if accepted:
return torch.stack(accepted, dim=1), target_probs
else:
return torch.tensor([]).reshape(1, 0), target_probs
def _get_draft_probs(self, context, draft_tokens, temperature):
"""Draft probs ์ฌ๊ณ์ฐ"""
full_seq = torch.cat([context, draft_tokens], dim=1)
outputs = self.draft(full_seq)
logits = outputs.logits[:, context.shape[1]-1:-1] / temperature
return torch.softmax(logits, dim=-1)
def _sample_from_target(self, context, target_probs, temperature):
"""Target model์์ ์ํ๋ง"""
# Rejection ์์น์ ๋ค์ ํ ํฐ
probs = target_probs[:, -1]
return torch.multinomial(probs, num_samples=1)
5. ์์ํ (Quantization)¶
5.1 ์์ํ ๋ฐฉ๋ฒ ๋น๊ต¶
| ๋ฐฉ๋ฒ | ์ ๋ฐ๋ | ์๋ | ํ์ง | ๋ฉ๋ชจ๋ฆฌ |
|---|---|---|---|---|
| FP16 | 16-bit | 1x | 100% | 1x |
| GPTQ | 4-bit | ~1.5x | 98-99% | 0.25x |
| AWQ | 4-bit | ~2x | 98-99% | 0.25x |
| GGUF | 2-8bit | ~2x | 95-99% | 0.15-0.5x |
| bitsandbytes | 4/8-bit | ~1.2x | 97-99% | 0.25-0.5x |
5.2 ์์ํ ์ฌ์ฉ¶
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# bitsandbytes 4-bit
def load_4bit_model(model_name: str):
"""4-bit ์์ํ ๋ชจ๋ธ ๋ก๋"""
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
return model
# GPTQ
def load_gptq_model(model_name: str):
"""GPTQ ์์ํ ๋ชจ๋ธ ๋ก๋"""
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device_map="auto",
use_safetensors=True
)
return model
# AWQ
def load_awq_model(model_name: str):
"""AWQ ์์ํ ๋ชจ๋ธ ๋ก๋"""
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(
model_name,
fuse_layers=True,
device_map="auto"
)
return model
6. ๋ฐฐ์น ์ฒ๋ฆฌ ์ต์ ํ¶
6.1 Continuous Batching¶
class ContinuousBatcher:
"""Continuous Batching ๊ตฌํ"""
def __init__(
self,
model,
tokenizer,
max_batch_size: int = 32,
max_seq_len: int = 2048
):
self.model = model
self.tokenizer = tokenizer
self.max_batch_size = max_batch_size
self.max_seq_len = max_seq_len
# ํ์ฑ ์์ฒญ๋ค
self.active_requests = {}
self.request_queue = []
def add_request(self, request_id: str, prompt: str, max_tokens: int):
"""์ ์์ฒญ ์ถ๊ฐ"""
tokens = self.tokenizer.encode(prompt)
self.request_queue.append({
"id": request_id,
"tokens": tokens,
"generated": [],
"max_tokens": max_tokens
})
def step(self) -> dict:
"""ํ ์คํ
์ฒ๋ฆฌ"""
# 1. ์ ์์ฒญ์ ๋ฐฐ์น์ ์ถ๊ฐ
while (len(self.active_requests) < self.max_batch_size and
self.request_queue):
req = self.request_queue.pop(0)
self.active_requests[req["id"]] = req
if not self.active_requests:
return {}
# 2. ๋ฐฐ์น ๊ตฌ์ฑ
batch_ids, batch_tokens = self._prepare_batch()
# 3. Forward pass
with torch.no_grad():
outputs = self.model(batch_tokens)
next_tokens = outputs.logits[:, -1].argmax(dim=-1)
# 4. ๊ฒฐ๊ณผ ์
๋ฐ์ดํธ
results = {}
completed = []
for i, req_id in enumerate(batch_ids):
req = self.active_requests[req_id]
token = next_tokens[i].item()
req["generated"].append(token)
# ์๋ฃ ์ฒดํฌ
if (len(req["generated"]) >= req["max_tokens"] or
token == self.tokenizer.eos_token_id):
results[req_id] = self.tokenizer.decode(req["generated"])
completed.append(req_id)
# 5. ์๋ฃ๋ ์์ฒญ ์ ๊ฑฐ
for req_id in completed:
del self.active_requests[req_id]
return results
def _prepare_batch(self):
"""๋ฐฐ์น ์ค๋น (padding)"""
batch_ids = list(self.active_requests.keys())
sequences = []
for req_id in batch_ids:
req = self.active_requests[req_id]
seq = req["tokens"] + req["generated"]
sequences.append(seq)
# Padding
max_len = max(len(s) for s in sequences)
padded = []
for seq in sequences:
padded.append(seq + [self.tokenizer.pad_token_id] * (max_len - len(seq)))
return batch_ids, torch.tensor(padded)
7. ์ฑ๋ฅ ๋ฒค์น๋งํน¶
import time
from dataclasses import dataclass
from typing import List
@dataclass
class BenchmarkResult:
throughput: float # tokens/second
latency_p50: float # ms
latency_p99: float # ms
memory_gb: float
def benchmark_inference(
model,
tokenizer,
prompts: List[str],
max_tokens: int = 100,
num_runs: int = 10
) -> BenchmarkResult:
"""์ถ๋ก ๋ฒค์น๋งํฌ"""
latencies = []
total_tokens = 0
# Warmup
for prompt in prompts[:2]:
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
_ = model.generate(**inputs, max_new_tokens=10)
# Benchmark
for _ in range(num_runs):
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
start = time.perf_counter()
outputs = model.generate(**inputs, max_new_tokens=max_tokens)
end = time.perf_counter()
latencies.append((end - start) * 1000) # ms
total_tokens += outputs.shape[1] - inputs["input_ids"].shape[1]
# ๋ฉ๋ชจ๋ฆฌ
if torch.cuda.is_available():
memory_gb = torch.cuda.max_memory_allocated() / 1e9
else:
memory_gb = 0
latencies.sort()
return BenchmarkResult(
throughput=total_tokens / (sum(latencies) / 1000),
latency_p50=latencies[len(latencies) // 2],
latency_p99=latencies[int(len(latencies) * 0.99)],
memory_gb=memory_gb
)
ํต์ฌ ์ ๋ฆฌ¶
์ถ๋ก ์ต์ ํ ๊ธฐ๋ฒ¶
1. PagedAttention (vLLM): KV cache ํจ์จํ
2. Continuous Batching: ๋์ ๋ฐฐ์น ์ฒ๋ฆฌ
3. Speculative Decoding: Draft+Verify
4. Quantization: 4-bit/8-bit ์์ถ
5. Flash Attention: Memory-efficient attention
6. Tensor Parallelism: ๋ค์ค GPU ๋ถ์ฐ
๋๊ตฌ ์ ํ ๊ฐ์ด๋¶
- ๊ณ ์ฒ๋ฆฌ๋ ์๋น: vLLM
- HuggingFace ํตํฉ: TGI
- ์ฃ์ง ๋๋ฐ์ด์ค: llama.cpp + GGUF
- ๊ฐ๋ฐ/์คํ: Transformers + bitsandbytes
์ฐธ๊ณ ์๋ฃ¶
- Kwon et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention"
- Leviathan et al. (2023). "Fast Inference from Transformers via Speculative Decoding"
- vLLM Documentation
- TGI Documentation