19. PEFT (Parameter-Efficient Fine-Tuning) ν΅ν©
19. PEFT (Parameter-Efficient Fine-Tuning) ν΅ν©¶
κ°μ¶
PEFT λ°©λ²λ‘ λ€μ μ 체 λͺ¨λΈ λμ μμ νλΌλ―Έν° μΈνΈλ§ νμ΅νμ¬ ν¨μ¨μ μΈ μ μμ κ°λ₯νκ² ν©λλ€. μ΄ λ μ¨μμλ λ€μν PEFT κΈ°λ²λ€μ ν΅ν©μ μΌλ‘ λ€λ£Ήλλ€.
1. PEFT κ°μ¶
1.1 μ PEFTμΈκ°?¶
Full Fine-tuningμ λ¬Έμ μ :
βββββββββββββββββββββββββββββββββββββββ
β LLaMA-7B β
β - νλΌλ―Έν°: 7B β
β - FP16 λ©λͺ¨λ¦¬: 14GB β
β - Optimizer states: 56GB β
β - Gradients: 14GB β
β - Total: ~84GB β
βββββββββββββββββββββββββββββββββββββββ
PEFTμ μ₯μ :
βββββββββββββββββββββββββββββββββββββββ
β LoRA (rank=8) β
β - νμ΅ νλΌλ―Έν°: ~0.1% β
β - μΆκ° λ©λͺ¨λ¦¬: ~100MB β
β - μ±λ₯: Full FTμ 90-95% β
β - μ€ν 리μ§: μλ³Έ + μμ adapter β
βββββββββββββββββββββββββββββββββββββββ
1.2 PEFT λ°©λ²λ‘ λΆλ₯¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PEFT Methods β
ββββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββββββ€
β Additive β Reparameterization β Selective β
β βββββββββ β βββββββββββββββββ β βββββββββ β
β β’ Adapters β β’ LoRA β β’ BitFit β
β β’ Prefix Tuning β β’ DoRA β β’ Diff Pruning β
β β’ Prompt Tuning β β’ AdaLoRA β β’ Partial FT β
β β’ IAΒ³ β β’ QLoRA β β
ββββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββββββ
2. LoRA (Low-Rank Adaptation)¶
2.1 μνμ μ리¶
κΈ°λ³Έ μμ΄λμ΄:
- Weight μ
λ°μ΄νΈ ΞWλ low-rankλ‘ κ·Όμ¬ κ°λ₯
- ΞW = BA, where B β R^(dΓr), A β R^(rΓk)
- r << min(d, k)
Forward pass:
h = Wβx + ΞWx = Wβx + BAx
νμ΅ νλΌλ―Έν°:
- Wβ: frozen
- A, B: trainable
- νλΌλ―Έν° μ: r(d + k) vs dk (r << min(d,k))
μμ (d=4096, k=4096, r=8):
- Full: 16.7M params
- LoRA: 65K params (0.4%)
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional
class LoRALayer(nn.Module):
"""LoRA λ μ΄μ΄"""
def __init__(
self,
in_features: int,
out_features: int,
rank: int = 8,
alpha: float = 16.0,
dropout: float = 0.0
):
super().__init__()
self.rank = rank
self.alpha = alpha
self.scaling = alpha / rank
# Low-rank matrices
self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
# μ΄κΈ°ν
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""LoRA delta: BA * scaling"""
return self.scaling * (self.dropout(x) @ self.lora_A.T @ self.lora_B.T)
class LinearWithLoRA(nn.Module):
"""LoRAκ° μ μ©λ Linear λ μ΄μ΄"""
def __init__(
self,
linear: nn.Linear,
rank: int = 8,
alpha: float = 16.0,
dropout: float = 0.0
):
super().__init__()
self.linear = linear
self.lora = LoRALayer(
linear.in_features,
linear.out_features,
rank, alpha, dropout
)
# Original weights frozen
for param in self.linear.parameters():
param.requires_grad = False
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.linear(x) + self.lora(x)
def merge_weights(self):
"""LoRA weightsλ₯Ό originalμ λ³ν©"""
with torch.no_grad():
self.linear.weight += (
self.lora.lora_B @ self.lora.lora_A
) * self.lora.scaling
def apply_lora_to_model(
model: nn.Module,
rank: int = 8,
alpha: float = 16.0,
target_modules: list = ["q_proj", "v_proj"]
):
"""λͺ¨λΈμ LoRA μ μ©"""
for name, module in model.named_modules():
if any(target in name for target in target_modules):
if isinstance(module, nn.Linear):
# λΆλͺ¨ λͺ¨λ μ°ΎκΈ°
parent_name = ".".join(name.split(".")[:-1])
child_name = name.split(".")[-1]
parent = model.get_submodule(parent_name) if parent_name else model
# LoRAλ‘ κ΅μ²΄
lora_linear = LinearWithLoRA(module, rank, alpha)
setattr(parent, child_name, lora_linear)
return model
2.2 QLoRA (Quantized LoRA)¶
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
def setup_qlora(model_name: str, rank: int = 64):
"""QLoRA μ€μ """
# 4-bit μμν μ€μ
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normal Float 4
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True # μ΄μ€ μμν
)
# λͺ¨λΈ λ‘λ
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
# kbit νμ΅ μ€λΉ
model = prepare_model_for_kbit_training(model)
# LoRA μ€μ
lora_config = LoraConfig(
r=rank,
lora_alpha=rank * 2,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# νμ΅ κ°λ₯ν νλΌλ―Έν° νμΈ
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable_params:,} / {all_params:,} ({100*trainable_params/all_params:.2f}%)")
return model
2.3 DoRA (Weight-Decomposed Low-Rank Adaptation)¶
class DoRALayer(nn.Module):
"""
DoRA: Weight = m * (W + BA) / ||W + BA||
Weightλ₯Ό magnitudeμ directionμΌλ‘ λΆν΄
"""
def __init__(
self,
in_features: int,
out_features: int,
rank: int = 8,
alpha: float = 16.0
):
super().__init__()
self.rank = rank
self.scaling = alpha / rank
# LoRA components
self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
# Magnitude vector (learnable)
self.magnitude = nn.Parameter(torch.ones(out_features))
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def forward(
self,
x: torch.Tensor,
original_weight: torch.Tensor
) -> torch.Tensor:
"""
W' = m * (W + ΞW) / ||W + ΞW||
"""
# ΞW = B @ A
delta_w = (self.lora_B @ self.lora_A) * self.scaling
# W + ΞW
adapted_weight = original_weight + delta_w
# Normalize direction
weight_norm = adapted_weight.norm(dim=1, keepdim=True)
normalized_weight = adapted_weight / (weight_norm + 1e-8)
# Apply magnitude
final_weight = self.magnitude.unsqueeze(1) * normalized_weight
return F.linear(x, final_weight)
3. Adapter Methods¶
3.1 Bottleneck Adapters¶
Transformer Block with Adapter:
ββββββββββββββββββββββββββββββββββββββββββ
β Multi-Head Attention β
β β β
β ββββββββββββββββββββββββββββββββββββ β
β β Adapter (bottleneck) β β
β β Linear(d β r) β GELU β β
β β Linear(r β d) + residual β β
β ββββββββββββββββββββββββββββββββββββ β
β β β
β Feed-Forward Network β
β β β
β ββββββββββββββββββββββββββββββββββββ β
β β Adapter (bottleneck) β β
β ββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββ
class Adapter(nn.Module):
"""Bottleneck Adapter"""
def __init__(
self,
hidden_size: int,
bottleneck_size: int,
adapter_scalar: float = 1.0
):
super().__init__()
self.down_proj = nn.Linear(hidden_size, bottleneck_size)
self.up_proj = nn.Linear(bottleneck_size, hidden_size)
self.act = nn.GELU()
self.scalar = adapter_scalar
# μ΄κΈ°ν: near-identity
nn.init.normal_(self.down_proj.weight, std=1e-3)
nn.init.zeros_(self.down_proj.bias)
nn.init.normal_(self.up_proj.weight, std=1e-3)
nn.init.zeros_(self.up_proj.bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
residual = x
x = self.down_proj(x)
x = self.act(x)
x = self.up_proj(x)
return residual + self.scalar * x
3.2 IAΒ³ (Infused Adapter by Inhibiting and Amplifying Inner Activations)¶
class IA3Layer(nn.Module):
"""
IAΒ³: νμ΅ κ°λ₯ν scaling vectorsλ§ μ¬μ©
- key, value, ffn μΆλ ₯μ element-wise κ³±
- λ§€μ° μ μ νλΌλ―Έν°
"""
def __init__(self, dim: int):
super().__init__()
# Learnable scaling vectors
self.l_k = nn.Parameter(torch.ones(dim)) # key scaling
self.l_v = nn.Parameter(torch.ones(dim)) # value scaling
self.l_ff = nn.Parameter(torch.ones(dim)) # ffn scaling
def scale_key(self, k: torch.Tensor) -> torch.Tensor:
return k * self.l_k
def scale_value(self, v: torch.Tensor) -> torch.Tensor:
return v * self.l_v
def scale_ffn(self, h: torch.Tensor) -> torch.Tensor:
return h * self.l_ff
4. Prompt-based Methods¶
4.1 Prefix Tuning¶
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Prefix Tuning β
β β
β Input: [Pβ, Pβ, ..., Pβ, xβ, xβ, ..., xβ] β
β β
β - Pα΅’: learnable prefix tokens (κ° layerμμ key/valueλ‘) β
β - xα΅’: actual input tokens β
β β
β Attention: β
β softmax(Q Β· [P_keys; X_keys]α΅) Β· [P_values; X_values] β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
class PrefixTuning(nn.Module):
"""Prefix Tuning"""
def __init__(
self,
num_layers: int,
num_heads: int,
head_dim: int,
prefix_length: int = 10,
hidden_size: int = 512
):
super().__init__()
self.prefix_length = prefix_length
self.num_layers = num_layers
self.num_heads = num_heads
self.head_dim = head_dim
# Prefix embeddings (through MLP for stability)
self.prefix_embedding = nn.Embedding(prefix_length, hidden_size)
# Layer-specific projections
self.prefix_mlp = nn.Sequential(
nn.Linear(hidden_size, hidden_size),
nn.Tanh(),
nn.Linear(hidden_size, num_layers * 2 * num_heads * head_dim)
)
def forward(self, batch_size: int) -> tuple:
"""
Returns:
prefix_keys: (num_layers, batch_size, num_heads, prefix_len, head_dim)
prefix_values: (num_layers, batch_size, num_heads, prefix_len, head_dim)
"""
# Prefix indices
prefix_idx = torch.arange(self.prefix_length)
prefix_embed = self.prefix_embedding(prefix_idx) # (prefix_len, hidden)
# Project to key/value pairs for all layers
prefix_kv = self.prefix_mlp(prefix_embed) # (prefix_len, num_layers*2*num_heads*head_dim)
# Reshape
prefix_kv = prefix_kv.view(
self.prefix_length,
self.num_layers, 2,
self.num_heads, self.head_dim
)
prefix_kv = prefix_kv.permute(1, 2, 0, 3, 4) # (layers, 2, prefix, heads, dim)
# Expand for batch
prefix_keys = prefix_kv[:, 0].unsqueeze(1).expand(-1, batch_size, -1, -1, -1)
prefix_values = prefix_kv[:, 1].unsqueeze(1).expand(-1, batch_size, -1, -1, -1)
return prefix_keys, prefix_values
4.2 Prompt Tuning¶
class PromptTuning(nn.Module):
"""
Prompt Tuning: μ
λ ₯μ soft prompt μΆκ°
λ¨μνμ§λ§ ν¨κ³Όμ (νΉν λν λͺ¨λΈμμ)
"""
def __init__(
self,
num_tokens: int,
embed_dim: int,
init_from_vocab: bool = False,
vocab_embeddings: Optional[nn.Embedding] = None
):
super().__init__()
self.num_tokens = num_tokens
# Soft prompt embeddings
self.prompt_embeddings = nn.Parameter(torch.zeros(num_tokens, embed_dim))
if init_from_vocab and vocab_embeddings is not None:
# μ€μ ν ν°μΌλ‘ μ΄κΈ°ν
indices = torch.randint(0, vocab_embeddings.num_embeddings, (num_tokens,))
self.prompt_embeddings.data = vocab_embeddings.weight[indices].clone()
else:
nn.init.normal_(self.prompt_embeddings, std=0.02)
def forward(self, input_embeddings: torch.Tensor) -> torch.Tensor:
"""
Args:
input_embeddings: (batch, seq_len, embed_dim)
Returns:
(batch, prompt_len + seq_len, embed_dim)
"""
batch_size = input_embeddings.shape[0]
# Expand prompt for batch
prompt = self.prompt_embeddings.unsqueeze(0).expand(batch_size, -1, -1)
# Concatenate
return torch.cat([prompt, input_embeddings], dim=1)
5. HuggingFace PEFT μ¬μ©¶
from peft import (
LoraConfig, PrefixTuningConfig, PromptTuningConfig,
get_peft_model, TaskType
)
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
def setup_peft_training(
model_name: str,
method: str = "lora",
output_dir: str = "./output"
):
"""λ€μν PEFT λ°©λ² μ€μ """
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# PEFT μ€μ
if method == "lora":
peft_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
elif method == "prefix":
peft_config = PrefixTuningConfig(
num_virtual_tokens=20,
task_type=TaskType.CAUSAL_LM
)
elif method == "prompt":
peft_config = PromptTuningConfig(
num_virtual_tokens=20,
prompt_tuning_init="TEXT",
prompt_tuning_init_text="Classify the sentiment of this text: ",
tokenizer_name_or_path=model_name,
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
return model, tokenizer
def train_with_peft(model, tokenizer, train_dataset):
"""PEFT λͺ¨λΈ νμ΅"""
training_args = TrainingArguments(
output_dir="./peft-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_steps=100,
logging_steps=10,
save_strategy="epoch",
fp16=True
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer
)
trainer.train()
# Adapter μ μ₯ (μλ³Έ λͺ¨λΈ λΆνμ)
model.save_pretrained("./peft-adapter")
def load_and_merge_adapter(base_model_name: str, adapter_path: str):
"""Adapter λ‘λ λ° λ³ν©"""
from peft import PeftModel
# Base model
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
# Adapter λ‘λ
model = PeftModel.from_pretrained(base_model, adapter_path)
# λ³ν© (μΆλ‘ μλ ν₯μ)
merged_model = model.merge_and_unload()
return merged_model
6. λ°©λ²λ‘ λΉκ΅¶
6.1 νλΌλ―Έν° ν¨μ¨μ±¶
| λ°©λ² | νμ΅ νλΌλ―Έν° (7B λͺ¨λΈ) | λ©λͺ¨λ¦¬ μ€λ²ν€λ |
|---|---|---|
| Full FT | 7B (100%) | ~84GB |
| LoRA (r=8) | ~4M (0.06%) | ~200MB |
| LoRA (r=64) | ~30M (0.4%) | ~1GB |
| QLoRA (r=64) | ~30M | ~6GB (4bit base) |
| Prefix Tuning | ~1M | ~100MB |
| Prompt Tuning | ~100K | ~10MB |
| IAΒ³ | ~300K | ~30MB |
6.2 μ±λ₯ λΉκ΅¶
μΌλ°μ μΈ μ±λ₯ μμ (downstream tasks):
Full FT > LoRA β QLoRA > Adapters > Prefix > Prompt
λ¨, λͺ¨λΈ ν¬κΈ°μ νμ€ν¬μ λ°λΌ λ€λ¦:
- λν λͺ¨λΈ (>10B): Prompt Tuningλ ν¨κ³Όμ
- μν λͺ¨λΈ (<1B): LoRA/Adapters κΆμ₯
- λ©λͺ¨λ¦¬ μ μ½: QLoRA νμ
6.3 μ ν κ°μ΄λ¶
def recommend_peft_method(
model_size_b: float, # λͺ¨λΈ ν¬κΈ° (billions)
gpu_memory_gb: float, # GPU λ©λͺ¨λ¦¬ (GB)
task_type: str, # "classification", "generation", "qa"
num_examples: int # νμ΅ λ°μ΄ν° μ
) -> str:
"""PEFT λ°©λ² μΆμ²"""
# λ©λͺ¨λ¦¬ κΈ°λ° κ²°μ
if gpu_memory_gb < model_size_b * 2:
# 4-bit μμν νμ
return "QLoRA"
# λ°μ΄ν° ν¬κΈ° κΈ°λ°
if num_examples < 1000:
# μ μ λ°μ΄ν°: Prompt Tuning
if model_size_b > 10:
return "Prompt Tuning"
else:
return "LoRA (small rank)"
# μΌλ°μ μΈ κ²½μ°
if task_type == "classification":
return "LoRA or Adapters"
elif task_type == "generation":
return "LoRA (target all projections)"
else:
return "LoRA"
ν΅μ¬ μ 리¶
PEFT ν΅μ¬ κ°λ ¶
1. LoRA: W + BAλ‘ low-rank μ
λ°μ΄νΈ
2. QLoRA: 4-bit μμν + LoRA
3. DoRA: magnitude/direction λΆλ¦¬
4. Adapters: bottleneck λͺ¨λ μΆκ°
5. Prefix: learnable key/value prefix
6. Prompt: soft prompt embeddings
7. IAΒ³: scaling vectorsλ§ νμ΅
μ€μ© ν¬μΈνΈ¶
- GPU λΆμ‘± β QLoRA μ¬μ©
- μΆλ‘ μλ μ€μ β merge_and_unload()
- μ¬λ¬ νμ€ν¬ β adapterλ³ μ μ₯/λ‘λ
- λν λͺ¨λΈ + μ μ λ°μ΄ν° β Prompt Tuning
μ°Έκ³ μλ£¶
- Hu et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models"
- Dettmers et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs"
- Liu et al. (2024). "DoRA: Weight-Decomposed Low-Rank Adaptation"
- Houlsby et al. (2019). "Parameter-Efficient Transfer Learning for NLP"
- HuggingFace PEFT