16. Vision-Language 심화

16. Vision-Language 심화

κ°œμš”

Vision-Language Models (VLMs)λŠ” 이미지와 ν…μŠ€νŠΈλ₯Ό ν•¨κ»˜ μ΄ν•΄ν•˜λŠ” λͺ¨λΈμž…λ‹ˆλ‹€. 이 λ ˆμŠ¨μ—μ„œλŠ” LLaVA, Qwen-VL λ“± μ΅œμ‹  VLM μ•„ν‚€ν…μ²˜μ™€ Visual Instruction Tuning 기법을 λ‹€λ£Ήλ‹ˆλ‹€.


1. VLM νŒ¨λŸ¬λ‹€μž„

1.1 λ°œμ „ κ³Όμ •

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    VLM λ°œμ „ κ³Όμ •                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                  β”‚
β”‚  2021: CLIP                                                      β”‚
β”‚  - Image-Text contrastive learning                              β”‚
β”‚  - Zero-shot λΆ„λ₯˜ κ°€λŠ₯                                          β”‚
β”‚                                                                  β”‚
β”‚  2022: Flamingo                                                  β”‚
β”‚  - LLM에 visual tokens μ£Όμž…                                     β”‚
β”‚  - Few-shot λΉ„μ „-μ–Έμ–΄ ν•™μŠ΅                                      β”‚
β”‚                                                                  β”‚
β”‚  2023: LLaVA                                                     β”‚
β”‚  - Visual Instruction Tuning                                    β”‚
β”‚  - μ˜€ν”ˆμ†ŒμŠ€ GPT-4V λŒ€μ•ˆ                                         β”‚
β”‚                                                                  β”‚
β”‚  2024: LLaVA-NeXT, Qwen-VL, Phi-3-Vision                        β”‚
β”‚  - 고해상도, 닀쀑 이미지, λΉ„λ””μ˜€                                 β”‚
β”‚  - μƒμš© μˆ˜μ€€ μ„±λŠ₯                                                β”‚
β”‚                                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1.2 μ•„ν‚€ν…μ²˜ 비ꡐ

λͺ¨λΈ Vision Encoder LLM μ—°κ²° 방식
LLaVA CLIP ViT-L Vicuna/LLaMA Linear Projection
Qwen-VL ViT-G Qwen Cross-Attention
InternVL InternViT InternLM MLP
Phi-3-Vision CLIP ViT Phi-3 Linear
GPT-4V Unknown GPT-4 Unknown

2. LLaVA (Large Language and Vision Assistant)

2.1 μ•„ν‚€ν…μ²˜

LLaVA ꡬ쑰:

이미지 β†’ CLIP ViT-L/14 β†’ Visual Features (576 tokens)
                ↓
         Linear Projection
                ↓
         Visual Tokens
                ↓
[System] [Visual Tokens] [User Query] β†’ LLaMA/Vicuna β†’ Response

ν•™μŠ΅ 단계:
1. Pre-training: Image-Text alignment (CC3M)
2. Fine-tuning: Visual Instruction Tuning (158K)

2.2 κ΅¬ν˜„

import torch
import torch.nn as nn
from transformers import CLIPVisionModel, LlamaForCausalLM, LlamaTokenizer

class LLaVAModel(nn.Module):
    """LLaVA μŠ€νƒ€μΌ Vision-Language Model"""

    def __init__(
        self,
        vision_encoder: str = "openai/clip-vit-large-patch14",
        llm: str = "lmsys/vicuna-7b-v1.5",
        freeze_vision: bool = True,
        freeze_llm: bool = False
    ):
        super().__init__()

        # Vision Encoder
        self.vision_encoder = CLIPVisionModel.from_pretrained(vision_encoder)
        self.vision_hidden_size = self.vision_encoder.config.hidden_size

        # Language Model
        self.llm = LlamaForCausalLM.from_pretrained(llm)
        self.llm_hidden_size = self.llm.config.hidden_size

        # Vision-Language Projection
        self.vision_projection = nn.Linear(
            self.vision_hidden_size,
            self.llm_hidden_size
        )

        # Freeze encoders
        if freeze_vision:
            for param in self.vision_encoder.parameters():
                param.requires_grad = False

        if freeze_llm:
            for param in self.llm.parameters():
                param.requires_grad = False

    def encode_images(self, images: torch.Tensor) -> torch.Tensor:
        """
        이미지 인코딩

        Args:
            images: (B, C, H, W)

        Returns:
            visual_tokens: (B, num_patches, llm_hidden_size)
        """
        # CLIP encoding
        vision_outputs = self.vision_encoder(images)
        image_features = vision_outputs.last_hidden_state  # (B, 257, 1024)

        # [CLS] 토큰 μ œμ™Έ
        image_features = image_features[:, 1:, :]  # (B, 256, 1024)

        # Project to LLM space
        visual_tokens = self.vision_projection(image_features)

        return visual_tokens

    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor,
        images: torch.Tensor = None,
        image_positions: torch.Tensor = None,
        labels: torch.Tensor = None
    ):
        """
        Forward pass

        Args:
            input_ids: (B, seq_len) ν…μŠ€νŠΈ 토큰
            attention_mask: (B, seq_len)
            images: (B, C, H, W) 이미지
            image_positions: 이미지 토큰이 λ“€μ–΄κ°ˆ μœ„μΉ˜
            labels: (B, seq_len) for training
        """
        B, seq_len = input_ids.shape

        # Text embeddings
        text_embeds = self.llm.model.embed_tokens(input_ids)

        # Image embeddings
        if images is not None:
            visual_tokens = self.encode_images(images)  # (B, num_patches, hidden)

            # Interleave visual tokens with text
            # κ°„μ†Œν™”: 이미지λ₯Ό ν…μŠ€νŠΈ μ•žμ— μΆ”κ°€
            combined_embeds = torch.cat([visual_tokens, text_embeds], dim=1)

            # Attention mask μ‘°μ •
            visual_mask = torch.ones(B, visual_tokens.shape[1], device=attention_mask.device)
            combined_mask = torch.cat([visual_mask, attention_mask], dim=1)
        else:
            combined_embeds = text_embeds
            combined_mask = attention_mask

        # LLM forward
        outputs = self.llm(
            inputs_embeds=combined_embeds,
            attention_mask=combined_mask,
            labels=labels,
            return_dict=True
        )

        return outputs


class VisualInstructionDataset:
    """Visual Instruction Tuning 데이터셋"""

    INSTRUCTION_TEMPLATES = [
        "Describe this image in detail.",
        "What can you see in this image?",
        "Explain what is happening in this picture.",
        "<question>",  # VQA
    ]

    def __init__(self, data_path: str):
        """
        데이터 ν˜•μ‹:
        {
            "image": "path/to/image.jpg",
            "conversations": [
                {"from": "human", "value": "<image>\nDescribe this image."},
                {"from": "gpt", "value": "This image shows..."}
            ]
        }
        """
        import json
        with open(data_path, 'r') as f:
            self.data = json.load(f)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]

        # 이미지 λ‘œλ“œ
        from PIL import Image
        image = Image.open(item['image']).convert('RGB')

        # λŒ€ν™” ꡬ성
        conversations = item['conversations']
        human_input = conversations[0]['value']
        assistant_output = conversations[1]['value']

        return {
            'image': image,
            'human': human_input,
            'assistant': assistant_output
        }

2.3 LLaVA-NeXT κ°œμ„ μ 

class LLaVANeXTConfig:
    """
    LLaVA-NeXT κ°œμ„  사항

    1. 고해상도 지원 (AnyRes)
    2. 더 λ‚˜μ€ Vision Encoder (SigLIP)
    3. 더 큰 LLM (Llama 3, Qwen 2)
    """

    # AnyRes: λ‹€μ–‘ν•œ 해상도 처리
    SUPPORTED_RESOLUTIONS = [
        (336, 336),
        (672, 336),
        (336, 672),
        (672, 672),
        (1008, 336),
        (336, 1008),
    ]

    @staticmethod
    def select_best_resolution(image_size: tuple, resolutions: list):
        """이미지에 κ°€μž₯ μ ν•©ν•œ 해상도 선택"""
        img_h, img_w = image_size
        img_ratio = img_w / img_h

        best_res = None
        best_ratio_diff = float('inf')

        for res in resolutions:
            res_ratio = res[1] / res[0]
            ratio_diff = abs(img_ratio - res_ratio)

            if ratio_diff < best_ratio_diff:
                best_ratio_diff = ratio_diff
                best_res = res

        return best_res


def anyres_processing(image, base_resolution=336):
    """
    AnyRes 이미지 처리

    고해상도 이미지λ₯Ό κΈ°λ³Έ 해상도 νƒ€μΌλ‘œ λΆ„ν• 
    + 전체 이미지 μΆ•μ†Œλ³Έ
    """
    from PIL import Image
    import torch

    # 1. 전체 이미지 λ¦¬μ‚¬μ΄μ¦ˆ (μ „μ—­ μ»¨ν…μŠ€νŠΈ)
    global_image = image.resize((base_resolution, base_resolution))

    # 2. 타일 λΆ„ν•  (μ§€μ—­ λ””ν…ŒμΌ)
    W, H = image.size
    num_tiles_w = (W + base_resolution - 1) // base_resolution
    num_tiles_h = (H + base_resolution - 1) // base_resolution

    tiles = []
    for i in range(num_tiles_h):
        for j in range(num_tiles_w):
            left = j * base_resolution
            top = i * base_resolution
            right = min(left + base_resolution, W)
            bottom = min(top + base_resolution, H)

            tile = image.crop((left, top, right, bottom))
            # νŒ¨λ”©
            padded_tile = Image.new('RGB', (base_resolution, base_resolution))
            padded_tile.paste(tile, (0, 0))
            tiles.append(padded_tile)

    # [global_image] + [tile1, tile2, ...]
    all_images = [global_image] + tiles

    return all_images

3. Qwen-VL

3.1 μ•„ν‚€ν…μ²˜

Qwen-VL νŠΉμ§•:

1. Vision Encoder: ViT-bigG (1.9B params)
2. 고해상도: 448Γ—448 (κ°€λ³€)
3. Grounding 지원: λ°”μš΄λ”© λ°•μŠ€ 좜λ ₯
4. OCR 강점: ν…μŠ€νŠΈ 인식 우수

μž…λ ₯ ν˜•μ‹:
<img>image_path</img> User question
<ref>object name</ref><box>(x1,y1),(x2,y2)</box>

3.2 μ‚¬μš© μ˜ˆμ‹œ

from transformers import AutoModelForCausalLM, AutoTokenizer

def use_qwen_vl():
    """Qwen-VL μ‚¬μš©"""

    model = AutoModelForCausalLM.from_pretrained(
        "Qwen/Qwen-VL-Chat",
        trust_remote_code=True,
        torch_dtype=torch.float16
    ).to("cuda")

    tokenizer = AutoTokenizer.from_pretrained(
        "Qwen/Qwen-VL-Chat",
        trust_remote_code=True
    )

    # κΈ°λ³Έ VQA
    query = tokenizer.from_list_format([
        {'image': 'path/to/image.jpg'},
        {'text': 'What is in this image?'},
    ])

    response, history = model.chat(tokenizer, query=query, history=None)
    print(response)

    # Grounding (객체 μœ„μΉ˜ μ°ΎκΈ°)
    query = tokenizer.from_list_format([
        {'image': 'path/to/image.jpg'},
        {'text': 'Find all the cats in this image and output their bounding boxes.'},
    ])

    response, history = model.chat(tokenizer, query=query, history=None)
    # 좜λ ₯: <ref>cat</ref><box>(100,200),(300,400)</box>

    # 닀쀑 이미지
    query = tokenizer.from_list_format([
        {'image': 'image1.jpg'},
        {'image': 'image2.jpg'},
        {'text': 'What is the difference between these two images?'},
    ])

    response, history = model.chat(tokenizer, query=query, history=None)

    return response

4. Visual Instruction Tuning

4.1 데이터 생성

class VisualInstructionGenerator:
    """Visual Instruction 데이터 생성"""

    def __init__(self, teacher_model="gpt-4-vision-preview"):
        from openai import OpenAI
        self.client = OpenAI()
        self.teacher_model = teacher_model

    def generate_conversation(
        self,
        image_path: str,
        task_type: str = "detailed_description"
    ):
        """GPT-4V둜 ν•™μŠ΅ 데이터 생성"""
        import base64

        # 이미지 인코딩
        with open(image_path, "rb") as f:
            image_data = base64.b64encode(f.read()).decode()

        task_prompts = {
            "detailed_description": "Describe this image in detail.",
            "reasoning": "What conclusions can you draw from this image? Explain your reasoning.",
            "conversation": "Generate a multi-turn conversation about this image.",
            "creative": "Write a creative story inspired by this image."
        }

        prompt = task_prompts.get(task_type, task_prompts["detailed_description"])

        response = self.client.chat.completions.create(
            model=self.teacher_model,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt},
                        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}}
                    ]
                }
            ],
            max_tokens=1024
        )

        return {
            "image": image_path,
            "task": task_type,
            "question": prompt,
            "answer": response.choices[0].message.content
        }

    def generate_dataset(
        self,
        image_paths: list,
        output_path: str,
        tasks: list = None
    ):
        """λŒ€κ·œλͺ¨ 데이터셋 생성"""
        import json
        from tqdm import tqdm

        if tasks is None:
            tasks = ["detailed_description", "reasoning", "conversation"]

        dataset = []

        for image_path in tqdm(image_paths):
            for task in tasks:
                try:
                    data = self.generate_conversation(image_path, task)
                    dataset.append(data)
                except Exception as e:
                    print(f"Error processing {image_path}: {e}")

        with open(output_path, 'w') as f:
            json.dump(dataset, f, indent=2)

        return dataset

4.2 ν•™μŠ΅ μ „λž΅

from transformers import Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model

def finetune_vlm():
    """VLM Fine-tuning"""

    # λͺ¨λΈ λ‘œλ“œ
    model = LLaVAModel(
        freeze_vision=True,  # Vision encoder κ³ μ •
        freeze_llm=False     # LLM fine-tune
    )

    # LoRA 적용 (효율적 ν•™μŠ΅)
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.05,
    )

    model.llm = get_peft_model(model.llm, lora_config)

    # ν•™μŠ΅ μ„€μ •
    training_args = TrainingArguments(
        output_dir="./llava-finetuned",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        num_train_epochs=1,
        learning_rate=2e-5,
        warmup_ratio=0.03,
        lr_scheduler_type="cosine",
        bf16=True,
        logging_steps=10,
        save_steps=500,
        dataloader_num_workers=4,
    )

    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=vlm_data_collator,
    )

    trainer.train()


def vlm_data_collator(features):
    """VLM 데이터 μ½œλ ˆμ΄ν„°"""
    batch = {
        'input_ids': torch.stack([f['input_ids'] for f in features]),
        'attention_mask': torch.stack([f['attention_mask'] for f in features]),
        'images': torch.stack([f['image'] for f in features]),
        'labels': torch.stack([f['labels'] for f in features]),
    }
    return batch

5. 평가 벀치마크

5.1 μ£Όμš” 벀치마크

VLM 평가 벀치마크:

1. VQA-v2: 일반 Visual QA
2. GQA: ꡬ쑰적 μΆ”λ‘  QA
3. TextVQA: 이미지 λ‚΄ ν…μŠ€νŠΈ 이해
4. POPE: ν™˜κ°(hallucination) 평가
5. MME: 14개 ν•˜μœ„ νƒœμŠ€ν¬ μ’…ν•©
6. MMBench: 20개 λŠ₯λ ₯ 평가
7. SEED-Bench: 19K λ‹€μ§€μ„ λ‹€ 문제

5.2 평가 μ½”λ“œ

def evaluate_vlm(model, dataset_name: str = "vqav2"):
    """VLM 평가"""

    if dataset_name == "vqav2":
        return evaluate_vqa_v2(model)
    elif dataset_name == "textvqa":
        return evaluate_textvqa(model)
    elif dataset_name == "pope":
        return evaluate_pope(model)


def evaluate_pope(model):
    """
    POPE: Polling-based Object Probing Evaluation

    ν™˜κ° 평가: "Is there a [object] in the image?"
    """
    from datasets import load_dataset

    dataset = load_dataset("lmms-lab/POPE")

    correct = 0
    total = 0

    for item in dataset['test']:
        image = item['image']
        question = item['question']  # "Is there a dog in the image?"
        answer = item['answer']      # "yes" or "no"

        # λͺ¨λΈ 예츑
        prediction = model.generate(image, question)
        pred_answer = "yes" if "yes" in prediction.lower() else "no"

        if pred_answer == answer:
            correct += 1
        total += 1

    accuracy = correct / total
    print(f"POPE Accuracy: {accuracy:.4f}")

    return accuracy

6. μ‹€μ „ μ‘μš©

6.1 λ¬Έμ„œ 이해

def document_understanding():
    """λ¬Έμ„œ 이해 μ‘μš©"""

    model = load_qwen_vl()  # OCR 강점

    # PDF νŽ˜μ΄μ§€ 뢄석
    def analyze_document_page(image_path: str, questions: list):
        results = []

        for question in questions:
            query = f"<img>{image_path}</img>{question}"
            answer = model.generate(query)
            results.append({
                'question': question,
                'answer': answer
            })

        return results

    # μ˜ˆμ‹œ 질문
    questions = [
        "What is the title of this document?",
        "Summarize the main points.",
        "Extract all dates mentioned.",
        "What tables are present? Describe their contents.",
    ]

    results = analyze_document_page("document_page.png", questions)


def chart_understanding():
    """차트/κ·Έλž˜ν”„ 이해"""

    prompts = [
        "What type of chart is this?",
        "What is the trend shown in this chart?",
        "What are the maximum and minimum values?",
        "Describe the relationship between X and Y.",
    ]

    # VLM으둜 차트 뢄석
    for prompt in prompts:
        response = model.generate(chart_image, prompt)
        print(f"Q: {prompt}")
        print(f"A: {response}\n")

참고 자료

λ…Όλ¬Έ

  • Liu et al. (2023). "Visual Instruction Tuning" (LLaVA)
  • Liu et al. (2024). "LLaVA-NeXT: Improved reasoning, OCR, and world knowledge"
  • Bai et al. (2023). "Qwen-VL: A Versatile Vision-Language Model"

λͺ¨λΈ

κ΄€λ ¨ 레슨

to navigate between lessons