14. Unified Vision Models

14. Unified Vision Models

κ°œμš”

Unified Vision ModelsλŠ” λ‹€μ–‘ν•œ λΉ„μ „ νƒœμŠ€ν¬(λΆ„λ₯˜, κ²€μΆœ, μ„Έκ·Έλ©˜ν…Œμ΄μ…˜ λ“±)λ₯Ό 단일 λͺ¨λΈλ‘œ μ²˜λ¦¬ν•˜λŠ” νŒ¨λŸ¬λ‹€μž„μž…λ‹ˆλ‹€. νƒœμŠ€ν¬λ³„ λͺ¨λΈ λŒ€μ‹  λ²”μš© λΉ„μ „ λͺ¨λΈμ„ λͺ©ν‘œλ‘œ ν•©λ‹ˆλ‹€.


1. νŒ¨λŸ¬λ‹€μž„ μ „ν™˜

1.1 전톡적 μ ‘κ·Ό vs 톡합 μ ‘κ·Ό

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    λΉ„μ „ λͺ¨λΈ νŒ¨λŸ¬λ‹€μž„                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                  β”‚
β”‚  전톡적 (Task-Specific):                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚ ResNet       β”‚  β”‚ Faster R-CNN β”‚  β”‚ DeepLab      β”‚           β”‚
β”‚  β”‚ (λΆ„λ₯˜)       β”‚  β”‚ (κ²€μΆœ)       β”‚  β”‚ (μ„Έκ·Έλ©˜ν…Œμ΄μ…˜)β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                                                                  β”‚
β”‚  톡합 (Task-Agnostic):                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚  β”‚              Unified Vision Model              β”‚              β”‚
β”‚  β”‚  "λΆ„λ₯˜ν•΄μ€˜" β†’ λΆ„λ₯˜ κ²°κ³Ό                        β”‚              β”‚
β”‚  β”‚  "객체 μ°Ύμ•„μ€˜" β†’ λ°”μš΄λ”© λ°•μŠ€                   β”‚              β”‚
β”‚  β”‚  "μ„Έκ·Έλ©˜ν…Œμ΄μ…˜ν•΄μ€˜" β†’ 마슀크                   β”‚              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚                                                                  β”‚
β”‚  μž₯점: 지식 곡유, μœ μ§€λ³΄μˆ˜ 용이, Zero-shot 전이                  β”‚
β”‚                                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1.2 μ£Όμš” λͺ¨λΈ 비ꡐ

λͺ¨λΈ 개발 νŠΉμ§• 지원 νƒœμŠ€ν¬
Florence Microsoft λŒ€κ·œλͺ¨ Image-Text λΆ„λ₯˜, κ²€μΆœ, 캑셔닝, VQA
PaLI Google λ‹€κ΅­μ–΄ VLM 캑셔닝, VQA, OCR
Unified-IO Allen AI λͺ¨λ“  λͺ¨λ‹¬λ¦¬ν‹° 이미지, μ˜€λ””μ˜€, ν…μŠ€νŠΈ
OFA Alibaba Seq2Seq 톡합 λ‹€μ–‘ν•œ λΉ„μ „-μ–Έμ–΄
GPT-4V OpenAI μƒμš© λ©€ν‹°λͺ¨λ‹¬ λ²”μš© λΉ„μ „ 이해

2. Florence: Foundation Model for Vision

2.1 μ•„ν‚€ν…μ²˜

Florence μ•„ν‚€ν…μ²˜:

이미지 인코더: CoSwin Transformer (Hierarchical)
ν…μŠ€νŠΈ 인코더: UniCL (Unified Contrastive Learning)

ν•™μŠ΅:
1. Image-Text Contrastive (CLIP μŠ€νƒ€μΌ)
2. Image-Text Matching
3. Masked Language Modeling

νŠΉμ§•:
- 9μ–΅ Image-Text 쌍으둜 ν•™μŠ΅
- λ‹€μ–‘ν•œ granularity (이미지 β†’ μ˜μ—­ β†’ ν”½μ…€)
- Dynamic Head둜 νƒœμŠ€ν¬ 적응

2.2 κ΅¬ν˜„ μ˜ˆμ‹œ

import torch
import torch.nn as nn
from transformers import CLIPProcessor, CLIPModel

class FlorenceStyleModel(nn.Module):
    """
    Florence μŠ€νƒ€μΌ 톡합 λΉ„μ „ λͺ¨λΈ (κ°„μ†Œν™”)

    핡심: CLIP λ°±λ³Έ + Task-specific Heads
    """

    def __init__(
        self,
        clip_model_name: str = "openai/clip-vit-large-patch14",
        num_classes: int = 1000,
        num_detection_classes: int = 80
    ):
        super().__init__()

        # CLIP λ°±λ³Έ (Image + Text 인코더)
        self.clip = CLIPModel.from_pretrained(clip_model_name)
        self.processor = CLIPProcessor.from_pretrained(clip_model_name)

        hidden_size = self.clip.config.vision_config.hidden_size

        # Task Heads
        self.classification_head = nn.Linear(hidden_size, num_classes)
        self.detection_head = DetectionHead(hidden_size, num_detection_classes)
        self.segmentation_head = SegmentationHead(hidden_size)
        self.caption_head = CaptionHead(hidden_size, self.clip.config.text_config)

    def forward(
        self,
        images: torch.Tensor,
        task: str = "classification",
        text_prompts: list = None
    ):
        """
        Args:
            images: (B, 3, H, W)
            task: "classification", "detection", "segmentation", "caption"
            text_prompts: ν…μŠ€νŠΈ ν”„λ‘¬ν”„νŠΈ (zero-shot용)
        """
        # Image features
        vision_outputs = self.clip.vision_model(images)
        image_features = vision_outputs.last_hidden_state  # (B, num_patches+1, hidden)
        pooled_features = vision_outputs.pooler_output  # (B, hidden)

        if task == "classification":
            if text_prompts:
                # Zero-shot classification (CLIP μŠ€νƒ€μΌ)
                return self._zero_shot_classify(pooled_features, text_prompts)
            else:
                return self.classification_head(pooled_features)

        elif task == "detection":
            return self.detection_head(image_features)

        elif task == "segmentation":
            return self.segmentation_head(image_features)

        elif task == "caption":
            return self.caption_head(pooled_features)

    def _zero_shot_classify(
        self,
        image_features: torch.Tensor,
        text_prompts: list
    ) -> torch.Tensor:
        """Zero-shot classification with text prompts"""
        # Text encoding
        text_inputs = self.processor(
            text=text_prompts,
            return_tensors="pt",
            padding=True
        ).to(image_features.device)

        text_features = self.clip.get_text_features(**text_inputs)

        # Normalize
        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)

        # Similarity
        similarity = image_features @ text_features.T
        return similarity


class DetectionHead(nn.Module):
    """Object Detection Head (DETR μŠ€νƒ€μΌ)"""

    def __init__(self, hidden_size: int, num_classes: int, num_queries: int = 100):
        super().__init__()
        self.num_queries = num_queries

        # Object queries
        self.query_embed = nn.Embedding(num_queries, hidden_size)

        # Transformer decoder
        decoder_layer = nn.TransformerDecoderLayer(hidden_size, 8, batch_first=True)
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)

        # Prediction heads
        self.class_head = nn.Linear(hidden_size, num_classes + 1)  # +1 for no-object
        self.bbox_head = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 4)  # (cx, cy, w, h)
        )

    def forward(self, image_features: torch.Tensor):
        B = image_features.size(0)

        # Query embedding
        queries = self.query_embed.weight.unsqueeze(0).repeat(B, 1, 1)

        # Decoder
        hs = self.decoder(queries, image_features)

        # Predictions
        class_logits = self.class_head(hs)
        bbox_pred = self.bbox_head(hs).sigmoid()

        return {
            'class_logits': class_logits,
            'bbox_pred': bbox_pred
        }


class SegmentationHead(nn.Module):
    """Semantic Segmentation Head"""

    def __init__(self, hidden_size: int, num_classes: int = 150):
        super().__init__()

        # FPN-style decoder
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(hidden_size, 256, 4, stride=2, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.ConvTranspose2d(256, 128, 4, stride=2, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.ConvTranspose2d(128, 64, 4, stride=2, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, num_classes, 1)
        )

    def forward(self, image_features: torch.Tensor):
        # Reshape patches to spatial
        B, N, C = image_features.shape
        H = W = int((N - 1) ** 0.5)  # -1 for CLS token
        features = image_features[:, 1:, :].transpose(1, 2).view(B, C, H, W)

        return self.decoder(features)


class CaptionHead(nn.Module):
    """Image Captioning Head"""

    def __init__(self, hidden_size: int, text_config):
        super().__init__()
        self.vocab_size = text_config.vocab_size

        # Cross-attention decoder
        decoder_layer = nn.TransformerDecoderLayer(hidden_size, 8, batch_first=True)
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)

        self.lm_head = nn.Linear(hidden_size, self.vocab_size)

    def forward(
        self,
        image_features: torch.Tensor,
        target_ids: torch.Tensor = None
    ):
        # 생성 μ‹œμ—λŠ” autoregressive
        # ν•™μŠ΅ μ‹œμ—λŠ” teacher forcing
        pass  # κ΅¬ν˜„ μƒλž΅

3. PaLI (Pathways Language and Image model)

3.1 μ•„ν‚€ν…μ²˜

PaLI ꡬ쑰:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      PaLI                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                        β”‚
β”‚  Image Encoder: ViT-e (4B params, 22B μ΄λ―Έμ§€λ‘œ ν•™μŠ΅)   β”‚
β”‚       ↓                                                β”‚
β”‚  Visual Tokens: [IMG1] [IMG2] ... [IMGn]              β”‚
β”‚       ↓                                                β”‚
β”‚  Text Encoder-Decoder: mT5 (λ‹€κ΅­μ–΄)                   β”‚
β”‚       ↓                                                β”‚
β”‚  Output: ν…μŠ€νŠΈ (λ‹€κ΅­μ–΄ 지원)                          β”‚
β”‚                                                        β”‚
β”‚  μž…λ ₯ ν˜•μ‹:                                            β”‚
β”‚  "<image> 이 이미지λ₯Ό μ„€λͺ…ν•΄μ£Όμ„Έμš”" β†’ "고양이가..."    β”‚
β”‚  "<image> What is in the image?" β†’ "A cat..."         β”‚
β”‚                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3.2 νƒœμŠ€ν¬ 톡합

class PaLITaskFormats:
    """PaLI νƒœμŠ€ν¬λ³„ μž…λ ₯ ν˜•μ‹"""

    TASK_FORMATS = {
        # λΆ„λ₯˜
        "classification": "What is in this image?",
        "fine_grained": "What species of bird is this?",

        # 캑셔닝
        "caption_en": "Generate a caption for this image.",
        "caption_ko": "이 이미지에 λŒ€ν•œ μ„€λͺ…을 μž‘μ„±ν•˜μ„Έμš”.",

        # VQA
        "vqa": "Question: {question} Answer:",

        # OCR
        "ocr": "What text is in this image?",

        # κ²€μΆœ (ν…μŠ€νŠΈλ‘œ ν‘œν˜„)
        "detection": "Detect all objects in this image.",
        # 좜λ ₯: "cat [100, 200, 300, 400]; dog [50, 60, 150, 200]"

        # μ„Έκ·Έλ©˜ν…Œμ΄μ…˜ μ°Έμ‘°
        "referring": "Segment the {object}.",
    }

    @staticmethod
    def format_input(task: str, **kwargs) -> str:
        template = PaLITaskFormats.TASK_FORMATS.get(task, "")
        return template.format(**kwargs)


# μ‚¬μš© μ˜ˆμ‹œ
def process_with_pali(model, image, task, **kwargs):
    """PaLI μŠ€νƒ€μΌ 처리"""

    # νƒœμŠ€ν¬λ³„ ν”„λ‘¬ν”„νŠΈ
    prompt = PaLITaskFormats.format_input(task, **kwargs)

    # Visual tokens + Text tokens
    inputs = model.prepare_inputs(image, prompt)

    # Generate
    outputs = model.generate(**inputs)

    # Parse output based on task
    if task == "detection":
        return parse_detection_output(outputs)
    elif task == "caption_en":
        return outputs
    else:
        return outputs

4. Unified-IO

4.1 μ§„μ •ν•œ 톡합: λͺ¨λ“  λͺ¨λ‹¬λ¦¬ν‹°

Unified-IO: 단일 λͺ¨λΈλ‘œ λͺ¨λ“  I/O 처리

μž…λ ₯/좜λ ₯ ν˜•μ‹:
- 이미지 β†’ VQ-VAE 토큰
- ν…μŠ€νŠΈ β†’ μ„œλΈŒμ›Œλ“œ 토큰
- λ°”μš΄λ”© λ°•μŠ€ β†’ μ’Œν‘œ 토큰 (이산화)
- 마슀크 β†’ VQ-VAE 토큰
- μ˜€λ””μ˜€ β†’ μŠ€νŽ™νŠΈλ‘œκ·Έλž¨ VQ-VAE

λͺ¨λ“  것을 토큰 μ‹œν€€μŠ€λ‘œ λ³€ν™˜ β†’ Seq2Seq Transformer

4.2 κ΅¬ν˜„ κ°œλ…

class UnifiedIOTokenizer:
    """Unified-IO μŠ€νƒ€μΌ 토큰화"""

    def __init__(self, vocab_size: int = 50000, image_vocab_size: int = 16384):
        self.vocab_size = vocab_size
        self.image_vocab_size = image_vocab_size

        # 특수 토큰
        self.SPECIAL_TOKENS = {
            '<image>': vocab_size,
            '</image>': vocab_size + 1,
            '<box>': vocab_size + 2,
            '</box>': vocab_size + 3,
            '<mask>': vocab_size + 4,
            '</mask>': vocab_size + 5,
            '<audio>': vocab_size + 6,
            '</audio>': vocab_size + 7,
        }

        # μ’Œν‘œ 이산화 bins
        self.num_bins = 1000

    def tokenize_image(self, image: torch.Tensor) -> torch.Tensor:
        """VQ-VAE둜 이미지 토큰화"""
        # VQ-VAE μΈμ½”λ”λ‘œ discrete codes μΆ”μΆœ
        # codes shape: (H', W')
        codes = self.vqvae.encode(image)

        # Flatten + offset
        tokens = codes.flatten() + self.vocab_size + len(self.SPECIAL_TOKENS)

        return tokens

    def tokenize_bbox(self, bbox: torch.Tensor) -> torch.Tensor:
        """
        λ°”μš΄λ”© λ°•μŠ€λ₯Ό 이산 ν† ν°μœΌλ‘œ

        bbox: (x1, y1, x2, y2) normalized [0, 1]
        """
        # 각 μ’Œν‘œλ₯Ό bin으둜 이산화
        bins = (bbox * self.num_bins).long()

        # 특수 토큰 + bins
        tokens = torch.tensor([
            self.SPECIAL_TOKENS['<box>'],
            bins[0], bins[1], bins[2], bins[3],
            self.SPECIAL_TOKENS['</box>']
        ])

        return tokens

    def decode_bbox(self, tokens: torch.Tensor) -> torch.Tensor:
        """ν† ν°μ—μ„œ λ°”μš΄λ”© λ°•μŠ€ 볡원"""
        # <box> 토큰 μœ„μΉ˜ μ°ΎκΈ°
        # 4개의 숫자 토큰 μΆ”μΆœ
        # μ •κ·œν™” ν•΄μ œ
        pass


class UnifiedIOModel(nn.Module):
    """Unified-IO μŠ€νƒ€μΌ λͺ¨λΈ"""

    def __init__(self, config):
        super().__init__()

        # Unified Embedding
        self.embeddings = nn.ModuleDict({
            'text': nn.Embedding(config.text_vocab_size, config.hidden_size),
            'image': nn.Embedding(config.image_vocab_size, config.hidden_size),
            'coord': nn.Embedding(config.num_bins, config.hidden_size),
        })

        # Encoder-Decoder Transformer
        self.encoder = TransformerEncoder(config)
        self.decoder = TransformerDecoder(config)

        # Unified LM Head
        self.lm_head = nn.Linear(config.hidden_size, config.total_vocab_size)

    def forward(self, input_tokens, output_tokens=None):
        """
        Seq2Seq forward

        input_tokens: ν˜Όν•© λͺ¨λ‹¬λ¦¬ν‹° 토큰
        output_tokens: λͺ©ν‘œ 좜λ ₯ 토큰
        """
        # 토큰 νƒ€μž…λ³„ μž„λ² λ”©
        embeddings = self._get_embeddings(input_tokens)

        # Encoder
        encoder_output = self.encoder(embeddings)

        # Decoder
        if output_tokens is not None:
            decoder_input = self._get_embeddings(output_tokens)
            decoder_output = self.decoder(decoder_input, encoder_output)
            logits = self.lm_head(decoder_output)
            return logits

        return encoder_output

    def _get_embeddings(self, tokens):
        """토큰 νƒ€μž…μ— 따라 μ μ ˆν•œ μž„λ² λ”© 선택"""
        # 토큰 λ²”μœ„μ— 따라 text/image/coord ꡬ뢄
        pass


# λ‹€μ–‘ν•œ νƒœμŠ€ν¬ μ˜ˆμ‹œ
def unified_io_examples():
    """Unified-IO νƒœμŠ€ν¬ μ˜ˆμ‹œ"""

    examples = {
        # Image Captioning
        "caption": {
            "input": "<image> {image_tokens} </image> Describe this image.",
            "output": "A cat sitting on a windowsill."
        },

        # Object Detection
        "detection": {
            "input": "<image> {image_tokens} </image> Detect all objects.",
            "output": "cat <box> 100 200 300 400 </box> dog <box> 50 60 150 200 </box>"
        },

        # Segmentation
        "segmentation": {
            "input": "<image> {image_tokens} </image> Segment the cat.",
            "output": "<mask> {mask_tokens} </mask>"
        },

        # Image Generation (μ—­λ°©ν–₯)
        "generation": {
            "input": "Generate an image of a sunset over mountains.",
            "output": "<image> {image_tokens} </image>"
        },

        # VQA
        "vqa": {
            "input": "<image> {image_tokens} </image> How many cats are there?",
            "output": "2"
        }
    }

    return examples

5. μ‹€μ „ ν™œμš©

5.1 Florence-2 μ‚¬μš© (HuggingFace)

from transformers import AutoProcessor, AutoModelForCausalLM

def use_florence2():
    """Florence-2 μ‹€μ „ μ‚¬μš©"""

    model = AutoModelForCausalLM.from_pretrained(
        "microsoft/Florence-2-large",
        trust_remote_code=True
    )
    processor = AutoProcessor.from_pretrained(
        "microsoft/Florence-2-large",
        trust_remote_code=True
    )

    from PIL import Image
    import requests

    url = "https://example.com/image.jpg"
    image = Image.open(requests.get(url, stream=True).raw)

    # λ‹€μ–‘ν•œ νƒœμŠ€ν¬
    tasks = {
        "<CAPTION>": "짧은 μΊ‘μ…˜",
        "<DETAILED_CAPTION>": "상세 μΊ‘μ…˜",
        "<MORE_DETAILED_CAPTION>": "맀우 μƒμ„Έν•œ μΊ‘μ…˜",
        "<OD>": "객체 κ²€μΆœ",
        "<DENSE_REGION_CAPTION>": "μ˜μ—­λ³„ μΊ‘μ…˜",
        "<REGION_PROPOSAL>": "μ˜μ—­ μ œμ•ˆ",
        "<CAPTION_TO_PHRASE_GROUNDING>": "ν…μŠ€νŠΈβ†’μ˜μ—­ κ·ΈλΌμš΄λ”©",
        "<REFERRING_EXPRESSION_SEGMENTATION>": "μ°Έμ‘° ν‘œν˜„ μ„Έκ·Έλ©˜ν…Œμ΄μ…˜",
        "<OCR>": "OCR",
        "<OCR_WITH_REGION>": "μ˜μ—­λ³„ OCR",
    }

    for task_prompt, description in tasks.items():
        inputs = processor(text=task_prompt, images=image, return_tensors="pt")

        generated_ids = model.generate(
            input_ids=inputs["input_ids"],
            pixel_values=inputs["pixel_values"],
            max_new_tokens=1024,
            num_beams=3
        )

        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
        parsed = processor.post_process_generation(generated_text, task=task_prompt, image_size=image.size)

        print(f"\n{description} ({task_prompt}):")
        print(parsed)


# μ‹€ν–‰
use_florence2()

5.2 μ»€μŠ€ν…€ νƒœμŠ€ν¬ ν•™μŠ΅

from transformers import Trainer, TrainingArguments
from datasets import Dataset

def finetune_unified_vision():
    """톡합 λΉ„μ „ λͺ¨λΈ fine-tuning"""

    # λ©€ν‹°νƒœμŠ€ν¬ 데이터셋 μ€€λΉ„
    def create_multitask_dataset():
        """μ—¬λŸ¬ νƒœμŠ€ν¬λ₯Ό ν•˜λ‚˜μ˜ λ°μ΄ν„°μ…‹μœΌλ‘œ"""
        samples = []

        # λΆ„λ₯˜ μƒ˜ν”Œ
        for img_path, label in classification_data:
            samples.append({
                'image': img_path,
                'task': '<CLASSIFICATION>',
                'input_text': '<CLASSIFICATION>',
                'output_text': label
            })

        # μΊ‘μ…˜ μƒ˜ν”Œ
        for img_path, caption in caption_data:
            samples.append({
                'image': img_path,
                'task': '<CAPTION>',
                'input_text': '<CAPTION>',
                'output_text': caption
            })

        # VQA μƒ˜ν”Œ
        for img_path, question, answer in vqa_data:
            samples.append({
                'image': img_path,
                'task': '<VQA>',
                'input_text': f'<VQA> {question}',
                'output_text': answer
            })

        return Dataset.from_list(samples)

    dataset = create_multitask_dataset()

    # ν•™μŠ΅
    training_args = TrainingArguments(
        output_dir="./unified-vision-finetuned",
        per_device_train_batch_size=8,
        num_train_epochs=3,
        learning_rate=1e-5,
        # νƒœμŠ€ν¬ μƒ˜ν”Œλ§ μ „λž΅
        dataloader_drop_last=True,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
    )

    trainer.train()

6. 미래 λ°©ν–₯

6.1 World Models

λ‹€μŒ 단계: World Models

λΉ„μ „ λͺ¨λΈ + 물리 이해 + 행동 예츑

μ˜ˆμ‹œ:
- μ΄λ―Έμ§€μ—μ„œ 물리 법칙 이해
- "곡을 λ˜μ§€λ©΄ μ–΄λ””λ‘œ 갈까?"
- λΉ„λ””μ˜€μ˜ λ‹€μŒ ν”„λ ˆμž„ 예츑
- λ‘œλ΄‡ μ‘°μž‘ κ³„νš

6.2 ν†΅ν•©μ˜ ν•œκ³„μ™€ νŠΈλ ˆμ΄λ“œμ˜€ν”„

μž₯점:
βœ“ νƒœμŠ€ν¬ κ°„ 지식 곡유
βœ“ 단일 λͺ¨λΈ μœ μ§€λ³΄μˆ˜
βœ“ Zero-shot 전이
βœ“ μƒˆλ‘œμš΄ νƒœμŠ€ν¬ 적응 용이

단점:
βœ— κ°œλ³„ νƒœμŠ€ν¬ 졜고 μ„±λŠ₯ 미달
βœ— ν•™μŠ΅ λ³΅μž‘μ„±
βœ— νƒœμŠ€ν¬ κ°„ κ°„μ„­
βœ— 큰 λͺ¨λΈ 크기

νŠΈλ ˆμ΄λ“œμ˜€ν”„:
- λ²”μš©μ„± vs μ „λ¬Έμ„±
- νŽΈμ˜μ„± vs 졜적 μ„±λŠ₯

참고 자료

λ…Όλ¬Έ

  • Yuan et al. (2021). "Florence: A New Foundation Model for Computer Vision"
  • Chen et al. (2022). "PaLI: A Jointly-Scaled Multilingual Language-Image Model"
  • Lu et al. (2022). "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks"

λͺ¨λΈ

κ΄€λ ¨ 레슨

to navigate between lessons