18. Audio/Video Foundation Models

18. Audio/Video Foundation Models

κ°œμš”

Audio와 Video λ„λ©”μΈμ˜ Foundation Model듀은 μŒμ„± 인식, μŒμ•… 생성, λΉ„λ””μ˜€ 이해 λ“± λ‹€μ–‘ν•œ λ©€ν‹°λ―Έλ””μ–΄ νƒœμŠ€ν¬λ₯Ό ν†΅ν•©μ μœΌλ‘œ μ²˜λ¦¬ν•©λ‹ˆλ‹€.


1. Speech Foundation Models

1.1 Whisper

OpenAI의 λ²”μš© μŒμ„± 인식 λͺ¨λΈ:

Whisper μ•„ν‚€ν…μ²˜:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Audio Input (30초 μ„Έκ·Έλ¨ΌνŠΈ)                  β”‚
β”‚       ↓                                      β”‚
β”‚  Log-Mel Spectrogram (80 bins)              β”‚
β”‚       ↓                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚
β”‚  β”‚   Audio Encoder      β”‚                   β”‚
β”‚  β”‚   (Transformer)      β”‚                   β”‚
β”‚  β”‚   - Conv1d stem      β”‚                   β”‚
β”‚  β”‚   - Sinusoidal pos   β”‚                   β”‚
β”‚  β”‚   - N layers         β”‚                   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚       ↓                                      β”‚
β”‚  Audio Features                              β”‚
β”‚       ↓                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚
β”‚  β”‚   Text Decoder       β”‚                   β”‚
β”‚  β”‚   (Transformer)      β”‚                   β”‚
β”‚  β”‚   - Cross-attention  β”‚                   β”‚
β”‚  β”‚   - Causal masking   β”‚                   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚       ↓                                      β”‚
β”‚  Text Output (Transcription/Translation)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

λͺ¨λΈ 크기:
- tiny:   39M params,  ~32x realtime
- base:   74M params,  ~16x realtime
- small:  244M params, ~6x realtime
- medium: 769M params, ~2x realtime
- large:  1.55B params, ~1x realtime
import torch
import whisper
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# OpenAI whisper μ‚¬μš©
def transcribe_with_whisper():
    """OpenAI Whisper둜 μŒμ„± 인식"""
    model = whisper.load_model("base")

    # κΈ°λ³Έ transcription
    result = model.transcribe("audio.mp3")
    print(result["text"])

    # μ–Έμ–΄ 감지 및 λ²ˆμ—­
    result = model.transcribe(
        "audio.mp3",
        task="translate",  # μ˜μ–΄λ‘œ λ²ˆμ—­
        language=None,     # μžλ™ 감지
        fp16=torch.cuda.is_available()
    )

    # νƒ€μž„μŠ€νƒ¬ν”„ 포함
    result = model.transcribe(
        "audio.mp3",
        word_timestamps=True
    )

    for segment in result["segments"]:
        print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")

    return result


# HuggingFace Transformers μ‚¬μš©
def transcribe_with_hf_whisper():
    """HuggingFace Whisper μ‚¬μš©"""
    processor = WhisperProcessor.from_pretrained("openai/whisper-base")
    model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")

    # μ˜€λ””μ˜€ λ‘œλ“œ (16kHz)
    import librosa
    audio, sr = librosa.load("audio.mp3", sr=16000)

    # μž…λ ₯ 처리
    input_features = processor(
        audio,
        sampling_rate=16000,
        return_tensors="pt"
    ).input_features

    # 생성
    predicted_ids = model.generate(
        input_features,
        language="korean",
        task="transcribe"
    )

    # λ””μ½”λ”©
    transcription = processor.batch_decode(
        predicted_ids,
        skip_special_tokens=True
    )[0]

    return transcription


# Whisper Fine-tuning
class WhisperFineTuner:
    """도메인 νŠΉν™” Whisper Fine-tuning"""

    def __init__(self, model_name: str = "openai/whisper-small"):
        from transformers import (
            WhisperForConditionalGeneration,
            WhisperProcessor,
            Seq2SeqTrainingArguments,
            Seq2SeqTrainer
        )

        self.processor = WhisperProcessor.from_pretrained(model_name)
        self.model = WhisperForConditionalGeneration.from_pretrained(model_name)

        # Freeze encoder (선택적)
        for param in self.model.model.encoder.parameters():
            param.requires_grad = False

    def prepare_dataset(self, dataset):
        """데이터셋 μ „μ²˜λ¦¬"""
        def prepare_example(example):
            audio = example["audio"]["array"]

            # μž…λ ₯ νŠΉμ§• μΆ”μΆœ
            input_features = self.processor(
                audio,
                sampling_rate=16000,
                return_tensors="pt"
            ).input_features[0]

            # λ ˆμ΄λΈ” 토큰화
            labels = self.processor.tokenizer(
                example["transcription"]
            ).input_ids

            return {
                "input_features": input_features,
                "labels": labels
            }

        return dataset.map(prepare_example)

    def train(self, train_dataset, eval_dataset):
        """Fine-tuning μ‹€ν–‰"""
        from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

        training_args = Seq2SeqTrainingArguments(
            output_dir="./whisper-finetuned",
            per_device_train_batch_size=8,
            gradient_accumulation_steps=2,
            learning_rate=1e-5,
            warmup_steps=500,
            num_train_epochs=3,
            evaluation_strategy="steps",
            eval_steps=500,
            save_steps=1000,
            fp16=True,
            predict_with_generate=True,
            generation_max_length=225
        )

        trainer = Seq2SeqTrainer(
            model=self.model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            tokenizer=self.processor.feature_extractor,
        )

        trainer.train()

1.2 Speech Synthesis (TTS)

# VITS/Coqui TTS
def text_to_speech_coqui():
    """Coqui TTS둜 μŒμ„± ν•©μ„±"""
    from TTS.api import TTS

    # λ‹€κ΅­μ–΄ TTS
    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")

    # μŒμ„± ν•©μ„±
    tts.tts_to_file(
        text="μ•ˆλ…•ν•˜μ„Έμš”, Foundation Model ν•™μŠ΅ μžλ£Œμž…λ‹ˆλ‹€.",
        file_path="output.wav",
        speaker_wav="reference_voice.wav",  # Voice cloning
        language="ko"
    )


# Bark (Suno AI)
def text_to_speech_bark():
    """Bark둜 μŒμ„± ν•©μ„± (비언어적 ν‘œν˜„ 포함)"""
    from transformers import AutoProcessor, BarkModel
    import scipy.io.wavfile as wavfile

    processor = AutoProcessor.from_pretrained("suno/bark")
    model = BarkModel.from_pretrained("suno/bark")

    # ν…μŠ€νŠΈ (비언어적 ν‘œν˜„ 포함 κ°€λŠ₯)
    text = "[laughs] Hello! This is amazing. [sighs]"

    inputs = processor(text, return_tensors="pt")
    audio_array = model.generate(**inputs)
    audio_array = audio_array.cpu().numpy().squeeze()

    # μ €μž₯
    wavfile.write("bark_output.wav", 24000, audio_array)

2. Audio Generation Models

2.1 AudioLM

Google의 Audio Language Model:

AudioLM ꡬ쑰:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Audio Input                       β”‚
β”‚                       ↓                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚           Semantic Tokens (w2v-BERT)          β”‚  β”‚
β”‚  β”‚           - High-level content               β”‚  β”‚
β”‚  β”‚           - ~25 tokens/second                β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                       ↓                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Coarse Acoustic Tokens (SoundStream) β”‚  β”‚
β”‚  β”‚           - Medium-level details             β”‚  β”‚
β”‚  β”‚           - ~50 tokens/second                β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                       ↓                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚          Fine Acoustic Tokens (SoundStream)  β”‚  β”‚
β”‚  β”‚           - Fine-grained details             β”‚  β”‚
β”‚  β”‚           - ~100 tokens/second               β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                       ↓                             β”‚
β”‚                 SoundStream Decoder                 β”‚
β”‚                       ↓                             β”‚
β”‚                   Audio Output                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3단계 생성:
1. Semantic β†’ Semantic (continuation)
2. Semantic β†’ Coarse Acoustic
3. Coarse β†’ Fine Acoustic

2.2 MusicGen

Meta의 μŒμ•… 생성 λͺ¨λΈ:

from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipy.io.wavfile as wavfile

class MusicGenerator:
    """MusicGen을 μ‚¬μš©ν•œ μŒμ•… 생성"""

    def __init__(self, model_size: str = "small"):
        """
        model_size: "small" (300M), "medium" (1.5B), "large" (3.3B)
        """
        model_name = f"facebook/musicgen-{model_size}"
        self.processor = AutoProcessor.from_pretrained(model_name)
        self.model = MusicgenForConditionalGeneration.from_pretrained(model_name)

        if torch.cuda.is_available():
            self.model = self.model.to("cuda")

    def generate_from_text(
        self,
        prompt: str,
        duration: float = 10.0,
        temperature: float = 1.0,
        guidance_scale: float = 3.0
    ):
        """ν…μŠ€νŠΈ ν”„λ‘¬ν”„νŠΈλ‘œ μŒμ•… 생성"""
        inputs = self.processor(
            text=[prompt],
            padding=True,
            return_tensors="pt"
        )

        if torch.cuda.is_available():
            inputs = {k: v.to("cuda") for k, v in inputs.items()}

        # 토큰 수 계산 (32kHz, 50 tokens/second)
        max_new_tokens = int(duration * 50)

        audio_values = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            guidance_scale=guidance_scale
        )

        return audio_values[0, 0].cpu().numpy()

    def generate_with_melody(
        self,
        prompt: str,
        melody_audio: torch.Tensor,
        duration: float = 10.0
    ):
        """λ©œλ‘œλ”” 쑰건뢀 생성 (melody λͺ¨λΈλ§Œ)"""
        inputs = self.processor(
            text=[prompt],
            audio=melody_audio,
            sampling_rate=32000,
            padding=True,
            return_tensors="pt"
        )

        max_new_tokens = int(duration * 50)

        audio_values = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens
        )

        return audio_values[0, 0].cpu().numpy()

    def save_audio(self, audio: np.ndarray, path: str):
        """μ˜€λ””μ˜€ μ €μž₯"""
        wavfile.write(path, 32000, audio)


# μ‚¬μš© μ˜ˆμ‹œ
def music_generation_examples():
    """λ‹€μ–‘ν•œ μŒμ•… 생성 μ˜ˆμ‹œ"""
    generator = MusicGenerator("small")

    # ν…μŠ€νŠΈ 기반 생성
    prompts = [
        "A calm piano melody with soft strings in the background",
        "Upbeat electronic dance music with heavy bass drops",
        "Traditional Korean music with gayageum and janggu",
        "Jazz trio improvisation with drums, bass, and piano"
    ]

    for i, prompt in enumerate(prompts):
        audio = generator.generate_from_text(
            prompt,
            duration=15.0,
            temperature=0.9,
            guidance_scale=3.5
        )
        generator.save_audio(audio, f"music_{i}.wav")
        print(f"Generated: {prompt[:50]}...")

2.3 AudioCraft (Audio Diffusion)

# AudioGen (μ‚¬μš΄λ“œ 효과 생성)
def generate_sound_effects():
    """AudioGen으둜 μ‚¬μš΄λ“œ 효과 생성"""
    from audiocraft.models import AudioGen
    from audiocraft.data.audio import audio_write

    model = AudioGen.get_pretrained("facebook/audiogen-medium")
    model.set_generation_params(duration=5)

    descriptions = [
        "Dog barking in the distance",
        "Thunder and heavy rain",
        "Car engine starting and driving away"
    ]

    wav = model.generate(descriptions)

    for i, one_wav in enumerate(wav):
        audio_write(f"sound_{i}", one_wav.cpu(), model.sample_rate)

3. Video Understanding Models

3.1 Video-LLaMA / VideoLLaMA 2

VideoLLaMA μ•„ν‚€ν…μ²˜:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Video Input                                         β”‚
β”‚  [Frame1, Frame2, ..., FrameN]                      β”‚
β”‚          ↓                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚         Visual Encoder (ViT/CLIP)              β”‚ β”‚
β”‚  β”‚         - Frame-level features                 β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚          ↓                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚       Video Q-Former                           β”‚ β”‚
β”‚  β”‚       - Temporal aggregation                   β”‚ β”‚
β”‚  β”‚       - Cross-attention with queries           β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚          ↓                                           β”‚
β”‚  Video Embeddings                                    β”‚
β”‚          +                                           β”‚
β”‚  Audio Embeddings (ImageBind)                        β”‚
β”‚          ↓                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚              LLM Backbone                       β”‚ β”‚
β”‚  β”‚           (LLaMA/Vicuna)                        β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚          ↓                                           β”‚
β”‚  Text Response                                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

class VideoUnderstanding:
    """λΉ„λ””μ˜€ 이해 λͺ¨λΈ"""

    def __init__(self, model_name: str = "DAMO-NLP-SG/Video-LLaMA-2-7B"):
        self.processor = AutoProcessor.from_pretrained(model_name)
        self.model = AutoModelForVision2Seq.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def extract_frames(
        self,
        video_path: str,
        num_frames: int = 8,
        uniform: bool = True
    ):
        """λΉ„λ””μ˜€μ—μ„œ ν”„λ ˆμž„ μΆ”μΆœ"""
        import cv2

        cap = cv2.VideoCapture(video_path)
        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

        if uniform:
            # 균일 μƒ˜ν”Œλ§
            indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
        else:
            # 랜덀 μƒ˜ν”Œλ§
            indices = sorted(np.random.choice(total_frames, num_frames, replace=False))

        frames = []
        for idx in indices:
            cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
            ret, frame = cap.read()
            if ret:
                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                frames.append(frame)

        cap.release()
        return frames

    def analyze_video(
        self,
        video_path: str,
        question: str,
        num_frames: int = 8
    ):
        """λΉ„λ””μ˜€ 뢄석 및 질문 응닡"""
        frames = self.extract_frames(video_path, num_frames)

        # μž…λ ₯ μ€€λΉ„
        inputs = self.processor(
            text=question,
            images=frames,
            return_tensors="pt"
        ).to(self.model.device)

        # 생성
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7
        )

        response = self.processor.decode(outputs[0], skip_special_tokens=True)
        return response


# Video Captioning
class VideoCaptioner:
    """λΉ„λ””μ˜€ 캑셔닝"""

    def __init__(self):
        from transformers import BlipProcessor, BlipForConditionalGeneration

        self.processor = BlipProcessor.from_pretrained(
            "Salesforce/blip2-opt-2.7b"
        )
        self.model = BlipForConditionalGeneration.from_pretrained(
            "Salesforce/blip2-opt-2.7b",
            torch_dtype=torch.float16
        )

    def caption_video(
        self,
        video_path: str,
        num_frames: int = 5
    ):
        """λΉ„λ””μ˜€ μΊ‘μ…˜ 생성"""
        frames = self._extract_frames(video_path, num_frames)

        captions = []
        for frame in frames:
            inputs = self.processor(images=frame, return_tensors="pt")
            output = self.model.generate(**inputs, max_new_tokens=50)
            caption = self.processor.decode(output[0], skip_special_tokens=True)
            captions.append(caption)

        # μΊ‘μ…˜ 톡합
        summary = self._summarize_captions(captions)
        return summary

    def _summarize_captions(self, captions: list) -> str:
        """ν”„λ ˆμž„λ³„ μΊ‘μ…˜μ„ λΉ„λ””μ˜€ μš”μ•½μœΌλ‘œ 톡합"""
        # κ°„λ‹¨ν•œ 톡합 (μ‹€μ œλ‘œλŠ” LLM μ‚¬μš© ꢌμž₯)
        unique_elements = set()
        for caption in captions:
            unique_elements.update(caption.lower().split())

        return " β†’ ".join(captions)

3.2 Video Generation κ°œλ… (Sora)

Sora 핡심 κ°œλ…:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Text Prompt                         β”‚
β”‚  "A cat playing piano in a cozy room with warm light"  β”‚
β”‚                         ↓                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚              Text Encoder (T5/CLIP)                β”‚β”‚
β”‚  β”‚              - Semantic understanding              β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚                         ↓                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚          Spacetime Latent Patches                  β”‚β”‚
β”‚  β”‚          - Video as 3D patches                     β”‚β”‚
β”‚  β”‚          - Compress HΓ—WΓ—T into latent             β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚                         ↓                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚              Diffusion Transformer                 β”‚β”‚
β”‚  β”‚              - DiT backbone                        β”‚β”‚
β”‚  β”‚              - Attention over spacetime            β”‚β”‚
β”‚  β”‚              - Variable resolution/duration        β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚                         ↓                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚              VAE Decoder                           β”‚β”‚
β”‚  β”‚              - Latent β†’ Pixel space               β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚                         ↓                               β”‚
β”‚                    Video Output                         β”‚
β”‚              (Variable length, up to 1 min)            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

핡심 기술:
1. Spacetime Patches: μ‹œκ³΅κ°„μ„ 패치둜 λΆ„ν• 
2. DiT (Diffusion Transformer): Transformer 기반 diffusion
3. Variable Resolution: λ‹€μ–‘ν•œ 해상도/길이 지원
4. Recaptioning: μƒμ„Έν•œ μΊ‘μ…˜μœΌλ‘œ μž¬ν•™μŠ΅
# κ°„λ‹¨ν•œ Video Diffusion κ°œλ… κ΅¬ν˜„
import torch
import torch.nn as nn
from einops import rearrange

class SpacetimePatchEmbed(nn.Module):
    """μ‹œκ³΅κ°„ 패치 μž„λ² λ”©"""

    def __init__(
        self,
        img_size: int = 256,
        patch_size: int = 16,
        num_frames: int = 16,
        temporal_patch: int = 2,
        in_channels: int = 4,  # VAE latent
        embed_dim: int = 768
    ):
        super().__init__()
        self.patch_size = patch_size
        self.temporal_patch = temporal_patch

        # 3D 패치 μž„λ² λ”©
        self.proj = nn.Conv3d(
            in_channels, embed_dim,
            kernel_size=(temporal_patch, patch_size, patch_size),
            stride=(temporal_patch, patch_size, patch_size)
        )

        # 패치 수 계산
        self.num_spatial_patches = (img_size // patch_size) ** 2
        self.num_temporal_patches = num_frames // temporal_patch
        self.num_patches = self.num_spatial_patches * self.num_temporal_patches

    def forward(self, x):
        """
        Args:
            x: (B, C, T, H, W) - λΉ„λ””μ˜€ latent
        Returns:
            patches: (B, N, D) - μ‹œκ³΅κ°„ 패치
        """
        # (B, D, t, h, w)
        x = self.proj(x)
        # (B, D, N) -> (B, N, D)
        x = x.flatten(2).transpose(1, 2)
        return x


class VideoTransformerBlock(nn.Module):
    """λΉ„λ””μ˜€μš© Transformer 블둝"""

    def __init__(
        self,
        dim: int,
        num_heads: int,
        mlp_ratio: float = 4.0,
        num_spatial_patches: int = 256,
        num_temporal_patches: int = 8
    ):
        super().__init__()
        self.num_spatial = num_spatial_patches
        self.num_temporal = num_temporal_patches

        # Spatial attention
        self.spatial_norm = nn.LayerNorm(dim)
        self.spatial_attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)

        # Temporal attention
        self.temporal_norm = nn.LayerNorm(dim)
        self.temporal_attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)

        # FFN
        self.ffn_norm = nn.LayerNorm(dim)
        self.ffn = nn.Sequential(
            nn.Linear(dim, int(dim * mlp_ratio)),
            nn.GELU(),
            nn.Linear(int(dim * mlp_ratio), dim)
        )

    def forward(self, x, t_emb=None):
        """
        Args:
            x: (B, T*S, D) - μ‹œκ³΅κ°„ 패치
            t_emb: (B, D) - timestep μž„λ² λ”©
        """
        B, N, D = x.shape
        T, S = self.num_temporal, self.num_spatial

        # Spatial attention (각 ν”„λ ˆμž„ λ‚΄)
        x_spatial = rearrange(x, 'b (t s) d -> (b t) s d', t=T, s=S)
        x_spatial = self.spatial_norm(x_spatial)
        attn_out, _ = self.spatial_attn(x_spatial, x_spatial, x_spatial)
        x_spatial = rearrange(attn_out, '(b t) s d -> b (t s) d', b=B, t=T)
        x = x + x_spatial

        # Temporal attention (같은 μœ„μΉ˜μ˜ ν”„λ ˆμž„ κ°„)
        x_temporal = rearrange(x, 'b (t s) d -> (b s) t d', t=T, s=S)
        x_temporal = self.temporal_norm(x_temporal)
        attn_out, _ = self.temporal_attn(x_temporal, x_temporal, x_temporal)
        x_temporal = rearrange(attn_out, '(b s) t d -> b (t s) d', b=B, s=S)
        x = x + x_temporal

        # FFN
        x = x + self.ffn(self.ffn_norm(x))

        return x


class SimpleDiT(nn.Module):
    """λ‹¨μˆœν™”λœ Diffusion Transformer"""

    def __init__(
        self,
        img_size: int = 256,
        patch_size: int = 16,
        num_frames: int = 16,
        in_channels: int = 4,
        hidden_size: int = 768,
        depth: int = 12,
        num_heads: int = 12
    ):
        super().__init__()

        # Patch embedding
        self.patch_embed = SpacetimePatchEmbed(
            img_size, patch_size, num_frames,
            temporal_patch=2, in_channels=in_channels,
            embed_dim=hidden_size
        )

        num_spatial = (img_size // patch_size) ** 2
        num_temporal = num_frames // 2

        # Position embedding
        self.pos_embed = nn.Parameter(
            torch.zeros(1, num_spatial * num_temporal, hidden_size)
        )

        # Timestep embedding
        self.time_embed = nn.Sequential(
            nn.Linear(hidden_size, hidden_size * 4),
            nn.SiLU(),
            nn.Linear(hidden_size * 4, hidden_size)
        )

        # Transformer blocks
        self.blocks = nn.ModuleList([
            VideoTransformerBlock(
                hidden_size, num_heads,
                num_spatial_patches=num_spatial,
                num_temporal_patches=num_temporal
            )
            for _ in range(depth)
        ])

        # Output
        self.final_norm = nn.LayerNorm(hidden_size)
        self.final_proj = nn.Linear(
            hidden_size,
            patch_size * patch_size * 2 * in_channels
        )

    def forward(self, x, t, cond=None):
        """
        Args:
            x: (B, C, T, H, W) - noisy video latent
            t: (B,) - diffusion timestep
            cond: (B, L, D) - text conditioning
        """
        # Patch embedding
        x = self.patch_embed(x) + self.pos_embed

        # Timestep embedding (sinusoidal)
        t_emb = self._sinusoidal_embedding(t, x.shape[-1])
        t_emb = self.time_embed(t_emb)

        # Transformer blocks
        for block in self.blocks:
            x = block(x, t_emb)

        # Output
        x = self.final_norm(x)
        x = self.final_proj(x)

        return x

    def _sinusoidal_embedding(self, t, dim):
        """Sinusoidal timestep embedding"""
        half = dim // 2
        freqs = torch.exp(
            -math.log(10000) * torch.arange(half, device=t.device) / half
        )
        args = t[:, None] * freqs[None]
        embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
        return embedding

4. μ‹€μš©μ  μ‘μš©

4.1 Multimodal Pipeline

class MultimodalPipeline:
    """μ˜€λ””μ˜€/λΉ„λ””μ˜€ 톡합 νŒŒμ΄ν”„λΌμΈ"""

    def __init__(self):
        # Speech recognition
        self.whisper = whisper.load_model("base")

        # Music generation
        self.music_processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
        self.music_model = MusicgenForConditionalGeneration.from_pretrained(
            "facebook/musicgen-small"
        )

        # TTS
        self.tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")

    def transcribe_and_translate(
        self,
        audio_path: str,
        target_language: str = "en"
    ):
        """μŒμ„± 인식 및 λ²ˆμ—­"""
        # μŒμ„± 인식
        result = self.whisper.transcribe(audio_path)
        original_text = result["text"]
        source_language = result["language"]

        # λ²ˆμ—­ (μ˜μ–΄λ‘œ)
        if source_language != target_language:
            translation = self.whisper.transcribe(
                audio_path,
                task="translate"
            )["text"]
        else:
            translation = original_text

        return {
            "original": original_text,
            "source_language": source_language,
            "translation": translation
        }

    def generate_soundtrack(
        self,
        video_description: str,
        mood: str,
        duration: float = 30.0
    ):
        """λΉ„λ””μ˜€ μ„€λͺ… 기반 λ°°κ²½μŒμ•… 생성"""
        prompt = f"{mood} music for: {video_description}"

        inputs = self.music_processor(
            text=[prompt],
            padding=True,
            return_tensors="pt"
        )

        max_new_tokens = int(duration * 50)

        audio_values = self.music_model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            guidance_scale=4.0
        )

        return audio_values[0, 0].numpy()

    def create_voiceover(
        self,
        script: str,
        reference_voice: str,
        language: str = "en"
    ):
        """λ³΄μ΄μŠ€μ˜€λ²„ 생성"""
        output_path = "voiceover.wav"

        self.tts.tts_to_file(
            text=script,
            file_path=output_path,
            speaker_wav=reference_voice,
            language=language
        )

        return output_path


# μ‚¬μš© μ˜ˆμ‹œ
def demo_pipeline():
    """νŒŒμ΄ν”„λΌμΈ 데λͺ¨"""
    pipeline = MultimodalPipeline()

    # 1. μŒμ„± 파일 전사 및 λ²ˆμ—­
    result = pipeline.transcribe_and_translate(
        "korean_speech.mp3",
        target_language="en"
    )
    print(f"Original: {result['original']}")
    print(f"Translation: {result['translation']}")

    # 2. λΉ„λ””μ˜€μš© λ°°κ²½μŒμ•… 생성
    music = pipeline.generate_soundtrack(
        video_description="A documentary about ocean wildlife",
        mood="Calm and majestic",
        duration=60.0
    )

    # 3. λ‚˜λ ˆμ΄μ…˜ 생성
    voiceover = pipeline.create_voiceover(
        script="Welcome to our exploration of the deep ocean.",
        reference_voice="narrator_sample.wav",
        language="en"
    )

4.2 Real-time Processing

import asyncio
from collections import deque

class RealTimeAudioProcessor:
    """μ‹€μ‹œκ°„ μ˜€λ””μ˜€ 처리"""

    def __init__(self, buffer_size: float = 3.0):
        self.buffer_size = buffer_size
        self.sample_rate = 16000
        self.audio_buffer = deque(maxlen=int(buffer_size * self.sample_rate))

        # Whisper λͺ¨λΈ (μž‘μ€ 버전 μ‚¬μš©)
        self.model = whisper.load_model("tiny")

    async def process_stream(self, audio_stream):
        """μ˜€λ””μ˜€ 슀트림 처리"""
        while True:
            # μ˜€λ””μ˜€ 청크 μˆ˜μ‹ 
            chunk = await audio_stream.receive()
            self.audio_buffer.extend(chunk)

            # 버퍼가 μΆ©λΆ„ν•˜λ©΄ 처리
            if len(self.audio_buffer) >= self.sample_rate * 2:
                audio_array = np.array(self.audio_buffer)

                # 비동기 전사
                result = await asyncio.to_thread(
                    self.model.transcribe,
                    audio_array,
                    fp16=False
                )

                yield result["text"]

                # 버퍼 일뢀 μœ μ§€ (μ˜€λ²„λž©)
                self.audio_buffer = deque(
                    list(self.audio_buffer)[self.sample_rate:],
                    maxlen=int(self.buffer_size * self.sample_rate)
                )


class StreamingVideoAnalyzer:
    """슀트리밍 λΉ„λ””μ˜€ 뢄석"""

    def __init__(self, frame_interval: int = 30):
        self.frame_interval = frame_interval
        self.frame_count = 0

        # CLIP for quick frame analysis
        from transformers import CLIPProcessor, CLIPModel
        self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
        self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

    def analyze_frame(self, frame, categories: list):
        """ν”„λ ˆμž„ λΆ„λ₯˜"""
        inputs = self.processor(
            text=categories,
            images=frame,
            return_tensors="pt",
            padding=True
        )

        outputs = self.model(**inputs)
        probs = outputs.logits_per_image.softmax(dim=1)

        return {cat: prob.item() for cat, prob in zip(categories, probs[0])}

    def process_video_stream(self, video_stream, categories: list):
        """λΉ„λ””μ˜€ 슀트림 처리"""
        import cv2

        while True:
            ret, frame = video_stream.read()
            if not ret:
                break

            self.frame_count += 1

            # 일정 κ°„κ²©μœΌλ‘œλ§Œ 뢄석
            if self.frame_count % self.frame_interval == 0:
                frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                analysis = self.analyze_frame(frame_rgb, categories)
                yield self.frame_count, analysis

5. λͺ¨λΈ 비ꡐ

5.1 Speech Models

λͺ¨λΈ νŒŒλΌλ―Έν„° νŠΉμ§• μš©λ„
Whisper Large 1.55B λ‹€κ΅­μ–΄, λ²ˆμ—­ λ²”μš© ASR
Whisper Large-v3 1.55B κ°œμ„ λœ 정확도 ν”„λ‘œλ•μ…˜
wav2vec 2.0 300M Self-supervised Fine-tuning 베이슀
HuBERT 300M-1B Masked prediction Speech representation

5.2 Audio Generation

λͺ¨λΈ 크기 νŠΉμ§• 좜λ ₯
MusicGen Small 300M λΉ λ₯Έ 생성 μŒμ•…
MusicGen Large 3.3B κ³ ν’ˆμ§ˆ μŒμ•…
AudioGen 300M-1.5B μ‚¬μš΄λ“œ 효과 μ˜€λ””μ˜€
Bark 1B+ λΉ„μ–Έμ–΄ ν‘œν˜„ TTS

5.3 Video Models

λͺ¨λΈ μ•„ν‚€ν…μ²˜ μž…λ ₯ νƒœμŠ€ν¬
VideoLLaMA LLaMA + Q-Former Video + Audio VQA, Captioning
Video-ChatGPT LLaVA variant Video Conversation
TimeSformer Divided attention Video Classification
ViViT Factorized Video Classification

핡심 정리

Audio Foundation Models

Whisper: λ²”μš© ASR + λ²ˆμ—­
β”œβ”€β”€ Encoder-Decoder Transformer
β”œβ”€β”€ 680K μ‹œκ°„ ν•™μŠ΅ 데이터
└── λ‹€κ΅­μ–΄ (99개 μ–Έμ–΄)

MusicGen: ν…μŠ€νŠΈβ†’μŒμ•…
β”œβ”€β”€ Autoregressive Transformer
β”œβ”€β”€ EnCodec 토큰화
└── ν…μŠ€νŠΈ/λ©œλ‘œλ”” 쑰건뢀

Video Foundation Models

Video Understanding:
β”œβ”€β”€ Frame sampling β†’ Visual encoder
β”œβ”€β”€ Temporal aggregation (Q-Former/pooling)
└── LLM backbone for reasoning

Video Generation (Sora concept):
β”œβ”€β”€ Spacetime patches (3D tokenization)
β”œβ”€β”€ Diffusion Transformer (DiT)
└── Variable resolution/duration

μ‹€μš© 포인트

  1. Whisper: Fine-tuning으둜 도메인 νŠΉν™” κ°€λŠ₯
  2. MusicGen: guidance_scale둜 ν’ˆμ§ˆ/λ‹€μ–‘μ„± 쑰절
  3. Video: ν”„λ ˆμž„ μƒ˜ν”Œλ§ μ „λž΅μ΄ μ€‘μš”
  4. Real-time: μž‘μ€ λͺ¨λΈ + 슀트리밍 버퍼

참고 자료

  1. Radford et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision" (Whisper)
  2. Copet et al. (2023). "Simple and Controllable Music Generation" (MusicGen)
  3. Borsos et al. (2023). "AudioLM: a Language Modeling Approach to Audio Generation"
  4. Zhang et al. (2023). "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model"
  5. OpenAI Sora Technical Report (2024)
to navigate between lessons