18. Audio/Video Foundation Models
18. Audio/Video Foundation Models¶
κ°μ¶
Audioμ Video λλ©μΈμ Foundation Modelλ€μ μμ± μΈμ, μμ μμ±, λΉλμ€ μ΄ν΄ λ± λ€μν λ©ν°λ―Έλμ΄ νμ€ν¬λ₯Ό ν΅ν©μ μΌλ‘ μ²λ¦¬ν©λλ€.
1. Speech Foundation Models¶
1.1 Whisper¶
OpenAIμ λ²μ© μμ± μΈμ λͺ¨λΈ:
Whisper μν€ν
μ²:
βββββββββββββββββββββββββββββββββββββββββββββββ
β Audio Input (30μ΄ μΈκ·Έλ¨ΌνΈ) β
β β β
β Log-Mel Spectrogram (80 bins) β
β β β
β ββββββββββββββββββββββββ β
β β Audio Encoder β β
β β (Transformer) β β
β β - Conv1d stem β β
β β - Sinusoidal pos β β
β β - N layers β β
β ββββββββββββββββββββββββ β
β β β
β Audio Features β
β β β
β ββββββββββββββββββββββββ β
β β Text Decoder β β
β β (Transformer) β β
β β - Cross-attention β β
β β - Causal masking β β
β ββββββββββββββββββββββββ β
β β β
β Text Output (Transcription/Translation) β
βββββββββββββββββββββββββββββββββββββββββββββββ
λͺ¨λΈ ν¬κΈ°:
- tiny: 39M params, ~32x realtime
- base: 74M params, ~16x realtime
- small: 244M params, ~6x realtime
- medium: 769M params, ~2x realtime
- large: 1.55B params, ~1x realtime
import torch
import whisper
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# OpenAI whisper μ¬μ©
def transcribe_with_whisper():
"""OpenAI Whisperλ‘ μμ± μΈμ"""
model = whisper.load_model("base")
# κΈ°λ³Έ transcription
result = model.transcribe("audio.mp3")
print(result["text"])
# μΈμ΄ κ°μ§ λ° λ²μ
result = model.transcribe(
"audio.mp3",
task="translate", # μμ΄λ‘ λ²μ
language=None, # μλ κ°μ§
fp16=torch.cuda.is_available()
)
# νμμ€ν¬ν ν¬ν¨
result = model.transcribe(
"audio.mp3",
word_timestamps=True
)
for segment in result["segments"]:
print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")
return result
# HuggingFace Transformers μ¬μ©
def transcribe_with_hf_whisper():
"""HuggingFace Whisper μ¬μ©"""
processor = WhisperProcessor.from_pretrained("openai/whisper-base")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
# μ€λμ€ λ‘λ (16kHz)
import librosa
audio, sr = librosa.load("audio.mp3", sr=16000)
# μ
λ ₯ μ²λ¦¬
input_features = processor(
audio,
sampling_rate=16000,
return_tensors="pt"
).input_features
# μμ±
predicted_ids = model.generate(
input_features,
language="korean",
task="transcribe"
)
# λμ½λ©
transcription = processor.batch_decode(
predicted_ids,
skip_special_tokens=True
)[0]
return transcription
# Whisper Fine-tuning
class WhisperFineTuner:
"""λλ©μΈ νΉν Whisper Fine-tuning"""
def __init__(self, model_name: str = "openai/whisper-small"):
from transformers import (
WhisperForConditionalGeneration,
WhisperProcessor,
Seq2SeqTrainingArguments,
Seq2SeqTrainer
)
self.processor = WhisperProcessor.from_pretrained(model_name)
self.model = WhisperForConditionalGeneration.from_pretrained(model_name)
# Freeze encoder (μ νμ )
for param in self.model.model.encoder.parameters():
param.requires_grad = False
def prepare_dataset(self, dataset):
"""λ°μ΄ν°μ
μ μ²λ¦¬"""
def prepare_example(example):
audio = example["audio"]["array"]
# μ
λ ₯ νΉμ§ μΆμΆ
input_features = self.processor(
audio,
sampling_rate=16000,
return_tensors="pt"
).input_features[0]
# λ μ΄λΈ ν ν°ν
labels = self.processor.tokenizer(
example["transcription"]
).input_ids
return {
"input_features": input_features,
"labels": labels
}
return dataset.map(prepare_example)
def train(self, train_dataset, eval_dataset):
"""Fine-tuning μ€ν"""
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-finetuned",
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=1e-5,
warmup_steps=500,
num_train_epochs=3,
evaluation_strategy="steps",
eval_steps=500,
save_steps=1000,
fp16=True,
predict_with_generate=True,
generation_max_length=225
)
trainer = Seq2SeqTrainer(
model=self.model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=self.processor.feature_extractor,
)
trainer.train()
1.2 Speech Synthesis (TTS)¶
# VITS/Coqui TTS
def text_to_speech_coqui():
"""Coqui TTSλ‘ μμ± ν©μ±"""
from TTS.api import TTS
# λ€κ΅μ΄ TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
# μμ± ν©μ±
tts.tts_to_file(
text="μλ
νμΈμ, Foundation Model νμ΅ μλ£μ
λλ€.",
file_path="output.wav",
speaker_wav="reference_voice.wav", # Voice cloning
language="ko"
)
# Bark (Suno AI)
def text_to_speech_bark():
"""Barkλ‘ μμ± ν©μ± (λΉμΈμ΄μ νν ν¬ν¨)"""
from transformers import AutoProcessor, BarkModel
import scipy.io.wavfile as wavfile
processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark")
# ν
μ€νΈ (λΉμΈμ΄μ νν ν¬ν¨ κ°λ₯)
text = "[laughs] Hello! This is amazing. [sighs]"
inputs = processor(text, return_tensors="pt")
audio_array = model.generate(**inputs)
audio_array = audio_array.cpu().numpy().squeeze()
# μ μ₯
wavfile.write("bark_output.wav", 24000, audio_array)
2. Audio Generation Models¶
2.1 AudioLM¶
Googleμ Audio Language Model:
AudioLM ꡬ쑰:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Audio Input β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Semantic Tokens (w2v-BERT) β β
β β - High-level content β β
β β - ~25 tokens/second β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Coarse Acoustic Tokens (SoundStream) β β
β β - Medium-level details β β
β β - ~50 tokens/second β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Fine Acoustic Tokens (SoundStream) β β
β β - Fine-grained details β β
β β - ~100 tokens/second β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β SoundStream Decoder β
β β β
β Audio Output β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3λ¨κ³ μμ±:
1. Semantic β Semantic (continuation)
2. Semantic β Coarse Acoustic
3. Coarse β Fine Acoustic
2.2 MusicGen¶
Metaμ μμ μμ± λͺ¨λΈ:
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipy.io.wavfile as wavfile
class MusicGenerator:
"""MusicGenμ μ¬μ©ν μμ
μμ±"""
def __init__(self, model_size: str = "small"):
"""
model_size: "small" (300M), "medium" (1.5B), "large" (3.3B)
"""
model_name = f"facebook/musicgen-{model_size}"
self.processor = AutoProcessor.from_pretrained(model_name)
self.model = MusicgenForConditionalGeneration.from_pretrained(model_name)
if torch.cuda.is_available():
self.model = self.model.to("cuda")
def generate_from_text(
self,
prompt: str,
duration: float = 10.0,
temperature: float = 1.0,
guidance_scale: float = 3.0
):
"""ν
μ€νΈ ν둬ννΈλ‘ μμ
μμ±"""
inputs = self.processor(
text=[prompt],
padding=True,
return_tensors="pt"
)
if torch.cuda.is_available():
inputs = {k: v.to("cuda") for k, v in inputs.items()}
# ν ν° μ κ³μ° (32kHz, 50 tokens/second)
max_new_tokens = int(duration * 50)
audio_values = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
guidance_scale=guidance_scale
)
return audio_values[0, 0].cpu().numpy()
def generate_with_melody(
self,
prompt: str,
melody_audio: torch.Tensor,
duration: float = 10.0
):
"""λ©λ‘λ μ‘°κ±΄λΆ μμ± (melody λͺ¨λΈλ§)"""
inputs = self.processor(
text=[prompt],
audio=melody_audio,
sampling_rate=32000,
padding=True,
return_tensors="pt"
)
max_new_tokens = int(duration * 50)
audio_values = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens
)
return audio_values[0, 0].cpu().numpy()
def save_audio(self, audio: np.ndarray, path: str):
"""μ€λμ€ μ μ₯"""
wavfile.write(path, 32000, audio)
# μ¬μ© μμ
def music_generation_examples():
"""λ€μν μμ
μμ± μμ"""
generator = MusicGenerator("small")
# ν
μ€νΈ κΈ°λ° μμ±
prompts = [
"A calm piano melody with soft strings in the background",
"Upbeat electronic dance music with heavy bass drops",
"Traditional Korean music with gayageum and janggu",
"Jazz trio improvisation with drums, bass, and piano"
]
for i, prompt in enumerate(prompts):
audio = generator.generate_from_text(
prompt,
duration=15.0,
temperature=0.9,
guidance_scale=3.5
)
generator.save_audio(audio, f"music_{i}.wav")
print(f"Generated: {prompt[:50]}...")
2.3 AudioCraft (Audio Diffusion)¶
# AudioGen (μ¬μ΄λ ν¨κ³Ό μμ±)
def generate_sound_effects():
"""AudioGenμΌλ‘ μ¬μ΄λ ν¨κ³Ό μμ±"""
from audiocraft.models import AudioGen
from audiocraft.data.audio import audio_write
model = AudioGen.get_pretrained("facebook/audiogen-medium")
model.set_generation_params(duration=5)
descriptions = [
"Dog barking in the distance",
"Thunder and heavy rain",
"Car engine starting and driving away"
]
wav = model.generate(descriptions)
for i, one_wav in enumerate(wav):
audio_write(f"sound_{i}", one_wav.cpu(), model.sample_rate)
3. Video Understanding Models¶
3.1 Video-LLaMA / VideoLLaMA 2¶
VideoLLaMA μν€ν
μ²:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Video Input β
β [Frame1, Frame2, ..., FrameN] β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Visual Encoder (ViT/CLIP) β β
β β - Frame-level features β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Video Q-Former β β
β β - Temporal aggregation β β
β β - Cross-attention with queries β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β Video Embeddings β
β + β
β Audio Embeddings (ImageBind) β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LLM Backbone β β
β β (LLaMA/Vicuna) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β Text Response β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
class VideoUnderstanding:
"""λΉλμ€ μ΄ν΄ λͺ¨λΈ"""
def __init__(self, model_name: str = "DAMO-NLP-SG/Video-LLaMA-2-7B"):
self.processor = AutoProcessor.from_pretrained(model_name)
self.model = AutoModelForVision2Seq.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
def extract_frames(
self,
video_path: str,
num_frames: int = 8,
uniform: bool = True
):
"""λΉλμ€μμ νλ μ μΆμΆ"""
import cv2
cap = cv2.VideoCapture(video_path)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
if uniform:
# κ· μΌ μνλ§
indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
else:
# λλ€ μνλ§
indices = sorted(np.random.choice(total_frames, num_frames, replace=False))
frames = []
for idx in indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ret, frame = cap.read()
if ret:
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(frame)
cap.release()
return frames
def analyze_video(
self,
video_path: str,
question: str,
num_frames: int = 8
):
"""λΉλμ€ λΆμ λ° μ§λ¬Έ μλ΅"""
frames = self.extract_frames(video_path, num_frames)
# μ
λ ₯ μ€λΉ
inputs = self.processor(
text=question,
images=frames,
return_tensors="pt"
).to(self.model.device)
# μμ±
outputs = self.model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7
)
response = self.processor.decode(outputs[0], skip_special_tokens=True)
return response
# Video Captioning
class VideoCaptioner:
"""λΉλμ€ μΊ‘μ
λ"""
def __init__(self):
from transformers import BlipProcessor, BlipForConditionalGeneration
self.processor = BlipProcessor.from_pretrained(
"Salesforce/blip2-opt-2.7b"
)
self.model = BlipForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16
)
def caption_video(
self,
video_path: str,
num_frames: int = 5
):
"""λΉλμ€ μΊ‘μ
μμ±"""
frames = self._extract_frames(video_path, num_frames)
captions = []
for frame in frames:
inputs = self.processor(images=frame, return_tensors="pt")
output = self.model.generate(**inputs, max_new_tokens=50)
caption = self.processor.decode(output[0], skip_special_tokens=True)
captions.append(caption)
# μΊ‘μ
ν΅ν©
summary = self._summarize_captions(captions)
return summary
def _summarize_captions(self, captions: list) -> str:
"""νλ μλ³ μΊ‘μ
μ λΉλμ€ μμ½μΌλ‘ ν΅ν©"""
# κ°λ¨ν ν΅ν© (μ€μ λ‘λ LLM μ¬μ© κΆμ₯)
unique_elements = set()
for caption in captions:
unique_elements.update(caption.lower().split())
return " β ".join(captions)
3.2 Video Generation κ°λ (Sora)¶
Sora ν΅μ¬ κ°λ
:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Text Prompt β
β "A cat playing piano in a cozy room with warm light" β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Text Encoder (T5/CLIP) ββ
β β - Semantic understanding ββ
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Spacetime Latent Patches ββ
β β - Video as 3D patches ββ
β β - Compress HΓWΓT into latent ββ
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Diffusion Transformer ββ
β β - DiT backbone ββ
β β - Attention over spacetime ββ
β β - Variable resolution/duration ββ
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β VAE Decoder ββ
β β - Latent β Pixel space ββ
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β
β Video Output β
β (Variable length, up to 1 min) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ν΅μ¬ κΈ°μ :
1. Spacetime Patches: μ곡κ°μ ν¨μΉλ‘ λΆν
2. DiT (Diffusion Transformer): Transformer κΈ°λ° diffusion
3. Variable Resolution: λ€μν ν΄μλ/κΈΈμ΄ μ§μ
4. Recaptioning: μμΈν μΊ‘μ
μΌλ‘ μ¬νμ΅
# κ°λ¨ν Video Diffusion κ°λ
ꡬν
import torch
import torch.nn as nn
from einops import rearrange
class SpacetimePatchEmbed(nn.Module):
"""μκ³΅κ° ν¨μΉ μλ² λ©"""
def __init__(
self,
img_size: int = 256,
patch_size: int = 16,
num_frames: int = 16,
temporal_patch: int = 2,
in_channels: int = 4, # VAE latent
embed_dim: int = 768
):
super().__init__()
self.patch_size = patch_size
self.temporal_patch = temporal_patch
# 3D ν¨μΉ μλ² λ©
self.proj = nn.Conv3d(
in_channels, embed_dim,
kernel_size=(temporal_patch, patch_size, patch_size),
stride=(temporal_patch, patch_size, patch_size)
)
# ν¨μΉ μ κ³μ°
self.num_spatial_patches = (img_size // patch_size) ** 2
self.num_temporal_patches = num_frames // temporal_patch
self.num_patches = self.num_spatial_patches * self.num_temporal_patches
def forward(self, x):
"""
Args:
x: (B, C, T, H, W) - λΉλμ€ latent
Returns:
patches: (B, N, D) - μκ³΅κ° ν¨μΉ
"""
# (B, D, t, h, w)
x = self.proj(x)
# (B, D, N) -> (B, N, D)
x = x.flatten(2).transpose(1, 2)
return x
class VideoTransformerBlock(nn.Module):
"""λΉλμ€μ© Transformer λΈλ‘"""
def __init__(
self,
dim: int,
num_heads: int,
mlp_ratio: float = 4.0,
num_spatial_patches: int = 256,
num_temporal_patches: int = 8
):
super().__init__()
self.num_spatial = num_spatial_patches
self.num_temporal = num_temporal_patches
# Spatial attention
self.spatial_norm = nn.LayerNorm(dim)
self.spatial_attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
# Temporal attention
self.temporal_norm = nn.LayerNorm(dim)
self.temporal_attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
# FFN
self.ffn_norm = nn.LayerNorm(dim)
self.ffn = nn.Sequential(
nn.Linear(dim, int(dim * mlp_ratio)),
nn.GELU(),
nn.Linear(int(dim * mlp_ratio), dim)
)
def forward(self, x, t_emb=None):
"""
Args:
x: (B, T*S, D) - μκ³΅κ° ν¨μΉ
t_emb: (B, D) - timestep μλ² λ©
"""
B, N, D = x.shape
T, S = self.num_temporal, self.num_spatial
# Spatial attention (κ° νλ μ λ΄)
x_spatial = rearrange(x, 'b (t s) d -> (b t) s d', t=T, s=S)
x_spatial = self.spatial_norm(x_spatial)
attn_out, _ = self.spatial_attn(x_spatial, x_spatial, x_spatial)
x_spatial = rearrange(attn_out, '(b t) s d -> b (t s) d', b=B, t=T)
x = x + x_spatial
# Temporal attention (κ°μ μμΉμ νλ μ κ°)
x_temporal = rearrange(x, 'b (t s) d -> (b s) t d', t=T, s=S)
x_temporal = self.temporal_norm(x_temporal)
attn_out, _ = self.temporal_attn(x_temporal, x_temporal, x_temporal)
x_temporal = rearrange(attn_out, '(b s) t d -> b (t s) d', b=B, s=S)
x = x + x_temporal
# FFN
x = x + self.ffn(self.ffn_norm(x))
return x
class SimpleDiT(nn.Module):
"""λ¨μνλ Diffusion Transformer"""
def __init__(
self,
img_size: int = 256,
patch_size: int = 16,
num_frames: int = 16,
in_channels: int = 4,
hidden_size: int = 768,
depth: int = 12,
num_heads: int = 12
):
super().__init__()
# Patch embedding
self.patch_embed = SpacetimePatchEmbed(
img_size, patch_size, num_frames,
temporal_patch=2, in_channels=in_channels,
embed_dim=hidden_size
)
num_spatial = (img_size // patch_size) ** 2
num_temporal = num_frames // 2
# Position embedding
self.pos_embed = nn.Parameter(
torch.zeros(1, num_spatial * num_temporal, hidden_size)
)
# Timestep embedding
self.time_embed = nn.Sequential(
nn.Linear(hidden_size, hidden_size * 4),
nn.SiLU(),
nn.Linear(hidden_size * 4, hidden_size)
)
# Transformer blocks
self.blocks = nn.ModuleList([
VideoTransformerBlock(
hidden_size, num_heads,
num_spatial_patches=num_spatial,
num_temporal_patches=num_temporal
)
for _ in range(depth)
])
# Output
self.final_norm = nn.LayerNorm(hidden_size)
self.final_proj = nn.Linear(
hidden_size,
patch_size * patch_size * 2 * in_channels
)
def forward(self, x, t, cond=None):
"""
Args:
x: (B, C, T, H, W) - noisy video latent
t: (B,) - diffusion timestep
cond: (B, L, D) - text conditioning
"""
# Patch embedding
x = self.patch_embed(x) + self.pos_embed
# Timestep embedding (sinusoidal)
t_emb = self._sinusoidal_embedding(t, x.shape[-1])
t_emb = self.time_embed(t_emb)
# Transformer blocks
for block in self.blocks:
x = block(x, t_emb)
# Output
x = self.final_norm(x)
x = self.final_proj(x)
return x
def _sinusoidal_embedding(self, t, dim):
"""Sinusoidal timestep embedding"""
half = dim // 2
freqs = torch.exp(
-math.log(10000) * torch.arange(half, device=t.device) / half
)
args = t[:, None] * freqs[None]
embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
return embedding
4. μ€μ©μ μμ©¶
4.1 Multimodal Pipeline¶
class MultimodalPipeline:
"""μ€λμ€/λΉλμ€ ν΅ν© νμ΄νλΌμΈ"""
def __init__(self):
# Speech recognition
self.whisper = whisper.load_model("base")
# Music generation
self.music_processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
self.music_model = MusicgenForConditionalGeneration.from_pretrained(
"facebook/musicgen-small"
)
# TTS
self.tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
def transcribe_and_translate(
self,
audio_path: str,
target_language: str = "en"
):
"""μμ± μΈμ λ° λ²μ"""
# μμ± μΈμ
result = self.whisper.transcribe(audio_path)
original_text = result["text"]
source_language = result["language"]
# λ²μ (μμ΄λ‘)
if source_language != target_language:
translation = self.whisper.transcribe(
audio_path,
task="translate"
)["text"]
else:
translation = original_text
return {
"original": original_text,
"source_language": source_language,
"translation": translation
}
def generate_soundtrack(
self,
video_description: str,
mood: str,
duration: float = 30.0
):
"""λΉλμ€ μ€λͺ
κΈ°λ° λ°°κ²½μμ
μμ±"""
prompt = f"{mood} music for: {video_description}"
inputs = self.music_processor(
text=[prompt],
padding=True,
return_tensors="pt"
)
max_new_tokens = int(duration * 50)
audio_values = self.music_model.generate(
**inputs,
max_new_tokens=max_new_tokens,
guidance_scale=4.0
)
return audio_values[0, 0].numpy()
def create_voiceover(
self,
script: str,
reference_voice: str,
language: str = "en"
):
"""보μ΄μ€μ€λ² μμ±"""
output_path = "voiceover.wav"
self.tts.tts_to_file(
text=script,
file_path=output_path,
speaker_wav=reference_voice,
language=language
)
return output_path
# μ¬μ© μμ
def demo_pipeline():
"""νμ΄νλΌμΈ λ°λͺ¨"""
pipeline = MultimodalPipeline()
# 1. μμ± νμΌ μ μ¬ λ° λ²μ
result = pipeline.transcribe_and_translate(
"korean_speech.mp3",
target_language="en"
)
print(f"Original: {result['original']}")
print(f"Translation: {result['translation']}")
# 2. λΉλμ€μ© λ°°κ²½μμ
μμ±
music = pipeline.generate_soundtrack(
video_description="A documentary about ocean wildlife",
mood="Calm and majestic",
duration=60.0
)
# 3. λλ μ΄μ
μμ±
voiceover = pipeline.create_voiceover(
script="Welcome to our exploration of the deep ocean.",
reference_voice="narrator_sample.wav",
language="en"
)
4.2 Real-time Processing¶
import asyncio
from collections import deque
class RealTimeAudioProcessor:
"""μ€μκ° μ€λμ€ μ²λ¦¬"""
def __init__(self, buffer_size: float = 3.0):
self.buffer_size = buffer_size
self.sample_rate = 16000
self.audio_buffer = deque(maxlen=int(buffer_size * self.sample_rate))
# Whisper λͺ¨λΈ (μμ λ²μ μ¬μ©)
self.model = whisper.load_model("tiny")
async def process_stream(self, audio_stream):
"""μ€λμ€ μ€νΈλ¦Ό μ²λ¦¬"""
while True:
# μ€λμ€ μ²ν¬ μμ
chunk = await audio_stream.receive()
self.audio_buffer.extend(chunk)
# λ²νΌκ° μΆ©λΆνλ©΄ μ²λ¦¬
if len(self.audio_buffer) >= self.sample_rate * 2:
audio_array = np.array(self.audio_buffer)
# λΉλκΈ° μ μ¬
result = await asyncio.to_thread(
self.model.transcribe,
audio_array,
fp16=False
)
yield result["text"]
# λ²νΌ μΌλΆ μ μ§ (μ€λ²λ©)
self.audio_buffer = deque(
list(self.audio_buffer)[self.sample_rate:],
maxlen=int(self.buffer_size * self.sample_rate)
)
class StreamingVideoAnalyzer:
"""μ€νΈλ¦¬λ° λΉλμ€ λΆμ"""
def __init__(self, frame_interval: int = 30):
self.frame_interval = frame_interval
self.frame_count = 0
# CLIP for quick frame analysis
from transformers import CLIPProcessor, CLIPModel
self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
def analyze_frame(self, frame, categories: list):
"""νλ μ λΆλ₯"""
inputs = self.processor(
text=categories,
images=frame,
return_tensors="pt",
padding=True
)
outputs = self.model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
return {cat: prob.item() for cat, prob in zip(categories, probs[0])}
def process_video_stream(self, video_stream, categories: list):
"""λΉλμ€ μ€νΈλ¦Ό μ²λ¦¬"""
import cv2
while True:
ret, frame = video_stream.read()
if not ret:
break
self.frame_count += 1
# μΌμ κ°κ²©μΌλ‘λ§ λΆμ
if self.frame_count % self.frame_interval == 0:
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
analysis = self.analyze_frame(frame_rgb, categories)
yield self.frame_count, analysis
5. λͺ¨λΈ λΉκ΅¶
5.1 Speech Models¶
| λͺ¨λΈ | νλΌλ―Έν° | νΉμ§ | μ©λ |
|---|---|---|---|
| Whisper Large | 1.55B | λ€κ΅μ΄, λ²μ | λ²μ© ASR |
| Whisper Large-v3 | 1.55B | κ°μ λ μ νλ | νλ‘λμ |
| wav2vec 2.0 | 300M | Self-supervised | Fine-tuning λ² μ΄μ€ |
| HuBERT | 300M-1B | Masked prediction | Speech representation |
5.2 Audio Generation¶
| λͺ¨λΈ | ν¬κΈ° | νΉμ§ | μΆλ ₯ |
|---|---|---|---|
| MusicGen Small | 300M | λΉ λ₯Έ μμ± | μμ |
| MusicGen Large | 3.3B | κ³ νμ§ | μμ |
| AudioGen | 300M-1.5B | μ¬μ΄λ ν¨κ³Ό | μ€λμ€ |
| Bark | 1B+ | λΉμΈμ΄ νν | TTS |
5.3 Video Models¶
| λͺ¨λΈ | μν€ν μ² | μ λ ₯ | νμ€ν¬ |
|---|---|---|---|
| VideoLLaMA | LLaMA + Q-Former | Video + Audio | VQA, Captioning |
| Video-ChatGPT | LLaVA variant | Video | Conversation |
| TimeSformer | Divided attention | Video | Classification |
| ViViT | Factorized | Video | Classification |
ν΅μ¬ μ 리¶
Audio Foundation Models¶
Whisper: λ²μ© ASR + λ²μ
βββ Encoder-Decoder Transformer
βββ 680K μκ° νμ΅ λ°μ΄ν°
βββ λ€κ΅μ΄ (99κ° μΈμ΄)
MusicGen: ν
μ€νΈβμμ
βββ Autoregressive Transformer
βββ EnCodec ν ν°ν
βββ ν
μ€νΈ/λ©λ‘λ 쑰건λΆ
Video Foundation Models¶
Video Understanding:
βββ Frame sampling β Visual encoder
βββ Temporal aggregation (Q-Former/pooling)
βββ LLM backbone for reasoning
Video Generation (Sora concept):
βββ Spacetime patches (3D tokenization)
βββ Diffusion Transformer (DiT)
βββ Variable resolution/duration
μ€μ© ν¬μΈνΈ¶
- Whisper: Fine-tuningμΌλ‘ λλ©μΈ νΉν κ°λ₯
- MusicGen: guidance_scaleλ‘ νμ§/λ€μμ± μ‘°μ
- Video: νλ μ μνλ§ μ λ΅μ΄ μ€μ
- Real-time: μμ λͺ¨λΈ + μ€νΈλ¦¬λ° λ²νΌ
μ°Έκ³ μλ£¶
- Radford et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision" (Whisper)
- Copet et al. (2023). "Simple and Controllable Music Generation" (MusicGen)
- Borsos et al. (2023). "AudioLM: a Language Modeling Approach to Audio Generation"
- Zhang et al. (2023). "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model"
- OpenAI Sora Technical Report (2024)