18. Audio/Video Foundation Models
18. Audio/Video Foundation Models¶
Overview¶
Foundation Models for the Audio and Video domains comprehensively handle various multimedia tasks including speech recognition, music generation, and video understanding.
1. Speech Foundation Models¶
1.1 Whisper¶
OpenAI's general-purpose speech recognition model:
Whisper Architecture:
┌─────────────────────────────────────────────┐
│ Audio Input (30-second segments) │
│ ↓ │
│ Log-Mel Spectrogram (80 bins) │
│ ↓ │
│ ┌──────────────────────┐ │
│ │ Audio Encoder │ │
│ │ (Transformer) │ │
│ │ - Conv1d stem │ │
│ │ - Sinusoidal pos │ │
│ │ - N layers │ │
│ └──────────────────────┘ │
│ ↓ │
│ Audio Features │
│ ↓ │
│ ┌──────────────────────┐ │
│ │ Text Decoder │ │
│ │ (Transformer) │ │
│ │ - Cross-attention │ │
│ │ - Causal masking │ │
│ └──────────────────────┘ │
│ ↓ │
│ Text Output (Transcription/Translation) │
└─────────────────────────────────────────────┘
Model sizes:
- tiny: 39M params, ~32x realtime
- base: 74M params, ~16x realtime
- small: 244M params, ~6x realtime
- medium: 769M params, ~2x realtime
- large: 1.55B params, ~1x realtime
import torch
import whisper
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# Using OpenAI whisper
def transcribe_with_whisper():
"""Speech recognition with OpenAI Whisper"""
model = whisper.load_model("base")
# Basic transcription
result = model.transcribe("audio.mp3")
print(result["text"])
# Language detection and translation
result = model.transcribe(
"audio.mp3",
task="translate", # Translate to English
language=None, # Auto-detect
fp16=torch.cuda.is_available()
)
# With timestamps
result = model.transcribe(
"audio.mp3",
word_timestamps=True
)
for segment in result["segments"]:
print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")
return result
# Using HuggingFace Transformers
def transcribe_with_hf_whisper():
"""Using HuggingFace Whisper"""
processor = WhisperProcessor.from_pretrained("openai/whisper-base")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
# Load audio (16kHz)
import librosa
audio, sr = librosa.load("audio.mp3", sr=16000)
# Process input
input_features = processor(
audio,
sampling_rate=16000,
return_tensors="pt"
).input_features
# Generate
predicted_ids = model.generate(
input_features,
language="korean",
task="transcribe"
)
# Decode
transcription = processor.batch_decode(
predicted_ids,
skip_special_tokens=True
)[0]
return transcription
# Whisper Fine-tuning
class WhisperFineTuner:
"""Domain-specific Whisper Fine-tuning"""
def __init__(self, model_name: str = "openai/whisper-small"):
from transformers import (
WhisperForConditionalGeneration,
WhisperProcessor,
Seq2SeqTrainingArguments,
Seq2SeqTrainer
)
self.processor = WhisperProcessor.from_pretrained(model_name)
self.model = WhisperForConditionalGeneration.from_pretrained(model_name)
# Freeze encoder (optional)
for param in self.model.model.encoder.parameters():
param.requires_grad = False
def prepare_dataset(self, dataset):
"""Dataset preprocessing"""
def prepare_example(example):
audio = example["audio"]["array"]
# Extract input features
input_features = self.processor(
audio,
sampling_rate=16000,
return_tensors="pt"
).input_features[0]
# Tokenize labels
labels = self.processor.tokenizer(
example["transcription"]
).input_ids
return {
"input_features": input_features,
"labels": labels
}
return dataset.map(prepare_example)
def train(self, train_dataset, eval_dataset):
"""Run fine-tuning"""
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-finetuned",
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=1e-5,
warmup_steps=500,
num_train_epochs=3,
evaluation_strategy="steps",
eval_steps=500,
save_steps=1000,
fp16=True,
predict_with_generate=True,
generation_max_length=225
)
trainer = Seq2SeqTrainer(
model=self.model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=self.processor.feature_extractor,
)
trainer.train()
1.2 Speech Synthesis (TTS)¶
# VITS/Coqui TTS
def text_to_speech_coqui():
"""Speech synthesis with Coqui TTS"""
from TTS.api import TTS
# Multilingual TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
# Speech synthesis
tts.tts_to_file(
text="Hello, this is Foundation Model learning material.",
file_path="output.wav",
speaker_wav="reference_voice.wav", # Voice cloning
language="en"
)
# Bark (Suno AI)
def text_to_speech_bark():
"""Speech synthesis with Bark (including non-verbal expressions)"""
from transformers import AutoProcessor, BarkModel
import scipy.io.wavfile as wavfile
processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark")
# Text (can include non-verbal expressions)
text = "[laughs] Hello! This is amazing. [sighs]"
inputs = processor(text, return_tensors="pt")
audio_array = model.generate(**inputs)
audio_array = audio_array.cpu().numpy().squeeze()
# Save
wavfile.write("bark_output.wav", 24000, audio_array)
2. Audio Generation Models¶
2.1 AudioLM¶
Google's Audio Language Model:
AudioLM Structure:
┌────────────────────────────────────────────────────┐
│ Audio Input │
│ ↓ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Semantic Tokens (w2v-BERT) │ │
│ │ - High-level content │ │
│ │ - ~25 tokens/second │ │
│ └──────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Coarse Acoustic Tokens (SoundStream) │ │
│ │ - Medium-level details │ │
│ │ - ~50 tokens/second │ │
│ └──────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Fine Acoustic Tokens (SoundStream) │ │
│ │ - Fine-grained details │ │
│ │ - ~100 tokens/second │ │
│ └──────────────────────────────────────────────┘ │
│ ↓ │
│ SoundStream Decoder │
│ ↓ │
│ Audio Output │
└────────────────────────────────────────────────────┘
3-stage generation:
1. Semantic → Semantic (continuation)
2. Semantic → Coarse Acoustic
3. Coarse → Fine Acoustic
2.2 MusicGen¶
Meta's music generation model:
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipy.io.wavfile as wavfile
class MusicGenerator:
"""Music generation using MusicGen"""
def __init__(self, model_size: str = "small"):
"""
model_size: "small" (300M), "medium" (1.5B), "large" (3.3B)
"""
model_name = f"facebook/musicgen-{model_size}"
self.processor = AutoProcessor.from_pretrained(model_name)
self.model = MusicgenForConditionalGeneration.from_pretrained(model_name)
if torch.cuda.is_available():
self.model = self.model.to("cuda")
def generate_from_text(
self,
prompt: str,
duration: float = 10.0,
temperature: float = 1.0,
guidance_scale: float = 3.0
):
"""Generate music from text prompt"""
inputs = self.processor(
text=[prompt],
padding=True,
return_tensors="pt"
)
if torch.cuda.is_available():
inputs = {k: v.to("cuda") for k, v in inputs.items()}
# Calculate token count (32kHz, 50 tokens/second)
max_new_tokens = int(duration * 50)
audio_values = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
guidance_scale=guidance_scale
)
return audio_values[0, 0].cpu().numpy()
def generate_with_melody(
self,
prompt: str,
melody_audio: torch.Tensor,
duration: float = 10.0
):
"""Melody-conditioned generation (melody model only)"""
inputs = self.processor(
text=[prompt],
audio=melody_audio,
sampling_rate=32000,
padding=True,
return_tensors="pt"
)
max_new_tokens = int(duration * 50)
audio_values = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens
)
return audio_values[0, 0].cpu().numpy()
def save_audio(self, audio: np.ndarray, path: str):
"""Save audio"""
wavfile.write(path, 32000, audio)
# Usage examples
def music_generation_examples():
"""Various music generation examples"""
generator = MusicGenerator("small")
# Text-based generation
prompts = [
"A calm piano melody with soft strings in the background",
"Upbeat electronic dance music with heavy bass drops",
"Traditional Korean music with gayageum and janggu",
"Jazz trio improvisation with drums, bass, and piano"
]
for i, prompt in enumerate(prompts):
audio = generator.generate_from_text(
prompt,
duration=15.0,
temperature=0.9,
guidance_scale=3.5
)
generator.save_audio(audio, f"music_{i}.wav")
print(f"Generated: {prompt[:50]}...")
2.3 AudioCraft (Audio Diffusion)¶
# AudioGen (sound effects generation)
def generate_sound_effects():
"""Generate sound effects with AudioGen"""
from audiocraft.models import AudioGen
from audiocraft.data.audio import audio_write
model = AudioGen.get_pretrained("facebook/audiogen-medium")
model.set_generation_params(duration=5)
descriptions = [
"Dog barking in the distance",
"Thunder and heavy rain",
"Car engine starting and driving away"
]
wav = model.generate(descriptions)
for i, one_wav in enumerate(wav):
audio_write(f"sound_{i}", one_wav.cpu(), model.sample_rate)
3. Video Understanding Models¶
3.1 Video-LLaMA / VideoLLaMA 2¶
VideoLLaMA Architecture:
┌─────────────────────────────────────────────────────┐
│ Video Input │
│ [Frame1, Frame2, ..., FrameN] │
│ ↓ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Visual Encoder (ViT/CLIP) │ │
│ │ - Frame-level features │ │
│ └────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Video Q-Former │ │
│ │ - Temporal aggregation │ │
│ │ - Cross-attention with queries │ │
│ └────────────────────────────────────────────────┘ │
│ ↓ │
│ Video Embeddings │
│ + │
│ Audio Embeddings (ImageBind) │
│ ↓ │
│ ┌────────────────────────────────────────────────┐ │
│ │ LLM Backbone │ │
│ │ (LLaMA/Vicuna) │ │
│ └────────────────────────────────────────────────┘ │
│ ↓ │
│ Text Response │
└─────────────────────────────────────────────────────┘
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
class VideoUnderstanding:
"""Video understanding model"""
def __init__(self, model_name: str = "DAMO-NLP-SG/Video-LLaMA-2-7B"):
self.processor = AutoProcessor.from_pretrained(model_name)
self.model = AutoModelForVision2Seq.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
def extract_frames(
self,
video_path: str,
num_frames: int = 8,
uniform: bool = True
):
"""Extract frames from video"""
import cv2
cap = cv2.VideoCapture(video_path)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
if uniform:
# Uniform sampling
indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
else:
# Random sampling
indices = sorted(np.random.choice(total_frames, num_frames, replace=False))
frames = []
for idx in indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ret, frame = cap.read()
if ret:
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(frame)
cap.release()
return frames
def analyze_video(
self,
video_path: str,
question: str,
num_frames: int = 8
):
"""Video analysis and question answering"""
frames = self.extract_frames(video_path, num_frames)
# Prepare input
inputs = self.processor(
text=question,
images=frames,
return_tensors="pt"
).to(self.model.device)
# Generate
outputs = self.model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7
)
response = self.processor.decode(outputs[0], skip_special_tokens=True)
return response
# Video Captioning
class VideoCaptioner:
"""Video captioning"""
def __init__(self):
from transformers import BlipProcessor, BlipForConditionalGeneration
self.processor = BlipProcessor.from_pretrained(
"Salesforce/blip2-opt-2.7b"
)
self.model = BlipForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16
)
def caption_video(
self,
video_path: str,
num_frames: int = 5
):
"""Generate video caption"""
frames = self._extract_frames(video_path, num_frames)
captions = []
for frame in frames:
inputs = self.processor(images=frame, return_tensors="pt")
output = self.model.generate(**inputs, max_new_tokens=50)
caption = self.processor.decode(output[0], skip_special_tokens=True)
captions.append(caption)
# Integrate captions
summary = self._summarize_captions(captions)
return summary
def _summarize_captions(self, captions: list) -> str:
"""Integrate frame captions into video summary"""
# Simple integration (using LLM recommended in practice)
unique_elements = set()
for caption in captions:
unique_elements.update(caption.lower().split())
return " → ".join(captions)
3.2 Video Generation Concept (Sora)¶
Sora Core Concepts:
┌────────────────────────────────────────────────────────┐
│ Text Prompt │
│ "A cat playing piano in a cozy room with warm light" │
│ ↓ │
│ ┌────────────────────────────────────────────────────┐│
│ │ Text Encoder (T5/CLIP) ││
│ │ - Semantic understanding ││
│ └────────────────────────────────────────────────────┘│
│ ↓ │
│ ┌────────────────────────────────────────────────────┐│
│ │ Spacetime Latent Patches ││
│ │ - Video as 3D patches ││
│ │ - Compress H×W×T into latent ││
│ └────────────────────────────────────────────────────┘│
│ ↓ │
│ ┌────────────────────────────────────────────────────┐│
│ │ Diffusion Transformer ││
│ │ - DiT backbone ││
│ │ - Attention over spacetime ││
│ │ - Variable resolution/duration ││
│ └────────────────────────────────────────────────────┘│
│ ↓ │
│ ┌────────────────────────────────────────────────────┐│
│ │ VAE Decoder ││
│ │ - Latent → Pixel space ││
│ └────────────────────────────────────────────────────┘│
│ ↓ │
│ Video Output │
│ (Variable length, up to 1 min) │
└────────────────────────────────────────────────────────┘
Key techniques:
1. Spacetime Patches: Divide spacetime into patches
2. DiT (Diffusion Transformer): Transformer-based diffusion
3. Variable Resolution: Support various resolutions/lengths
4. Recaptioning: Retrain with detailed captions
# Simple Video Diffusion concept implementation
import torch
import torch.nn as nn
from einops import rearrange
class SpacetimePatchEmbed(nn.Module):
"""Spacetime patch embedding"""
def __init__(
self,
img_size: int = 256,
patch_size: int = 16,
num_frames: int = 16,
temporal_patch: int = 2,
in_channels: int = 4, # VAE latent
embed_dim: int = 768
):
super().__init__()
self.patch_size = patch_size
self.temporal_patch = temporal_patch
# 3D patch embedding
self.proj = nn.Conv3d(
in_channels, embed_dim,
kernel_size=(temporal_patch, patch_size, patch_size),
stride=(temporal_patch, patch_size, patch_size)
)
# Calculate patch counts
self.num_spatial_patches = (img_size // patch_size) ** 2
self.num_temporal_patches = num_frames // temporal_patch
self.num_patches = self.num_spatial_patches * self.num_temporal_patches
def forward(self, x):
"""
Args:
x: (B, C, T, H, W) - video latent
Returns:
patches: (B, N, D) - spacetime patches
"""
# (B, D, t, h, w)
x = self.proj(x)
# (B, D, N) -> (B, N, D)
x = x.flatten(2).transpose(1, 2)
return x
class VideoTransformerBlock(nn.Module):
"""Transformer block for video"""
def __init__(
self,
dim: int,
num_heads: int,
mlp_ratio: float = 4.0,
num_spatial_patches: int = 256,
num_temporal_patches: int = 8
):
super().__init__()
self.num_spatial = num_spatial_patches
self.num_temporal = num_temporal_patches
# Spatial attention
self.spatial_norm = nn.LayerNorm(dim)
self.spatial_attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
# Temporal attention
self.temporal_norm = nn.LayerNorm(dim)
self.temporal_attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
# FFN
self.ffn_norm = nn.LayerNorm(dim)
self.ffn = nn.Sequential(
nn.Linear(dim, int(dim * mlp_ratio)),
nn.GELU(),
nn.Linear(int(dim * mlp_ratio), dim)
)
def forward(self, x, t_emb=None):
"""
Args:
x: (B, T*S, D) - spacetime patches
t_emb: (B, D) - timestep embedding
"""
B, N, D = x.shape
T, S = self.num_temporal, self.num_spatial
# Spatial attention (within each frame)
x_spatial = rearrange(x, 'b (t s) d -> (b t) s d', t=T, s=S)
x_spatial = self.spatial_norm(x_spatial)
attn_out, _ = self.spatial_attn(x_spatial, x_spatial, x_spatial)
x_spatial = rearrange(attn_out, '(b t) s d -> b (t s) d', b=B, t=T)
x = x + x_spatial
# Temporal attention (across frames at same position)
x_temporal = rearrange(x, 'b (t s) d -> (b s) t d', t=T, s=S)
x_temporal = self.temporal_norm(x_temporal)
attn_out, _ = self.temporal_attn(x_temporal, x_temporal, x_temporal)
x_temporal = rearrange(attn_out, '(b s) t d -> b (t s) d', b=B, s=S)
x = x + x_temporal
# FFN
x = x + self.ffn(self.ffn_norm(x))
return x
class SimpleDiT(nn.Module):
"""Simplified Diffusion Transformer"""
def __init__(
self,
img_size: int = 256,
patch_size: int = 16,
num_frames: int = 16,
in_channels: int = 4,
hidden_size: int = 768,
depth: int = 12,
num_heads: int = 12
):
super().__init__()
# Patch embedding
self.patch_embed = SpacetimePatchEmbed(
img_size, patch_size, num_frames,
temporal_patch=2, in_channels=in_channels,
embed_dim=hidden_size
)
num_spatial = (img_size // patch_size) ** 2
num_temporal = num_frames // 2
# Position embedding
self.pos_embed = nn.Parameter(
torch.zeros(1, num_spatial * num_temporal, hidden_size)
)
# Timestep embedding
self.time_embed = nn.Sequential(
nn.Linear(hidden_size, hidden_size * 4),
nn.SiLU(),
nn.Linear(hidden_size * 4, hidden_size)
)
# Transformer blocks
self.blocks = nn.ModuleList([
VideoTransformerBlock(
hidden_size, num_heads,
num_spatial_patches=num_spatial,
num_temporal_patches=num_temporal
)
for _ in range(depth)
])
# Output
self.final_norm = nn.LayerNorm(hidden_size)
self.final_proj = nn.Linear(
hidden_size,
patch_size * patch_size * 2 * in_channels
)
def forward(self, x, t, cond=None):
"""
Args:
x: (B, C, T, H, W) - noisy video latent
t: (B,) - diffusion timestep
cond: (B, L, D) - text conditioning
"""
# Patch embedding
x = self.patch_embed(x) + self.pos_embed
# Timestep embedding (sinusoidal)
t_emb = self._sinusoidal_embedding(t, x.shape[-1])
t_emb = self.time_embed(t_emb)
# Transformer blocks
for block in self.blocks:
x = block(x, t_emb)
# Output
x = self.final_norm(x)
x = self.final_proj(x)
return x
def _sinusoidal_embedding(self, t, dim):
"""Sinusoidal timestep embedding"""
half = dim // 2
freqs = torch.exp(
-math.log(10000) * torch.arange(half, device=t.device) / half
)
args = t[:, None] * freqs[None]
embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
return embedding
4. Practical Applications¶
4.1 Multimodal Pipeline¶
class MultimodalPipeline:
"""Integrated audio/video pipeline"""
def __init__(self):
# Speech recognition
self.whisper = whisper.load_model("base")
# Music generation
self.music_processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
self.music_model = MusicgenForConditionalGeneration.from_pretrained(
"facebook/musicgen-small"
)
# TTS
self.tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
def transcribe_and_translate(
self,
audio_path: str,
target_language: str = "en"
):
"""Speech recognition and translation"""
# Speech recognition
result = self.whisper.transcribe(audio_path)
original_text = result["text"]
source_language = result["language"]
# Translation (to English)
if source_language != target_language:
translation = self.whisper.transcribe(
audio_path,
task="translate"
)["text"]
else:
translation = original_text
return {
"original": original_text,
"source_language": source_language,
"translation": translation
}
def generate_soundtrack(
self,
video_description: str,
mood: str,
duration: float = 30.0
):
"""Generate background music based on video description"""
prompt = f"{mood} music for: {video_description}"
inputs = self.music_processor(
text=[prompt],
padding=True,
return_tensors="pt"
)
max_new_tokens = int(duration * 50)
audio_values = self.music_model.generate(
**inputs,
max_new_tokens=max_new_tokens,
guidance_scale=4.0
)
return audio_values[0, 0].numpy()
def create_voiceover(
self,
script: str,
reference_voice: str,
language: str = "en"
):
"""Generate voiceover"""
output_path = "voiceover.wav"
self.tts.tts_to_file(
text=script,
file_path=output_path,
speaker_wav=reference_voice,
language=language
)
return output_path
# Usage example
def demo_pipeline():
"""Pipeline demo"""
pipeline = MultimodalPipeline()
# 1. Transcribe and translate audio file
result = pipeline.transcribe_and_translate(
"korean_speech.mp3",
target_language="en"
)
print(f"Original: {result['original']}")
print(f"Translation: {result['translation']}")
# 2. Generate background music for video
music = pipeline.generate_soundtrack(
video_description="A documentary about ocean wildlife",
mood="Calm and majestic",
duration=60.0
)
# 3. Generate narration
voiceover = pipeline.create_voiceover(
script="Welcome to our exploration of the deep ocean.",
reference_voice="narrator_sample.wav",
language="en"
)
4.2 Real-time Processing¶
import asyncio
from collections import deque
class RealTimeAudioProcessor:
"""Real-time audio processing"""
def __init__(self, buffer_size: float = 3.0):
self.buffer_size = buffer_size
self.sample_rate = 16000
self.audio_buffer = deque(maxlen=int(buffer_size * self.sample_rate))
# Whisper model (use small version)
self.model = whisper.load_model("tiny")
async def process_stream(self, audio_stream):
"""Process audio stream"""
while True:
# Receive audio chunk
chunk = await audio_stream.receive()
self.audio_buffer.extend(chunk)
# Process when buffer is sufficient
if len(self.audio_buffer) >= self.sample_rate * 2:
audio_array = np.array(self.audio_buffer)
# Async transcription
result = await asyncio.to_thread(
self.model.transcribe,
audio_array,
fp16=False
)
yield result["text"]
# Keep partial buffer (overlap)
self.audio_buffer = deque(
list(self.audio_buffer)[self.sample_rate:],
maxlen=int(self.buffer_size * self.sample_rate)
)
class StreamingVideoAnalyzer:
"""Streaming video analysis"""
def __init__(self, frame_interval: int = 30):
self.frame_interval = frame_interval
self.frame_count = 0
# CLIP for quick frame analysis
from transformers import CLIPProcessor, CLIPModel
self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
def analyze_frame(self, frame, categories: list):
"""Frame classification"""
inputs = self.processor(
text=categories,
images=frame,
return_tensors="pt",
padding=True
)
outputs = self.model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
return {cat: prob.item() for cat, prob in zip(categories, probs[0])}
def process_video_stream(self, video_stream, categories: list):
"""Process video stream"""
import cv2
while True:
ret, frame = video_stream.read()
if not ret:
break
self.frame_count += 1
# Analyze only at intervals
if self.frame_count % self.frame_interval == 0:
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
analysis = self.analyze_frame(frame_rgb, categories)
yield self.frame_count, analysis
5. Model Comparison¶
5.1 Speech Models¶
| Model | Parameters | Features | Use Case |
|---|---|---|---|
| Whisper Large | 1.55B | Multilingual, Translation | General ASR |
| Whisper Large-v3 | 1.55B | Improved accuracy | Production |
| wav2vec 2.0 | 300M | Self-supervised | Fine-tuning base |
| HuBERT | 300M-1B | Masked prediction | Speech representation |
5.2 Audio Generation¶
| Model | Size | Features | Output |
|---|---|---|---|
| MusicGen Small | 300M | Fast generation | Music |
| MusicGen Large | 3.3B | High quality | Music |
| AudioGen | 300M-1.5B | Sound effects | Audio |
| Bark | 1B+ | Non-verbal expressions | TTS |
5.3 Video Models¶
| Model | Architecture | Input | Tasks |
|---|---|---|---|
| VideoLLaMA | LLaMA + Q-Former | Video + Audio | VQA, Captioning |
| Video-ChatGPT | LLaVA variant | Video | Conversation |
| TimeSformer | Divided attention | Video | Classification |
| ViViT | Factorized | Video | Classification |
Key Summary¶
Audio Foundation Models¶
Whisper: General ASR + Translation
├── Encoder-Decoder Transformer
├── 680K hours training data
└── Multilingual (99 languages)
MusicGen: Text→Music
├── Autoregressive Transformer
├── EnCodec tokenization
└── Text/Melody conditioned
Video Foundation Models¶
Video Understanding:
├── Frame sampling → Visual encoder
├── Temporal aggregation (Q-Former/pooling)
└── LLM backbone for reasoning
Video Generation (Sora concept):
├── Spacetime patches (3D tokenization)
├── Diffusion Transformer (DiT)
└── Variable resolution/duration
Practical Points¶
- Whisper: Can be domain-specialized with fine-tuning
- MusicGen: Control quality/diversity with guidance_scale
- Video: Frame sampling strategy is crucial
- Real-time: Small models + streaming buffer
References¶
- Radford et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision" (Whisper)
- Copet et al. (2023). "Simple and Controllable Music Generation" (MusicGen)
- Borsos et al. (2023). "AudioLM: a Language Modeling Approach to Audio Generation"
- Zhang et al. (2023). "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model"
- OpenAI Sora Technical Report (2024)