27. TensorBoard μ‹œκ°ν™”

이전: μ •κ·œν™” λ ˆμ΄μ–΄ | λ‹€μŒ: 생성 λͺ¨λΈ - GAN


27. TensorBoard μ‹œκ°ν™”

ν•™μŠ΅ λͺ©ν‘œ

  • TensorBoard의 핡심 κΈ°λŠ₯κ³Ό ν™œμš© 사둀 이해
  • PyTorchμ—μ„œ TensorBoard 연동 방법 μŠ΅λ“
  • ν•™μŠ΅ λ©”νŠΈλ¦­, λͺ¨λΈ κ·Έλž˜ν”„, μž„λ² λ”© μ‹œκ°ν™”
  • ν•˜μ΄νΌνŒŒλΌλ―Έν„° νŠœλ‹ κ²°κ³Ό 비ꡐ 뢄석

1. TensorBoard μ†Œκ°œ

1.1 TensorBoardλž€?

TensorBoardλŠ” λ¨Έμ‹ λŸ¬λ‹ μ‹€ν—˜μ„ μ‹œκ°ν™”ν•˜κ³  λΆ„μ„ν•˜κΈ° μœ„ν•œ λ„κ΅¬μž…λ‹ˆλ‹€.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      TensorBoard                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚  β”‚ Scalars β”‚  β”‚ Images  β”‚  β”‚ Graphs  β”‚  β”‚Histogramsβ”‚        β”‚
β”‚  β”‚ (손싀,  β”‚  β”‚ (μƒ˜ν”Œ,  β”‚  β”‚ (λͺ¨λΈ   β”‚  β”‚ (κ°€μ€‘μΉ˜ β”‚        β”‚
β”‚  β”‚ 정확도) β”‚  β”‚ 생성물) β”‚  β”‚ ꡬ쑰)   β”‚  β”‚ 뢄포)   β”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚  β”‚Embeddingsβ”‚ β”‚  Text   β”‚  β”‚ Audio   β”‚  β”‚ HParams β”‚        β”‚
β”‚  β”‚(t-SNE,  β”‚  β”‚ (둜그,  β”‚  β”‚ (μŒμ„±   β”‚  β”‚(ν•˜μ΄νΌ  β”‚        β”‚
β”‚  β”‚ PCA)    β”‚  β”‚ μƒ˜ν”Œ)   β”‚  β”‚ μƒ˜ν”Œ)   β”‚  β”‚νŒŒλΌλ―Έν„°)β”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1.2 μ„€μΉ˜ 및 μ‹€ν–‰

# μ„€μΉ˜
pip install tensorboard

# μ‹€ν–‰
tensorboard --logdir=runs --port=6006

# λΈŒλΌμš°μ €μ—μ„œ 접속: http://localhost:6006

2. PyTorch와 TensorBoard 연동

2.1 SummaryWriter κΈ°λ³Έ μ‚¬μš©λ²•

from torch.utils.tensorboard import SummaryWriter
import torch
import torch.nn as nn

# SummaryWriter 생성
writer = SummaryWriter('runs/experiment_1')

# 슀칼라 κ°’ 기둝
for step in range(100):
    loss = 1.0 / (step + 1)  # μ˜ˆμ‹œ 손싀값
    accuracy = step / 100.0   # μ˜ˆμ‹œ 정확도

    writer.add_scalar('Loss/train', loss, step)
    writer.add_scalar('Accuracy/train', accuracy, step)

# μ’…λ£Œ
writer.close()

2.2 μ‹€ν—˜λ³„ 둜그 디렉토리 ꡬ성

from datetime import datetime
import os

def create_writer(experiment_name: str, extra: str = None) -> SummaryWriter:
    """μ‹€ν—˜λ³„ 고유 둜그 디렉토리 생성"""
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

    if extra:
        log_dir = f'runs/{experiment_name}/{extra}/{timestamp}'
    else:
        log_dir = f'runs/{experiment_name}/{timestamp}'

    os.makedirs(log_dir, exist_ok=True)
    return SummaryWriter(log_dir)

# μ‚¬μš© μ˜ˆμ‹œ
writer = create_writer('mnist_cnn', 'lr_0.001_batch_32')

3. 슀칼라 λ‘œκΉ… (Scalars)

3.1 ν•™μŠ΅/검증 λ©”νŠΈλ¦­ 기둝

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter

class Trainer:
    def __init__(self, model, train_loader, val_loader, device='cuda'):
        self.model = model.to(device)
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.device = device
        self.criterion = nn.CrossEntropyLoss()
        self.optimizer = optim.Adam(model.parameters(), lr=0.001)
        self.writer = SummaryWriter()
        self.global_step = 0

    def train_epoch(self, epoch: int):
        self.model.train()
        running_loss = 0.0
        correct = 0
        total = 0

        for batch_idx, (data, target) in enumerate(self.train_loader):
            data, target = data.to(self.device), target.to(self.device)

            self.optimizer.zero_grad()
            output = self.model(data)
            loss = self.criterion(output, target)
            loss.backward()
            self.optimizer.step()

            running_loss += loss.item()
            _, predicted = output.max(1)
            total += target.size(0)
            correct += predicted.eq(target).sum().item()

            # λ°°μΉ˜λ³„ λ‘œκΉ… (선택적)
            if batch_idx % 100 == 0:
                self.writer.add_scalar('Loss/train_step', loss.item(), self.global_step)
            self.global_step += 1

        # 에폭별 λ‘œκΉ…
        epoch_loss = running_loss / len(self.train_loader)
        epoch_acc = 100. * correct / total

        self.writer.add_scalar('Loss/train', epoch_loss, epoch)
        self.writer.add_scalar('Accuracy/train', epoch_acc, epoch)

        return epoch_loss, epoch_acc

    def validate(self, epoch: int):
        self.model.eval()
        val_loss = 0.0
        correct = 0
        total = 0

        with torch.no_grad():
            for data, target in self.val_loader:
                data, target = data.to(self.device), target.to(self.device)
                output = self.model(data)
                val_loss += self.criterion(output, target).item()
                _, predicted = output.max(1)
                total += target.size(0)
                correct += predicted.eq(target).sum().item()

        val_loss /= len(self.val_loader)
        val_acc = 100. * correct / total

        self.writer.add_scalar('Loss/val', val_loss, epoch)
        self.writer.add_scalar('Accuracy/val', val_acc, epoch)

        return val_loss, val_acc

3.2 μ—¬λŸ¬ 슀칼라λ₯Ό ν•œ κ·Έλž˜ν”„μ— ν‘œμ‹œ

# 방법 1: add_scalars μ‚¬μš©
writer.add_scalars('Loss', {
    'train': train_loss,
    'val': val_loss
}, epoch)

writer.add_scalars('Accuracy', {
    'train': train_acc,
    'val': val_acc
}, epoch)

# 방법 2: 같은 νƒœκ·Έ 경둜 μ‚¬μš©
# Loss/trainκ³Ό Loss/val은 TensorBoardμ—μ„œ μžλ™μœΌλ‘œ 그룹화됨

3.3 ν•™μŠ΅λ₯  μŠ€μΌ€μ€„λŸ¬ λ‘œκΉ…

from torch.optim.lr_scheduler import CosineAnnealingLR

scheduler = CosineAnnealingLR(optimizer, T_max=100)

for epoch in range(100):
    train_one_epoch()
    scheduler.step()

    # ν˜„μž¬ ν•™μŠ΅λ₯  기둝
    current_lr = scheduler.get_last_lr()[0]
    writer.add_scalar('Learning_Rate', current_lr, epoch)

4. 이미지 λ‘œκΉ… (Images)

4.1 μž…λ ₯ 이미지 μ‹œκ°ν™”

import torchvision
from torchvision import transforms

def log_images(writer, images, tag, step, normalize=True):
    """이미지 배치λ₯Ό κ·Έλ¦¬λ“œλ‘œ μ‹œκ°ν™”"""
    # μ •κ·œν™”λœ 이미지λ₯Ό μ›λž˜ λ²”μœ„λ‘œ 볡원 (선택적)
    if normalize:
        # ImageNet μ •κ·œν™” μ—­λ³€ν™˜
        mean = torch.tensor([0.485, 0.456, 0.406]).view(1, 3, 1, 1)
        std = torch.tensor([0.229, 0.224, 0.225]).view(1, 3, 1, 1)
        images = images * std + mean
        images = torch.clamp(images, 0, 1)

    # κ·Έλ¦¬λ“œ 생성
    grid = torchvision.utils.make_grid(images, nrow=8, padding=2)
    writer.add_image(tag, grid, step)

# μ‚¬μš© μ˜ˆμ‹œ
for batch_idx, (images, labels) in enumerate(train_loader):
    if batch_idx == 0:  # 첫 배치만 λ‘œκΉ…
        log_images(writer, images[:32], 'Input/samples', epoch)
        break

4.2 생성 λͺ¨λΈ 좜λ ₯ μ‹œκ°ν™”

class GANTrainer:
    def __init__(self, generator, discriminator, writer):
        self.G = generator
        self.D = discriminator
        self.writer = writer
        self.fixed_noise = torch.randn(64, 100, 1, 1)  # κ³ μ • λ…Έμ΄μ¦ˆ

    def log_generated_images(self, epoch):
        """μƒμ„±λœ 이미지 μ‹œκ°ν™” (ν•™μŠ΅ μ§„ν–‰ ν™•μΈμš©)"""
        self.G.eval()
        with torch.no_grad():
            fake_images = self.G(self.fixed_noise.to(self.G.device))
            fake_images = (fake_images + 1) / 2  # [-1, 1] -> [0, 1]

        grid = torchvision.utils.make_grid(fake_images, nrow=8)
        self.writer.add_image('Generated/samples', grid, epoch)
        self.G.train()

4.3 νŠΉμ§• λ§΅ μ‹œκ°ν™”

def visualize_feature_maps(model, image, writer, layer_name, step):
    """CNN 쀑간 λ ˆμ΄μ–΄μ˜ νŠΉμ§• λ§΅ μ‹œκ°ν™”"""
    activation = {}

    def get_activation(name):
        def hook(model, input, output):
            activation[name] = output.detach()
        return hook

    # ν›… 등둝
    layer = dict(model.named_modules())[layer_name]
    handle = layer.register_forward_hook(get_activation(layer_name))

    # μˆœμ „νŒŒ
    model.eval()
    with torch.no_grad():
        _ = model(image.unsqueeze(0))

    # νŠΉμ§• λ§΅ μΆ”μΆœ
    feat = activation[layer_name].squeeze(0)  # [C, H, W]

    # μ±„λ„λ³„λ‘œ μ‹œκ°ν™” (처음 16개)
    feat = feat[:16].unsqueeze(1)  # [16, 1, H, W]
    grid = torchvision.utils.make_grid(feat, nrow=4, normalize=True)
    writer.add_image(f'Features/{layer_name}', grid, step)

    handle.remove()

4.4 Grad-CAM μ‹œκ°ν™”

import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

class GradCAM:
    def __init__(self, model, target_layer):
        self.model = model
        self.target_layer = target_layer
        self.gradients = None
        self.activations = None

        # ν›… 등둝
        target_layer.register_forward_hook(self.save_activation)
        target_layer.register_full_backward_hook(self.save_gradient)

    def save_activation(self, module, input, output):
        self.activations = output.detach()

    def save_gradient(self, module, grad_input, grad_output):
        self.gradients = grad_output[0].detach()

    def __call__(self, x, class_idx=None):
        self.model.eval()
        output = self.model(x)

        if class_idx is None:
            class_idx = output.argmax(dim=1)

        self.model.zero_grad()
        one_hot = torch.zeros_like(output)
        one_hot[0, class_idx] = 1
        output.backward(gradient=one_hot)

        # Grad-CAM 계산
        weights = self.gradients.mean(dim=(2, 3), keepdim=True)
        cam = (weights * self.activations).sum(dim=1, keepdim=True)
        cam = F.relu(cam)
        cam = F.interpolate(cam, size=x.shape[2:], mode='bilinear', align_corners=False)
        cam = cam - cam.min()
        cam = cam / cam.max()

        return cam.squeeze().cpu().numpy()

def log_gradcam(writer, model, image, target_layer, step):
    """Grad-CAM κ²°κ³Όλ₯Ό TensorBoard에 λ‘œκΉ…"""
    gradcam = GradCAM(model, target_layer)
    cam = gradcam(image.unsqueeze(0))

    # 컬러맡 적용
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))

    # 원본 이미지
    img_np = image.permute(1, 2, 0).cpu().numpy()
    img_np = (img_np - img_np.min()) / (img_np.max() - img_np.min())
    axes[0].imshow(img_np)
    axes[0].set_title('Original')
    axes[0].axis('off')

    # Grad-CAM
    axes[1].imshow(cam, cmap='jet')
    axes[1].set_title('Grad-CAM')
    axes[1].axis('off')

    # μ˜€λ²„λ ˆμ΄
    axes[2].imshow(img_np)
    axes[2].imshow(cam, cmap='jet', alpha=0.5)
    axes[2].set_title('Overlay')
    axes[2].axis('off')

    plt.tight_layout()
    writer.add_figure('GradCAM', fig, step)
    plt.close(fig)

5. νžˆμŠ€ν† κ·Έλž¨ (Histograms)

5.1 κ°€μ€‘μΉ˜ 뢄포 μ‹œκ°ν™”

def log_weights_histograms(writer, model, epoch):
    """λͺ¨λΈ κ°€μ€‘μΉ˜ 뢄포λ₯Ό νžˆμŠ€ν† κ·Έλž¨μœΌλ‘œ μ‹œκ°ν™”"""
    for name, param in model.named_parameters():
        if param.requires_grad:
            # κ°€μ€‘μΉ˜ κ°’
            writer.add_histogram(f'Weights/{name}', param.data, epoch)

            # κ·Έλž˜λ””μ–ΈνŠΈ κ°’ (μžˆλŠ” 경우)
            if param.grad is not None:
                writer.add_histogram(f'Gradients/{name}', param.grad, epoch)

# ν•™μŠ΅ λ£¨ν”„μ—μ„œ μ‚¬μš©
for epoch in range(num_epochs):
    train_one_epoch()

    # λ§€ 10 μ—ν­λ§ˆλ‹€ νžˆμŠ€ν† κ·Έλž¨ λ‘œκΉ…
    if epoch % 10 == 0:
        log_weights_histograms(writer, model, epoch)

5.2 ν™œμ„±ν™” κ°’ 뢄포 좔적

class ActivationLogger:
    """λ ˆμ΄μ–΄λ³„ ν™œμ„±ν™” κ°’ 뢄포 좔적"""

    def __init__(self, model, writer):
        self.writer = writer
        self.activations = {}
        self.hooks = []

        for name, module in model.named_modules():
            if isinstance(module, (nn.ReLU, nn.GELU, nn.SiLU)):
                hook = module.register_forward_hook(
                    self._make_hook(name)
                )
                self.hooks.append(hook)

    def _make_hook(self, name):
        def hook(module, input, output):
            self.activations[name] = output.detach()
        return hook

    def log(self, step):
        for name, activation in self.activations.items():
            self.writer.add_histogram(f'Activations/{name}', activation, step)
        self.activations.clear()

    def remove_hooks(self):
        for hook in self.hooks:
            hook.remove()

6. λͺ¨λΈ κ·Έλž˜ν”„ (Graphs)

6.1 λͺ¨λΈ ꡬ쑰 μ‹œκ°ν™”

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# λͺ¨λΈ κ·Έλž˜ν”„ λ‘œκΉ…
model = SimpleCNN()
dummy_input = torch.randn(1, 3, 32, 32)

writer = SummaryWriter('runs/model_graph')
writer.add_graph(model, dummy_input)
writer.close()

6.2 λ³΅μž‘ν•œ λͺ¨λΈ κ·Έλž˜ν”„

# Transformer λͺ¨λΈ κ·Έλž˜ν”„
from torchvision.models import vit_b_16

model = vit_b_16(pretrained=False)
dummy_input = torch.randn(1, 3, 224, 224)

writer.add_graph(model, dummy_input)

7. μž„λ² λ”© μ‹œκ°ν™” (Embeddings)

7.1 t-SNE/PCA둜 μž„λ² λ”© μ‹œκ°ν™”

import torch
import torchvision
from torchvision import datasets, transforms

def extract_embeddings(model, dataloader, device):
    """λͺ¨λΈμ˜ λ§ˆμ§€λ§‰ λ ˆμ΄μ–΄ μ „ μž„λ² λ”© μΆ”μΆœ"""
    model.eval()
    embeddings = []
    labels = []
    images = []

    with torch.no_grad():
        for data, target in dataloader:
            data = data.to(device)

            # λ§ˆμ§€λ§‰ FC λ ˆμ΄μ–΄ μ „κΉŒμ§€ μˆœμ „νŒŒ
            # λͺ¨λΈ ꡬ쑰에 따라 μˆ˜μ • ν•„μš”
            x = model.features(data)
            x = model.avgpool(x)
            emb = x.view(x.size(0), -1)

            embeddings.append(emb.cpu())
            labels.append(target)
            images.append(data.cpu())

    return (
        torch.cat(embeddings),
        torch.cat(labels),
        torch.cat(images)
    )

# μ‚¬μš© μ˜ˆμ‹œ
embeddings, labels, images = extract_embeddings(model, test_loader, device)

# TensorBoard에 μž„λ² λ”© λ‘œκΉ…
writer.add_embedding(
    embeddings,
    metadata=labels,
    label_img=images,
    global_step=epoch,
    tag='Embeddings/test_set'
)

7.2 단어 μž„λ² λ”© μ‹œκ°ν™” (NLP)

import torch.nn as nn

# 단어 μž„λ² λ”© μ˜ˆμ‹œ
vocab = ['king', 'queen', 'man', 'woman', 'prince', 'princess',
         'dog', 'cat', 'puppy', 'kitten']
embedding_dim = 128

embedding_layer = nn.Embedding(len(vocab), embedding_dim)

# μž„λ² λ”© 벑터 μΆ”μΆœ
indices = torch.arange(len(vocab))
embeddings = embedding_layer(indices)

# TensorBoard에 λ‘œκΉ…
writer.add_embedding(
    embeddings,
    metadata=vocab,
    tag='Word_Embeddings'
)

8. ν•˜μ΄νΌνŒŒλΌλ―Έν„° νŠœλ‹ (HParams)

8.1 ν•˜μ΄νΌνŒŒλΌλ―Έν„° μ‹€ν—˜ λ‘œκΉ…

from torch.utils.tensorboard.summary import hparams

def train_with_hparams(lr, batch_size, optimizer_name, epochs=10):
    """ν•˜μ΄νΌνŒŒλΌλ―Έν„°λ³„ μ‹€ν—˜ μ‹€ν–‰"""

    # 고유 μ‹€ν—˜ 디렉토리
    run_name = f'lr_{lr}_bs_{batch_size}_{optimizer_name}'
    writer = SummaryWriter(f'runs/hparam_search/{run_name}')

    # λͺ¨λΈ 및 데이터 μ„€μ •
    model = SimpleCNN().to(device)

    if optimizer_name == 'adam':
        optimizer = optim.Adam(model.parameters(), lr=lr)
    elif optimizer_name == 'sgd':
        optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    # ν•™μŠ΅
    best_accuracy = 0
    for epoch in range(epochs):
        train_loss, train_acc = train_one_epoch(model, train_loader, optimizer)
        val_loss, val_acc = validate(model, val_loader)

        writer.add_scalar('Loss/train', train_loss, epoch)
        writer.add_scalar('Accuracy/val', val_acc, epoch)

        best_accuracy = max(best_accuracy, val_acc)

    # ν•˜μ΄νΌνŒŒλΌλ―Έν„°μ™€ μ΅œμ’… λ©”νŠΈλ¦­ 기둝
    hparam_dict = {
        'lr': lr,
        'batch_size': batch_size,
        'optimizer': optimizer_name
    }
    metric_dict = {
        'hparam/best_accuracy': best_accuracy,
        'hparam/final_loss': val_loss
    }

    writer.add_hparams(hparam_dict, metric_dict)
    writer.close()

    return best_accuracy

# κ·Έλ¦¬λ“œ μ„œμΉ˜ μ‹€ν–‰
learning_rates = [0.001, 0.01, 0.1]
batch_sizes = [32, 64, 128]
optimizers = ['adam', 'sgd']

for lr in learning_rates:
    for bs in batch_sizes:
        for opt in optimizers:
            acc = train_with_hparams(lr, bs, opt)
            print(f'LR={lr}, BS={bs}, OPT={opt} -> Acc={acc:.2f}%')

8.2 Optuna와 TensorBoard 연동

import optuna
from optuna.integration import TensorBoardCallback

def objective(trial):
    # ν•˜μ΄νΌνŒŒλΌλ―Έν„° μƒ˜ν”Œλ§
    lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
    batch_size = trial.suggest_categorical('batch_size', [32, 64, 128])
    n_layers = trial.suggest_int('n_layers', 1, 3)
    hidden_dim = trial.suggest_int('hidden_dim', 64, 256, step=64)
    dropout = trial.suggest_float('dropout', 0.1, 0.5)

    # λͺ¨λΈ 생성
    model = create_model(n_layers, hidden_dim, dropout)

    # ν•™μŠ΅ 및 평가
    accuracy = train_and_evaluate(model, lr, batch_size)

    return accuracy

# TensorBoard 콜백과 ν•¨κ»˜ μ΅œμ ν™”
study = optuna.create_study(direction='maximize')
tensorboard_callback = TensorBoardCallback('runs/optuna/', metric_name='accuracy')

study.optimize(
    objective,
    n_trials=100,
    callbacks=[tensorboard_callback]
)

print(f'Best trial: {study.best_trial.params}')
print(f'Best accuracy: {study.best_value:.2f}%')

9. μ»€μŠ€ν…€ 슀칼라 λ ˆμ΄μ•„μ›ƒ

9.1 λŒ€μ‹œλ³΄λ“œ λ ˆμ΄μ•„μ›ƒ μ •μ˜

from torch.utils.tensorboard import SummaryWriter
from torch.utils.tensorboard.summary import custom_scalars

# μ»€μŠ€ν…€ λ ˆμ΄μ•„μ›ƒ μ •μ˜
layout = {
    'Training Metrics': {
        'loss': ['Multiline', ['Loss/train', 'Loss/val']],
        'accuracy': ['Multiline', ['Accuracy/train', 'Accuracy/val']],
    },
    'Learning Rate': {
        'lr': ['Multiline', ['Learning_Rate']],
    },
    'Per-Class Accuracy': {
        'classes': ['Multiline', [f'Accuracy/class_{i}' for i in range(10)]],
    },
}

writer = SummaryWriter('runs/custom_layout')
writer.add_custom_scalars(layout)

# 이후 일반적인 λ‘œκΉ…
for epoch in range(100):
    writer.add_scalar('Loss/train', train_loss, epoch)
    writer.add_scalar('Loss/val', val_loss, epoch)
    writer.add_scalar('Accuracy/train', train_acc, epoch)
    writer.add_scalar('Accuracy/val', val_acc, epoch)
    writer.add_scalar('Learning_Rate', lr, epoch)

    for i in range(10):
        writer.add_scalar(f'Accuracy/class_{i}', class_acc[i], epoch)

10. ν…μŠ€νŠΈ 및 μ˜€λ””μ˜€ λ‘œκΉ…

10.1 ν…μŠ€νŠΈ λ‘œκΉ…

# ν•™μŠ΅ 둜그 기둝
writer.add_text('Hyperparameters', f'''
- Learning Rate: {lr}
- Batch Size: {batch_size}
- Optimizer: {optimizer_name}
- Epochs: {num_epochs}
''', 0)

# λͺ¨λΈ μš”μ•½ 기둝
from torchinfo import summary

model_summary = str(summary(model, input_size=(1, 3, 224, 224), verbose=0))
writer.add_text('Model/summary', f'```\n{model_summary}\n```', 0)

# NLP μƒ˜ν”Œ λ‘œκΉ…
writer.add_text('Samples/input', 'The quick brown fox jumps over the lazy dog', 0)
writer.add_text('Samples/prediction', 'The fast brown fox jumps over the lazy dog', 0)

10.2 μ˜€λ””μ˜€ λ‘œκΉ…

import torchaudio

# μ˜€λ””μ˜€ 파일 λ‘œκΉ…
waveform, sample_rate = torchaudio.load('audio.wav')
writer.add_audio('Audio/input', waveform, 0, sample_rate=sample_rate)

# μƒμ„±λœ μ˜€λ””μ˜€ λ‘œκΉ… (예: TTS, μŒμ•… 생성)
generated_audio = model.generate(text_input)
writer.add_audio('Audio/generated', generated_audio, step, sample_rate=22050)

11. ν”„λ‘œνŒŒμΌλ§ (Profiler)

11.1 PyTorch Profiler와 TensorBoard

import torch
from torch.profiler import profile, record_function, ProfilerActivity

# ν”„λ‘œνŒŒμΌλ§ μ„€μ •
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(
        wait=1,      # μ›Œλ°μ—…
        warmup=1,    # ν”„λ‘œνŒŒμΌ μ€€λΉ„
        active=3,    # μ‹€μ œ ν”„λ‘œνŒŒμΌλ§
        repeat=2     # 반볡
    ),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('runs/profiler'),
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for step, (data, target) in enumerate(train_loader):
        if step >= (1 + 1 + 3) * 2:
            break

        with record_function("data_loading"):
            data, target = data.to(device), target.to(device)

        with record_function("forward"):
            output = model(data)
            loss = criterion(output, target)

        with record_function("backward"):
            optimizer.zero_grad()
            loss.backward()

        with record_function("optimizer_step"):
            optimizer.step()

        prof.step()

# TensorBoardμ—μ„œ PYTORCH_PROFILER νƒ­ 확인

11.2 λ©”λͺ¨λ¦¬ ν”„λ‘œνŒŒμΌλ§

def profile_memory(model, input_size, device='cuda'):
    """GPU λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰ 뢄석"""
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.empty_cache()

    model = model.to(device)
    x = torch.randn(input_size).to(device)

    # μˆœμ „νŒŒ
    torch.cuda.synchronize()
    output = model(x)
    forward_memory = torch.cuda.max_memory_allocated() / 1e9

    # μ—­μ „νŒŒ
    loss = output.sum()
    loss.backward()
    torch.cuda.synchronize()
    total_memory = torch.cuda.max_memory_allocated() / 1e9

    print(f'Forward memory: {forward_memory:.2f} GB')
    print(f'Total memory (forward + backward): {total_memory:.2f} GB')

    return forward_memory, total_memory

# λ‘œκΉ…
fwd_mem, total_mem = profile_memory(model, (32, 3, 224, 224))
writer.add_scalar('Memory/forward_GB', fwd_mem, 0)
writer.add_scalar('Memory/total_GB', total_mem, 0)

12. λΆ„μ‚° ν•™μŠ΅μ—μ„œ TensorBoard

12.1 DDP ν™˜κ²½μ—μ„œ λ‘œκΉ…

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup_distributed():
    dist.init_process_group(backend='nccl')
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    return rank, world_size

def train_ddp():
    rank, world_size = setup_distributed()

    # Rank 0μ—μ„œλ§Œ TensorBoard λ‘œκΉ…
    writer = SummaryWriter() if rank == 0 else None

    model = MyModel().to(rank)
    model = DDP(model, device_ids=[rank])

    for epoch in range(num_epochs):
        # 둜컬 λ©”νŠΈλ¦­ 계산
        local_loss = train_one_epoch(model, train_loader)

        # λͺ¨λ“  ν”„λ‘œμ„ΈμŠ€μ˜ 손싀 평균
        loss_tensor = torch.tensor([local_loss]).to(rank)
        dist.all_reduce(loss_tensor, op=dist.ReduceOp.SUM)
        avg_loss = loss_tensor.item() / world_size

        # Rank 0μ—μ„œλ§Œ λ‘œκΉ…
        if writer is not None:
            writer.add_scalar('Loss/train', avg_loss, epoch)

    if writer is not None:
        writer.close()

    dist.destroy_process_group()

13. μ‹€μ „ 예제: μ™„μ „ν•œ ν•™μŠ΅ νŒŒμ΄ν”„λΌμΈ

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms, models
from datetime import datetime
import os

class TensorBoardTrainer:
    def __init__(
        self,
        model: nn.Module,
        train_loader: DataLoader,
        val_loader: DataLoader,
        optimizer: optim.Optimizer,
        scheduler=None,
        device: str = 'cuda',
        experiment_name: str = 'default'
    ):
        self.model = model.to(device)
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.optimizer = optimizer
        self.scheduler = scheduler
        self.device = device
        self.criterion = nn.CrossEntropyLoss()

        # TensorBoard μ„€μ •
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        log_dir = f'runs/{experiment_name}/{timestamp}'
        self.writer = SummaryWriter(log_dir)

        # λͺ¨λΈ κ·Έλž˜ν”„ λ‘œκΉ…
        dummy_input = next(iter(train_loader))[0][:1].to(device)
        self.writer.add_graph(model, dummy_input)

        # ν•˜μ΄νΌνŒŒλΌλ―Έν„° λ‘œκΉ…
        self._log_hyperparameters()

        self.global_step = 0
        self.best_val_acc = 0

    def _log_hyperparameters(self):
        hparams = {
            'lr': self.optimizer.param_groups[0]['lr'],
            'batch_size': self.train_loader.batch_size,
            'optimizer': self.optimizer.__class__.__name__,
            'model': self.model.__class__.__name__,
        }
        self.writer.add_text('Hyperparameters', str(hparams), 0)

    def train_epoch(self, epoch: int):
        self.model.train()
        running_loss = 0.0
        correct = 0
        total = 0

        for batch_idx, (data, target) in enumerate(self.train_loader):
            data, target = data.to(self.device), target.to(self.device)

            self.optimizer.zero_grad()
            output = self.model(data)
            loss = self.criterion(output, target)
            loss.backward()
            self.optimizer.step()

            running_loss += loss.item()
            _, predicted = output.max(1)
            total += target.size(0)
            correct += predicted.eq(target).sum().item()

            # λ°°μΉ˜λ³„ 손싀 λ‘œκΉ…
            self.writer.add_scalar('Loss/train_step', loss.item(), self.global_step)
            self.global_step += 1

            # 첫 배치 이미지 λ‘œκΉ…
            if batch_idx == 0 and epoch % 10 == 0:
                grid = torchvision.utils.make_grid(data[:16])
                self.writer.add_image('Input/train_samples', grid, epoch)

        epoch_loss = running_loss / len(self.train_loader)
        epoch_acc = 100. * correct / total

        self.writer.add_scalar('Loss/train', epoch_loss, epoch)
        self.writer.add_scalar('Accuracy/train', epoch_acc, epoch)

        # κ°€μ€‘μΉ˜ νžˆμŠ€ν† κ·Έλž¨ (λ§€ 10 에폭)
        if epoch % 10 == 0:
            for name, param in self.model.named_parameters():
                self.writer.add_histogram(f'Weights/{name}', param, epoch)
                if param.grad is not None:
                    self.writer.add_histogram(f'Gradients/{name}', param.grad, epoch)

        return epoch_loss, epoch_acc

    def validate(self, epoch: int):
        self.model.eval()
        val_loss = 0.0
        correct = 0
        total = 0
        all_preds = []
        all_targets = []

        with torch.no_grad():
            for data, target in self.val_loader:
                data, target = data.to(self.device), target.to(self.device)
                output = self.model(data)
                val_loss += self.criterion(output, target).item()
                _, predicted = output.max(1)
                total += target.size(0)
                correct += predicted.eq(target).sum().item()
                all_preds.extend(predicted.cpu().numpy())
                all_targets.extend(target.cpu().numpy())

        val_loss /= len(self.val_loader)
        val_acc = 100. * correct / total

        self.writer.add_scalar('Loss/val', val_loss, epoch)
        self.writer.add_scalar('Accuracy/val', val_acc, epoch)

        # 졜고 μ„±λŠ₯ κ°±μ‹ 
        if val_acc > self.best_val_acc:
            self.best_val_acc = val_acc
            self.writer.add_scalar('Best/val_accuracy', val_acc, epoch)

        return val_loss, val_acc

    def train(self, num_epochs: int):
        for epoch in range(num_epochs):
            train_loss, train_acc = self.train_epoch(epoch)
            val_loss, val_acc = self.validate(epoch)

            if self.scheduler:
                self.scheduler.step()
                self.writer.add_scalar(
                    'Learning_Rate',
                    self.scheduler.get_last_lr()[0],
                    epoch
                )

            print(f'Epoch {epoch+1}/{num_epochs}:')
            print(f'  Train Loss: {train_loss:.4f}, Acc: {train_acc:.2f}%')
            print(f'  Val Loss: {val_loss:.4f}, Acc: {val_acc:.2f}%')

        # μ΅œμ’… λ©”νŠΈλ¦­ λ‘œκΉ…
        self.writer.add_hparams(
            {'lr': self.optimizer.param_groups[0]['lr']},
            {'hparam/best_accuracy': self.best_val_acc}
        )

        self.writer.close()
        print(f'\nTraining complete. Best Val Accuracy: {self.best_val_acc:.2f}%')


# μ‚¬μš© μ˜ˆμ‹œ
if __name__ == '__main__':
    # 데이터 μ€€λΉ„
    transform = transforms.Compose([
        transforms.Resize(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])

    train_dataset = datasets.CIFAR10('./data', train=True, download=True, transform=transform)
    val_dataset = datasets.CIFAR10('./data', train=False, transform=transform)

    train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=4)
    val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False, num_workers=4)

    # λͺ¨λΈ μ„€μ •
    model = models.resnet18(pretrained=True)
    model.fc = nn.Linear(model.fc.in_features, 10)

    optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

    # ν•™μŠ΅
    trainer = TensorBoardTrainer(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
        optimizer=optimizer,
        scheduler=scheduler,
        device='cuda',
        experiment_name='cifar10_resnet18'
    )

    trainer.train(num_epochs=50)

14. 팁과 Best Practices

14.1 λ‘œκΉ… μ£ΌκΈ° μ΅œμ ν™”

# λ°°μΉ˜λ³„ λ‘œκΉ…μ€ λ„ˆλ¬΄ μžμ£Όν•˜λ©΄ μ„±λŠ₯ μ €ν•˜
# ꢌμž₯: 배치 손싀은 100~500 μŠ€ν…λ§ˆλ‹€, 에폭 λ©”νŠΈλ¦­μ€ λ§€ 에폭

LOG_INTERVAL = 100

for batch_idx, (data, target) in enumerate(train_loader):
    # ... ν•™μŠ΅ μ½”λ“œ ...

    if batch_idx % LOG_INTERVAL == 0:
        writer.add_scalar('Loss/train_step', loss.item(), global_step)

14.2 둜그 파일 관리

# 였래된 둜그 정리
find runs/ -type d -mtime +30 -exec rm -rf {} +

# νŠΉμ • μ‹€ν—˜λ§Œ μœ μ§€
tensorboard --logdir=runs/experiment_final --port=6006

14.3 원격 TensorBoard

# μ„œλ²„μ—μ„œ TensorBoard μ‹€ν–‰
tensorboard --logdir=runs --host=0.0.0.0 --port=6006

# λ‘œμ»¬μ—μ„œ SSH 터널링
ssh -L 6006:localhost:6006 user@server

# λ˜λŠ” ngrok μ‚¬μš©
ngrok http 6006

μ—°μŠ΅ 문제

μ—°μŠ΅ 1: κΈ°λ³Έ λ‘œκΉ… κ΅¬ν˜„

MNIST λΆ„λ₯˜ λͺ¨λΈμ„ ν•™μŠ΅ν•˜λ©΄μ„œ λ‹€μŒμ„ TensorBoard에 λ‘œκΉ…ν•˜μ„Έμš”: - ν•™μŠ΅/검증 손싀 및 정확도 - ν•™μŠ΅λ₯  λ³€ν™” - μƒ˜ν”Œ μž…λ ₯ 이미지

μ—°μŠ΅ 2: λͺ¨λΈ 뢄석

ν•™μŠ΅λœ CNN λͺ¨λΈμ— λŒ€ν•΄: - κ°€μ€‘μΉ˜ νžˆμŠ€ν† κ·Έλž¨ μ‹œκ°ν™” - νŠΉμ§• λ§΅ μ‹œκ°ν™” - Grad-CAM 적용

μ—°μŠ΅ 3: ν•˜μ΄νΌνŒŒλΌλ―Έν„° νŠœλ‹

ν•™μŠ΅λ₯ , 배치 크기, λ“œλ‘­μ•„μ›ƒ λΉ„μœ¨μ— λŒ€ν•΄: - κ·Έλ¦¬λ“œ μ„œμΉ˜ μ‹€ν–‰ - HParams λŒ€μ‹œλ³΄λ“œμ—μ„œ κ²°κ³Ό 비ꡐ - 졜적 μ‘°ν•© μ°ΎκΈ°


참고 자료

to navigate between lessons