10. ์ •๋ณด ์ด๋ก  (Information Theory)

10. ์ •๋ณด ์ด๋ก  (Information Theory)

ํ•™์Šต ๋ชฉํ‘œ

  • ์ •๋ณด๋Ÿ‰๊ณผ ์—”ํŠธ๋กœํ”ผ์˜ ๊ฐœ๋…์„ ์ดํ•ดํ•˜๊ณ  ๋ถˆํ™•์‹ค์„ฑ์˜ ์ธก๋„๋กœ์„œ์˜ ์—ญํ• ์„ ํŒŒ์•…ํ•œ๋‹ค
  • ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ์™€ KL ๋ฐœ์‚ฐ์˜ ์ •์˜์™€ ์„ฑ์งˆ์„ ํ•™์Šตํ•˜๊ณ  ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ์˜ ํ™œ์šฉ์„ ์ดํ•ดํ•œ๋‹ค
  • ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์„ ์ดํ•ดํ•˜๊ณ  ๋ณ€์ˆ˜ ๊ฐ„ ์˜์กด์„ฑ ์ธก์ • ๋ฐฉ๋ฒ•์„ ์ตํžŒ๋‹ค
  • ์˜Œ์„ผ ๋ถ€๋“ฑ์‹์„ ํ™œ์šฉํ•˜์—ฌ ์ •๋ณด ์ด๋ก ์˜ ์ฃผ์š” ๋ถ€๋“ฑ์‹์„ ์ฆ๋ช…ํ•  ์ˆ˜ ์žˆ๋‹ค
  • VAE, GAN ๋“ฑ ์ƒ์„ฑ ๋ชจ๋ธ์—์„œ ์ •๋ณด ์ด๋ก ์ด ์–ด๋–ป๊ฒŒ ํ™œ์šฉ๋˜๋Š”์ง€ ํ•™์Šตํ•œ๋‹ค
  • Python์œผ๋กœ ์—”ํŠธ๋กœํ”ผ, KL ๋ฐœ์‚ฐ, ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์„ ๊ณ„์‚ฐํ•˜๊ณ  ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค

1. ์ •๋ณด๋Ÿ‰๊ณผ ์—”ํŠธ๋กœํ”ผ (Information and Entropy)

1.1 ์ž๊ธฐ ์ •๋ณด (Self-Information)

์‚ฌ๊ฑด $x$๊ฐ€ ๋ฐœ์ƒํ–ˆ์„ ๋•Œ์˜ ์ •๋ณด๋Ÿ‰:

$$I(x) = -\log P(x) = \log \frac{1}{P(x)}$$

์ง๊ด€: - ํ™•๋ฅ ์ด ๋‚ฎ์€ ์‚ฌ๊ฑด โ†’ ๋งŽ์€ ์ •๋ณด (๋†€๋ผ์›€) - ํ™•๋ฅ ์ด ๋†’์€ ์‚ฌ๊ฑด โ†’ ์ ์€ ์ •๋ณด

๋‹จ์œ„: - $\log_2$: bits - $\log_e$: nats

์˜ˆ์ œ: - ๋™์ „ ๋˜์ง€๊ธฐ (์•ž๋ฉด ํ™•๋ฅ  0.5): $I(\text{H}) = -\log_2(0.5) = 1$ bit - ์ฃผ์‚ฌ์œ„ (๊ฐ ๋ฉด ํ™•๋ฅ  1/6): $I(1) = -\log_2(1/6) \approx 2.58$ bits

1.2 ์ƒค๋„Œ ์—”ํŠธ๋กœํ”ผ (Shannon Entropy)

ํ™•๋ฅ  ๋ณ€์ˆ˜ $X$์˜ ์—”ํŠธ๋กœํ”ผ๋Š” ํ‰๊ท  ์ •๋ณด๋Ÿ‰์ž…๋‹ˆ๋‹ค:

$$H(X) = -\sum_{x} P(x) \log P(x) = \mathbb{E}_{P(x)}[-\log P(x)]$$

์—ฐ์† ํ™•๋ฅ  ๋ณ€์ˆ˜์˜ ๊ฒฝ์šฐ (๋ฏธ๋ถ„ ์—”ํŠธ๋กœํ”ผ):

$$h(X) = -\int p(x) \log p(x) dx$$

์„ฑ์งˆ: 1. $H(X) \geq 0$ (๋น„์Œ์„ฑ) 2. $H(X) = 0 \Leftrightarrow X$๊ฐ€ ๊ฒฐ์ •์  (ํ™•๋ฅ  1์ธ ์‚ฌ๊ฑด ํ•˜๋‚˜๋งŒ ์กด์žฌ) 3. ์ตœ๋Œ€ ์—”ํŠธ๋กœํ”ผ: ๊ท ๋“ฑ ๋ถ„ํฌ์ผ ๋•Œ

1.3 ์—”ํŠธ๋กœํ”ผ์˜ ์˜๋ฏธ

์—”ํŠธ๋กœํ”ผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ•ด์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  1. ๋ถˆํ™•์‹ค์„ฑ์˜ ์ธก๋„: ๋ถ„ํฌ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋ถˆํ™•์‹คํ•œ๊ฐ€?
  2. ํ‰๊ท  ์ •๋ณด๋Ÿ‰: ์ƒ˜ํ”Œ๋ง ์‹œ ๊ธฐ๋Œ€๋˜๋Š” ์ •๋ณด๋Ÿ‰
  3. ์ตœ์†Œ ๋ถ€ํ˜ธํ™” ๊ธธ์ด: ๋ฐ์ดํ„ฐ๋ฅผ ์••์ถ•ํ•˜๋Š”๋ฐ ํ•„์š”ํ•œ ์ตœ์†Œ ๋น„ํŠธ ์ˆ˜

์˜ˆ์ œ:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

def entropy(p):
    """์—”ํŠธ๋กœํ”ผ ๊ณ„์‚ฐ (0*log(0) = 0์œผ๋กœ ์ฒ˜๋ฆฌ)"""
    p = np.array(p)
    p = p[p > 0]  # 0 ์ œ๊ฑฐ
    return -np.sum(p * np.log2(p))

# ์ด์ง„ ๋ถ„ํฌ์˜ ์—”ํŠธ๋กœํ”ผ
p_range = np.linspace(0.01, 0.99, 100)
H_binary = [-p * np.log2(p) - (1-p) * np.log2(1-p) for p in p_range]

plt.figure(figsize=(14, 4))

# ์ด์ง„ ์—”ํŠธ๋กœํ”ผ
plt.subplot(131)
plt.plot(p_range, H_binary, linewidth=2)
plt.axhline(1, color='r', linestyle='--', alpha=0.5, label='Maximum H=1')
plt.axvline(0.5, color='g', linestyle='--', alpha=0.5, label='p=0.5')
plt.xlabel('p (ํ™•๋ฅ )')
plt.ylabel('H(X) (bits)')
plt.title('Binary Entropy Function')
plt.legend()
plt.grid(True)

# ์ฃผ์‚ฌ์œ„ vs ํŽธํ–ฅ๋œ ์ฃผ์‚ฌ์œ„
fair_die = [1/6] * 6
biased_die = [0.5, 0.3, 0.1, 0.05, 0.03, 0.02]

H_fair = entropy(fair_die)
H_biased = entropy(biased_die)

plt.subplot(132)
x = np.arange(1, 7)
width = 0.35
plt.bar(x - width/2, fair_die, width, label=f'Fair (H={H_fair:.2f})', alpha=0.7)
plt.bar(x + width/2, biased_die, width, label=f'Biased (H={H_biased:.2f})', alpha=0.7)
plt.xlabel('Face')
plt.ylabel('Probability')
plt.title('Dice Distributions')
plt.legend()
plt.grid(True, axis='y')

# ๋‹ค์–‘ํ•œ ๋ถ„ํฌ์˜ ์—”ํŠธ๋กœํ”ผ
n_outcomes = 10
distributions = {
    'Uniform': np.ones(n_outcomes) / n_outcomes,
    'Peaked': stats.norm.pdf(np.arange(n_outcomes), 5, 1),
    'Very Peaked': stats.norm.pdf(np.arange(n_outcomes), 5, 0.5),
}

# ์ •๊ทœํ™”
for key in distributions:
    distributions[key] /= distributions[key].sum()

plt.subplot(133)
x_pos = np.arange(len(distributions))
entropies = [entropy(distributions[key]) for key in distributions]
bars = plt.bar(x_pos, entropies, alpha=0.7)
plt.xticks(x_pos, distributions.keys())
plt.ylabel('Entropy (bits)')
plt.title('Entropy of Different Distributions')
plt.grid(True, axis='y')

# ๊ฐ ๋ง‰๋Œ€์— ๊ฐ’ ํ‘œ์‹œ
for i, (bar, h) in enumerate(zip(bars, entropies)):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
             f'{h:.2f}', ha='center', va='bottom')

plt.tight_layout()
plt.savefig('entropy_examples.png', dpi=150, bbox_inches='tight')
plt.show()

print("๊ท ๋“ฑ ๋ถ„ํฌ๊ฐ€ ์ตœ๋Œ€ ์—”ํŠธ๋กœํ”ผ๋ฅผ ๊ฐ€์ง!")
print(f"Fair die: {H_fair:.3f} bits")
print(f"Biased die: {H_biased:.3f} bits")

1.4 ์ตœ๋Œ€ ์—”ํŠธ๋กœํ”ผ ๋ถ„ํฌ

์ •๋ฆฌ: ์ œ์•ฝ์ด ์—†์„ ๋•Œ, $n$๊ฐœ์˜ ์ด์‚ฐ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ง„ ํ™•๋ฅ  ๋ถ„ํฌ ์ค‘ ์—”ํŠธ๋กœํ”ผ๊ฐ€ ์ตœ๋Œ€์ธ ๊ฒƒ์€ ๊ท ๋“ฑ ๋ถ„ํฌ์ž…๋‹ˆ๋‹ค.

$$H(X) \leq \log n$$

๋“ฑํ˜ธ๋Š” $P(x) = 1/n$ (๋ชจ๋“  $x$)์ผ ๋•Œ ์„ฑ๋ฆฝ.

์ฆ๋ช…: ๋ผ๊ทธ๋ž‘์ฃผ ์Šน์ˆ˜๋ฒ• ์‚ฌ์šฉ.

์—ฐ์† ๋ถ„ํฌ์˜ ๊ฒฝ์šฐ: - ์ œ์•ฝ ์—†์Œ โ†’ ์ •์˜ ๋ถˆ๊ฐ€ (๋ฌดํ•œ๋Œ€) - ํ‰๊ท  ๊ณ ์ • โ†’ ์ง€์ˆ˜ ๋ถ„ํฌ - ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ ๊ณ ์ • โ†’ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ (์ตœ๋Œ€ ์—”ํŠธ๋กœํ”ผ!)

์ด๊ฒƒ์ด ๊ฐ€์šฐ์‹œ์•ˆ์ด ์ž์—ฐ์—์„œ ํ”ํ•œ ์ด์œ  ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

2. ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ (Cross-Entropy)

2.1 ์ •์˜

๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ๋Š” ๋ถ„ํฌ $P$๋กœ๋ถ€ํ„ฐ ์ƒ˜ํ”Œ๋ง๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํฌ $Q$๋กœ ๋ถ€ํ˜ธํ™”ํ•  ๋•Œ ํ•„์š”ํ•œ ํ‰๊ท  ๋น„ํŠธ ์ˆ˜์ž…๋‹ˆ๋‹ค:

$$H(P, Q) = -\sum_{x} P(x) \log Q(x) = \mathbb{E}_{P(x)}[-\log Q(x)]$$

ํ•ด์„: - $P$: ์‹ค์ œ ๋ถ„ํฌ (๋ฐ์ดํ„ฐ) - $Q$: ๋ชจ๋ธ ๋ถ„ํฌ (์˜ˆ์ธก) - $H(P, Q)$: $Q$๋กœ $P$๋ฅผ ๊ทผ์‚ฌํ•  ๋•Œ์˜ "๋น„์šฉ"

์„ฑ์งˆ: 1. $H(P, Q) \geq H(P)$ (๋“ฑํ˜ธ๋Š” $P = Q$์ผ ๋•Œ) 2. ๋น„๋Œ€์นญ: $H(P, Q) \neq H(Q, P)$ (์ผ๋ฐ˜์ ์œผ๋กœ)

2.2 ์ด์ง„ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ (Binary Cross-Entropy)

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์—์„œ ์‚ฌ์šฉ:

$$H(y, \hat{y}) = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]$$

  • $y \in \{0, 1\}$: ์‹ค์ œ ๋ ˆ์ด๋ธ”
  • $\hat{y} \in [0, 1]$: ์˜ˆ์ธก ํ™•๋ฅ 

n๊ฐœ ์ƒ˜ํ”Œ์˜ ํ‰๊ท :

$$\mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^n [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]$$

2.3 ์นดํ…Œ๊ณ ๋ฆฌ์ปฌ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ (Categorical Cross-Entropy)

๋‹ค์ค‘ ๋ถ„๋ฅ˜์—์„œ ์‚ฌ์šฉ:

$$H(P, Q) = -\sum_{k=1}^K P_k \log Q_k$$

One-hot ์ธ์ฝ”๋”ฉ๋œ ๊ฒฝ์šฐ ($P_k = \delta_{k,c}$):

$$\mathcal{L}_{\text{CCE}} = -\log Q_c$$

์—ฌ๊ธฐ์„œ $c$๋Š” ์ •๋‹ต ํด๋ž˜์Šค.

import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

# ์ด์ง„ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์‹œ๊ฐํ™”
y_true = np.array([0, 0, 1, 1])  # ์‹ค์ œ ๋ ˆ์ด๋ธ”
y_pred_range = np.linspace(0.01, 0.99, 100)

# ๊ฐ ์ƒ˜ํ”Œ์— ๋Œ€ํ•œ ์†์‹ค
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for i, y in enumerate(y_true):
    if y == 1:
        loss = -np.log(y_pred_range)
    else:
        loss = -np.log(1 - y_pred_range)

    axes[i].plot(y_pred_range, loss, linewidth=2)
    axes[i].set_xlabel('Predicted Probability')
    axes[i].set_ylabel('Loss')
    axes[i].set_title(f'True Label: {y}')
    axes[i].grid(True)
    axes[i].axvline(y, color='r', linestyle='--', alpha=0.5, label=f'Optimal: {y}')
    axes[i].legend()

plt.tight_layout()
plt.savefig('binary_cross_entropy.png', dpi=150, bbox_inches='tight')
plt.show()

# PyTorch ๊ตฌํ˜„ ๊ฒ€์ฆ
y_true_tensor = torch.tensor([0., 0., 1., 1.])
y_pred_tensor = torch.tensor([0.1, 0.2, 0.8, 0.9])

bce_loss = nn.BCELoss()
loss = bce_loss(y_pred_tensor, y_true_tensor)

# ์ˆ˜๋™ ๊ณ„์‚ฐ
manual_loss = -(y_true_tensor * torch.log(y_pred_tensor) +
                (1 - y_true_tensor) * torch.log(1 - y_pred_tensor)).mean()

print(f"PyTorch BCE Loss: {loss.item():.4f}")
print(f"Manual BCE Loss: {manual_loss.item():.4f}")
print(f"Difference: {abs(loss - manual_loss).item():.10f}")

# ์นดํ…Œ๊ณ ๋ฆฌ์ปฌ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ
print("\n์นดํ…Œ๊ณ ๋ฆฌ์ปฌ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ:")
y_true_cat = torch.tensor([2])  # ํด๋ž˜์Šค 2๊ฐ€ ์ •๋‹ต
y_pred_logits = torch.tensor([[1.0, 2.0, 3.0, 1.5]])  # ๋กœ์ง“

ce_loss = nn.CrossEntropyLoss()
loss_cat = ce_loss(y_pred_logits, y_true_cat)

# ์ˆ˜๋™ ๊ณ„์‚ฐ
y_pred_softmax = torch.softmax(y_pred_logits, dim=1)
manual_loss_cat = -torch.log(y_pred_softmax[0, 2])

print(f"PyTorch CE Loss: {loss_cat.item():.4f}")
print(f"Manual CE Loss: {manual_loss_cat.item():.4f}")
print(f"Softmax probabilities: {y_pred_softmax.numpy()}")

3. KL ๋ฐœ์‚ฐ (Kullback-Leibler Divergence)

3.1 ์ •์˜

KL ๋ฐœ์‚ฐ์€ ๋‘ ๋ถ„ํฌ $P$์™€ $Q$ ์‚ฌ์ด์˜ "๊ฑฐ๋ฆฌ"๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค:

$$D_{\text{KL}}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_{P(x)}\left[\log \frac{P(x)}{Q(x)}\right]$$

์—ฐ์† ๋ถ„ํฌ:

$$D_{\text{KL}}(P \| Q) = \int p(x) \log \frac{p(x)}{q(x)} dx$$

3.2 KL ๋ฐœ์‚ฐ์˜ ์„ฑ์งˆ

  1. ๋น„์Œ์„ฑ: $D_{\text{KL}}(P \| Q) \geq 0$
  2. ๋“ฑํ˜ธ ์กฐ๊ฑด: $D_{\text{KL}}(P \| Q) = 0 \Leftrightarrow P = Q$ (๊ฑฐ์˜ ๋ชจ๋“  ๊ณณ์—์„œ)
  3. ๋น„๋Œ€์นญ: $D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)$ โ†’ ๊ฑฐ๋ฆฌ ํ•จ์ˆ˜๊ฐ€ ์•„๋‹˜!
  4. ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ์™€์˜ ๊ด€๊ณ„: $$D_{\text{KL}}(P \| Q) = H(P, Q) - H(P)$$

3.3 ๋น„์Œ์„ฑ ์ฆ๋ช… (๊น์Šค ๋ถ€๋“ฑ์‹)

Jensen ๋ถ€๋“ฑ์‹ ์‚ฌ์šฉ: $-\log$๋Š” ๋ณผ๋ก ํ•จ์ˆ˜์ด๋ฏ€๋กœ,

$$-\mathbb{E}[\log X] \geq -\log \mathbb{E}[X]$$

์ ์šฉ:

$$D_{\text{KL}}(P \| Q) = \mathbb{E}_{P(x)}\left[\log \frac{P(x)}{Q(x)}\right]$$

$$= -\mathbb{E}_{P(x)}\left[\log \frac{Q(x)}{P(x)}\right]$$

$$\geq -\log \mathbb{E}_{P(x)}\left[\frac{Q(x)}{P(x)}\right]$$

$$= -\log \sum_{x} P(x) \frac{Q(x)}{P(x)}$$

$$= -\log \sum_{x} Q(x) = -\log 1 = 0$$

3.4 Forward KL vs Reverse KL

Forward KL: $D_{\text{KL}}(P \| Q)$ - $Q$๊ฐ€ $P$๋ฅผ ์ปค๋ฒ„ํ•˜๋„๋ก ๊ฐ•์ œ (mode-covering) - $P(x) > 0$์ด๋ฉด $Q(x) > 0$์ด์–ด์•ผ ํ•จ

Reverse KL: $D_{\text{KL}}(Q \| P)$ - $Q$๊ฐ€ $P$์˜ ๋ชจ๋“œ๋ฅผ ์„ ํƒ (mode-seeking) - $Q$๋Š” $P$์˜ ํ•œ ๋ชจ๋“œ์—๋งŒ ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ์Œ

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# ๋‘ ๊ฐœ์˜ ๊ฐ€์šฐ์‹œ์•ˆ ํ˜ผํ•ฉ์„ ํƒ€๊ฒŸ์œผ๋กœ
np.random.seed(42)
x = np.linspace(-5, 10, 1000)

# ํƒ€๊ฒŸ: ๋‘ ๊ฐœ์˜ ๊ฐ€์šฐ์‹œ์•ˆ ํ˜ผํ•ฉ
p = 0.5 * stats.norm.pdf(x, 0, 1) + 0.5 * stats.norm.pdf(x, 5, 1)
p = p / np.trapz(p, x)  # ์ •๊ทœํ™”

# Forward KL๋กœ ๊ทผ์‚ฌ (mode-covering)
def kl_divergence_forward(mu, sigma, x, p_true):
    q = stats.norm.pdf(x, mu, sigma)
    q = q / np.trapz(q, x)
    # Forward KL: E_p[log(p/q)]
    kl = np.trapz(p_true * np.log((p_true + 1e-10) / (q + 1e-10)), x)
    return kl

# Reverse KL๋กœ ๊ทผ์‚ฌ (mode-seeking)
def kl_divergence_reverse(mu, sigma, x, p_true):
    q = stats.norm.pdf(x, mu, sigma)
    q = q / np.trapz(q, x)
    # Reverse KL: E_q[log(q/p)]
    kl = np.trapz(q * np.log((q + 1e-10) / (p_true + 1e-10)), x)
    return kl

# ๊ทธ๋ฆฌ๋“œ ์„œ์น˜๋กœ ์ตœ์ ํ™” (์‹ค์ œ๋กœ๋Š” ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• ์‚ฌ์šฉ)
mu_range = np.linspace(-2, 7, 50)
sigma_range = np.linspace(0.5, 3, 30)

best_forward_kl = float('inf')
best_reverse_kl = float('inf')
best_forward_params = None
best_reverse_params = None

for mu in mu_range:
    for sigma in sigma_range:
        fkl = kl_divergence_forward(mu, sigma, x, p)
        rkl = kl_divergence_reverse(mu, sigma, x, p)

        if fkl < best_forward_kl:
            best_forward_kl = fkl
            best_forward_params = (mu, sigma)

        if rkl < best_reverse_kl:
            best_reverse_kl = rkl
            best_reverse_params = (mu, sigma)

# ์ตœ์  ๋ถ„ํฌ
q_forward = stats.norm.pdf(x, *best_forward_params)
q_forward = q_forward / np.trapz(q_forward, x)

q_reverse = stats.norm.pdf(x, *best_reverse_params)
q_reverse = q_reverse / np.trapz(q_reverse, x)

# ์‹œ๊ฐํ™”
plt.figure(figsize=(14, 5))

plt.subplot(121)
plt.plot(x, p, 'k-', linewidth=2, label='True P (bimodal)')
plt.plot(x, q_forward, 'b--', linewidth=2,
         label=f'Forward KL Q (ฮผ={best_forward_params[0]:.2f}, ฯƒ={best_forward_params[1]:.2f})')
plt.fill_between(x, 0, p, alpha=0.2, color='k')
plt.fill_between(x, 0, q_forward, alpha=0.2, color='b')
plt.xlabel('x')
plt.ylabel('Density')
plt.title(f'Forward KL: D(P||Q) (mode-covering)\nKL = {best_forward_kl:.4f}')
plt.legend()
plt.grid(True)

plt.subplot(122)
plt.plot(x, p, 'k-', linewidth=2, label='True P (bimodal)')
plt.plot(x, q_reverse, 'r--', linewidth=2,
         label=f'Reverse KL Q (ฮผ={best_reverse_params[0]:.2f}, ฯƒ={best_reverse_params[1]:.2f})')
plt.fill_between(x, 0, p, alpha=0.2, color='k')
plt.fill_between(x, 0, q_reverse, alpha=0.2, color='r')
plt.xlabel('x')
plt.ylabel('Density')
plt.title(f'Reverse KL: D(Q||P) (mode-seeking)\nKL = {best_reverse_kl:.4f}')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.savefig('forward_vs_reverse_kl.png', dpi=150, bbox_inches='tight')
plt.show()

print("Forward KL (mode-covering): ๋‘ ๋ชจ๋“œ ์‚ฌ์ด์— ๋„“๊ฒŒ ํผ์ง")
print(f"  Best params: ฮผ={best_forward_params[0]:.2f}, ฯƒ={best_forward_params[1]:.2f}")
print("\nReverse KL (mode-seeking): ํ•œ ๋ชจ๋“œ๋ฅผ ์„ ํƒ")
print(f"  Best params: ฮผ={best_reverse_params[0]:.2f}, ฯƒ={best_reverse_params[1]:.2f}")

4. ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰ (Mutual Information)

4.1 ์ •์˜

๋‘ ํ™•๋ฅ  ๋ณ€์ˆ˜ $X$์™€ $Y$ ์‚ฌ์ด์˜ ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰:

$$I(X; Y) = \sum_{x, y} P(x, y) \log \frac{P(x, y)}{P(x)P(y)}$$

$$= D_{\text{KL}}(P(X, Y) \| P(X)P(Y))$$

๋‹ค๋ฅธ ํ‘œํ˜„:

$$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$$

$$= H(X) + H(Y) - H(X, Y)$$

4.2 ์กฐ๊ฑด๋ถ€ ์—”ํŠธ๋กœํ”ผ

$Y$๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ $X$์˜ ์กฐ๊ฑด๋ถ€ ์—”ํŠธ๋กœํ”ผ:

$$H(X|Y) = \sum_{y} P(y) H(X|Y=y)$$

$$= -\sum_{x, y} P(x, y) \log P(x|y)$$

์ฒด์ธ ๊ทœ์น™:

$$H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)$$

4.3 ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์˜ ์„ฑ์งˆ

  1. ๋น„์Œ์„ฑ: $I(X; Y) \geq 0$
  2. ๋Œ€์นญ์„ฑ: $I(X; Y) = I(Y; X)$
  3. ๋…๋ฆฝ์„ฑ: $I(X; Y) = 0 \Leftrightarrow X \perp Y$
  4. ์ƒํ•œ: $I(X; Y) \leq \min(H(X), H(Y))$

์ง๊ด€: - $I(X; Y)$: $X$๋ฅผ ์•Œ๋ฉด $Y$์— ๋Œ€ํ•œ ๋ถˆํ™•์‹ค์„ฑ์ด ์–ผ๋งˆ๋‚˜ ์ค„์–ด๋“œ๋Š”๊ฐ€? - $= H(Y) - H(Y|X)$: $Y$์˜ ์—”ํŠธ๋กœํ”ผ์—์„œ $X$๋ฅผ ์•Œ์•˜์„ ๋•Œ์˜ ์กฐ๊ฑด๋ถ€ ์—”ํŠธ๋กœํ”ผ๋ฅผ ๋บ€ ๊ฒƒ

4.4 ์‘์šฉ: ํŠน์ง• ์„ ํƒ

๋จธ์‹ ๋Ÿฌ๋‹์—์„œ ํŠน์ง• $X$์™€ ํƒ€๊ฒŸ $Y$ ์‚ฌ์ด์˜ ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์ด ๋†’์œผ๋ฉด, $X$๋Š” ์œ ์šฉํ•œ ํŠน์ง•์ž…๋‹ˆ๋‹ค.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification

# ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                          n_redundant=5, n_repeated=0, random_state=42)

# ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰ ๊ณ„์‚ฐ
mi = mutual_info_classif(X, y, random_state=42)

# ์‹œ๊ฐํ™”
plt.figure(figsize=(14, 4))

plt.subplot(131)
plt.bar(range(len(mi)), mi)
plt.xlabel('Feature Index')
plt.ylabel('Mutual Information')
plt.title('Mutual Information with Target')
plt.grid(True, axis='y')

# ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์ด ๋†’์€ ํŠน์ง• vs ๋‚ฎ์€ ํŠน์ง•
high_mi_idx = np.argmax(mi)
low_mi_idx = np.argmin(mi)

plt.subplot(132)
plt.scatter(X[:, high_mi_idx], y, alpha=0.5)
plt.xlabel(f'Feature {high_mi_idx}')
plt.ylabel('Target')
plt.title(f'High MI Feature (MI={mi[high_mi_idx]:.3f})')
plt.grid(True)

plt.subplot(133)
plt.scatter(X[:, low_mi_idx], y, alpha=0.5)
plt.xlabel(f'Feature {low_mi_idx}')
plt.ylabel('Target')
plt.title(f'Low MI Feature (MI={mi[low_mi_idx]:.3f})')
plt.grid(True)

plt.tight_layout()
plt.savefig('mutual_information.png', dpi=150, bbox_inches='tight')
plt.show()

print("์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์ด ๋†’์€ ํŠน์ง•:")
top_features = np.argsort(mi)[-5:][::-1]
for idx in top_features:
    print(f"  Feature {idx}: MI = {mi[idx]:.4f}")

print("\n์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์ด ๋‚ฎ์€ ํŠน์ง•:")
bottom_features = np.argsort(mi)[:5]
for idx in bottom_features:
    print(f"  Feature {idx}: MI = {mi[idx]:.4f}")

5. ์˜Œ์„ผ ๋ถ€๋“ฑ์‹ (Jensen's Inequality)

5.1 ์ •์˜

ํ•จ์ˆ˜ $f$๊ฐ€ ๋ณผ๋ก ํ•จ์ˆ˜(convex)์ด๋ฉด:

$$f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$$

ํ•จ์ˆ˜ $f$๊ฐ€ ์˜ค๋ชฉ ํ•จ์ˆ˜(concave)์ด๋ฉด:

$$f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]$$

์ด์‚ฐ ํ˜•ํƒœ:

$$f\left(\sum_i \lambda_i x_i\right) \leq \sum_i \lambda_i f(x_i)$$

์—ฌ๊ธฐ์„œ $\lambda_i \geq 0, \sum_i \lambda_i = 1$.

5.2 ๋ณผ๋ก ํ•จ์ˆ˜์˜ ์˜ˆ

  • $f(x) = x^2$
  • $f(x) = e^x$
  • $f(x) = -\log x$ (for $x > 0$)

5.3 ์‘์šฉ: KL ๋ฐœ์‚ฐ์˜ ๋น„์Œ์„ฑ

$$D_{\text{KL}}(P \| Q) = \mathbb{E}_{P}\left[\log \frac{P(x)}{Q(x)}\right]$$

$$= -\mathbb{E}_{P}\left[\log \frac{Q(x)}{P(x)}\right]$$

$-\log$๋Š” ๋ณผ๋ก ํ•จ์ˆ˜์ด๋ฏ€๋กœ Jensen ๋ถ€๋“ฑ์‹:

$$-\mathbb{E}_{P}\left[\log \frac{Q(x)}{P(x)}\right] \geq -\log \mathbb{E}_{P}\left[\frac{Q(x)}{P(x)}\right]$$

$$= -\log \sum_{x} P(x) \frac{Q(x)}{P(x)} = -\log \sum_{x} Q(x) = 0$$

5.4 ์‘์šฉ: ELBO ์œ ๋„ (VAE)

VAE์—์„œ Evidence Lower BOund (ELBO):

$$\log P(x) = \log \int P(x, z) dz = \log \int Q(z) \frac{P(x, z)}{Q(z)} dz$$

Jensen ($\log$๋Š” ์˜ค๋ชฉ):

$$\geq \int Q(z) \log \frac{P(x, z)}{Q(z)} dz$$

$$= \mathbb{E}_{Q(z)}[\log P(x, z)] - \mathbb{E}_{Q(z)}[\log Q(z)]$$

$$= \mathbb{E}_{Q(z)}[\log P(x|z)] - D_{\text{KL}}(Q(z) \| P(z))$$

์ด๊ฒƒ์ด VAE์˜ ์†์‹ค ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค!

import numpy as np
import matplotlib.pyplot as plt

# Jensen ๋ถ€๋“ฑ์‹ ์‹œ๊ฐํ™”
np.random.seed(42)

# ๋ณผ๋ก ํ•จ์ˆ˜: f(x) = x^2
x_values = np.linspace(-2, 2, 100)
f_convex = x_values**2

# ์ƒ˜ํ”Œ
samples = np.array([-1.5, -0.5, 0.5, 1.5])
weights = np.array([0.25, 0.25, 0.25, 0.25])

# ๊ธฐ๋Œ€๊ฐ’
E_x = np.sum(weights * samples)
f_E_x = E_x**2

# f(x)์˜ ๊ธฐ๋Œ€๊ฐ’
E_f_x = np.sum(weights * samples**2)

plt.figure(figsize=(14, 5))

# ๋ณผ๋ก ํ•จ์ˆ˜
plt.subplot(121)
plt.plot(x_values, f_convex, 'b-', linewidth=2, label='f(x) = xยฒ')
plt.scatter(samples, samples**2, color='r', s=100, zorder=5,
            label='Sample points')

# E[X]
plt.axvline(E_x, color='g', linestyle='--', alpha=0.7,
            label=f'E[X] = {E_x:.2f}')
plt.scatter([E_x], [f_E_x], color='g', s=200, marker='s',
            zorder=5, label=f'f(E[X]) = {f_E_x:.2f}')

# E[f(X)]
plt.axhline(E_f_x, color='orange', linestyle='--', alpha=0.7,
            label=f'E[f(X)] = {E_f_x:.2f}')

plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Jensen\'s Inequality (Convex)\nf(E[X]) โ‰ค E[f(X)]')
plt.legend()
plt.grid(True)

# ์˜ค๋ชฉ ํ•จ์ˆ˜: g(x) = log(x)
x_values_pos = np.linspace(0.1, 3, 100)
g_concave = np.log(x_values_pos)

samples_pos = np.array([0.5, 1.0, 1.5, 2.0])
E_x_pos = np.sum(weights * samples_pos)
g_E_x = np.log(E_x_pos)
E_g_x = np.sum(weights * np.log(samples_pos))

plt.subplot(122)
plt.plot(x_values_pos, g_concave, 'b-', linewidth=2, label='g(x) = log(x)')
plt.scatter(samples_pos, np.log(samples_pos), color='r', s=100,
            zorder=5, label='Sample points')

plt.axvline(E_x_pos, color='g', linestyle='--', alpha=0.7,
            label=f'E[X] = {E_x_pos:.2f}')
plt.scatter([E_x_pos], [g_E_x], color='g', s=200, marker='s',
            zorder=5, label=f'g(E[X]) = {g_E_x:.2f}')

plt.axhline(E_g_x, color='orange', linestyle='--', alpha=0.7,
            label=f'E[g(X)] = {E_g_x:.2f}')

plt.xlabel('x')
plt.ylabel('g(x)')
plt.title('Jensen\'s Inequality (Concave)\ng(E[X]) โ‰ฅ E[g(X)]')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.savefig('jensen_inequality.png', dpi=150, bbox_inches='tight')
plt.show()

print("Convex (f(x) = xยฒ):")
print(f"  f(E[X]) = {f_E_x:.4f}")
print(f"  E[f(X)] = {E_f_x:.4f}")
print(f"  f(E[X]) โ‰ค E[f(X)]: {f_E_x <= E_f_x}")

print("\nConcave (g(x) = log(x)):")
print(f"  g(E[X]) = {g_E_x:.4f}")
print(f"  E[g(X)] = {E_g_x:.4f}")
print(f"  g(E[X]) โ‰ฅ E[g(X)]: {g_E_x >= E_g_x}")

6. ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ์˜ ์ •๋ณด ์ด๋ก 

6.1 ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค = MLE

๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์ตœ์†Œํ™” = ์Œ์˜ ๋กœ๊ทธ ์šฐ๋„ ์ตœ์†Œํ™”:

$$\min_{\theta} H(P_{\text{data}}, P_{\theta}) = \min_{\theta} -\mathbb{E}_{(x,y) \sim P_{\text{data}}}[\log P_{\theta}(y|x)]$$

์ด๋Š” MLE์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค!

6.2 VAE์˜ ELBO

Variational Autoencoder์˜ ๋ชฉ์  ํ•จ์ˆ˜:

$$\mathcal{L}_{\text{VAE}} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))$$

  • ์ฒซ ๋ฒˆ์งธ ํ•ญ: ์žฌ๊ตฌ์„ฑ ์†์‹ค (reconstruction loss)
  • ๋‘ ๋ฒˆ์งธ ํ•ญ: KL ์ •๊ทœํ™” (KL regularization)
import torch
import torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=400, latent_dim=20):
        super(VAE, self).__init__()

        # Encoder
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)

        # Decoder
        self.fc3 = nn.Linear(latent_dim, hidden_dim)
        self.fc4 = nn.Linear(hidden_dim, input_dim)

    def encode(self, x):
        h = F.relu(self.fc1(x))
        return self.fc_mu(h), self.fc_logvar(h)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def decode(self, z):
        h = F.relu(self.fc3(z))
        return torch.sigmoid(self.fc4(h))

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        return self.decode(z), mu, logvar

def vae_loss(recon_x, x, mu, logvar):
    # ์žฌ๊ตฌ์„ฑ ์†์‹ค (Binary Cross-Entropy)
    BCE = F.binary_cross_entropy(recon_x, x, reduction='sum')

    # KL ๋ฐœ์‚ฐ: D_KL(q(z|x) || p(z))
    # p(z) = N(0, I)์ด๋ฏ€๋กœ ํ•ด์„์ ์œผ๋กœ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ
    # KL = -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

    return BCE + KLD, BCE, KLD

# ๊ฐ„๋‹จํ•œ ํ…Œ์ŠคํŠธ
model = VAE(input_dim=784, latent_dim=20)
x = torch.randn(32, 784)  # ๋ฐฐ์น˜ ํฌ๊ธฐ 32

recon_x, mu, logvar = model(x)
loss, bce, kld = vae_loss(recon_x, x, mu, logvar)

print(f"Total Loss: {loss.item():.2f}")
print(f"Reconstruction Loss (BCE): {bce.item():.2f}")
print(f"KL Divergence: {kld.item():.2f}")
print(f"\nELBO = -Loss = {-loss.item():.2f}")

6.3 GAN์˜ JS ๋ฐœ์‚ฐ

Generative Adversarial Network์˜ ์›๋ž˜ ๋ชฉ์  ํ•จ์ˆ˜๋Š” Jensen-Shannon (JS) ๋ฐœ์‚ฐ๊ณผ ๊ด€๋ จ:

$$D_{\text{JS}}(P \| Q) = \frac{1}{2}D_{\text{KL}}(P \| M) + \frac{1}{2}D_{\text{KL}}(Q \| M)$$

์—ฌ๊ธฐ์„œ $M = \frac{1}{2}(P + Q)$.

GAN์˜ ์ตœ์  ํŒ๋ณ„๊ธฐ๋Š” JS ๋ฐœ์‚ฐ์„ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค.

6.4 ์ •๋ณด ๋ณ‘๋ชฉ (Information Bottleneck)

์ •๋ณด ๋ณ‘๋ชฉ ์ด๋ก ์€ ์ž…๋ ฅ $X$์™€ ํƒ€๊ฒŸ $Y$ ์‚ฌ์ด์— ์••์ถ•๋œ ํ‘œํ˜„ $Z$๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค:

$$\min_{p(z|x)} I(X; Z) - \beta I(Z; Y)$$

  • $I(X; Z)$: ์ตœ์†Œํ™” โ†’ ์••์ถ•
  • $I(Z; Y)$: ์ตœ๋Œ€ํ™” โ†’ ์ •๋ณด ๋ณด์กด
  • $\beta$: trade-off ํŒŒ๋ผ๋ฏธํ„ฐ

์ด๋Š” ๋”ฅ๋Ÿฌ๋‹์˜ ์ด๋ก ์  ์ดํ•ด์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

import numpy as np
import matplotlib.pyplot as plt

# JS ๋ฐœ์‚ฐ ์‹œ๊ฐํ™”
def js_divergence(p, q):
    """Jensen-Shannon divergence"""
    m = 0.5 * (p + q)
    kl_pm = np.sum(p * np.log((p + 1e-10) / (m + 1e-10)))
    kl_qm = np.sum(q * np.log((q + 1e-10) / (m + 1e-10)))
    return 0.5 * kl_pm + 0.5 * kl_qm

# ๋‘ ๋ถ„ํฌ
n_bins = 10
p = np.random.dirichlet(np.ones(n_bins))
q = np.random.dirichlet(np.ones(n_bins))

# KL vs JS
kl_pq = np.sum(p * np.log((p + 1e-10) / (q + 1e-10)))
kl_qp = np.sum(q * np.log((q + 1e-10) / (p + 1e-10)))
js_pq = js_divergence(p, q)

plt.figure(figsize=(14, 4))

plt.subplot(131)
x = np.arange(n_bins)
width = 0.35
plt.bar(x - width/2, p, width, label='P', alpha=0.7)
plt.bar(x + width/2, q, width, label='Q', alpha=0.7)
plt.xlabel('Bin')
plt.ylabel('Probability')
plt.title('Two Distributions')
plt.legend()
plt.grid(True, axis='y')

plt.subplot(132)
divergences = [kl_pq, kl_qp, js_pq]
labels = ['KL(P||Q)', 'KL(Q||P)', 'JS(P||Q)']
colors = ['blue', 'red', 'green']
bars = plt.bar(labels, divergences, color=colors, alpha=0.7)
plt.ylabel('Divergence')
plt.title('Divergence Comparison')
plt.grid(True, axis='y')

for bar, val in zip(bars, divergences):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{val:.3f}', ha='center', va='bottom')

# JS ๋ฐœ์‚ฐ์˜ ๋Œ€์นญ์„ฑ
plt.subplot(133)
# ๋‹ค์–‘ํ•œ ๋ถ„ํฌ ์Œ์— ๋Œ€ํ•ด
n_tests = 20
kl_diffs = []
js_vals = []

for _ in range(n_tests):
    p_test = np.random.dirichlet(np.ones(n_bins))
    q_test = np.random.dirichlet(np.ones(n_bins))

    kl_pq_test = np.sum(p_test * np.log((p_test + 1e-10) / (q_test + 1e-10)))
    kl_qp_test = np.sum(q_test * np.log((q_test + 1e-10) / (p_test + 1e-10)))
    js_test = js_divergence(p_test, q_test)

    kl_diffs.append(abs(kl_pq_test - kl_qp_test))
    js_vals.append(js_test)

plt.scatter(kl_diffs, js_vals, alpha=0.7)
plt.xlabel('|KL(P||Q) - KL(Q||P)|')
plt.ylabel('JS(P||Q)')
plt.title('JS is Symmetric, KL is Not')
plt.grid(True)

plt.tight_layout()
plt.savefig('js_divergence.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"KL(P||Q) = {kl_pq:.4f}")
print(f"KL(Q||P) = {kl_qp:.4f}")
print(f"JS(P||Q) = {js_pq:.4f}")
print(f"\nKL์€ ๋น„๋Œ€์นญ, JS๋Š” ๋Œ€์นญ!")
print(f"JS(P||Q) = JS(Q||P) ํ•ญ์ƒ ์„ฑ๋ฆฝ")

์—ฐ์Šต ๋ฌธ์ œ

๋ฌธ์ œ 1: ์ตœ๋Œ€ ์—”ํŠธ๋กœํ”ผ ์ฆ๋ช…

๋ผ๊ทธ๋ž‘์ฃผ ์Šน์ˆ˜๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ, ์ œ์•ฝ ์กฐ๊ฑด $\sum_{i=1}^n p_i = 1$ํ•˜์—์„œ ์—”ํŠธ๋กœํ”ผ $H = -\sum_{i=1}^n p_i \log p_i$๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ถ„ํฌ๊ฐ€ ๊ท ๋“ฑ ๋ถ„ํฌ $p_i = 1/n$์ž„์„ ์ฆ๋ช…ํ•˜์‹œ์˜ค.

ํžŒํŠธ: - ๋ผ๊ทธ๋ž‘์ง€์•ˆ: $L = -\sum_i p_i \log p_i - \lambda(\sum_i p_i - 1)$ - $\frac{\partial L}{\partial p_i} = 0$ ํ’€๊ธฐ

๋ฌธ์ œ 2: ์กฐ๊ฑด๋ถ€ ์—”ํŠธ๋กœํ”ผ์™€ ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰

๋‹ค์Œ์„ ์ฆ๋ช…ํ•˜์‹œ์˜ค:

(a) $H(X, Y) = H(X) + H(Y|X)$ (์ฒด์ธ ๊ทœ์น™)

(b) $I(X; Y) = H(X) + H(Y) - H(X, Y)$

(c) $I(X; Y) \leq \min(H(X), H(Y))$

(d) $I(X; Y) = 0 \Leftrightarrow X \perp Y$

๋ฌธ์ œ 3: KL ๋ฐœ์‚ฐ ๊ณ„์‚ฐ

๋‘ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ $P = \mathcal{N}(\mu_1, \sigma_1^2)$์™€ $Q = \mathcal{N}(\mu_2, \sigma_2^2)$ ์‚ฌ์ด์˜ KL ๋ฐœ์‚ฐ์„ ํ•ด์„์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜์‹œ์˜ค.

๊ฒฐ๊ณผ: $$D_{\text{KL}}(P \| Q) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$$

Python์œผ๋กœ ๊ฒ€์ฆํ•˜์‹œ์˜ค (์ˆ˜์น˜ ์ ๋ถ„ vs ํ•ด์„ ํ•ด).

๋ฌธ์ œ 4: VAE์˜ KL ๋ฐœ์‚ฐ

VAE์—์„œ $q(z|x) = \mathcal{N}(\mu, \text{diag}(\sigma^2))$์ด๊ณ  $p(z) = \mathcal{N}(0, I)$์ผ ๋•Œ, KL ๋ฐœ์‚ฐ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์Œ์„ ์œ ๋„ํ•˜์‹œ์˜ค:

$$D_{\text{KL}}(q(z|x) \| p(z)) = \frac{1}{2}\sum_{j=1}^d \left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right)$$

์—ฌ๊ธฐ์„œ $d$๋Š” ์ž ์žฌ ๊ณต๊ฐ„์˜ ์ฐจ์›.

๋ฌธ์ œ 5: ์ •๋ณด ๋ณ‘๋ชฉ ์‹œ๋ฎฌ๋ ˆ์ด์…˜

๊ฐ„๋‹จํ•œ ๋ฐ์ดํ„ฐ์…‹ (์˜ˆ: 2D ๋ถ„๋ฅ˜ ๋ฌธ์ œ)์— ๋Œ€ํ•ด ์ •๋ณด ๋ณ‘๋ชฉ ์›๋ฆฌ๋ฅผ ๊ตฌํ˜„ํ•˜์‹œ์˜ค:

(a) ์ž…๋ ฅ $X$, ์••์ถ• ํ‘œํ˜„ $Z$, ํƒ€๊ฒŸ $Y$๋ฅผ ์ •์˜

(b) $I(X; Z)$์™€ $I(Z; Y)$๋ฅผ ์ถ”์ • (ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜)

(c) ๋‹ค์–‘ํ•œ $\beta$ ๊ฐ’์— ๋Œ€ํ•ด trade-off ๊ณก์„  ๊ทธ๋ฆฌ๊ธฐ

(d) ์ตœ์  ์••์ถ• ์ˆ˜์ค€ ์ฐพ๊ธฐ

์ฐธ๊ณ  ์ž๋ฃŒ

  1. Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. [์ •๋ณด ์ด๋ก ์˜ ๋ฐ”์ด๋ธ”]
  2. MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press.
  3. Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. Chapter 6 (Information Theory).
  4. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. Chapter 3 (Information Theory).
  5. ๋…ผ๋ฌธ: Tishby, N., & Zaslavsky, N. (2015). "Deep Learning and the Information Bottleneck Principle". IEEE Information Theory Workshop.
  6. ๋…ผ๋ฌธ: Kingma, D. P., & Welling, M. (2013). "Auto-Encoding Variational Bayes". ICLR.
  7. ํŠœํ† ๋ฆฌ์–ผ: Shwartz-Ziv, R., & Tishby, N. (2017). "Opening the Black Box of Deep Neural Networks via Information". arXiv:1703.00810.
  8. ๋ธ”๋กœ๊ทธ: Colah's Blog - Visual Information Theory - https://colah.github.io/posts/2015-09-Visual-Information/
to navigate between lessons