10. ์ ๋ณด ์ด๋ก (Information Theory)
10. ์ ๋ณด ์ด๋ก (Information Theory)¶
ํ์ต ๋ชฉํ¶
- ์ ๋ณด๋๊ณผ ์ํธ๋กํผ์ ๊ฐ๋ ์ ์ดํดํ๊ณ ๋ถํ์ค์ฑ์ ์ธก๋๋ก์์ ์ญํ ์ ํ์ ํ๋ค
- ๊ต์ฐจ ์ํธ๋กํผ์ KL ๋ฐ์ฐ์ ์ ์์ ์ฑ์ง์ ํ์ตํ๊ณ ๋จธ์ ๋ฌ๋์์์ ํ์ฉ์ ์ดํดํ๋ค
- ์ํธ ์ ๋ณด๋์ ์ดํดํ๊ณ ๋ณ์ ๊ฐ ์์กด์ฑ ์ธก์ ๋ฐฉ๋ฒ์ ์ตํ๋ค
- ์์ผ ๋ถ๋ฑ์์ ํ์ฉํ์ฌ ์ ๋ณด ์ด๋ก ์ ์ฃผ์ ๋ถ๋ฑ์์ ์ฆ๋ช ํ ์ ์๋ค
- VAE, GAN ๋ฑ ์์ฑ ๋ชจ๋ธ์์ ์ ๋ณด ์ด๋ก ์ด ์ด๋ป๊ฒ ํ์ฉ๋๋์ง ํ์ตํ๋ค
- Python์ผ๋ก ์ํธ๋กํผ, KL ๋ฐ์ฐ, ์ํธ ์ ๋ณด๋์ ๊ณ์ฐํ๊ณ ์๊ฐํํ ์ ์๋ค
1. ์ ๋ณด๋๊ณผ ์ํธ๋กํผ (Information and Entropy)¶
1.1 ์๊ธฐ ์ ๋ณด (Self-Information)¶
์ฌ๊ฑด $x$๊ฐ ๋ฐ์ํ์ ๋์ ์ ๋ณด๋:
$$I(x) = -\log P(x) = \log \frac{1}{P(x)}$$
์ง๊ด: - ํ๋ฅ ์ด ๋ฎ์ ์ฌ๊ฑด โ ๋ง์ ์ ๋ณด (๋๋ผ์) - ํ๋ฅ ์ด ๋์ ์ฌ๊ฑด โ ์ ์ ์ ๋ณด
๋จ์: - $\log_2$: bits - $\log_e$: nats
์์ : - ๋์ ๋์ง๊ธฐ (์๋ฉด ํ๋ฅ 0.5): $I(\text{H}) = -\log_2(0.5) = 1$ bit - ์ฃผ์ฌ์ (๊ฐ ๋ฉด ํ๋ฅ 1/6): $I(1) = -\log_2(1/6) \approx 2.58$ bits
1.2 ์ค๋ ์ํธ๋กํผ (Shannon Entropy)¶
ํ๋ฅ ๋ณ์ $X$์ ์ํธ๋กํผ๋ ํ๊ท ์ ๋ณด๋์ ๋๋ค:
$$H(X) = -\sum_{x} P(x) \log P(x) = \mathbb{E}_{P(x)}[-\log P(x)]$$
์ฐ์ ํ๋ฅ ๋ณ์์ ๊ฒฝ์ฐ (๋ฏธ๋ถ ์ํธ๋กํผ):
$$h(X) = -\int p(x) \log p(x) dx$$
์ฑ์ง: 1. $H(X) \geq 0$ (๋น์์ฑ) 2. $H(X) = 0 \Leftrightarrow X$๊ฐ ๊ฒฐ์ ์ (ํ๋ฅ 1์ธ ์ฌ๊ฑด ํ๋๋ง ์กด์ฌ) 3. ์ต๋ ์ํธ๋กํผ: ๊ท ๋ฑ ๋ถํฌ์ผ ๋
1.3 ์ํธ๋กํผ์ ์๋ฏธ¶
์ํธ๋กํผ๋ ๋ค์๊ณผ ๊ฐ์ด ํด์ํ ์ ์์ต๋๋ค:
- ๋ถํ์ค์ฑ์ ์ธก๋: ๋ถํฌ๊ฐ ์ผ๋ง๋ ๋ถํ์คํ๊ฐ?
- ํ๊ท ์ ๋ณด๋: ์ํ๋ง ์ ๊ธฐ๋๋๋ ์ ๋ณด๋
- ์ต์ ๋ถํธํ ๊ธธ์ด: ๋ฐ์ดํฐ๋ฅผ ์์ถํ๋๋ฐ ํ์ํ ์ต์ ๋นํธ ์
์์ :
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def entropy(p):
"""์ํธ๋กํผ ๊ณ์ฐ (0*log(0) = 0์ผ๋ก ์ฒ๋ฆฌ)"""
p = np.array(p)
p = p[p > 0] # 0 ์ ๊ฑฐ
return -np.sum(p * np.log2(p))
# ์ด์ง ๋ถํฌ์ ์ํธ๋กํผ
p_range = np.linspace(0.01, 0.99, 100)
H_binary = [-p * np.log2(p) - (1-p) * np.log2(1-p) for p in p_range]
plt.figure(figsize=(14, 4))
# ์ด์ง ์ํธ๋กํผ
plt.subplot(131)
plt.plot(p_range, H_binary, linewidth=2)
plt.axhline(1, color='r', linestyle='--', alpha=0.5, label='Maximum H=1')
plt.axvline(0.5, color='g', linestyle='--', alpha=0.5, label='p=0.5')
plt.xlabel('p (ํ๋ฅ )')
plt.ylabel('H(X) (bits)')
plt.title('Binary Entropy Function')
plt.legend()
plt.grid(True)
# ์ฃผ์ฌ์ vs ํธํฅ๋ ์ฃผ์ฌ์
fair_die = [1/6] * 6
biased_die = [0.5, 0.3, 0.1, 0.05, 0.03, 0.02]
H_fair = entropy(fair_die)
H_biased = entropy(biased_die)
plt.subplot(132)
x = np.arange(1, 7)
width = 0.35
plt.bar(x - width/2, fair_die, width, label=f'Fair (H={H_fair:.2f})', alpha=0.7)
plt.bar(x + width/2, biased_die, width, label=f'Biased (H={H_biased:.2f})', alpha=0.7)
plt.xlabel('Face')
plt.ylabel('Probability')
plt.title('Dice Distributions')
plt.legend()
plt.grid(True, axis='y')
# ๋ค์ํ ๋ถํฌ์ ์ํธ๋กํผ
n_outcomes = 10
distributions = {
'Uniform': np.ones(n_outcomes) / n_outcomes,
'Peaked': stats.norm.pdf(np.arange(n_outcomes), 5, 1),
'Very Peaked': stats.norm.pdf(np.arange(n_outcomes), 5, 0.5),
}
# ์ ๊ทํ
for key in distributions:
distributions[key] /= distributions[key].sum()
plt.subplot(133)
x_pos = np.arange(len(distributions))
entropies = [entropy(distributions[key]) for key in distributions]
bars = plt.bar(x_pos, entropies, alpha=0.7)
plt.xticks(x_pos, distributions.keys())
plt.ylabel('Entropy (bits)')
plt.title('Entropy of Different Distributions')
plt.grid(True, axis='y')
# ๊ฐ ๋ง๋์ ๊ฐ ํ์
for i, (bar, h) in enumerate(zip(bars, entropies)):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
f'{h:.2f}', ha='center', va='bottom')
plt.tight_layout()
plt.savefig('entropy_examples.png', dpi=150, bbox_inches='tight')
plt.show()
print("๊ท ๋ฑ ๋ถํฌ๊ฐ ์ต๋ ์ํธ๋กํผ๋ฅผ ๊ฐ์ง!")
print(f"Fair die: {H_fair:.3f} bits")
print(f"Biased die: {H_biased:.3f} bits")
1.4 ์ต๋ ์ํธ๋กํผ ๋ถํฌ¶
์ ๋ฆฌ: ์ ์ฝ์ด ์์ ๋, $n$๊ฐ์ ์ด์ฐ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ์ง ํ๋ฅ ๋ถํฌ ์ค ์ํธ๋กํผ๊ฐ ์ต๋์ธ ๊ฒ์ ๊ท ๋ฑ ๋ถํฌ์ ๋๋ค.
$$H(X) \leq \log n$$
๋ฑํธ๋ $P(x) = 1/n$ (๋ชจ๋ $x$)์ผ ๋ ์ฑ๋ฆฝ.
์ฆ๋ช : ๋ผ๊ทธ๋์ฃผ ์น์๋ฒ ์ฌ์ฉ.
์ฐ์ ๋ถํฌ์ ๊ฒฝ์ฐ: - ์ ์ฝ ์์ โ ์ ์ ๋ถ๊ฐ (๋ฌดํ๋) - ํ๊ท ๊ณ ์ โ ์ง์ ๋ถํฌ - ํ๊ท ๊ณผ ๋ถ์ฐ ๊ณ ์ โ ๊ฐ์ฐ์์ ๋ถํฌ (์ต๋ ์ํธ๋กํผ!)
์ด๊ฒ์ด ๊ฐ์ฐ์์์ด ์์ฐ์์ ํํ ์ด์ ์ค ํ๋์ ๋๋ค.
2. ๊ต์ฐจ ์ํธ๋กํผ (Cross-Entropy)¶
2.1 ์ ์¶
๊ต์ฐจ ์ํธ๋กํผ๋ ๋ถํฌ $P$๋ก๋ถํฐ ์ํ๋ง๋ ๋ฐ์ดํฐ๋ฅผ ๋ถํฌ $Q$๋ก ๋ถํธํํ ๋ ํ์ํ ํ๊ท ๋นํธ ์์ ๋๋ค:
$$H(P, Q) = -\sum_{x} P(x) \log Q(x) = \mathbb{E}_{P(x)}[-\log Q(x)]$$
ํด์: - $P$: ์ค์ ๋ถํฌ (๋ฐ์ดํฐ) - $Q$: ๋ชจ๋ธ ๋ถํฌ (์์ธก) - $H(P, Q)$: $Q$๋ก $P$๋ฅผ ๊ทผ์ฌํ ๋์ "๋น์ฉ"
์ฑ์ง: 1. $H(P, Q) \geq H(P)$ (๋ฑํธ๋ $P = Q$์ผ ๋) 2. ๋น๋์นญ: $H(P, Q) \neq H(Q, P)$ (์ผ๋ฐ์ ์ผ๋ก)
2.2 ์ด์ง ๊ต์ฐจ ์ํธ๋กํผ (Binary Cross-Entropy)¶
๋ก์ง์คํฑ ํ๊ท์์ ์ฌ์ฉ:
$$H(y, \hat{y}) = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]$$
- $y \in \{0, 1\}$: ์ค์ ๋ ์ด๋ธ
- $\hat{y} \in [0, 1]$: ์์ธก ํ๋ฅ
n๊ฐ ์ํ์ ํ๊ท :
$$\mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^n [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]$$
2.3 ์นดํ ๊ณ ๋ฆฌ์ปฌ ๊ต์ฐจ ์ํธ๋กํผ (Categorical Cross-Entropy)¶
๋ค์ค ๋ถ๋ฅ์์ ์ฌ์ฉ:
$$H(P, Q) = -\sum_{k=1}^K P_k \log Q_k$$
One-hot ์ธ์ฝ๋ฉ๋ ๊ฒฝ์ฐ ($P_k = \delta_{k,c}$):
$$\mathcal{L}_{\text{CCE}} = -\log Q_c$$
์ฌ๊ธฐ์ $c$๋ ์ ๋ต ํด๋์ค.
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
# ์ด์ง ๊ต์ฐจ ์ํธ๋กํผ ์๊ฐํ
y_true = np.array([0, 0, 1, 1]) # ์ค์ ๋ ์ด๋ธ
y_pred_range = np.linspace(0.01, 0.99, 100)
# ๊ฐ ์ํ์ ๋ํ ์์ค
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
for i, y in enumerate(y_true):
if y == 1:
loss = -np.log(y_pred_range)
else:
loss = -np.log(1 - y_pred_range)
axes[i].plot(y_pred_range, loss, linewidth=2)
axes[i].set_xlabel('Predicted Probability')
axes[i].set_ylabel('Loss')
axes[i].set_title(f'True Label: {y}')
axes[i].grid(True)
axes[i].axvline(y, color='r', linestyle='--', alpha=0.5, label=f'Optimal: {y}')
axes[i].legend()
plt.tight_layout()
plt.savefig('binary_cross_entropy.png', dpi=150, bbox_inches='tight')
plt.show()
# PyTorch ๊ตฌํ ๊ฒ์ฆ
y_true_tensor = torch.tensor([0., 0., 1., 1.])
y_pred_tensor = torch.tensor([0.1, 0.2, 0.8, 0.9])
bce_loss = nn.BCELoss()
loss = bce_loss(y_pred_tensor, y_true_tensor)
# ์๋ ๊ณ์ฐ
manual_loss = -(y_true_tensor * torch.log(y_pred_tensor) +
(1 - y_true_tensor) * torch.log(1 - y_pred_tensor)).mean()
print(f"PyTorch BCE Loss: {loss.item():.4f}")
print(f"Manual BCE Loss: {manual_loss.item():.4f}")
print(f"Difference: {abs(loss - manual_loss).item():.10f}")
# ์นดํ
๊ณ ๋ฆฌ์ปฌ ๊ต์ฐจ ์ํธ๋กํผ
print("\n์นดํ
๊ณ ๋ฆฌ์ปฌ ๊ต์ฐจ ์ํธ๋กํผ:")
y_true_cat = torch.tensor([2]) # ํด๋์ค 2๊ฐ ์ ๋ต
y_pred_logits = torch.tensor([[1.0, 2.0, 3.0, 1.5]]) # ๋ก์ง
ce_loss = nn.CrossEntropyLoss()
loss_cat = ce_loss(y_pred_logits, y_true_cat)
# ์๋ ๊ณ์ฐ
y_pred_softmax = torch.softmax(y_pred_logits, dim=1)
manual_loss_cat = -torch.log(y_pred_softmax[0, 2])
print(f"PyTorch CE Loss: {loss_cat.item():.4f}")
print(f"Manual CE Loss: {manual_loss_cat.item():.4f}")
print(f"Softmax probabilities: {y_pred_softmax.numpy()}")
3. KL ๋ฐ์ฐ (Kullback-Leibler Divergence)¶
3.1 ์ ์¶
KL ๋ฐ์ฐ์ ๋ ๋ถํฌ $P$์ $Q$ ์ฌ์ด์ "๊ฑฐ๋ฆฌ"๋ฅผ ์ธก์ ํฉ๋๋ค:
$$D_{\text{KL}}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_{P(x)}\left[\log \frac{P(x)}{Q(x)}\right]$$
์ฐ์ ๋ถํฌ:
$$D_{\text{KL}}(P \| Q) = \int p(x) \log \frac{p(x)}{q(x)} dx$$
3.2 KL ๋ฐ์ฐ์ ์ฑ์ง¶
- ๋น์์ฑ: $D_{\text{KL}}(P \| Q) \geq 0$
- ๋ฑํธ ์กฐ๊ฑด: $D_{\text{KL}}(P \| Q) = 0 \Leftrightarrow P = Q$ (๊ฑฐ์ ๋ชจ๋ ๊ณณ์์)
- ๋น๋์นญ: $D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)$ โ ๊ฑฐ๋ฆฌ ํจ์๊ฐ ์๋!
- ๊ต์ฐจ ์ํธ๋กํผ์์ ๊ด๊ณ: $$D_{\text{KL}}(P \| Q) = H(P, Q) - H(P)$$
3.3 ๋น์์ฑ ์ฆ๋ช (๊น์ค ๋ถ๋ฑ์)¶
Jensen ๋ถ๋ฑ์ ์ฌ์ฉ: $-\log$๋ ๋ณผ๋ก ํจ์์ด๋ฏ๋ก,
$$-\mathbb{E}[\log X] \geq -\log \mathbb{E}[X]$$
์ ์ฉ:
$$D_{\text{KL}}(P \| Q) = \mathbb{E}_{P(x)}\left[\log \frac{P(x)}{Q(x)}\right]$$
$$= -\mathbb{E}_{P(x)}\left[\log \frac{Q(x)}{P(x)}\right]$$
$$\geq -\log \mathbb{E}_{P(x)}\left[\frac{Q(x)}{P(x)}\right]$$
$$= -\log \sum_{x} P(x) \frac{Q(x)}{P(x)}$$
$$= -\log \sum_{x} Q(x) = -\log 1 = 0$$
3.4 Forward KL vs Reverse KL¶
Forward KL: $D_{\text{KL}}(P \| Q)$ - $Q$๊ฐ $P$๋ฅผ ์ปค๋ฒํ๋๋ก ๊ฐ์ (mode-covering) - $P(x) > 0$์ด๋ฉด $Q(x) > 0$์ด์ด์ผ ํจ
Reverse KL: $D_{\text{KL}}(Q \| P)$ - $Q$๊ฐ $P$์ ๋ชจ๋๋ฅผ ์ ํ (mode-seeking) - $Q$๋ $P$์ ํ ๋ชจ๋์๋ง ์ง์คํ ์ ์์
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# ๋ ๊ฐ์ ๊ฐ์ฐ์์ ํผํฉ์ ํ๊ฒ์ผ๋ก
np.random.seed(42)
x = np.linspace(-5, 10, 1000)
# ํ๊ฒ: ๋ ๊ฐ์ ๊ฐ์ฐ์์ ํผํฉ
p = 0.5 * stats.norm.pdf(x, 0, 1) + 0.5 * stats.norm.pdf(x, 5, 1)
p = p / np.trapz(p, x) # ์ ๊ทํ
# Forward KL๋ก ๊ทผ์ฌ (mode-covering)
def kl_divergence_forward(mu, sigma, x, p_true):
q = stats.norm.pdf(x, mu, sigma)
q = q / np.trapz(q, x)
# Forward KL: E_p[log(p/q)]
kl = np.trapz(p_true * np.log((p_true + 1e-10) / (q + 1e-10)), x)
return kl
# Reverse KL๋ก ๊ทผ์ฌ (mode-seeking)
def kl_divergence_reverse(mu, sigma, x, p_true):
q = stats.norm.pdf(x, mu, sigma)
q = q / np.trapz(q, x)
# Reverse KL: E_q[log(q/p)]
kl = np.trapz(q * np.log((q + 1e-10) / (p_true + 1e-10)), x)
return kl
# ๊ทธ๋ฆฌ๋ ์์น๋ก ์ต์ ํ (์ค์ ๋ก๋ ๊ฒฝ์ฌํ๊ฐ๋ฒ ์ฌ์ฉ)
mu_range = np.linspace(-2, 7, 50)
sigma_range = np.linspace(0.5, 3, 30)
best_forward_kl = float('inf')
best_reverse_kl = float('inf')
best_forward_params = None
best_reverse_params = None
for mu in mu_range:
for sigma in sigma_range:
fkl = kl_divergence_forward(mu, sigma, x, p)
rkl = kl_divergence_reverse(mu, sigma, x, p)
if fkl < best_forward_kl:
best_forward_kl = fkl
best_forward_params = (mu, sigma)
if rkl < best_reverse_kl:
best_reverse_kl = rkl
best_reverse_params = (mu, sigma)
# ์ต์ ๋ถํฌ
q_forward = stats.norm.pdf(x, *best_forward_params)
q_forward = q_forward / np.trapz(q_forward, x)
q_reverse = stats.norm.pdf(x, *best_reverse_params)
q_reverse = q_reverse / np.trapz(q_reverse, x)
# ์๊ฐํ
plt.figure(figsize=(14, 5))
plt.subplot(121)
plt.plot(x, p, 'k-', linewidth=2, label='True P (bimodal)')
plt.plot(x, q_forward, 'b--', linewidth=2,
label=f'Forward KL Q (ฮผ={best_forward_params[0]:.2f}, ฯ={best_forward_params[1]:.2f})')
plt.fill_between(x, 0, p, alpha=0.2, color='k')
plt.fill_between(x, 0, q_forward, alpha=0.2, color='b')
plt.xlabel('x')
plt.ylabel('Density')
plt.title(f'Forward KL: D(P||Q) (mode-covering)\nKL = {best_forward_kl:.4f}')
plt.legend()
plt.grid(True)
plt.subplot(122)
plt.plot(x, p, 'k-', linewidth=2, label='True P (bimodal)')
plt.plot(x, q_reverse, 'r--', linewidth=2,
label=f'Reverse KL Q (ฮผ={best_reverse_params[0]:.2f}, ฯ={best_reverse_params[1]:.2f})')
plt.fill_between(x, 0, p, alpha=0.2, color='k')
plt.fill_between(x, 0, q_reverse, alpha=0.2, color='r')
plt.xlabel('x')
plt.ylabel('Density')
plt.title(f'Reverse KL: D(Q||P) (mode-seeking)\nKL = {best_reverse_kl:.4f}')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig('forward_vs_reverse_kl.png', dpi=150, bbox_inches='tight')
plt.show()
print("Forward KL (mode-covering): ๋ ๋ชจ๋ ์ฌ์ด์ ๋๊ฒ ํผ์ง")
print(f" Best params: ฮผ={best_forward_params[0]:.2f}, ฯ={best_forward_params[1]:.2f}")
print("\nReverse KL (mode-seeking): ํ ๋ชจ๋๋ฅผ ์ ํ")
print(f" Best params: ฮผ={best_reverse_params[0]:.2f}, ฯ={best_reverse_params[1]:.2f}")
4. ์ํธ ์ ๋ณด๋ (Mutual Information)¶
4.1 ์ ์¶
๋ ํ๋ฅ ๋ณ์ $X$์ $Y$ ์ฌ์ด์ ์ํธ ์ ๋ณด๋:
$$I(X; Y) = \sum_{x, y} P(x, y) \log \frac{P(x, y)}{P(x)P(y)}$$
$$= D_{\text{KL}}(P(X, Y) \| P(X)P(Y))$$
๋ค๋ฅธ ํํ:
$$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$$
$$= H(X) + H(Y) - H(X, Y)$$
4.2 ์กฐ๊ฑด๋ถ ์ํธ๋กํผ¶
$Y$๊ฐ ์ฃผ์ด์ก์ ๋ $X$์ ์กฐ๊ฑด๋ถ ์ํธ๋กํผ:
$$H(X|Y) = \sum_{y} P(y) H(X|Y=y)$$
$$= -\sum_{x, y} P(x, y) \log P(x|y)$$
์ฒด์ธ ๊ท์น:
$$H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)$$
4.3 ์ํธ ์ ๋ณด๋์ ์ฑ์ง¶
- ๋น์์ฑ: $I(X; Y) \geq 0$
- ๋์นญ์ฑ: $I(X; Y) = I(Y; X)$
- ๋ ๋ฆฝ์ฑ: $I(X; Y) = 0 \Leftrightarrow X \perp Y$
- ์ํ: $I(X; Y) \leq \min(H(X), H(Y))$
์ง๊ด: - $I(X; Y)$: $X$๋ฅผ ์๋ฉด $Y$์ ๋ํ ๋ถํ์ค์ฑ์ด ์ผ๋ง๋ ์ค์ด๋๋๊ฐ? - $= H(Y) - H(Y|X)$: $Y$์ ์ํธ๋กํผ์์ $X$๋ฅผ ์์์ ๋์ ์กฐ๊ฑด๋ถ ์ํธ๋กํผ๋ฅผ ๋บ ๊ฒ
4.4 ์์ฉ: ํน์ง ์ ํ¶
๋จธ์ ๋ฌ๋์์ ํน์ง $X$์ ํ๊ฒ $Y$ ์ฌ์ด์ ์ํธ ์ ๋ณด๋์ด ๋์ผ๋ฉด, $X$๋ ์ ์ฉํ ํน์ง์ ๋๋ค.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification
# ๋ฐ์ดํฐ ์์ฑ
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, n_repeated=0, random_state=42)
# ์ํธ ์ ๋ณด๋ ๊ณ์ฐ
mi = mutual_info_classif(X, y, random_state=42)
# ์๊ฐํ
plt.figure(figsize=(14, 4))
plt.subplot(131)
plt.bar(range(len(mi)), mi)
plt.xlabel('Feature Index')
plt.ylabel('Mutual Information')
plt.title('Mutual Information with Target')
plt.grid(True, axis='y')
# ์ํธ ์ ๋ณด๋์ด ๋์ ํน์ง vs ๋ฎ์ ํน์ง
high_mi_idx = np.argmax(mi)
low_mi_idx = np.argmin(mi)
plt.subplot(132)
plt.scatter(X[:, high_mi_idx], y, alpha=0.5)
plt.xlabel(f'Feature {high_mi_idx}')
plt.ylabel('Target')
plt.title(f'High MI Feature (MI={mi[high_mi_idx]:.3f})')
plt.grid(True)
plt.subplot(133)
plt.scatter(X[:, low_mi_idx], y, alpha=0.5)
plt.xlabel(f'Feature {low_mi_idx}')
plt.ylabel('Target')
plt.title(f'Low MI Feature (MI={mi[low_mi_idx]:.3f})')
plt.grid(True)
plt.tight_layout()
plt.savefig('mutual_information.png', dpi=150, bbox_inches='tight')
plt.show()
print("์ํธ ์ ๋ณด๋์ด ๋์ ํน์ง:")
top_features = np.argsort(mi)[-5:][::-1]
for idx in top_features:
print(f" Feature {idx}: MI = {mi[idx]:.4f}")
print("\n์ํธ ์ ๋ณด๋์ด ๋ฎ์ ํน์ง:")
bottom_features = np.argsort(mi)[:5]
for idx in bottom_features:
print(f" Feature {idx}: MI = {mi[idx]:.4f}")
5. ์์ผ ๋ถ๋ฑ์ (Jensen's Inequality)¶
5.1 ์ ์¶
ํจ์ $f$๊ฐ ๋ณผ๋ก ํจ์(convex)์ด๋ฉด:
$$f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$$
ํจ์ $f$๊ฐ ์ค๋ชฉ ํจ์(concave)์ด๋ฉด:
$$f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]$$
์ด์ฐ ํํ:
$$f\left(\sum_i \lambda_i x_i\right) \leq \sum_i \lambda_i f(x_i)$$
์ฌ๊ธฐ์ $\lambda_i \geq 0, \sum_i \lambda_i = 1$.
5.2 ๋ณผ๋ก ํจ์์ ์¶
- $f(x) = x^2$
- $f(x) = e^x$
- $f(x) = -\log x$ (for $x > 0$)
5.3 ์์ฉ: KL ๋ฐ์ฐ์ ๋น์์ฑ¶
$$D_{\text{KL}}(P \| Q) = \mathbb{E}_{P}\left[\log \frac{P(x)}{Q(x)}\right]$$
$$= -\mathbb{E}_{P}\left[\log \frac{Q(x)}{P(x)}\right]$$
$-\log$๋ ๋ณผ๋ก ํจ์์ด๋ฏ๋ก Jensen ๋ถ๋ฑ์:
$$-\mathbb{E}_{P}\left[\log \frac{Q(x)}{P(x)}\right] \geq -\log \mathbb{E}_{P}\left[\frac{Q(x)}{P(x)}\right]$$
$$= -\log \sum_{x} P(x) \frac{Q(x)}{P(x)} = -\log \sum_{x} Q(x) = 0$$
5.4 ์์ฉ: ELBO ์ ๋ (VAE)¶
VAE์์ Evidence Lower BOund (ELBO):
$$\log P(x) = \log \int P(x, z) dz = \log \int Q(z) \frac{P(x, z)}{Q(z)} dz$$
Jensen ($\log$๋ ์ค๋ชฉ):
$$\geq \int Q(z) \log \frac{P(x, z)}{Q(z)} dz$$
$$= \mathbb{E}_{Q(z)}[\log P(x, z)] - \mathbb{E}_{Q(z)}[\log Q(z)]$$
$$= \mathbb{E}_{Q(z)}[\log P(x|z)] - D_{\text{KL}}(Q(z) \| P(z))$$
์ด๊ฒ์ด VAE์ ์์ค ํจ์์ ๋๋ค!
import numpy as np
import matplotlib.pyplot as plt
# Jensen ๋ถ๋ฑ์ ์๊ฐํ
np.random.seed(42)
# ๋ณผ๋ก ํจ์: f(x) = x^2
x_values = np.linspace(-2, 2, 100)
f_convex = x_values**2
# ์ํ
samples = np.array([-1.5, -0.5, 0.5, 1.5])
weights = np.array([0.25, 0.25, 0.25, 0.25])
# ๊ธฐ๋๊ฐ
E_x = np.sum(weights * samples)
f_E_x = E_x**2
# f(x)์ ๊ธฐ๋๊ฐ
E_f_x = np.sum(weights * samples**2)
plt.figure(figsize=(14, 5))
# ๋ณผ๋ก ํจ์
plt.subplot(121)
plt.plot(x_values, f_convex, 'b-', linewidth=2, label='f(x) = xยฒ')
plt.scatter(samples, samples**2, color='r', s=100, zorder=5,
label='Sample points')
# E[X]
plt.axvline(E_x, color='g', linestyle='--', alpha=0.7,
label=f'E[X] = {E_x:.2f}')
plt.scatter([E_x], [f_E_x], color='g', s=200, marker='s',
zorder=5, label=f'f(E[X]) = {f_E_x:.2f}')
# E[f(X)]
plt.axhline(E_f_x, color='orange', linestyle='--', alpha=0.7,
label=f'E[f(X)] = {E_f_x:.2f}')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Jensen\'s Inequality (Convex)\nf(E[X]) โค E[f(X)]')
plt.legend()
plt.grid(True)
# ์ค๋ชฉ ํจ์: g(x) = log(x)
x_values_pos = np.linspace(0.1, 3, 100)
g_concave = np.log(x_values_pos)
samples_pos = np.array([0.5, 1.0, 1.5, 2.0])
E_x_pos = np.sum(weights * samples_pos)
g_E_x = np.log(E_x_pos)
E_g_x = np.sum(weights * np.log(samples_pos))
plt.subplot(122)
plt.plot(x_values_pos, g_concave, 'b-', linewidth=2, label='g(x) = log(x)')
plt.scatter(samples_pos, np.log(samples_pos), color='r', s=100,
zorder=5, label='Sample points')
plt.axvline(E_x_pos, color='g', linestyle='--', alpha=0.7,
label=f'E[X] = {E_x_pos:.2f}')
plt.scatter([E_x_pos], [g_E_x], color='g', s=200, marker='s',
zorder=5, label=f'g(E[X]) = {g_E_x:.2f}')
plt.axhline(E_g_x, color='orange', linestyle='--', alpha=0.7,
label=f'E[g(X)] = {E_g_x:.2f}')
plt.xlabel('x')
plt.ylabel('g(x)')
plt.title('Jensen\'s Inequality (Concave)\ng(E[X]) โฅ E[g(X)]')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig('jensen_inequality.png', dpi=150, bbox_inches='tight')
plt.show()
print("Convex (f(x) = xยฒ):")
print(f" f(E[X]) = {f_E_x:.4f}")
print(f" E[f(X)] = {E_f_x:.4f}")
print(f" f(E[X]) โค E[f(X)]: {f_E_x <= E_f_x}")
print("\nConcave (g(x) = log(x)):")
print(f" g(E[X]) = {g_E_x:.4f}")
print(f" E[g(X)] = {E_g_x:.4f}")
print(f" g(E[X]) โฅ E[g(X)]: {g_E_x >= E_g_x}")
6. ๋จธ์ ๋ฌ๋์์์ ์ ๋ณด ์ด๋ก ¶
6.1 ๊ต์ฐจ ์ํธ๋กํผ ์์ค = MLE¶
๋ถ๋ฅ ๋ฌธ์ ์์ ๊ต์ฐจ ์ํธ๋กํผ ์ต์ํ = ์์ ๋ก๊ทธ ์ฐ๋ ์ต์ํ:
$$\min_{\theta} H(P_{\text{data}}, P_{\theta}) = \min_{\theta} -\mathbb{E}_{(x,y) \sim P_{\text{data}}}[\log P_{\theta}(y|x)]$$
์ด๋ MLE์ ๋์ผํฉ๋๋ค!
6.2 VAE์ ELBO¶
Variational Autoencoder์ ๋ชฉ์ ํจ์:
$$\mathcal{L}_{\text{VAE}} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))$$
- ์ฒซ ๋ฒ์งธ ํญ: ์ฌ๊ตฌ์ฑ ์์ค (reconstruction loss)
- ๋ ๋ฒ์งธ ํญ: KL ์ ๊ทํ (KL regularization)
import torch
import torch.nn as nn
import torch.nn.functional as F
class VAE(nn.Module):
def __init__(self, input_dim=784, hidden_dim=400, latent_dim=20):
super(VAE, self).__init__()
# Encoder
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc_mu = nn.Linear(hidden_dim, latent_dim)
self.fc_logvar = nn.Linear(hidden_dim, latent_dim)
# Decoder
self.fc3 = nn.Linear(latent_dim, hidden_dim)
self.fc4 = nn.Linear(hidden_dim, input_dim)
def encode(self, x):
h = F.relu(self.fc1(x))
return self.fc_mu(h), self.fc_logvar(h)
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def decode(self, z):
h = F.relu(self.fc3(z))
return torch.sigmoid(self.fc4(h))
def forward(self, x):
mu, logvar = self.encode(x)
z = self.reparameterize(mu, logvar)
return self.decode(z), mu, logvar
def vae_loss(recon_x, x, mu, logvar):
# ์ฌ๊ตฌ์ฑ ์์ค (Binary Cross-Entropy)
BCE = F.binary_cross_entropy(recon_x, x, reduction='sum')
# KL ๋ฐ์ฐ: D_KL(q(z|x) || p(z))
# p(z) = N(0, I)์ด๋ฏ๋ก ํด์์ ์ผ๋ก ๊ณ์ฐ ๊ฐ๋ฅ
# KL = -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return BCE + KLD, BCE, KLD
# ๊ฐ๋จํ ํ
์คํธ
model = VAE(input_dim=784, latent_dim=20)
x = torch.randn(32, 784) # ๋ฐฐ์น ํฌ๊ธฐ 32
recon_x, mu, logvar = model(x)
loss, bce, kld = vae_loss(recon_x, x, mu, logvar)
print(f"Total Loss: {loss.item():.2f}")
print(f"Reconstruction Loss (BCE): {bce.item():.2f}")
print(f"KL Divergence: {kld.item():.2f}")
print(f"\nELBO = -Loss = {-loss.item():.2f}")
6.3 GAN์ JS ๋ฐ์ฐ¶
Generative Adversarial Network์ ์๋ ๋ชฉ์ ํจ์๋ Jensen-Shannon (JS) ๋ฐ์ฐ๊ณผ ๊ด๋ จ:
$$D_{\text{JS}}(P \| Q) = \frac{1}{2}D_{\text{KL}}(P \| M) + \frac{1}{2}D_{\text{KL}}(Q \| M)$$
์ฌ๊ธฐ์ $M = \frac{1}{2}(P + Q)$.
GAN์ ์ต์ ํ๋ณ๊ธฐ๋ JS ๋ฐ์ฐ์ ์ถ์ ํฉ๋๋ค.
6.4 ์ ๋ณด ๋ณ๋ชฉ (Information Bottleneck)¶
์ ๋ณด ๋ณ๋ชฉ ์ด๋ก ์ ์ ๋ ฅ $X$์ ํ๊ฒ $Y$ ์ฌ์ด์ ์์ถ๋ ํํ $Z$๋ฅผ ์ฐพ์ต๋๋ค:
$$\min_{p(z|x)} I(X; Z) - \beta I(Z; Y)$$
- $I(X; Z)$: ์ต์ํ โ ์์ถ
- $I(Z; Y)$: ์ต๋ํ โ ์ ๋ณด ๋ณด์กด
- $\beta$: trade-off ํ๋ผ๋ฏธํฐ
์ด๋ ๋ฅ๋ฌ๋์ ์ด๋ก ์ ์ดํด์ ์ฌ์ฉ๋ฉ๋๋ค.
import numpy as np
import matplotlib.pyplot as plt
# JS ๋ฐ์ฐ ์๊ฐํ
def js_divergence(p, q):
"""Jensen-Shannon divergence"""
m = 0.5 * (p + q)
kl_pm = np.sum(p * np.log((p + 1e-10) / (m + 1e-10)))
kl_qm = np.sum(q * np.log((q + 1e-10) / (m + 1e-10)))
return 0.5 * kl_pm + 0.5 * kl_qm
# ๋ ๋ถํฌ
n_bins = 10
p = np.random.dirichlet(np.ones(n_bins))
q = np.random.dirichlet(np.ones(n_bins))
# KL vs JS
kl_pq = np.sum(p * np.log((p + 1e-10) / (q + 1e-10)))
kl_qp = np.sum(q * np.log((q + 1e-10) / (p + 1e-10)))
js_pq = js_divergence(p, q)
plt.figure(figsize=(14, 4))
plt.subplot(131)
x = np.arange(n_bins)
width = 0.35
plt.bar(x - width/2, p, width, label='P', alpha=0.7)
plt.bar(x + width/2, q, width, label='Q', alpha=0.7)
plt.xlabel('Bin')
plt.ylabel('Probability')
plt.title('Two Distributions')
plt.legend()
plt.grid(True, axis='y')
plt.subplot(132)
divergences = [kl_pq, kl_qp, js_pq]
labels = ['KL(P||Q)', 'KL(Q||P)', 'JS(P||Q)']
colors = ['blue', 'red', 'green']
bars = plt.bar(labels, divergences, color=colors, alpha=0.7)
plt.ylabel('Divergence')
plt.title('Divergence Comparison')
plt.grid(True, axis='y')
for bar, val in zip(bars, divergences):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{val:.3f}', ha='center', va='bottom')
# JS ๋ฐ์ฐ์ ๋์นญ์ฑ
plt.subplot(133)
# ๋ค์ํ ๋ถํฌ ์์ ๋ํด
n_tests = 20
kl_diffs = []
js_vals = []
for _ in range(n_tests):
p_test = np.random.dirichlet(np.ones(n_bins))
q_test = np.random.dirichlet(np.ones(n_bins))
kl_pq_test = np.sum(p_test * np.log((p_test + 1e-10) / (q_test + 1e-10)))
kl_qp_test = np.sum(q_test * np.log((q_test + 1e-10) / (p_test + 1e-10)))
js_test = js_divergence(p_test, q_test)
kl_diffs.append(abs(kl_pq_test - kl_qp_test))
js_vals.append(js_test)
plt.scatter(kl_diffs, js_vals, alpha=0.7)
plt.xlabel('|KL(P||Q) - KL(Q||P)|')
plt.ylabel('JS(P||Q)')
plt.title('JS is Symmetric, KL is Not')
plt.grid(True)
plt.tight_layout()
plt.savefig('js_divergence.png', dpi=150, bbox_inches='tight')
plt.show()
print(f"KL(P||Q) = {kl_pq:.4f}")
print(f"KL(Q||P) = {kl_qp:.4f}")
print(f"JS(P||Q) = {js_pq:.4f}")
print(f"\nKL์ ๋น๋์นญ, JS๋ ๋์นญ!")
print(f"JS(P||Q) = JS(Q||P) ํญ์ ์ฑ๋ฆฝ")
์ฐ์ต ๋ฌธ์ ¶
๋ฌธ์ 1: ์ต๋ ์ํธ๋กํผ ์ฆ๋ช ¶
๋ผ๊ทธ๋์ฃผ ์น์๋ฒ์ ์ฌ์ฉํ์ฌ, ์ ์ฝ ์กฐ๊ฑด $\sum_{i=1}^n p_i = 1$ํ์์ ์ํธ๋กํผ $H = -\sum_{i=1}^n p_i \log p_i$๋ฅผ ์ต๋ํํ๋ ๋ถํฌ๊ฐ ๊ท ๋ฑ ๋ถํฌ $p_i = 1/n$์์ ์ฆ๋ช ํ์์ค.
ํํธ: - ๋ผ๊ทธ๋์ง์: $L = -\sum_i p_i \log p_i - \lambda(\sum_i p_i - 1)$ - $\frac{\partial L}{\partial p_i} = 0$ ํ๊ธฐ
๋ฌธ์ 2: ์กฐ๊ฑด๋ถ ์ํธ๋กํผ์ ์ํธ ์ ๋ณด๋¶
๋ค์์ ์ฆ๋ช ํ์์ค:
(a) $H(X, Y) = H(X) + H(Y|X)$ (์ฒด์ธ ๊ท์น)
(b) $I(X; Y) = H(X) + H(Y) - H(X, Y)$
(c) $I(X; Y) \leq \min(H(X), H(Y))$
(d) $I(X; Y) = 0 \Leftrightarrow X \perp Y$
๋ฌธ์ 3: KL ๋ฐ์ฐ ๊ณ์ฐ¶
๋ ๊ฐ์ฐ์์ ๋ถํฌ $P = \mathcal{N}(\mu_1, \sigma_1^2)$์ $Q = \mathcal{N}(\mu_2, \sigma_2^2)$ ์ฌ์ด์ KL ๋ฐ์ฐ์ ํด์์ ์ผ๋ก ๊ณ์ฐํ์์ค.
๊ฒฐ๊ณผ: $$D_{\text{KL}}(P \| Q) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$$
Python์ผ๋ก ๊ฒ์ฆํ์์ค (์์น ์ ๋ถ vs ํด์ ํด).
๋ฌธ์ 4: VAE์ KL ๋ฐ์ฐ¶
VAE์์ $q(z|x) = \mathcal{N}(\mu, \text{diag}(\sigma^2))$์ด๊ณ $p(z) = \mathcal{N}(0, I)$์ผ ๋, KL ๋ฐ์ฐ์ด ๋ค์๊ณผ ๊ฐ์์ ์ ๋ํ์์ค:
$$D_{\text{KL}}(q(z|x) \| p(z)) = \frac{1}{2}\sum_{j=1}^d \left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right)$$
์ฌ๊ธฐ์ $d$๋ ์ ์ฌ ๊ณต๊ฐ์ ์ฐจ์.
๋ฌธ์ 5: ์ ๋ณด ๋ณ๋ชฉ ์๋ฎฌ๋ ์ด์ ¶
๊ฐ๋จํ ๋ฐ์ดํฐ์ (์: 2D ๋ถ๋ฅ ๋ฌธ์ )์ ๋ํด ์ ๋ณด ๋ณ๋ชฉ ์๋ฆฌ๋ฅผ ๊ตฌํํ์์ค:
(a) ์ ๋ ฅ $X$, ์์ถ ํํ $Z$, ํ๊ฒ $Y$๋ฅผ ์ ์
(b) $I(X; Z)$์ $I(Z; Y)$๋ฅผ ์ถ์ (ํ์คํ ๊ทธ๋จ ๊ธฐ๋ฐ)
(c) ๋ค์ํ $\beta$ ๊ฐ์ ๋ํด trade-off ๊ณก์ ๊ทธ๋ฆฌ๊ธฐ
(d) ์ต์ ์์ถ ์์ค ์ฐพ๊ธฐ
์ฐธ๊ณ ์๋ฃ¶
- Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. [์ ๋ณด ์ด๋ก ์ ๋ฐ์ด๋ธ]
- MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press.
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. Chapter 6 (Information Theory).
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. Chapter 3 (Information Theory).
- ๋ ผ๋ฌธ: Tishby, N., & Zaslavsky, N. (2015). "Deep Learning and the Information Bottleneck Principle". IEEE Information Theory Workshop.
- ๋ ผ๋ฌธ: Kingma, D. P., & Welling, M. (2013). "Auto-Encoding Variational Bayes". ICLR.
- ํํ ๋ฆฌ์ผ: Shwartz-Ziv, R., & Tishby, N. (2017). "Opening the Black Box of Deep Neural Networks via Information". arXiv:1703.00810.
- ๋ธ๋ก๊ทธ: Colah's Blog - Visual Information Theory - https://colah.github.io/posts/2015-09-Visual-Information/