08. Probability for Machine Learning
08. Probability for Machine Learning¶
Learning Objectives¶
- Understand and apply basic probability axioms, conditional probability, and Bayes' theorem
- Learn the concept of random variables and the differences between discrete and continuous distributions
- Calculate and interpret key statistics of random variables such as expectation, variance, and covariance
- Learn the characteristics and applications of probability distributions commonly used in machine learning
- Implement probabilistic inference and Bayesian updates using Bayes' theorem
- Understand the difference between generative and discriminative models from a probabilistic perspective
1. Foundations of Probability¶
1.1 Axioms of Probability¶
Sample Space $\Omega$: set of all possible outcomes
Event $A$: subset of the sample space
Probability Measure $P$ satisfies the following axioms:
- Non-negativity: $P(A) \geq 0$ for all $A$
- Normalization: $P(\Omega) = 1$
- Countable Additivity: For mutually exclusive events $A_1, A_2, \ldots$ $$P\left(\bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty P(A_i)$$
1.2 Conditional Probability¶
Probability of event $A$ occurring given that event $B$ has occurred:
$$ P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad \text{if } P(B) > 0 $$
Intuition: When we know $B$ has occurred, the sample space shrinks from $\Omega$ to $B$.
1.3 Independence¶
Events $A$ and $B$ are independent if:
$$ P(A \cap B) = P(A) \cdot P(B) $$
or equivalently: $$ P(A|B) = P(A) $$
1.4 Law of Total Probability¶
If $B_1, \ldots, B_n$ form a partition of the sample space:
$$ P(A) = \sum_{i=1}^n P(A|B_i)P(B_i) $$
1.5 Bayes' Theorem¶
$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$
or using the law of total probability:
$$ P(A|B) = \frac{P(B|A)P(A)}{\sum_{i} P(B|A_i)P(A_i)} $$
Terminology: - $P(A)$: prior probability - $P(B|A)$: likelihood - $P(A|B)$: posterior probability - $P(B)$: marginal probability or evidence
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# ๋ฒ ์ด์ฆ ์ ๋ฆฌ ์์ : ์๋ฃ ์ง๋จ
# ์ง๋ณ ์ ๋ณ๋ฅ : 1%
P_disease = 0.01
P_no_disease = 1 - P_disease
# ๊ฒ์ฌ ์ ํ๋
# ๋ฏผ๊ฐ๋ (sensitivity): ๋ณ์ด ์์ ๋ ์์ฑ ํ๋ฅ
P_positive_given_disease = 0.95
# ํน์ด๋ (specificity): ๋ณ์ด ์์ ๋ ์์ฑ ํ๋ฅ
P_negative_given_no_disease = 0.95
P_positive_given_no_disease = 1 - P_negative_given_no_disease
# ์ ํ๋ฅ : ์์ฑ ๊ฒ์ฌ ํ๋ฅ
P_positive = (P_positive_given_disease * P_disease +
P_positive_given_no_disease * P_no_disease)
# ๋ฒ ์ด์ฆ ์ ๋ฆฌ: ์์ฑ์ผ ๋ ์ค์ ๋ณ์ด ์์ ํ๋ฅ
P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive
print("์๋ฃ ์ง๋จ ์์ (๋ฒ ์ด์ฆ ์ ๋ฆฌ)")
print(f"์ง๋ณ ์ ๋ณ๋ฅ (์ฌ์ ํ๋ฅ ): {P_disease:.1%}")
print(f"๊ฒ์ฌ ๋ฏผ๊ฐ๋: {P_positive_given_disease:.1%}")
print(f"๊ฒ์ฌ ํน์ด๋: {P_negative_given_no_disease:.1%}")
print(f"\n์์ฑ ๊ฒ์ฌ ํ๋ฅ (์ ํ๋ฅ ): {P_positive:.4f}")
print(f"์์ฑ์ผ ๋ ์ค์ ๋ณ์ด ์์ ํ๋ฅ (์ฌํ ํ๋ฅ ): {P_disease_given_positive:.1%}")
print(f"\nํด์: ๊ฒ์ฌ๊ฐ ์์ฑ์ด์ด๋ ์ค์ ๋ณ์ด ์์ ํ๋ฅ ์ {P_disease_given_positive:.1%}์ ๋ถ๊ณผ")
print(" (๋ฎ์ ์ ๋ณ๋ฅ ๋ก ์ธํด ์์์ฑ์ด ๋ง์)")
# ์๊ฐํ: ๋ฒ ์ด์ฆ ์ ๋ฆฌ
fig, ax = plt.subplots(figsize=(12, 6))
categories = ['์ฌ์ ํ๋ฅ \n(๋ณ ์์)', '์ฐ๋\n(์์ฑ|๋ณ)', '์ฌํ ํ๋ฅ \n(๋ณ|์์ฑ)']
probabilities = [P_disease, P_positive_given_disease, P_disease_given_positive]
colors = ['skyblue', 'lightgreen', 'salmon']
bars = ax.bar(categories, probabilities, color=colors, edgecolor='black', linewidth=2)
# ๊ฐ ํ์
for bar, prob in zip(bars, probabilities):
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2., height,
f'{prob:.1%}', ha='center', va='bottom', fontsize=14, fontweight='bold')
ax.set_ylabel('ํ๋ฅ ', fontsize=13)
ax.set_title('๋ฒ ์ด์ฆ ์ ๋ฆฌ: ์๋ฃ ์ง๋จ ์์ ', fontsize=15)
ax.set_ylim(0, 1.0)
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.savefig('bayes_theorem_medical.png', dpi=150)
plt.show()
2. Random Variables¶
2.1 Definition of Random Variables¶
Random Variable: a function from the sample space to real numbers $$X: \Omega \to \mathbb{R}$$
Discrete random variable: takes countable values (e.g., dice, coins) Continuous random variable: takes continuous values (e.g., height, temperature)
2.2 Probability Mass Function (PMF)¶
For a discrete random variable $X$:
$$ p_X(x) = P(X = x) $$
Properties: - $p_X(x) \geq 0$ for all $x$ - $\sum_{x} p_X(x) = 1$
2.3 Probability Density Function (PDF)¶
For a continuous random variable $X$:
$$ P(a \leq X \leq b) = \int_a^b f_X(x) dx $$
Properties: - $f_X(x) \geq 0$ for all $x$ - $\int_{-\infty}^{\infty} f_X(x) dx = 1$ - $P(X = x) = 0$ (probability at a single point is 0)
2.4 Cumulative Distribution Function (CDF)¶
$$ F_X(x) = P(X \leq x) $$
Properties: - Non-decreasing function - $\lim_{x \to -\infty} F_X(x) = 0$, $\lim_{x \to \infty} F_X(x) = 1$ - For continuous random variables: $f_X(x) = \frac{d}{dx}F_X(x)$
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
# 1. ์ด์ฐ: ์ดํญ ๋ถํฌ
n, p = 10, 0.5
x_binom = np.arange(0, n+1)
pmf_binom = stats.binom.pmf(x_binom, n, p)
cdf_binom = stats.binom.cdf(x_binom, n, p)
axes[0, 0].bar(x_binom, pmf_binom, color='skyblue', edgecolor='black')
axes[0, 0].set_title('์ดํญ ๋ถํฌ PMF\n$n=10, p=0.5$', fontsize=12)
axes[0, 0].set_xlabel('x')
axes[0, 0].set_ylabel('P(X=x)')
axes[0, 0].grid(True, alpha=0.3)
axes[1, 0].step(x_binom, cdf_binom, where='post', linewidth=2, color='blue')
axes[1, 0].set_title('์ดํญ ๋ถํฌ CDF', fontsize=12)
axes[1, 0].set_xlabel('x')
axes[1, 0].set_ylabel('P(Xโคx)')
axes[1, 0].grid(True, alpha=0.3)
# 2. ์ฐ์: ์ ๊ท ๋ถํฌ
mu, sigma = 0, 1
x_norm = np.linspace(-4, 4, 1000)
pdf_norm = stats.norm.pdf(x_norm, mu, sigma)
cdf_norm = stats.norm.cdf(x_norm, mu, sigma)
axes[0, 1].plot(x_norm, pdf_norm, linewidth=2, color='red')
axes[0, 1].fill_between(x_norm, pdf_norm, alpha=0.3, color='red')
axes[0, 1].set_title('์ ๊ท ๋ถํฌ PDF\n$\mu=0, \sigma=1$', fontsize=12)
axes[0, 1].set_xlabel('x')
axes[0, 1].set_ylabel('f(x)')
axes[0, 1].grid(True, alpha=0.3)
axes[1, 1].plot(x_norm, cdf_norm, linewidth=2, color='darkred')
axes[1, 1].set_title('์ ๊ท ๋ถํฌ CDF', fontsize=12)
axes[1, 1].set_xlabel('x')
axes[1, 1].set_ylabel('F(x)')
axes[1, 1].grid(True, alpha=0.3)
# 3. ์ฐ์: ์ง์ ๋ถํฌ
lam = 1.0
x_exp = np.linspace(0, 5, 1000)
pdf_exp = stats.expon.pdf(x_exp, scale=1/lam)
cdf_exp = stats.expon.cdf(x_exp, scale=1/lam)
axes[0, 2].plot(x_exp, pdf_exp, linewidth=2, color='green')
axes[0, 2].fill_between(x_exp, pdf_exp, alpha=0.3, color='green')
axes[0, 2].set_title('์ง์ ๋ถํฌ PDF\n$\lambda=1$', fontsize=12)
axes[0, 2].set_xlabel('x')
axes[0, 2].set_ylabel('f(x)')
axes[0, 2].grid(True, alpha=0.3)
axes[1, 2].plot(x_exp, cdf_exp, linewidth=2, color='darkgreen')
axes[1, 2].set_title('์ง์ ๋ถํฌ CDF', fontsize=12)
axes[1, 2].set_xlabel('x')
axes[1, 2].set_ylabel('F(x)')
axes[1, 2].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('pmf_pdf_cdf.png', dpi=150)
plt.show()
print("PMF vs PDF:")
print(" PMF (์ด์ฐ): ํน์ ๊ฐ์ ํ๋ฅ P(X=x)")
print(" PDF (์ฐ์): ํ๋ฅ ๋ฐ๋, ๊ตฌ๊ฐ ํ๋ฅ ์ ์ ๋ถ์ผ๋ก ๊ณ์ฐ")
print(" CDF: ๋์ ํ๋ฅ P(Xโคx), ์ด์ฐ/์ฐ์ ๋ชจ๋ ์ ์")
2.5 Joint, Marginal, and Conditional Distributions¶
Joint Distribution: $$P(X = x, Y = y)$$ or $$f_{X,Y}(x, y)$$
Marginal Distribution: $$p_X(x) = \sum_y p_{X,Y}(x, y)$$ or $$f_X(x) = \int f_{X,Y}(x, y) dy$$
Conditional Distribution: $$p_{X|Y}(x|y) = \frac{p_{X,Y}(x, y)}{p_Y(y)}$$
# ๊ฒฐํฉ ๋ถํฌ ์์ : ์ด๋ณ๋ ์ ๊ท๋ถํฌ
from scipy.stats import multivariate_normal
# ํ๋ผ๋ฏธํฐ
mu = np.array([0, 0])
cov = np.array([[1, 0.7],
[0.7, 1]])
# ๊ทธ๋ฆฌ๋
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
pos = np.dstack((X, Y))
# ๊ฒฐํฉ PDF
rv = multivariate_normal(mu, cov)
Z = rv.pdf(pos)
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# ๋ฑ๊ณ ์
ax = axes[0]
contour = ax.contourf(X, Y, Z, levels=15, cmap='viridis')
plt.colorbar(contour, ax=ax)
ax.set_xlabel('X', fontsize=12)
ax.set_ylabel('Y', fontsize=12)
ax.set_title('๊ฒฐํฉ ๋ถํฌ $f_{X,Y}(x,y)$ (์ด๋ณ๋ ์ ๊ท)', fontsize=14)
ax.grid(True, alpha=0.3)
# ์ฃผ๋ณ ๋ถํฌ
ax = axes[1]
marginal_X = stats.norm.pdf(x, mu[0], np.sqrt(cov[0, 0]))
marginal_Y = stats.norm.pdf(y, mu[1], np.sqrt(cov[1, 1]))
ax.plot(x, marginal_X, linewidth=3, label='์ฃผ๋ณ ๋ถํฌ $f_X(x)$', color='blue')
ax.plot(y, marginal_Y, linewidth=3, label='์ฃผ๋ณ ๋ถํฌ $f_Y(y)$', color='red')
ax.set_xlabel('๊ฐ', fontsize=12)
ax.set_ylabel('๋ฐ๋', fontsize=12)
ax.set_title('์ฃผ๋ณ ๋ถํฌ', fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('joint_marginal_distributions.png', dpi=150)
plt.show()
print(f"๊ณต๋ถ์ฐ ํ๋ ฌ:\n{cov}")
print(f"์๊ด๊ณ์: {cov[0,1] / np.sqrt(cov[0,0] * cov[1,1]):.2f}")
3. Expectation and Variance¶
3.1 Expectation¶
Discrete: $$\mathbb{E}[X] = \sum_x x \cdot p_X(x)$$
Continuous: $$\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) dx$$
Expectation of a function (LOTUS - Law of the Unconscious Statistician): $$\mathbb{E}[g(X)] = \sum_x g(x) \cdot p_X(x) \quad \text{or} \quad \int g(x) \cdot f_X(x) dx$$
3.2 Properties of Expectation¶
-
Linearity: $$\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$$
-
Product of independent variables: If $X, Y$ independent then $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$
3.3 Variance¶
$$ \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 $$
Standard Deviation: $$\sigma_X = \sqrt{\text{Var}(X)}$$
Properties of variance: - $\text{Var}(aX + b) = a^2 \text{Var}(X)$ - If $X, Y$ independent then $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$
3.4 Covariance¶
$$ \text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] $$
Correlation Coefficient: $$ \rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \in [-1, 1] $$
import numpy as np
import matplotlib.pyplot as plt
# ๋ชฌํ
์นด๋ฅผ๋ก๋ก ๊ธฐ๋๊ฐ๊ณผ ๋ถ์ฐ ์ถ์
np.random.seed(42)
# ์ ๊ท ๋ถํฌ ์ํ๋ง
mu, sigma = 2, 1.5
samples = np.random.normal(mu, sigma, 10000)
# ๊ธฐ๋๊ฐ๊ณผ ๋ถ์ฐ ์ถ์
estimated_mean = np.mean(samples)
estimated_var = np.var(samples, ddof=0)
estimated_std = np.std(samples, ddof=0)
print("๋ชฌํ
์นด๋ฅผ๋ก ์ถ์ ")
print(f"์ด๋ก ์ ํ๊ท : {mu}, ์ถ์ ํ๊ท : {estimated_mean:.4f}")
print(f"์ด๋ก ์ ๋ถ์ฐ: {sigma**2}, ์ถ์ ๋ถ์ฐ: {estimated_var:.4f}")
print(f"์ด๋ก ์ ํ์คํธ์ฐจ: {sigma}, ์ถ์ ํ์คํธ์ฐจ: {estimated_std:.4f}")
# ์๊ฐํ
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# ํ์คํ ๊ทธ๋จ + ์ด๋ก ์ PDF
ax = axes[0]
ax.hist(samples, bins=50, density=True, alpha=0.7, color='skyblue',
edgecolor='black', label='์ํ ํ์คํ ๊ทธ๋จ')
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000)
pdf = stats.norm.pdf(x, mu, sigma)
ax.plot(x, pdf, linewidth=3, color='red', label='์ด๋ก ์ PDF')
ax.axvline(estimated_mean, color='green', linestyle='--', linewidth=2,
label=f'์ถ์ ํ๊ท = {estimated_mean:.2f}')
ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('๋ฐ๋', fontsize=12)
ax.set_title(f'์ ๊ท ๋ถํฌ ์ํ๋ง (ฮผ={mu}, ฯ={sigma})', fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
# ์ํ ํฌ๊ธฐ์ ๋ฐ๋ฅธ ์๋ ด
ax = axes[1]
sample_sizes = np.arange(10, 10001, 10)
running_means = [np.mean(samples[:n]) for n in sample_sizes]
ax.plot(sample_sizes, running_means, linewidth=2, color='blue',
label='๋์ ํ๊ท ')
ax.axhline(mu, color='red', linestyle='--', linewidth=2, label=f'์ด๋ก ์ ํ๊ท = {mu}')
ax.set_xlabel('์ํ ํฌ๊ธฐ', fontsize=12)
ax.set_ylabel('๋์ ํ๊ท ', fontsize=12)
ax.set_title('๋์์ ๋ฒ์น (Law of Large Numbers)', fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('expectation_variance_estimation.png', dpi=150)
plt.show()
# ๊ณต๋ถ์ฐ ์์
np.random.seed(42)
n = 1000
# ์์ ์๊ด๊ด๊ณ
X1 = np.random.randn(n)
Y1 = 0.8 * X1 + 0.3 * np.random.randn(n)
# ์์ ์๊ด๊ด๊ณ
X2 = np.random.randn(n)
Y2 = -0.8 * X2 + 0.3 * np.random.randn(n)
# ๋
๋ฆฝ
X3 = np.random.randn(n)
Y3 = np.random.randn(n)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
datasets = [(X1, Y1, '์์ ์๊ด'), (X2, Y2, '์์ ์๊ด'), (X3, Y3, '๋
๋ฆฝ (์๊ด ์์)')]
for idx, (X, Y, title) in enumerate(datasets):
ax = axes[idx]
ax.scatter(X, Y, alpha=0.5, s=20, edgecolors='k', linewidths=0.5)
# ํต๊ณ๋ ๊ณ์ฐ
cov = np.cov(X, Y)[0, 1]
corr = np.corrcoef(X, Y)[0, 1]
ax.set_xlabel('X', fontsize=12)
ax.set_ylabel('Y', fontsize=12)
ax.set_title(f'{title}\nCov={cov:.3f}, ฯ={corr:.3f}', fontsize=13)
ax.grid(True, alpha=0.3)
ax.set_aspect('equal')
plt.tight_layout()
plt.savefig('covariance_correlation.png', dpi=150)
plt.show()
print("\n๊ณต๋ถ์ฐ๊ณผ ์๊ด๊ณ์:")
print(" Cov > 0: ์์ ๊ด๊ณ (X ์ฆ๊ฐ โ Y ์ฆ๊ฐ)")
print(" Cov < 0: ์์ ๊ด๊ณ (X ์ฆ๊ฐ โ Y ๊ฐ์)")
print(" Cov = 0: ์ ํ ๊ด๊ณ ์์ (๋
๋ฆฝ์ด๋ฉด Cov=0, ์ญ์ ์ฑ๋ฆฝ ์ ํจ)")
print(" ฯ โ [-1, 1]: ์ ๊ทํ๋ ๊ณต๋ถ์ฐ (๋จ์ ๋ฌด๊ด)")
4. Common Probability Distributions¶
4.1 Discrete Distributions¶
Bernoulli Distribution: $$X \sim \text{Ber}(p), \quad P(X=1) = p, \; P(X=0) = 1-p$$ - Mean: $p$, Variance: $p(1-p)$
Binomial Distribution: $$X \sim \text{Bin}(n, p), \quad P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}$$ - Number of successes in $n$ independent Bernoulli trials - Mean: $np$, Variance: $np(1-p)$
Poisson Distribution: $$X \sim \text{Pois}(\lambda), \quad P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$$ - Number of events in unit time/space - Mean: $\lambda$, Variance: $\lambda$
Categorical Distribution: $$X \sim \text{Cat}(p_1, \ldots, p_K), \quad P(X=k) = p_k, \; \sum p_k = 1$$ - Basic distribution for multiclass classification
4.2 Continuous Distributions¶
Uniform Distribution: $$X \sim \text{Unif}(a, b), \quad f(x) = \frac{1}{b-a} \text{ for } x \in [a, b]$$ - Mean: $\frac{a+b}{2}$, Variance: $\frac{(b-a)^2}{12}$
Normal/Gaussian Distribution: $$X \sim \mathcal{N}(\mu, \sigma^2), \quad f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$ - Mean: $\mu$, Variance: $\sigma^2$ - Arises naturally via Central Limit Theorem
Exponential Distribution: $$X \sim \text{Exp}(\lambda), \quad f(x) = \lambda e^{-\lambda x} \text{ for } x \geq 0$$ - Waiting time in Poisson process - Mean: $1/\lambda$, Variance: $1/\lambda^2$
Beta Distribution: $$X \sim \text{Beta}(\alpha, \beta), \quad f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)} \text{ for } x \in [0, 1]$$ - Distribution of probabilities (prior in Bayesian inference)
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
fig, axes = plt.subplots(3, 3, figsize=(18, 15))
# 1. ๋ฒ ๋ฅด๋์ด
ax = axes[0, 0]
p = 0.7
x = [0, 1]
pmf = [1-p, p]
ax.bar(x, pmf, color='skyblue', edgecolor='black', width=0.4)
ax.set_title(f'๋ฒ ๋ฅด๋์ด (p={p})', fontsize=12)
ax.set_xticks([0, 1])
ax.set_ylabel('P(X=x)')
# 2. ์ดํญ
ax = axes[0, 1]
n, p = 20, 0.5
x = np.arange(0, n+1)
pmf = stats.binom.pmf(x, n, p)
ax.bar(x, pmf, color='lightgreen', edgecolor='black')
ax.set_title(f'์ดํญ (n={n}, p={p})', fontsize=12)
ax.set_xlabel('x')
# 3. ํฌ์์ก
ax = axes[0, 2]
lam = 5
x = np.arange(0, 20)
pmf = stats.poisson.pmf(x, lam)
ax.bar(x, pmf, color='salmon', edgecolor='black')
ax.set_title(f'ํฌ์์ก (ฮป={lam})', fontsize=12)
ax.set_xlabel('x')
# 4. ๊ท ๋ฑ
ax = axes[1, 0]
a, b = 0, 1
x = np.linspace(-0.5, 1.5, 1000)
pdf = stats.uniform.pdf(x, a, b-a)
ax.plot(x, pdf, linewidth=3, color='blue')
ax.fill_between(x, pdf, alpha=0.3, color='blue')
ax.set_title(f'๊ท ๋ฑ (a={a}, b={b})', fontsize=12)
ax.set_ylabel('f(x)')
# 5. ์ ๊ท (์ฌ๋ฌ ํ๋ผ๋ฏธํฐ)
ax = axes[1, 1]
x = np.linspace(-5, 5, 1000)
params = [(0, 1), (0, 0.5), (1, 1)]
for mu, sigma in params:
pdf = stats.norm.pdf(x, mu, sigma)
ax.plot(x, pdf, linewidth=2, label=f'ฮผ={mu}, ฯ={sigma}')
ax.set_title('์ ๊ท ๋ถํฌ', fontsize=12)
ax.legend(fontsize=9)
# 6. ์ง์
ax = axes[1, 2]
x = np.linspace(0, 5, 1000)
lambdas = [0.5, 1, 2]
for lam in lambdas:
pdf = stats.expon.pdf(x, scale=1/lam)
ax.plot(x, pdf, linewidth=2, label=f'ฮป={lam}')
ax.set_title('์ง์ ๋ถํฌ', fontsize=12)
ax.legend(fontsize=9)
# 7. ๊ฐ๋ง
ax = axes[2, 0]
x = np.linspace(0, 20, 1000)
params = [(1, 1), (2, 2), (5, 1)]
for k, theta in params:
pdf = stats.gamma.pdf(x, k, scale=theta)
ax.plot(x, pdf, linewidth=2, label=f'k={k}, ฮธ={theta}')
ax.set_title('๊ฐ๋ง ๋ถํฌ', fontsize=12)
ax.set_ylabel('f(x)')
ax.legend(fontsize=9)
# 8. ๋ฒ ํ
ax = axes[2, 1]
x = np.linspace(0, 1, 1000)
params = [(0.5, 0.5), (2, 2), (5, 2)]
for alpha, beta in params:
pdf = stats.beta.pdf(x, alpha, beta)
ax.plot(x, pdf, linewidth=2, label=f'ฮฑ={alpha}, ฮฒ={beta}')
ax.set_title('๋ฒ ํ ๋ถํฌ', fontsize=12)
ax.legend(fontsize=9)
# 9. ์นด์ด์ ๊ณฑ
ax = axes[2, 2]
x = np.linspace(0, 15, 1000)
dfs = [2, 4, 6]
for df in dfs:
pdf = stats.chi2.pdf(x, df)
ax.plot(x, pdf, linewidth=2, label=f'df={df}')
ax.set_title('์นด์ด์ ๊ณฑ ๋ถํฌ', fontsize=12)
ax.legend(fontsize=9)
for ax in axes.flat:
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('common_distributions.png', dpi=150)
plt.show()
print("๋จธ์ ๋ฌ๋์์์ ํ์ฉ:")
print(" ๋ฒ ๋ฅด๋์ด/์ดํญ: ์ด์ง ๋ถ๋ฅ")
print(" ์นดํ
๊ณ ๋ฆฌ์ปฌ: ๋คํญ ๋ถ๋ฅ")
print(" ์ ๊ท: ์ฐ์ ๋ฐ์ดํฐ, ์ค์ฐจ ๋ชจ๋ธ, VAE ์ ์ฌ ๊ณต๊ฐ")
print(" ํฌ์์ก: ์นด์ดํธ ๋ฐ์ดํฐ (์ถ์ฒ ์์คํ
, ์น ํธ๋ํฝ)")
print(" ๋ฒ ํ: ๋ฒ ์ด์ง์ ์ถ๋ก ์ ์ฌ์ ๋ถํฌ")
print(" ์ง์/๊ฐ๋ง: ๋๊ธฐ ์๊ฐ, ์์กด ๋ถ์")
4.3 Multivariate Normal Distribution¶
$$ \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}), \quad f(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^d |\boldsymbol{\Sigma}|}}\exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right) $$
- $\boldsymbol{\mu} \in \mathbb{R}^d$: mean vector
- $\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}$: covariance matrix (positive definite)
from scipy.stats import multivariate_normal
import numpy as np
import matplotlib.pyplot as plt
# ๋ค๋ณ๋ ์ ๊ท๋ถํฌ ์๊ฐํ
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
mu = np.array([0, 0])
covs = [
np.array([[1, 0], [0, 1]]), # ๋
๋ฆฝ
np.array([[1, 0.8], [0.8, 1]]), # ์์ ์๊ด
np.array([[1, -0.8], [-0.8, 1]]) # ์์ ์๊ด
]
titles = ['๋
๋ฆฝ (ฯ=0)', '์์ ์๊ด (ฯ=0.8)', '์์ ์๊ด (ฯ=-0.8)']
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
pos = np.dstack((X, Y))
for ax, cov, title in zip(axes, covs, titles):
rv = multivariate_normal(mu, cov)
Z = rv.pdf(pos)
contour = ax.contourf(X, Y, Z, levels=15, cmap='viridis')
ax.contour(X, Y, Z, levels=15, colors='white', alpha=0.3, linewidths=0.5)
ax.set_xlabel('$X_1$', fontsize=12)
ax.set_ylabel('$X_2$', fontsize=12)
ax.set_title(title, fontsize=13)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('multivariate_normal.png', dpi=150)
plt.show()
print("๋ค๋ณ๋ ์ ๊ท๋ถํฌ:")
print(" - ๊ณ ์ฐจ์ ๋ฐ์ดํฐ ๋ชจ๋ธ๋ง์ ๊ธฐ๋ณธ")
print(" - ๊ฐ์ฐ์์ ํ๋ก์ธ์ค, GMM, VAE ๋ฑ์์ ํต์ฌ")
print(" - ๊ณต๋ถ์ฐ ํ๋ ฌ๋ก ๋ณ์ ๊ฐ ์์กด์ฑ ํํ")
5. Advanced Bayes' Theorem¶
5.1 Bayesian Update¶
Prior โ Data โ Posterior
$$ P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)} \propto P(D | \theta) P(\theta) $$
- $\theta$: parameter (treated as random variable)
- $D$: observed data
- $P(\theta)$: prior probability (belief before data)
- $P(D | \theta)$: likelihood (plausibility of data given parameter)
- $P(\theta | D)$: posterior probability (updated belief after data)
5.2 Example: Coin Flip (Beta-Binomial Model)¶
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
# ๋ฒ ํ-์ดํญ ๋ชจ๋ธ: ๋์ ์ ์๋ฉด ํ๋ฅ ์ถ์
# ์ฌ์ ๋ถํฌ: Beta(ฮฑ, ฮฒ)
# ์ฐ๋: Binomial
# ์ฌํ๋ถํฌ: Beta(ฮฑ + n_heads, ฮฒ + n_tails)
np.random.seed(42)
# ์ง์ง ๋์ ํ๋ฅ (์ ์ ์๋ค๊ณ ๊ฐ์ )
true_p = 0.7
# ์ฌ์ ๋ถํฌ (๊ท ๋ฑ ์ฌ์ : Beta(1, 1))
alpha_prior, beta_prior = 1, 1
# ๋์ ๋์ง๊ธฐ ์๋ฎฌ๋ ์ด์
n_flips_list = [0, 1, 5, 20, 100]
data = np.random.binomial(1, true_p, 100)
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()
p_vals = np.linspace(0, 1, 1000)
for idx, n_flips in enumerate(n_flips_list):
ax = axes[idx]
if n_flips == 0:
# ์ฌ์ ๋ถํฌ๋ง
prior_pdf = stats.beta.pdf(p_vals, alpha_prior, beta_prior)
ax.plot(p_vals, prior_pdf, linewidth=3, color='blue', label='์ฌ์ ๋ถํฌ')
else:
# ๋ฐ์ดํฐ
observed_data = data[:n_flips]
n_heads = np.sum(observed_data)
n_tails = n_flips - n_heads
# ์ฌํ๋ถํฌ
alpha_post = alpha_prior + n_heads
beta_post = beta_prior + n_tails
posterior_pdf = stats.beta.pdf(p_vals, alpha_post, beta_post)
# ์ฌ์ ๋ถํฌ
prior_pdf = stats.beta.pdf(p_vals, alpha_prior, beta_prior)
ax.plot(p_vals, prior_pdf, linewidth=2, color='blue', linestyle='--',
label='์ฌ์ ๋ถํฌ', alpha=0.7)
ax.plot(p_vals, posterior_pdf, linewidth=3, color='red', label='์ฌํ๋ถํฌ')
# MAP ์ถ์ (์ต๋ ์ฌํ ํ๋ฅ )
map_estimate = (alpha_post - 1) / (alpha_post + beta_post - 2)
ax.axvline(map_estimate, color='red', linestyle=':', linewidth=2,
label=f'MAP = {map_estimate:.3f}')
# ์ง์ง ํ๋ฅ
ax.axvline(true_p, color='green', linestyle='--', linewidth=2,
label=f'์ง์ง p = {true_p}')
ax.set_xlabel('p (์๋ฉด ํ๋ฅ )', fontsize=11)
ax.set_ylabel('๋ฐ๋', fontsize=11)
ax.set_title(f'๋์ {n_flips}๋ฒ ๋์ง ํ' if n_flips > 0 else '์ฌ์ ๋ถํฌ', fontsize=12)
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 1)
# ์๋ ด ๊ณก์
ax = axes[-1]
n_range = np.arange(1, 101)
map_estimates = []
for n in n_range:
n_heads = np.sum(data[:n])
n_tails = n - n_heads
alpha_post = alpha_prior + n_heads
beta_post = beta_prior + n_tails
map_est = (alpha_post - 1) / (alpha_post + beta_post - 2)
map_estimates.append(map_est)
ax.plot(n_range, map_estimates, linewidth=2, color='red', label='MAP ์ถ์ ')
ax.axhline(true_p, color='green', linestyle='--', linewidth=2, label=f'์ง์ง p = {true_p}')
ax.set_xlabel('๋์ ๋์ง ํ์', fontsize=11)
ax.set_ylabel('์ถ์ ๋ p', fontsize=11)
ax.set_title('๋ฒ ์ด์ง์ ํ์ต ์๋ ด', fontsize=12)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('bayesian_update_coin.png', dpi=150)
plt.show()
print("๋ฒ ์ด์ง์ ์
๋ฐ์ดํธ:")
print(" - ๋ฐ์ดํฐ๊ฐ ๋์ด๋ ์๋ก ์ฌํ๋ถํฌ๊ฐ ์ง์ง ๊ฐ ์ฃผ๋ณ์ ์ง์ค")
print(" - ์ฌ์ ๋ถํฌ์ ์ํฅ์ ๋ฐ์ดํฐ๊ฐ ๋ง์์ง๋ฉด ๊ฐ์")
print(" - ๋ถํ์ค์ฑ์ ๋ถํฌ๋ก ํํ (์ ์ถ์ ์ด ์๋)")
6. Probability in Machine Learning¶
6.1 Generative vs Discriminative Models¶
Generative Model: - Models $P(X, Y) = P(Y)P(X|Y)$ - Learns data distribution per class - Prediction: $P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}$ via Bayes' theorem - Examples: Naive Bayes, GMM, VAE, GAN
Discriminative Model: - Directly models $P(Y|X)$ - Learns only decision boundary - Examples: Logistic regression, SVM, neural networks
from sklearn.datasets import make_classification
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
# ๋ฐ์ดํฐ ์์ฑ
X, y = make_classification(n_samples=300, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1,
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# ์์ฑ ๋ชจ๋ธ: ๋์ด๋ธ ๋ฒ ์ด์ฆ
generative_model = GaussianNB()
generative_model.fit(X_train, y_train)
# ํ๋ณ ๋ชจ๋ธ: ๋ก์ง์คํฑ ํ๊ท
discriminative_model = LogisticRegression()
discriminative_model.fit(X_train, y_train)
# ์๊ฐํ
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# ๊ทธ๋ฆฌ๋
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
models = [
(generative_model, '์์ฑ ๋ชจ๋ธ (๋์ด๋ธ ๋ฒ ์ด์ฆ)', axes[0]),
(discriminative_model, 'ํ๋ณ ๋ชจ๋ธ (๋ก์ง์คํฑ ํ๊ท)', axes[1])
]
for model, title, ax in models:
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
ax.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1],
c='blue', marker='o', s=50, edgecolors='k', label='Class 0')
ax.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1],
c='red', marker='s', s=50, edgecolors='k', label='Class 1')
score = model.score(X_test, y_test)
ax.set_xlabel('Feature 1', fontsize=12)
ax.set_ylabel('Feature 2', fontsize=12)
ax.set_title(f'{title}\nTest Accuracy: {score:.3f}', fontsize=13)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('generative_vs_discriminative.png', dpi=150)
plt.show()
print("์์ฑ vs ํ๋ณ:")
print(" ์์ฑ: P(X,Y) ์ ์ฒด ๋ถํฌ ๋ชจ๋ธ๋ง โ ์ํ ์์ฑ ๊ฐ๋ฅ")
print(" ํ๋ณ: P(Y|X) ์กฐ๊ฑด๋ถ๋ง โ ์์ธก๋ง ๊ฐ๋ฅ, ๋ณดํต ๋ ๋์ ์ฑ๋ฅ")
6.2 Naive Bayes Classifier¶
Assumption: features are conditionally independent given the class
$$ P(X_1, \ldots, X_d | Y) = \prod_{i=1}^d P(X_i | Y) $$
Prediction: $$ \hat{y} = \arg\max_y P(Y=y) \prod_{i=1}^d P(X_i | Y=y) $$
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
# ํ
์คํธ ๋ถ๋ฅ ์์
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
train = fetch_20newsgroups(subset='train', categories=categories, random_state=42)
test = fetch_20newsgroups(subset='test', categories=categories, random_state=42)
# ํน์ง ์ถ์ถ (Bag-of-Words)
vectorizer = CountVectorizer(max_features=1000)
X_train = vectorizer.fit_transform(train.data)
X_test = vectorizer.transform(test.data)
# ๋์ด๋ธ ๋ฒ ์ด์ฆ ํ์ต
nb_model = MultinomialNB()
nb_model.fit(X_train, train.target)
# ์์ธก
y_pred = nb_model.predict(X_test)
print("๋์ด๋ธ ๋ฒ ์ด์ฆ ํ
์คํธ ๋ถ๋ฅ:")
print(classification_report(test.target, y_pred, target_names=test.target_names))
print("\n๋์ด๋ธ ๋ฒ ์ด์ฆ์ ํน์ง:")
print(" - ์กฐ๊ฑด๋ถ ๋
๋ฆฝ ๊ฐ์ (naive) โ ๊ณ์ฐ ํจ์จ์ ")
print(" - ๊ณ ์ฐจ์ ๋ฐ์ดํฐ์์๋ ์ ์๋ (ํ
์คํธ ๋ถ๋ฅ)")
print(" - ํ๋ฅ ์ ํด์ ๊ฐ๋ฅ")
print(" - ์์ ๋ฐ์ดํฐ์
์์๋ ํฉ๋ฆฌ์ ์ฑ๋ฅ")
6.3 Probabilistic Graphical Models¶
- Bayesian Network: represents conditional independence with directed acyclic graph (DAG)
- Markov Random Field: undirected graph
- Hidden Markov Model (HMM): inference of hidden states in time series data
- Applications: speech recognition, natural language processing, computer vision
# ๊ฐ๋จํ ๋ฒ ์ด์ง์ ๋คํธ์ํฌ ์์ (๊ฐ๋
์ )
import networkx as nx
import matplotlib.pyplot as plt
# ๋ฒ ์ด์ง์ ๋คํธ์ํฌ ๊ตฌ์กฐ
# Rain โ Sprinkler, Rain โ Grass Wet, Sprinkler โ Grass Wet
G = nx.DiGraph()
G.add_edges_from([('Rain', 'Sprinkler'), ('Rain', 'Grass Wet'),
('Sprinkler', 'Grass Wet')])
plt.figure(figsize=(10, 6))
pos = {'Rain': (0.5, 1), 'Sprinkler': (0, 0), 'Grass Wet': (1, 0)}
nx.draw(G, pos, with_labels=True, node_size=3000, node_color='lightblue',
font_size=12, font_weight='bold', arrowsize=20, arrows=True)
plt.title('๋ฒ ์ด์ง์ ๋คํธ์ํฌ: ๋น โ ์คํ๋งํด๋ฌ, ์๋ ์ ์', fontsize=14)
plt.tight_layout()
plt.savefig('bayesian_network_example.png', dpi=150)
plt.show()
print("ํ๋ฅ ์ ๊ทธ๋ํ ๋ชจ๋ธ:")
print(" - ๋ณ์ ๊ฐ ์์กด์ฑ์ ๊ทธ๋ํ๋ก ํํ")
print(" - ์กฐ๊ฑด๋ถ ๋
๋ฆฝ์ฑ์ผ๋ก ๊ณ์ฐ ํจ์จํ")
print(" - ์ถ๋ก : ๊ด์ธก๋ ๋ณ์๋ก ์จ๊ฒจ์ง ๋ณ์ ์ถ์ ")
print(" - ํ์ต: ๋ฐ์ดํฐ๋ก๋ถํฐ ๊ทธ๋ํ ๊ตฌ์กฐ์ ํ๋ฅ ํ๋ผ๋ฏธํฐ ํ์ต")
Practice Problems¶
-
Bayes' Theorem Application: Design a spam filter using Bayes' theorem. Derive the formula for calculating spam probability given the presence of specific words, and implement with simple example data.
-
Distribution Fitting: Use
scipy.statsto fit a normal distribution to real data (e.g., height, test scores) and verify goodness-of-fit with Q-Q plot. If normal distribution is inadequate, try other distributions. -
Monte Carlo Integration: For $X \sim \mathcal{N}(0, 1)$, compute $\mathbb{E}[e^X]$ (1) analytically and (2) estimate via Monte Carlo sampling. Verify convergence as sample size increases.
-
Bayesian Linear Regression: Implement linear regression from a Bayesian perspective. Assign normal prior to weights and update posterior with each data observation. Visualize posterior mean and uncertainty.
-
Naive Bayes vs Logistic Regression: Compare performance of Naive Bayes (generative) and logistic regression (discriminative) on Iris dataset. Plot learning curves as training data size varies. Analyze which model is advantageous in which situations.
References¶
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press
- The bible of ML from a probabilistic viewpoint
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer
- Chapter 1-2: Probability foundations and distributions
- Wasserman, L. (2004). All of Statistics. Springer
- Concise summary of statistics and probability
- Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models. MIT Press
- Comprehensive textbook on probabilistic graphical models
- SciPy Stats Documentation: https://docs.scipy.org/doc/scipy/reference/stats.html
- Seeing Theory (probability/statistics visualization): https://seeing-theory.brown.edu/
- Bayesian Methods for Hackers (online book): https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers