์ฐจ์› ์ถ•์†Œ (Dimensionality Reduction)

์ฐจ์› ์ถ•์†Œ (Dimensionality Reduction)

๊ฐœ์š”

์ฐจ์› ์ถ•์†Œ๋Š” ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ ์ €์ฐจ์›์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ๋†’์ด๊ณ  ์‹œ๊ฐํ™”๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ฃผ์š” ๋ฐฉ๋ฒ•์œผ๋กœ PCA, t-SNE, ํŠน์„ฑ ์„ ํƒ ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.


1. ์ฐจ์› ์ถ•์†Œ์˜ ํ•„์š”์„ฑ

1.1 ์ฐจ์›์˜ ์ €์ฃผ (Curse of Dimensionality)

"""
์ฐจ์›์˜ ์ €์ฃผ:
1. ๊ณ ์ฐจ์›์—์„œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ๊ฐ„ ๊ฑฐ๋ฆฌ๊ฐ€ ๋น„์Šทํ•ด์ง
2. ๋ฐ์ดํ„ฐ๊ฐ€ ํฌ์†Œํ•ด์ง (sparse)
3. ๋ชจ๋ธ ํ•™์Šต์— ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ ํ•„์š”
4. ๊ณผ์ ํ•ฉ ์œ„ํ—˜ ์ฆ๊ฐ€
5. ๊ณ„์‚ฐ ๋น„์šฉ ์ฆ๊ฐ€

์ฐจ์› ์ถ•์†Œ์˜ ๋ชฉ์ :
1. ์‹œ๊ฐํ™” (2D/3D)
2. ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ
3. ๊ณ„์‚ฐ ํšจ์œจ์„ฑ
4. ๋‹ค์ค‘๊ณต์„ ์„ฑ ์ œ๊ฑฐ
5. ํŠน์„ฑ ์ถ”์ถœ
"""

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits, load_iris, fetch_olivetti_faces
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# ์ฐจ์›์˜ ์ €์ฃผ ๋ฐ๋ชจ: ๊ณ ์ฐจ์›์—์„œ ๊ฑฐ๋ฆฌ ๋ถ„ํฌ
np.random.seed(42)

def distance_distribution(n_dims, n_points=1000):
    """๊ณ ์ฐจ์›์—์„œ ๊ฑฐ๋ฆฌ ๋ถ„ํฌ ํ™•์ธ"""
    points = np.random.rand(n_points, n_dims)
    # ๋žœ๋ค ํฌ์ธํŠธ ์Œ ๊ฐ„ ๊ฑฐ๋ฆฌ
    idx = np.random.choice(n_points, size=(500, 2), replace=False)
    distances = [np.linalg.norm(points[i] - points[j]) for i, j in idx]
    return distances

# ๋‹ค์–‘ํ•œ ์ฐจ์›์—์„œ ๊ฑฐ๋ฆฌ ๋ถ„ํฌ
dims = [2, 10, 100, 1000]

fig, axes = plt.subplots(1, 4, figsize=(16, 4))
for ax, d in zip(axes, dims):
    distances = distance_distribution(d)
    ax.hist(distances, bins=30, edgecolor='black')
    ax.set_title(f'Dim={d}\nMean={np.mean(distances):.2f}, Std={np.std(distances):.2f}')
    ax.set_xlabel('Distance')

plt.tight_layout()
plt.show()

print("์ฐจ์›์ด ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ๊ฑฐ๋ฆฌ ๋ถ„ํฌ๊ฐ€ ์ข์•„์ง โ†’ ํฌ์ธํŠธ๋“ค์ด ๋น„์Šทํ•œ ๊ฑฐ๋ฆฌ์— ์œ„์น˜")

2. ์ฃผ์„ฑ๋ถ„ ๋ถ„์„ (PCA)

2.1 PCA์˜ ์›๋ฆฌ

"""
PCA (Principal Component Analysis):
- ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ถ•(์ฃผ์„ฑ๋ถ„)์„ ์ฐพ์Œ
- ๊ณ ์ฐจ์› โ†’ ์ €์ฐจ์› ํˆฌ์˜
- ์„ ํ˜• ๋ณ€ํ™˜

์ˆ˜ํ•™์  ์›๋ฆฌ:
1. ๋ฐ์ดํ„ฐ ์ค‘์‹ฌํ™” (ํ‰๊ท  0)
2. ๊ณต๋ถ„์‚ฐ ํ–‰๋ ฌ ๊ณ„์‚ฐ
3. ๊ณ ์œ ๊ฐ’ ๋ถ„ํ•ด (eigendecomposition)
4. ๊ณ ์œ ๊ฐ’์ด ํฐ ์ˆœ์„œ๋กœ ๊ณ ์œ ๋ฒกํ„ฐ(์ฃผ์„ฑ๋ถ„) ์„ ํƒ
5. ์„ ํƒ๋œ ์ฃผ์„ฑ๋ถ„์œผ๋กœ ๋ฐ์ดํ„ฐ ํˆฌ์˜

์ฃผ์„ฑ๋ถ„:
- ์ฒซ ๋ฒˆ์งธ ์ฃผ์„ฑ๋ถ„: ๋ถ„์‚ฐ์ด ๊ฐ€์žฅ ํฐ ๋ฐฉํ–ฅ
- ๋‘ ๋ฒˆ์งธ ์ฃผ์„ฑ๋ถ„: ์ฒซ ๋ฒˆ์งธ์™€ ์ง๊ตํ•˜๋ฉด์„œ ๋ถ„์‚ฐ์ด ํฐ ๋ฐฉํ–ฅ
- n๋ฒˆ์งธ ์ฃผ์„ฑ๋ถ„: ์ด์ „ ์ฃผ์„ฑ๋ถ„๋“ค๊ณผ ์ง๊ต
"""

from sklearn.decomposition import PCA

# 2D ์˜ˆ์‹œ๋กœ PCA ์‹œ๊ฐํ™”
np.random.seed(42)
X_2d = np.dot(np.random.randn(200, 2), [[2, 1], [1, 2]])

# PCA ์ ์šฉ
pca = PCA(n_components=2)
pca.fit(X_2d)

# ์‹œ๊ฐํ™”
plt.figure(figsize=(10, 8))
plt.scatter(X_2d[:, 0], X_2d[:, 1], alpha=0.5)

# ์ฃผ์„ฑ๋ถ„ ๋ฐฉํ–ฅ (ํ™”์‚ดํ‘œ)
mean = pca.mean_
for i, (comp, var) in enumerate(zip(pca.components_, pca.explained_variance_)):
    end = mean + comp * np.sqrt(var) * 3
    plt.arrow(mean[0], mean[1], end[0]-mean[0], end[1]-mean[1],
              head_width=0.3, head_length=0.2, fc=f'C{i}', ec=f'C{i}',
              linewidth=2, label=f'PC{i+1} (Var: {var:.2f})')

plt.xlabel('X1')
plt.ylabel('X2')
plt.title('PCA: Principal Components')
plt.legend()
plt.axis('equal')
plt.grid(True, alpha=0.3)
plt.show()

print(f"์ฃผ์„ฑ๋ถ„:\n{pca.components_}")
print(f"์„ค๋ช…๋œ ๋ถ„์‚ฐ: {pca.explained_variance_}")
print(f"์„ค๋ช…๋œ ๋ถ„์‚ฐ ๋น„์œจ: {pca.explained_variance_ratio_}")

2.2 sklearn PCA ์‚ฌ์šฉ๋ฒ•

from sklearn.decomposition import PCA

# Iris ๋ฐ์ดํ„ฐ
iris = load_iris()
X = iris.data
y = iris.target

# ์Šค์ผ€์ผ๋ง (PCA ์ „ ํ•„์ˆ˜)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA ์ ์šฉ (2์ฐจ์›์œผ๋กœ ์ถ•์†Œ)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print(f"์›๋ณธ ํ˜•์ƒ: {X.shape}")
print(f"PCA ํ›„ ํ˜•์ƒ: {X_pca.shape}")
print(f"์„ค๋ช…๋œ ๋ถ„์‚ฐ ๋น„์œจ: {pca.explained_variance_ratio_}")
print(f"๋ˆ„์  ์„ค๋ช… ๋ถ„์‚ฐ: {sum(pca.explained_variance_ratio_):.4f}")

# ์‹œ๊ฐํ™”
plt.figure(figsize=(10, 8))
for i, target_name in enumerate(iris.target_names):
    mask = y == i
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1], label=target_name, alpha=0.7)

plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%})')
plt.title('PCA: Iris Dataset')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

2.3 ์ฃผ์„ฑ๋ถ„ ์ˆ˜ ์„ ํƒ

# ์ „์ฒด ์ฃผ์„ฑ๋ถ„์œผ๋กœ PCA
pca_full = PCA()
pca_full.fit(X_scaled)

# ๋ˆ„์  ์„ค๋ช… ๋ถ„์‚ฐ
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

# ์‹œ๊ฐํ™”
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ๊ฐœ๋ณ„ ๋ถ„์‚ฐ
axes[0].bar(range(1, len(pca_full.explained_variance_ratio_)+1),
            pca_full.explained_variance_ratio_, edgecolor='black')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Individual Explained Variance')

# ๋ˆ„์  ๋ถ„์‚ฐ
axes[1].plot(range(1, len(cumulative_variance)+1), cumulative_variance, 'o-')
axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% variance')
axes[1].axhline(y=0.99, color='g', linestyle='--', label='99% variance')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Explained Variance')
axes[1].legend()

plt.tight_layout()
plt.show()

# 95% ๋ถ„์‚ฐ์„ ์„ค๋ช…ํ•˜๋Š” ์ฃผ์„ฑ๋ถ„ ์ˆ˜
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
print(f"95% ๋ถ„์‚ฐ ์„ค๋ช…์— ํ•„์š”ํ•œ ์ฃผ์„ฑ๋ถ„ ์ˆ˜: {n_components_95}")

2.4 PCA๋กœ ๋ถ„์‚ฐ ๋น„์œจ ์ง€์ •

# ๋ถ„์‚ฐ ๋น„์œจ๋กœ ์ฃผ์„ฑ๋ถ„ ์ˆ˜ ์ž๋™ ๊ฒฐ์ •
pca_95 = PCA(n_components=0.95)  # 95% ๋ถ„์‚ฐ ์„ค๋ช…
X_pca_95 = pca_95.fit_transform(X_scaled)

print(f"95% ๋ถ„์‚ฐ โ†’ {pca_95.n_components_}๊ฐœ ์ฃผ์„ฑ๋ถ„ ์„ ํƒ")
print(f"์‹ค์ œ ์„ค๋ช…๋œ ๋ถ„์‚ฐ: {sum(pca_95.explained_variance_ratio_):.4f}")

# ๋‹ค์–‘ํ•œ ๋ถ„์‚ฐ ๋น„์œจ
for var_ratio in [0.8, 0.9, 0.95, 0.99]:
    pca_temp = PCA(n_components=var_ratio)
    pca_temp.fit(X_scaled)
    print(f"{var_ratio*100:.0f}% ๋ถ„์‚ฐ โ†’ {pca_temp.n_components_}๊ฐœ ์ฃผ์„ฑ๋ถ„")

2.5 PCA ํ™œ์šฉ: ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ

# ์ˆซ์ž ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

# ๋…ธ์ด์ฆˆ ์ถ”๊ฐ€
np.random.seed(42)
X_noisy = X_digits + np.random.normal(0, 4, X_digits.shape)

# PCA๋กœ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ (์ฃผ์š” ์ฃผ์„ฑ๋ถ„๋งŒ ์œ ์ง€)
pca_denoise = PCA(n_components=20)
X_reduced = pca_denoise.fit_transform(X_noisy)
X_denoised = pca_denoise.inverse_transform(X_reduced)

# ์‹œ๊ฐํ™”
fig, axes = plt.subplots(3, 10, figsize=(15, 5))

for i in range(10):
    # ์›๋ณธ
    axes[0, i].imshow(X_digits[i].reshape(8, 8), cmap='gray')
    axes[0, i].axis('off')
    if i == 0:
        axes[0, i].set_title('Original')

    # ๋…ธ์ด์ฆˆ
    axes[1, i].imshow(X_noisy[i].reshape(8, 8), cmap='gray')
    axes[1, i].axis('off')
    if i == 0:
        axes[1, i].set_title('Noisy')

    # ๋ณต์›
    axes[2, i].imshow(X_denoised[i].reshape(8, 8), cmap='gray')
    axes[2, i].axis('off')
    if i == 0:
        axes[2, i].set_title('Denoised')

plt.suptitle('PCA for Noise Reduction')
plt.tight_layout()
plt.show()

3. t-SNE

3.1 t-SNE ์›๋ฆฌ

"""
t-SNE (t-distributed Stochastic Neighbor Embedding):
- ๋น„์„ ํ˜• ์ฐจ์› ์ถ•์†Œ
- ์‹œ๊ฐํ™”์— ์ฃผ๋กœ ์‚ฌ์šฉ (2D/3D)
- ์ง€์—ญ ๊ตฌ์กฐ ๋ณด์กด์— ๋›ฐ์–ด๋‚จ

์›๋ฆฌ:
1. ๊ณ ์ฐจ์›์—์„œ ์ ๋“ค ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ๋กœ ๊ณ„์‚ฐ
2. ์ €์ฐจ์›์—์„œ t-๋ถ„ํฌ ๊ธฐ๋ฐ˜ ์œ ์‚ฌ๋„ ์ •์˜
3. KL-divergence ์ตœ์†Œํ™”๋กœ ์ €์ฐจ์› ์ขŒํ‘œ ํ•™์Šต

ํŠน์ง•:
- ๋น„์„ ํ˜• ๊ด€๊ณ„ ํฌ์ฐฉ
- ํด๋Ÿฌ์Šคํ„ฐ ๋ถ„๋ฆฌ์— ํšจ๊ณผ์ 
- ๊ณ„์‚ฐ ๋น„์šฉ ๋†’์Œ
- ์ƒˆ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ ๋ถˆ๊ฐ€ (transform ์—†์Œ)
- ๊ฒฐ๊ณผ ์žฌํ˜„์„ฑ ๋ฌธ์ œ (random_state ์ค‘์š”)
"""

from sklearn.manifold import TSNE

# t-SNE ์ ์šฉ
tsne = TSNE(
    n_components=2,
    perplexity=30,          # ์ง€์—ญ ์ด์›ƒ ํฌ๊ธฐ (5-50)
    learning_rate='auto',   # ํ•™์Šต๋ฅ 
    n_iter=1000,            # ๋ฐ˜๋ณต ํšŸ์ˆ˜
    random_state=42
)

# ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋ฏ€๋กœ ์ผ๋ถ€๋งŒ ์‚ฌ์šฉ
X_sample = X_digits[:500]
y_sample = y_digits[:500]

X_tsne = tsne.fit_transform(X_sample)

# ์‹œ๊ฐํ™”
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample, cmap='tab10', alpha=0.7)
plt.colorbar(scatter)
plt.title('t-SNE: Digits Dataset')
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.show()

3.2 perplexity ํŒŒ๋ผ๋ฏธํ„ฐ

# perplexity ํšจ๊ณผ
perplexities = [5, 30, 50, 100]

fig, axes = plt.subplots(1, 4, figsize=(20, 5))

for ax, perp in zip(axes, perplexities):
    tsne_temp = TSNE(n_components=2, perplexity=perp, random_state=42)
    X_temp = tsne_temp.fit_transform(X_sample)

    scatter = ax.scatter(X_temp[:, 0], X_temp[:, 1], c=y_sample, cmap='tab10', alpha=0.7)
    ax.set_title(f'perplexity={perp}')
    ax.set_xlabel('t-SNE 1')
    ax.set_ylabel('t-SNE 2')

plt.tight_layout()
plt.show()

print("perplexity ๊ฐ€์ด๋“œ:")
print("  - ์ž‘์€ ๊ฐ’ (5-10): ์ง€์—ญ ๊ตฌ์กฐ์— ์ง‘์ค‘")
print("  - ํฐ ๊ฐ’ (30-50): ์ „์—ญ ๊ตฌ์กฐ ๊ณ ๋ ค")
print("  - ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์— ๋”ฐ๋ผ ์กฐ์ ˆ ํ•„์š”")

3.3 PCA vs t-SNE ๋น„๊ต

# ์Šค์ผ€์ผ๋ง
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sample)

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# ๋น„๊ต ์‹œ๊ฐํ™”
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

scatter1 = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y_sample, cmap='tab10', alpha=0.7)
axes[0].set_title('PCA')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')

scatter2 = axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample, cmap='tab10', alpha=0.7)
axes[1].set_title('t-SNE')
axes[1].set_xlabel('t-SNE 1')
axes[1].set_ylabel('t-SNE 2')

plt.tight_layout()
plt.show()

print("PCA: ๋ถ„์‚ฐ ์ตœ๋Œ€ํ™”, ์„ ํ˜•, ๋น ๋ฆ„, ์ „์—ญ ๊ตฌ์กฐ")
print("t-SNE: ์ด์›ƒ ๋ณด์กด, ๋น„์„ ํ˜•, ๋А๋ฆผ, ์ง€์—ญ ๊ตฌ์กฐ")

4. UMAP

"""
UMAP (Uniform Manifold Approximation and Projection):
- t-SNE๋ณด๋‹ค ๋น ๋ฆ„
- ์ „์—ญ ๊ตฌ์กฐ ๋” ์ž˜ ๋ณด์กด
- ์ƒˆ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ ๊ฐ€๋Šฅ

# pip install umap-learn
"""

# import umap

# umap_reducer = umap.UMAP(
#     n_neighbors=15,      # ์ง€์—ญ ์ด์›ƒ ์ˆ˜
#     min_dist=0.1,        # ํฌ์ธํŠธ ๊ฐ„ ์ตœ์†Œ ๊ฑฐ๋ฆฌ
#     n_components=2,
#     random_state=42
# )
# X_umap = umap_reducer.fit_transform(X_scaled)

# ์„ค์น˜ ์—†์ด ์„ค๋ช…
print("UMAP ํŠน์ง•:")
print("  - t-SNE๋ณด๋‹ค ๋น ๋ฆ„")
print("  - ์ „์—ญ ๊ตฌ์กฐ ๋” ์ž˜ ๋ณด์กด")
print("  - transform() ์ง€์› (์ƒˆ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜)")
print("  - ์ฃผ์š” ํŒŒ๋ผ๋ฏธํ„ฐ: n_neighbors, min_dist")

5. ํŠน์„ฑ ์„ ํƒ (Feature Selection)

5.1 ํ•„ํ„ฐ ๋ฐฉ๋ฒ• (Filter Methods)

from sklearn.feature_selection import (
    SelectKBest, SelectPercentile,
    f_classif, mutual_info_classif, chi2
)

"""
ํ•„ํ„ฐ ๋ฐฉ๋ฒ•:
- ๋ชจ๋ธ๊ณผ ๋…๋ฆฝ์ ์œผ๋กœ ํŠน์„ฑ ํ‰๊ฐ€
- ๋น ๋ฆ„, ๊ฐ„๋‹จ
- ํ†ต๊ณ„์  ๊ฒ€์ • ๊ธฐ๋ฐ˜

๋ฐฉ๋ฒ•:
1. ๋ถ„์‚ฐ ๊ธฐ๋ฐ˜: VarianceThreshold
2. ์ƒ๊ด€๊ด€๊ณ„ ๊ธฐ๋ฐ˜: ํƒ€๊ฒŸ๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜
3. ํ†ต๊ณ„ ๊ฒ€์ •: ANOVA F-value, ์นด์ด์ œ๊ณฑ
4. ์ •๋ณด ์ด๋ก : ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰
"""

# ๋ฐ์ดํ„ฐ
X, y = load_iris(return_X_y=True)

# ANOVA F-value ๊ธฐ๋ฐ˜ ํŠน์„ฑ ์„ ํƒ
selector = SelectKBest(score_func=f_classif, k=2)
X_selected = selector.fit_transform(X, y)

print("ANOVA F-value ํŠน์„ฑ ์„ ํƒ:")
print(f"์›๋ณธ ํŠน์„ฑ ์ˆ˜: {X.shape[1]}")
print(f"์„ ํƒ๋œ ํŠน์„ฑ ์ˆ˜: {X_selected.shape[1]}")
print(f"๊ฐ ํŠน์„ฑ ์ ์ˆ˜: {selector.scores_}")
print(f"์„ ํƒ๋œ ํŠน์„ฑ ์ธ๋ฑ์Šค: {selector.get_support(indices=True)}")

# ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰ ๊ธฐ๋ฐ˜
selector_mi = SelectKBest(score_func=mutual_info_classif, k=2)
selector_mi.fit(X, y)
print(f"\n์ƒํ˜ธ ์ •๋ณด๋Ÿ‰ ์ ์ˆ˜: {selector_mi.scores_}")

5.2 ๋ž˜ํผ ๋ฐฉ๋ฒ• (Wrapper Methods)

from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression

"""
๋ž˜ํผ ๋ฐฉ๋ฒ•:
- ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ธฐ๋ฐ˜ ํŠน์„ฑ ์„ ํƒ
- ์ •ํ™•ํ•˜์ง€๋งŒ ๋А๋ฆผ
- ๊ณผ์ ํ•ฉ ์œ„ํ—˜

๋ฐฉ๋ฒ•:
1. RFE (Recursive Feature Elimination)
2. ์ „์ง„ ์„ ํƒ (Forward Selection)
3. ํ›„์ง„ ์ œ๊ฑฐ (Backward Elimination)
"""

# RFE (์žฌ๊ท€์  ํŠน์„ฑ ์ œ๊ฑฐ)
model = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=model, n_features_to_select=2, step=1)
rfe.fit(X, y)

print("RFE ํŠน์„ฑ ์„ ํƒ:")
print(f"์„ ํƒ๋œ ํŠน์„ฑ: {rfe.get_support()}")
print(f"ํŠน์„ฑ ์ˆœ์œ„: {rfe.ranking_}")

# RFECV (๊ต์ฐจ ๊ฒ€์ฆ ํฌํ•จ)
rfecv = RFECV(estimator=model, cv=5, scoring='accuracy')
rfecv.fit(X, y)

print(f"\nRFECV ์ตœ์  ํŠน์„ฑ ์ˆ˜: {rfecv.n_features_}")
print(f"์„ ํƒ๋œ ํŠน์„ฑ: {rfecv.get_support()}")

# CV ์ ์ˆ˜ ์‹œ๊ฐํ™”
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(rfecv.cv_results_['mean_test_score'])+1),
         rfecv.cv_results_['mean_test_score'], 'o-')
plt.xlabel('Number of Features')
plt.ylabel('Cross-Validation Score')
plt.title('RFECV: Optimal Number of Features')
plt.grid(True, alpha=0.3)
plt.show()

5.3 ์ž„๋ฒ ๋””๋“œ ๋ฐฉ๋ฒ• (Embedded Methods)

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Lasso

"""
์ž„๋ฒ ๋””๋“œ ๋ฐฉ๋ฒ•:
- ๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •์—์„œ ํŠน์„ฑ ์„ ํƒ
- ํ•„ํ„ฐ์™€ ๋ž˜ํผ์˜ ์ค‘๊ฐ„
- L1 ์ •๊ทœํ™”, ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ

๋ฐฉ๋ฒ•:
1. L1 ์ •๊ทœํ™” (Lasso)
2. ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ์ค‘์š”๋„
"""

# Random Forest ์ค‘์š”๋„ ๊ธฐ๋ฐ˜
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# ํŠน์„ฑ ์ค‘์š”๋„
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

# ์‹œ๊ฐํ™”
plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), [f'Feature {i}' for i in indices])
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Random Forest Feature Importance')
plt.show()

# SelectFromModel
selector = SelectFromModel(rf, threshold='median')
selector.fit(X, y)
X_selected = selector.transform(X)

print(f"Random Forest ๊ธฐ๋ฐ˜ ์„ ํƒ๋œ ํŠน์„ฑ ์ˆ˜: {X_selected.shape[1]}")
print(f"์„ ํƒ๋œ ํŠน์„ฑ: {selector.get_support()}")

6. ๋ถ„์‚ฐ ๊ธฐ๋ฐ˜ ํŠน์„ฑ ์„ ํƒ

from sklearn.feature_selection import VarianceThreshold

# ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ (๋ถ„์‚ฐ์ด ๋‹ค๋ฅธ ํŠน์„ฑ)
X_var = np.array([
    [0, 0, 1, 100],
    [0, 0, 0, 101],
    [0, 0, 1, 99],
    [0, 0, 0, 100],
    [0, 0, 1, 102]
])

# ๋ถ„์‚ฐ์ด ๋‚ฎ์€ ํŠน์„ฑ ์ œ๊ฑฐ
selector = VarianceThreshold(threshold=0.5)
X_high_var = selector.fit_transform(X_var)

print("๋ถ„์‚ฐ ๊ธฐ๋ฐ˜ ํŠน์„ฑ ์„ ํƒ:")
print(f"๊ฐ ํŠน์„ฑ ๋ถ„์‚ฐ: {selector.variances_}")
print(f"์„ ํƒ๋œ ํŠน์„ฑ: {selector.get_support()}")
print(f"์›๋ณธ ํ˜•์ƒ: {X_var.shape}")
print(f"์„ ํƒ ํ›„ ํ˜•์ƒ: {X_high_var.shape}")

7. ์ƒ๊ด€๊ด€๊ณ„ ๊ธฐ๋ฐ˜ ํŠน์„ฑ ์ œ๊ฑฐ

import pandas as pd

# ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ (์ƒ๊ด€๋œ ํŠน์„ฑ ํฌํ•จ)
np.random.seed(42)
n_samples = 100

X_corr = np.column_stack([
    np.random.randn(n_samples),  # ํŠน์„ฑ 0
    np.random.randn(n_samples),  # ํŠน์„ฑ 1
    np.random.randn(n_samples),  # ํŠน์„ฑ 2
])
# ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„ ํŠน์„ฑ ์ถ”๊ฐ€
X_corr = np.column_stack([X_corr, X_corr[:, 0] + np.random.randn(n_samples) * 0.1])

df = pd.DataFrame(X_corr, columns=['F0', 'F1', 'F2', 'F3'])

# ์ƒ๊ด€ํ–‰๋ ฌ
corr_matrix = df.corr().abs()

# ์ƒ๊ด€๊ด€๊ณ„ ํžˆํŠธ๋งต
plt.figure(figsize=(8, 6))
plt.imshow(corr_matrix, cmap='coolwarm', vmin=0, vmax=1)
plt.colorbar(label='Correlation')
plt.xticks(range(len(corr_matrix.columns)), corr_matrix.columns)
plt.yticks(range(len(corr_matrix.columns)), corr_matrix.columns)
plt.title('Feature Correlation Matrix')

for i in range(len(corr_matrix)):
    for j in range(len(corr_matrix)):
        plt.text(j, i, f'{corr_matrix.iloc[i, j]:.2f}',
                 ha='center', va='center')
plt.show()

# ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„ ํŠน์„ฑ ์ œ๊ฑฐ ํ•จ์ˆ˜
def remove_highly_correlated(df, threshold=0.9):
    corr_matrix = df.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
    return df.drop(columns=to_drop), to_drop

df_cleaned, dropped = remove_highly_correlated(df, threshold=0.9)
print(f"์ œ๊ฑฐ๋œ ํŠน์„ฑ: {dropped}")
print(f"๋‚จ์€ ํŠน์„ฑ: {list(df_cleaned.columns)}")

8. ์ฐจ์› ์ถ•์†Œ ํŒŒ์ดํ”„๋ผ์ธ

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# ๋ฐ์ดํ„ฐ
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# PCA + SVM ํŒŒ์ดํ”„๋ผ์ธ
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=30)),
    ('svm', SVC(kernel='rbf', random_state=42))
])

# ๊ต์ฐจ ๊ฒ€์ฆ
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"PCA (30) + SVM CV ์ ์ˆ˜: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# ์ „์ฒด ํŠน์„ฑ vs PCA
pipeline_full = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', random_state=42))
])

cv_scores_full = cross_val_score(pipeline_full, X_train, y_train, cv=5)
print(f"์ „์ฒด ํŠน์„ฑ + SVM CV ์ ์ˆ˜: {cv_scores_full.mean():.4f} (+/- {cv_scores_full.std():.4f})")

print(f"\nPCA๋กœ {X.shape[1]} โ†’ 30 ์ฐจ์› ์ถ•์†Œ")

9. Incremental PCA (๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ)

from sklearn.decomposition import IncrementalPCA

"""
Incremental PCA:
- ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์— ์ ํ•ฉ
- ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ ์ฒ˜๋ฆฌ
- ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ 
"""

# ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์‹œ๋ฎฌ๋ ˆ์ด์…˜
X_large = np.random.randn(10000, 100)

# ์ผ๋ฐ˜ PCA
pca_regular = PCA(n_components=10)
pca_regular.fit(X_large)

# Incremental PCA
ipca = IncrementalPCA(n_components=10, batch_size=500)
ipca.fit(X_large)

print("์ผ๋ฐ˜ PCA vs Incremental PCA:")
print(f"์„ค๋ช…๋œ ๋ถ„์‚ฐ ๋น„์œจ (์ผ๋ฐ˜): {sum(pca_regular.explained_variance_ratio_):.4f}")
print(f"์„ค๋ช…๋œ ๋ถ„์‚ฐ ๋น„์œจ (์ฆ๋ถ„): {sum(ipca.explained_variance_ratio_):.4f}")

# ๋ฐฐ์น˜๋กœ ์ฒ˜๋ฆฌ (๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ)
ipca_batch = IncrementalPCA(n_components=10)
for batch_start in range(0, len(X_large), 1000):
    batch = X_large[batch_start:batch_start+1000]
    ipca_batch.partial_fit(batch)

print(f"๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ์„ค๋ช…๋œ ๋ถ„์‚ฐ: {sum(ipca_batch.explained_variance_ratio_):.4f}")

10. ์ฐจ์› ์ถ•์†Œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋น„๊ต

from sklearn.decomposition import PCA, KernelPCA, TruncatedSVD
from sklearn.manifold import TSNE, MDS, Isomap
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

"""
์ฐจ์› ์ถ•์†Œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋น„๊ต:

1. PCA: ์„ ํ˜•, ๋ถ„์‚ฐ ์ตœ๋Œ€ํ™”, ๋น ๋ฆ„
2. Kernel PCA: ๋น„์„ ํ˜• PCA
3. LDA: ํด๋ž˜์Šค ๋ถ„๋ฆฌ ์ตœ๋Œ€ํ™” (์ง€๋„ ํ•™์Šต)
4. t-SNE: ์‹œ๊ฐํ™”, ์ง€์—ญ ๊ตฌ์กฐ
5. UMAP: ์‹œ๊ฐํ™”, ์ „์—ญ+์ง€์—ญ ๊ตฌ์กฐ
6. MDS: ๊ฑฐ๋ฆฌ ๋ณด์กด
7. Isomap: ์ธก์ง€์„  ๊ฑฐ๋ฆฌ ๋ณด์กด
"""

# ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋น„๊ต (์ž‘์€ ๋ฐ์ดํ„ฐ์…‹)
algorithms = {
    'PCA': PCA(n_components=2),
    'Kernel PCA': KernelPCA(n_components=2, kernel='rbf'),
    'LDA': LDA(n_components=2),
    't-SNE': TSNE(n_components=2, random_state=42)
}

# ๋ฐ์ดํ„ฐ
X, y = load_iris(return_X_y=True)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ๋น„๊ต ์‹œ๊ฐํ™”
fig, axes = plt.subplots(1, 4, figsize=(20, 5))

for ax, (name, algo) in zip(axes, algorithms.items()):
    if name == 'LDA':
        X_reduced = algo.fit_transform(X_scaled, y)
    else:
        X_reduced = algo.fit_transform(X_scaled)

    scatter = ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', alpha=0.7)
    ax.set_title(name)
    ax.set_xlabel('Component 1')
    ax.set_ylabel('Component 2')

plt.tight_layout()
plt.show()

์—ฐ์Šต ๋ฌธ์ œ

๋ฌธ์ œ 1: PCA ์ ์šฉ

Digits ๋ฐ์ดํ„ฐ์— PCA๋ฅผ ์ ์šฉํ•˜๊ณ  95% ๋ถ„์‚ฐ์„ ์„ค๋ช…ํ•˜๋Š” ์ฃผ์„ฑ๋ถ„ ์ˆ˜๋ฅผ ์ฐพ์œผ์„ธ์š”.

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X = digits.data

# ํ’€์ด
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
pca.fit(X_scaled)

cumsum = np.cumsum(pca.explained_variance_ratio_)
n_95 = np.argmax(cumsum >= 0.95) + 1

print(f"95% ๋ถ„์‚ฐ์— ํ•„์š”ํ•œ ์ฃผ์„ฑ๋ถ„ ์ˆ˜: {n_95}")
print(f"์›๋ณธ ์ฐจ์›: {X.shape[1]}")

๋ฌธ์ œ 2: t-SNE ์‹œ๊ฐํ™”

Digits ๋ฐ์ดํ„ฐ๋ฅผ t-SNE๋กœ ์‹œ๊ฐํ™”ํ•˜์„ธ์š”.

from sklearn.manifold import TSNE

# ํ’€์ด (์‹œ๊ฐ„ ๋‹จ์ถ•์„ ์œ„ํ•ด ์ผ๋ถ€๋งŒ)
X_sample = X[:500]
y_sample = digits.target[:500]

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_sample)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample, cmap='tab10')
plt.colorbar(scatter)
plt.title('t-SNE: Digits')
plt.show()

๋ฌธ์ œ 3: ํŠน์„ฑ ์„ ํƒ

Random Forest ์ค‘์š”๋„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ์œ„ 20๊ฐœ ํŠน์„ฑ์„ ์„ ํƒํ•˜์„ธ์š”.

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# ํ’€์ด
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, digits.target)

# ์ƒ์œ„ 20๊ฐœ
selector = SelectFromModel(rf, max_features=20, threshold=-np.inf)
selector.fit(X, digits.target)
X_selected = selector.transform(X)

print(f"์„ ํƒ๋œ ํŠน์„ฑ ์ˆ˜: {X_selected.shape[1]}")
print(f"์„ ํƒ๋œ ํŠน์„ฑ ์ธ๋ฑ์Šค: {np.where(selector.get_support())[0]}")

์š”์•ฝ

๋ฐฉ๋ฒ• ์œ ํ˜• ํŠน์ง• ์šฉ๋„
PCA ์„ ํ˜• ๋ถ„์‚ฐ ์ตœ๋Œ€ํ™” ์ผ๋ฐ˜์ ์ธ ์ฐจ์› ์ถ•์†Œ
Kernel PCA ๋น„์„ ํ˜• ์ปค๋„ ํŠธ๋ฆญ ๋น„์„ ํ˜• ํŒจํ„ด
LDA ์ง€๋„ ํ•™์Šต ํด๋ž˜์Šค ๋ถ„๋ฆฌ ๋ถ„๋ฅ˜ ์ „์ฒ˜๋ฆฌ
t-SNE ๋น„์„ ํ˜• ์ง€์—ญ ๊ตฌ์กฐ ๋ณด์กด ์‹œ๊ฐํ™”
UMAP ๋น„์„ ํ˜• ๋น ๋ฆ„, ์ „์—ญ ๊ตฌ์กฐ ์‹œ๊ฐํ™”

ํŠน์„ฑ ์„ ํƒ ๋ฐฉ๋ฒ• ๋น„๊ต

๋ฐฉ๋ฒ• ์œ ํ˜• ์žฅ์  ๋‹จ์ 
Filter ํ†ต๊ณ„ ๊ธฐ๋ฐ˜ ๋น ๋ฆ„ ํŠน์„ฑ ๊ฐ„ ๊ด€๊ณ„ ๋ฌด์‹œ
Wrapper ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์ •ํ™• ๋А๋ฆผ, ๊ณผ์ ํ•ฉ
Embedded ํ•™์Šต ์ค‘ ์„ ํƒ ํšจ์œจ์  ๋ชจ๋ธ ์˜์กด์ 

์ฐจ์› ์ถ•์†Œ ์„ ํƒ ๊ฐ€์ด๋“œ

์ƒํ™ฉ ๊ถŒ์žฅ ๋ฐฉ๋ฒ•
๋…ธ์ด์ฆˆ ์ œ๊ฑฐ, ์••์ถ• PCA
์‹œ๊ฐํ™” (2D/3D) t-SNE, UMAP
๋ถ„๋ฅ˜ ์ „์ฒ˜๋ฆฌ LDA
๋น„์„ ํ˜• ํŒจํ„ด Kernel PCA, UMAP
๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ Incremental PCA, TruncatedSVD
ํŠน์„ฑ ํ•ด์„ ํ•„์š” ํŠน์„ฑ ์„ ํƒ (Filter/Embedded)
to navigate between lessons