Dimensionality Reduction
Dimensionality Reduction¶
Overview¶
Dimensionality reduction transforms high-dimensional data into lower dimensions while preserving important information. It's used for visualization, noise reduction, and improving model performance.
1. Why Dimensionality Reduction?¶
1.1 Curse of Dimensionality¶
Problem: As dimensions increase: - Data becomes sparse (most points are far apart) - Distance metrics become meaningless - Computational cost increases exponentially - Overfitting risk increases
1.2 Benefits of Dimensionality Reduction¶
- Visualization: Project high-D data to 2D/3D for plotting
- Noise Reduction: Remove irrelevant features
- Faster Training: Fewer features → faster algorithms
- Storage Efficiency: Smaller datasets
- Multicollinearity Reduction: Remove redundant features
1.3 Types of Dimensionality Reduction¶
| Type | Description | Examples |
|---|---|---|
| Feature Selection | Select subset of original features | Filter, Wrapper, Embedded methods |
| Feature Extraction | Create new features from original ones | PCA, t-SNE, LDA |
2. Principal Component Analysis (PCA)¶
2.1 Concepts¶
Goal: Find orthogonal axes (principal components) that capture maximum variance
Key Ideas: - PC1 (1st component): Direction of maximum variance - PC2 (2nd component): Orthogonal to PC1, captures next most variance - PC3, PC4, ...: Continue orthogonally
Mathematical Foundation: - Eigenvalue decomposition of covariance matrix - Eigenvectors → Principal components - Eigenvalues → Variance explained
2.2 Basic Implementation¶
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
# Load data
iris = load_iris()
X = iris.data
y = iris.target
# Standardize (IMPORTANT for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Visualize
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolors='k', alpha=0.7)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
plt.title('PCA of Iris Dataset')
plt.colorbar(scatter, label='Species')
plt.show()
# Print explained variance
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.2%}")
2.3 Choosing Number of Components¶
Method 1: Explained Variance Ratio
# PCA with all components
pca_full = PCA()
pca_full.fit(X_scaled)
# Plot cumulative explained variance
cumsum = np.cumsum(pca_full.explained_variance_ratio_)
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(cumsum) + 1), cumsum, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by PCA Components')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% variance')
plt.legend()
plt.grid(True)
plt.show()
# Find number of components for 95% variance
n_components_95 = np.argmax(cumsum >= 0.95) + 1
print(f"Components needed for 95% variance: {n_components_95}")
Method 2: Scree Plot (Elbow method)
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(pca_full.explained_variance_) + 1),
pca_full.explained_variance_, marker='o')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue (Variance)')
plt.title('Scree Plot')
plt.grid(True)
plt.show()
2.4 PCA with Automatic Variance Selection¶
# Keep components that explain at least 95% variance
pca_95 = PCA(n_components=0.95)
X_pca_95 = pca_95.fit_transform(X_scaled)
print(f"Original dimensions: {X_scaled.shape[1]}")
print(f"Reduced dimensions: {X_pca_95.shape[1]}")
print(f"Variance explained: {pca_95.explained_variance_ratio_.sum():.2%}")
2.5 Inverse Transform (Reconstruction)¶
# Reduce to 2 components
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)
# Reconstruct original data
X_reconstructed = pca.inverse_transform(X_reduced)
# Calculate reconstruction error
mse = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"Reconstruction MSE: {mse:.4f}")
# Visualize first sample
print("Original (scaled):", X_scaled[0])
print("Reconstructed: ", X_reconstructed[0])
2.6 PCA Components (Loadings)¶
# Get component loadings
components = pca_full.components_
# Visualize first two components
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
for i, ax in enumerate(axes):
ax.bar(range(len(components[i])), components[i])
ax.set_xlabel('Original Feature')
ax.set_ylabel('Contribution')
ax.set_title(f'PC{i+1} Loadings')
ax.set_xticks(range(len(iris.feature_names)))
ax.set_xticklabels(iris.feature_names, rotation=45)
plt.tight_layout()
plt.show()
3. t-SNE (t-Distributed Stochastic Neighbor Embedding)¶
3.1 Concepts¶
Goal: Preserve local structure (nearby points stay nearby in low dimensions)
Key Differences from PCA: - Non-linear (PCA is linear) - Primarily for visualization (2D/3D), not feature reduction - Stochastic (different runs give different results) - Slower than PCA
Perplexity: Balances local vs global structure (typically 5-50)
3.2 Basic Implementation¶
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
# Load digits dataset (64 dimensions)
digits = load_digits()
X = digits.data
y = digits.target
# Apply t-SNE (WARNING: slow on large datasets)
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X)
# Visualize
plt.figure(figsize=(12, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', alpha=0.6, s=5)
plt.colorbar(scatter, label='Digit')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.title('t-SNE Visualization of Digits Dataset')
plt.show()
3.3 Effect of Perplexity¶
# Compare different perplexity values
perplexities = [5, 30, 50]
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for ax, perplexity in zip(axes, perplexities):
tsne = TSNE(n_components=2, perplexity=perplexity, random_state=42)
X_tsne = tsne.fit_transform(X[:500]) # Use subset for speed
ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y[:500], cmap='tab10', alpha=0.6, s=10)
ax.set_title(f'Perplexity = {perplexity}')
ax.set_xlabel('t-SNE 1')
ax.set_ylabel('t-SNE 2')
plt.tight_layout()
plt.show()
3.4 PCA → t-SNE Pipeline (Recommended)¶
# Speed up t-SNE by first reducing dimensions with PCA
from sklearn.pipeline import Pipeline
# Pipeline: PCA (reduce to 50 components) → t-SNE (reduce to 2)
tsne_pipeline = Pipeline([
('pca', PCA(n_components=50)),
('tsne', TSNE(n_components=2, perplexity=30, random_state=42))
])
X_tsne_fast = tsne_pipeline.fit_transform(X)
plt.figure(figsize=(12, 8))
scatter = plt.scatter(X_tsne_fast[:, 0], X_tsne_fast[:, 1], c=y, cmap='tab10', alpha=0.6, s=5)
plt.colorbar(scatter, label='Digit')
plt.title('t-SNE with PCA Preprocessing')
plt.show()
3.5 t-SNE vs PCA Comparison¶
from sklearn.datasets import load_wine
# Load wine dataset
wine = load_wine()
X = wine.data
y = wine.target
# Standardize
X_scaled = StandardScaler().fit_transform(X)
# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7)
axes[0].set_title('PCA')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.7)
axes[1].set_title('t-SNE')
axes[1].set_xlabel('t-SNE 1')
axes[1].set_ylabel('t-SNE 2')
plt.tight_layout()
plt.show()
4. Other Dimensionality Reduction Techniques¶
4.1 Truncated SVD (Similar to PCA, works with sparse matrices)¶
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=2, random_state=42)
X_svd = svd.fit_transform(X_scaled)
print(f"Explained variance ratio: {svd.explained_variance_ratio_}")
4.2 UMAP (Uniform Manifold Approximation and Projection)¶
Advantages over t-SNE: - Faster - Better preserves global structure - Can be used for feature extraction (not just visualization)
# pip install umap-learn
import umap
# Apply UMAP
umap_model = umap.UMAP(n_components=2, random_state=42)
X_umap = umap_model.fit_transform(X_scaled)
plt.figure(figsize=(10, 6))
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.title('UMAP Visualization')
plt.colorbar(label='Class')
plt.show()
5. Feature Selection Methods¶
5.1 Filter Methods (Statistical tests)¶
from sklearn.feature_selection import SelectKBest, chi2, f_classif
# Chi-squared test (for non-negative features)
X_positive = X - X.min() # Make all values non-negative
selector = SelectKBest(chi2, k=2)
X_selected = selector.fit_transform(X_positive, y)
print(f"Original shape: {X.shape}")
print(f"Selected shape: {X_selected.shape}")
print(f"Selected features: {selector.get_support()}")
# F-statistic (ANOVA)
selector_f = SelectKBest(f_classif, k=2)
X_selected_f = selector_f.fit_transform(X, y)
print(f"F-scores: {selector_f.scores_}")
5.2 Wrapper Methods (Recursive Feature Elimination)¶
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
# Recursive Feature Elimination
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfe = RFE(estimator=rfc, n_features_to_select=2)
rfe.fit(X, y)
print(f"Selected features: {rfe.support_}")
print(f"Feature ranking: {rfe.ranking_}")
5.3 Embedded Methods (Feature Importance)¶
# Train Random Forest
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X, y)
# Get feature importances
importances = rfc.feature_importances_
indices = np.argsort(importances)[::-1]
# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(len(importances)), importances[indices])
plt.xticks(range(len(importances)), [iris.feature_names[i] for i in indices], rotation=45)
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Feature Importances from Random Forest')
plt.tight_layout()
plt.show()
# Select top k features
k = 2
top_k_indices = indices[:k]
X_selected = X[:, top_k_indices]
print(f"Top {k} features: {[iris.feature_names[i] for i in top_k_indices]}")
5.4 L1 Regularization (Lasso)¶
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
# L1 regularization
lasso = LogisticRegression(penalty='l1', solver='liblinear', C=0.1, random_state=42)
lasso.fit(X_scaled, y)
# Select features based on L1 coefficients
selector = SelectFromModel(lasso, prefit=True)
X_selected = selector.transform(X_scaled)
print(f"Original features: {X_scaled.shape[1]}")
print(f"Selected features: {X_selected.shape[1]}")
print(f"Selected feature mask: {selector.get_support()}")
6. Practical Application: High-Dimensional Data¶
6.1 MNIST Digits Visualization¶
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
# Load MNIST (WARNING: large dataset)
mnist = fetch_openml('mnist_784', version=1, parser='auto')
X, y = mnist.data[:5000], mnist.target[:5000] # Use subset
# Standardize
X_scaled = StandardScaler().fit_transform(X)
# PCA (fast)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# t-SNE with PCA preprocessing
pca_50 = PCA(n_components=50)
X_pca_50 = pca_50.fit_transform(X_scaled)
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_pca_50)
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y.astype(int), cmap='tab10', alpha=0.5, s=1)
axes[0].set_title('PCA on MNIST')
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y.astype(int), cmap='tab10', alpha=0.5, s=1)
axes[1].set_title('t-SNE on MNIST')
plt.tight_layout()
plt.show()
6.2 Impact on Classification Performance¶
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import time
# Original data
start = time.time()
rfc = RandomForestClassifier(n_estimators=50, random_state=42)
rfc.fit(X_scaled, y)
y_pred = rfc.predict(X_scaled)
time_original = time.time() - start
acc_original = accuracy_score(y, y_pred)
# PCA-reduced data (50 components)
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X_scaled)
start = time.time()
rfc_pca = RandomForestClassifier(n_estimators=50, random_state=42)
rfc_pca.fit(X_pca, y)
y_pred_pca = rfc_pca.predict(X_pca)
time_pca = time.time() - start
acc_pca = accuracy_score(y, y_pred_pca)
print(f"Original (784 features):")
print(f" Accuracy: {acc_original:.4f}, Time: {time_original:.2f}s")
print(f"\nPCA (50 components):")
print(f" Accuracy: {acc_pca:.4f}, Time: {time_pca:.2f}s")
print(f" Speedup: {time_original/time_pca:.2f}x")
7. Comparison of Methods¶
| Method | Type | Linear | Speed | Use Case |
|---|---|---|---|---|
| PCA | Extraction | Yes | Fast | Feature reduction, noise reduction |
| t-SNE | Extraction | No | Slow | Visualization only |
| UMAP | Extraction | No | Medium | Visualization + feature reduction |
| SelectKBest | Selection | - | Fast | Quick feature selection |
| RFE | Selection | - | Slow | Thorough feature selection |
| Feature Importance | Selection | - | Medium | Interpretable feature selection |
8. Practical Tips¶
8.1 When to Use PCA¶
- High-dimensional data (>50 features)
- Features are correlated
- Need linear transformation
- Want to remove multicollinearity
- Need reproducible results
8.2 When to Use t-SNE/UMAP¶
- Visualization of high-D data in 2D/3D
- Need to preserve local structure
- Exploring clusters in data
- Not for feature extraction in pipelines (except UMAP)
8.3 Feature Selection vs Extraction¶
| Aspect | Feature Selection | Feature Extraction |
|---|---|---|
| Interpretability | High (keeps original features) | Low (new features) |
| Information Loss | Possible (discards features) | Minimal (transforms all) |
| Use Case | When features have meaning | When interpretability isn't critical |
9. Exercises¶
Exercise 1: PCA on Wine Dataset¶
Load the wine dataset, apply PCA, and determine how many components capture 90% variance.
from sklearn.datasets import load_wine
# Your code here
Exercise 2: t-SNE Visualization¶
Load the digits dataset and compare t-SNE with different perplexity values (5, 30, 50).
from sklearn.datasets import load_digits
# Your code here
Exercise 3: Feature Selection¶
Use RFE to select the best 3 features from the breast cancer dataset and evaluate performance.
from sklearn.datasets import load_breast_cancer
# Your code here
Exercise 4: PCA + Classification¶
Compare classification performance on original vs PCA-reduced data on a dataset of your choice.
# Your code here
Summary¶
| Topic | Key Points |
|---|---|
| PCA | Linear, maximize variance, eigenvectors, choose components by explained variance |
| Scree Plot | Elbow method for PCA, plot eigenvalues |
| Inverse Transform | Reconstruct original data from reduced dimensions |
| t-SNE | Non-linear, visualization, perplexity parameter, slow |
| UMAP | Faster than t-SNE, preserves global structure |
| Feature Selection | Filter (statistical), Wrapper (RFE), Embedded (importance) |
| Standardization | Always scale features before PCA/t-SNE |
| Applications | Visualization, noise reduction, faster training, multicollinearity |
Key Takeaway: Use PCA for linear dimensionality reduction and feature extraction. Use t-SNE/UMAP for non-linear visualization. Use feature selection when interpretability is important.