22. Multivariate Analysis
22. Multivariate Analysis¶
Previous: Time Series Models | Next: Nonparametric Statistics
Overview¶
Multivariate analysis involves statistical techniques for analyzing multiple variables simultaneously. In this chapter, we will learn about dimensionality reduction (PCA, Factor Analysis), classification (LDA, QDA), and cluster validation.
1. Principal Component Analysis (PCA)¶
1.1 PCA Concept¶
Goal: Project high-dimensional data onto lower dimensions while preserving variance
Principal Component: Orthogonal direction that maximizes data variance
Mathematical Definition: - First principal component: Unit vector w₁ that maximizes Var(w₁ᵀX) - k-th principal component: Maximizes variance while orthogonal to previous components
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris, load_wine, load_breast_cancer
np.random.seed(42)
# Intuitive understanding of PCA with 2D example
n = 200
theta = np.pi / 4
cov = [[3, 2], [2, 2]]
X_2d = np.random.multivariate_normal([0, 0], cov, n)
# Perform PCA
pca_2d = PCA()
X_pca = pca_2d.fit_transform(X_2d)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Original data
ax = axes[0]
ax.scatter(X_2d[:, 0], X_2d[:, 1], alpha=0.5)
# Show principal component directions
mean = X_2d.mean(axis=0)
for i, (comp, var) in enumerate(zip(pca_2d.components_, pca_2d.explained_variance_)):
ax.annotate('', xy=mean + 2 * np.sqrt(var) * comp, xytext=mean,
arrowprops=dict(arrowstyle='->', color=['red', 'blue'][i], lw=2))
ax.set_xlabel('X1')
ax.set_ylabel('X2')
ax.set_title('Original Data and Principal Component Directions')
ax.axis('equal')
ax.grid(True, alpha=0.3)
# Principal component space
ax = axes[1]
ax.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5)
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_title('Principal Component Space')
ax.axis('equal')
ax.grid(True, alpha=0.3)
# Explained variance
ax = axes[2]
explained_var_ratio = pca_2d.explained_variance_ratio_
ax.bar([1, 2], explained_var_ratio, alpha=0.7)
ax.set_xlabel('Principal Component')
ax.set_ylabel('Explained Variance Ratio')
ax.set_title(f'Explained Variance: PC1={explained_var_ratio[0]:.1%}, PC2={explained_var_ratio[1]:.1%}')
ax.set_xticks([1, 2])
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
1.2 PCA Theory¶
def pca_from_scratch(X, n_components=None):
"""
PCA implementation from scratch
1. Center data (mean 0)
2. Compute covariance matrix
3. Eigenvalue decomposition
4. Sort eigenvectors (descending eigenvalues)
"""
# Center data
X_centered = X - X.mean(axis=0)
# Covariance matrix
n = X.shape[0]
cov_matrix = (X_centered.T @ X_centered) / (n - 1)
# Eigenvalue decomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Sort in descending order
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Explained variance ratio
explained_variance_ratio = eigenvalues / eigenvalues.sum()
if n_components is not None:
eigenvectors = eigenvectors[:, :n_components]
eigenvalues = eigenvalues[:n_components]
explained_variance_ratio = explained_variance_ratio[:n_components]
# Project
X_pca = X_centered @ eigenvectors
return {
'components': eigenvectors.T,
'explained_variance': eigenvalues,
'explained_variance_ratio': explained_variance_ratio,
'transformed': X_pca
}
# Verify: compare with sklearn
X_test = np.random.randn(100, 5)
result_scratch = pca_from_scratch(X_test, n_components=3)
pca_sklearn = PCA(n_components=3).fit(X_test)
print("=== PCA Implementation Verification ===")
print(f"Explained variance ratio (scratch): {result_scratch['explained_variance_ratio']}")
print(f"Explained variance ratio (sklearn): {pca_sklearn.explained_variance_ratio_}")
print("(Signs may differ but absolute values should match)")
1.3 PCA with sklearn¶
# Iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
feature_names = iris.feature_names
# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_iris)
# PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
# Results analysis
print("=== Iris PCA Results ===")
print(f"Original features: {X_iris.shape[1]}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Cumulative explained variance: {np.cumsum(pca.explained_variance_ratio_)}")
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Scree plot
ax = axes[0]
ax.bar(range(1, 5), pca.explained_variance_ratio_, alpha=0.7, label='Individual')
ax.plot(range(1, 5), np.cumsum(pca.explained_variance_ratio_), 'ro-', label='Cumulative')
ax.set_xlabel('Principal Component')
ax.set_ylabel('Explained Variance Ratio')
ax.set_title('Scree Plot')
ax.legend()
ax.set_xticks(range(1, 5))
ax.grid(True, alpha=0.3)
# PC1 vs PC2
ax = axes[1]
for target in np.unique(y_iris):
mask = y_iris == target
ax.scatter(X_pca[mask, 0], X_pca[mask, 1], alpha=0.7,
label=iris.target_names[target])
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
ax.set_title('PCA: PC1 vs PC2')
ax.legend()
ax.grid(True, alpha=0.3)
# Component loadings
ax = axes[2]
loadings = pd.DataFrame(
pca.components_[:2].T,
columns=['PC1', 'PC2'],
index=feature_names
)
loadings.plot(kind='bar', ax=ax, alpha=0.7)
ax.set_ylabel('Loading')
ax.set_title('Principal Component Loadings')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
1.4 Determining Number of Components¶
def determine_n_components(X, methods=['kaiser', 'variance', 'elbow']):
"""
Methods for determining number of principal components
1. Kaiser rule: eigenvalue > 1 (for standardized data)
2. Variance criterion: cumulative variance >= threshold (typically 85-95%)
3. Scree plot: elbow point
"""
pca = PCA()
pca.fit(X)
results = {}
# Kaiser rule
if 'kaiser' in methods:
n_kaiser = np.sum(pca.explained_variance_ > 1)
results['kaiser'] = n_kaiser
print(f"Kaiser rule (eigenvalue > 1): {n_kaiser} components")
# Variance criterion
if 'variance' in methods:
cumsum = np.cumsum(pca.explained_variance_ratio_)
n_80 = np.argmax(cumsum >= 0.80) + 1
n_90 = np.argmax(cumsum >= 0.90) + 1
n_95 = np.argmax(cumsum >= 0.95) + 1
results['variance_80'] = n_80
results['variance_90'] = n_90
results['variance_95'] = n_95
print(f"Variance criterion 80%: {n_80} components")
print(f"Variance criterion 90%: {n_90} components")
print(f"Variance criterion 95%: {n_95} components")
# Scree plot
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
ax = axes[0]
ax.plot(range(1, len(pca.explained_variance_) + 1),
pca.explained_variance_, 'bo-')
ax.axhline(1, color='r', linestyle='--', label='Kaiser (eigenvalue=1)')
ax.set_xlabel('Principal Component')
ax.set_ylabel('Eigenvalue')
ax.set_title('Scree Plot (Eigenvalues)')
ax.legend()
ax.grid(True, alpha=0.3)
ax = axes[1]
ax.plot(range(1, len(pca.explained_variance_ratio_) + 1),
np.cumsum(pca.explained_variance_ratio_), 'go-')
ax.axhline(0.80, color='orange', linestyle='--', label='80%')
ax.axhline(0.90, color='r', linestyle='--', label='90%')
ax.axhline(0.95, color='purple', linestyle='--', label='95%')
ax.set_xlabel('Principal Component')
ax.set_ylabel('Cumulative Explained Variance')
ax.set_title('Cumulative Variance')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return results
# Test with Wine dataset
wine = load_wine()
X_wine = StandardScaler().fit_transform(wine.data)
print("=== Wine Dataset Component Selection ===")
results = determine_n_components(X_wine)
1.5 Biplot¶
def biplot(X, y, pca, feature_names, target_names, ax=None):
"""
PCA biplot: display observations and variables simultaneously
"""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 8))
X_pca = pca.transform(X)
# Scaling (same scale for observations and loadings)
scale = 1 / np.max(np.abs(X_pca[:, :2]))
# Plot observations
for target in np.unique(y):
mask = y == target
ax.scatter(X_pca[mask, 0] * scale, X_pca[mask, 1] * scale,
alpha=0.5, label=target_names[target], s=30)
# Loading vectors
loadings = pca.components_[:2].T
for i, (loading, name) in enumerate(zip(loadings, feature_names)):
ax.arrow(0, 0, loading[0], loading[1],
head_width=0.05, head_length=0.03, fc='red', ec='red')
ax.text(loading[0] * 1.1, loading[1] * 1.1, name, fontsize=9,
ha='center', va='center')
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
ax.set_title('Biplot')
ax.legend(loc='best')
ax.grid(True, alpha=0.3)
ax.axhline(0, color='k', linewidth=0.5)
ax.axvline(0, color='k', linewidth=0.5)
return ax
# Iris biplot
pca_iris = PCA(n_components=4).fit(X_scaled)
biplot(X_scaled, y_iris, pca_iris, feature_names, iris.target_names)
plt.show()
2. Factor Analysis¶
2.1 Factor Analysis vs PCA¶
| Aspect | PCA | Factor Analysis |
|---|---|---|
| Goal | Maximize variance | Discover latent factors |
| Model | Data = principal components | Observed variable = factors + error |
| Unique variance | None | Unique variance per variable |
| Rotation | Unnecessary (orthogonal) | Rotation for interpretation |
| Use | Dimensionality reduction | Structure discovery, survey analysis |
2.2 Factor Analysis Model¶
Model: $$X_i = \mu_i + \lambda_{i1}F_1 + \lambda_{i2}F_2 + ... + \lambda_{im}F_m + \epsilon_i$$
- Fⱼ: Common factor (latent factor)
- λᵢⱼ: Factor loading
- εᵢ: Unique factor (error)
from sklearn.decomposition import FactorAnalysis
from scipy.stats import zscore
# Factor analysis example
np.random.seed(42)
# Generate data with 2 latent factors
n = 300
F1 = np.random.normal(0, 1, n) # Factor 1
F2 = np.random.normal(0, 1, n) # Factor 2
# 6 observed variables (3 loading on each factor)
X1 = 0.8 * F1 + 0.1 * F2 + np.random.normal(0, 0.3, n)
X2 = 0.7 * F1 + 0.2 * F2 + np.random.normal(0, 0.3, n)
X3 = 0.9 * F1 + 0.0 * F2 + np.random.normal(0, 0.3, n)
X4 = 0.1 * F1 + 0.8 * F2 + np.random.normal(0, 0.3, n)
X5 = 0.2 * F1 + 0.7 * F2 + np.random.normal(0, 0.3, n)
X6 = 0.0 * F1 + 0.9 * F2 + np.random.normal(0, 0.3, n)
X_fa = np.column_stack([X1, X2, X3, X4, X5, X6])
X_fa = zscore(X_fa) # Standardize
# Factor analysis
fa = FactorAnalysis(n_components=2, random_state=42)
F_scores = fa.fit_transform(X_fa)
print("=== Factor Analysis Results ===")
print("\nFactor Loadings:")
loadings_df = pd.DataFrame(
fa.components_.T,
columns=['Factor 1', 'Factor 2'],
index=[f'X{i+1}' for i in range(6)]
)
print(loadings_df.round(3))
print(f"\nUniqueness:")
print(pd.Series(fa.noise_variance_, index=[f'X{i+1}' for i in range(6)]).round(3))
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Factor loadings plot
ax = axes[0]
loadings_df.plot(kind='bar', ax=ax, alpha=0.7)
ax.set_ylabel('Loading')
ax.set_title('Factor Loadings')
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
ax.axhline(0, color='k', linewidth=0.5)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
# Factor scores
ax = axes[1]
ax.scatter(F_scores[:, 0], F_scores[:, 1], alpha=0.5)
ax.set_xlabel('Factor 1')
ax.set_ylabel('Factor 2')
ax.set_title('Factor Scores')
ax.axhline(0, color='k', linewidth=0.5)
ax.axvline(0, color='k', linewidth=0.5)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
2.3 Factor Rotation¶
def varimax_rotation(loadings, n_iter=100, tol=1e-6):
"""
Varimax rotation (orthogonal rotation)
Maximize loading variance for easier interpretation
"""
p, k = loadings.shape
rotated = loadings.copy()
for _ in range(n_iter):
old_rotated = rotated.copy()
for i in range(k - 1):
for j in range(i + 1, k):
# 2x2 rotation
x = rotated[:, i]
y = rotated[:, j]
u = x**2 - y**2
v = 2 * x * y
A = np.sum(u)
B = np.sum(v)
C = np.sum(u**2 - v**2)
D = 2 * np.sum(u * v)
phi = 0.25 * np.arctan2(D - 2 * A * B / p,
C - (A**2 - B**2) / p)
cos_phi = np.cos(phi)
sin_phi = np.sin(phi)
rotated[:, i] = x * cos_phi + y * sin_phi
rotated[:, j] = -x * sin_phi + y * cos_phi
if np.max(np.abs(rotated - old_rotated)) < tol:
break
return rotated
# Compare before and after rotation
loadings_original = fa.components_.T
loadings_rotated = varimax_rotation(loadings_original)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Before rotation
ax = axes[0]
pd.DataFrame(loadings_original, columns=['F1', 'F2'],
index=[f'X{i+1}' for i in range(6)]).plot(kind='bar', ax=ax, alpha=0.7)
ax.set_title('Before Rotation')
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
ax.grid(True, alpha=0.3, axis='y')
# After rotation
ax = axes[1]
pd.DataFrame(loadings_rotated, columns=['F1', 'F2'],
index=[f'X{i+1}' for i in range(6)]).plot(kind='bar', ax=ax, alpha=0.7)
ax.set_title('After Varimax Rotation')
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
print("=== Loadings After Varimax Rotation ===")
print(pd.DataFrame(loadings_rotated, columns=['Factor 1', 'Factor 2'],
index=[f'X{i+1}' for i in range(6)]).round(3))
3. Discriminant Analysis¶
3.1 LDA (Linear Discriminant Analysis)¶
Goal: Find linear combinations that maximize class separation
Criterion: Maximize between-class variance / within-class variance
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
# Iris data LDA
lda = LinearDiscriminantAnalysis()
X_lda = lda.fit_transform(X_scaled, y_iris)
print("=== LDA Results ===")
print(f"Number of discriminant functions: {X_lda.shape[1]}")
print(f"Explained variance ratio: {lda.explained_variance_ratio_}")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# LDA projection
ax = axes[0]
for target in np.unique(y_iris):
mask = y_iris == target
ax.scatter(X_lda[mask, 0], X_lda[mask, 1], alpha=0.7,
label=iris.target_names[target])
ax.set_xlabel(f'LD1 ({lda.explained_variance_ratio_[0]:.1%})')
ax.set_ylabel(f'LD2 ({lda.explained_variance_ratio_[1]:.1%})')
ax.set_title('LDA Projection')
ax.legend()
ax.grid(True, alpha=0.3)
# PCA vs LDA comparison
ax = axes[1]
pca_2 = PCA(n_components=2)
X_pca_2 = pca_2.fit_transform(X_scaled)
for target in np.unique(y_iris):
mask = y_iris == target
ax.scatter(X_pca_2[mask, 0], X_pca_2[mask, 1], alpha=0.7,
label=iris.target_names[target])
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_title('PCA Projection (comparison)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
3.2 LDA Classifier¶
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y_iris, test_size=0.3, random_state=42
)
# LDA classifier
lda_clf = LinearDiscriminantAnalysis()
lda_clf.fit(X_train, y_train)
y_pred_lda = lda_clf.predict(X_test)
print("=== LDA Classification Performance ===")
print(f"Train accuracy: {lda_clf.score(X_train, y_train):.4f}")
print(f"Test accuracy: {lda_clf.score(X_test, y_test):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lda, target_names=iris.target_names))
# Cross-validation
cv_scores = cross_val_score(lda_clf, X_scaled, y_iris, cv=5)
print(f"\n5-fold CV accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
3.3 QDA (Quadratic Discriminant Analysis)¶
# QDA: allows different covariances per class
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)
y_pred_qda = qda.predict(X_test)
print("=== QDA Classification Performance ===")
print(f"Train accuracy: {qda.score(X_train, y_train):.4f}")
print(f"Test accuracy: {qda.score(X_test, y_test):.4f}")
# LDA vs QDA comparison
print("\n=== LDA vs QDA Comparison ===")
comparison = pd.DataFrame({
'Model': ['LDA', 'QDA'],
'Train Accuracy': [lda_clf.score(X_train, y_train),
qda.score(X_train, y_train)],
'Test Accuracy': [lda_clf.score(X_test, y_test),
qda.score(X_test, y_test)]
})
print(comparison)
print("\nLDA vs QDA Selection Criteria:")
print("- LDA: Assumes equal covariances across classes, simpler, suitable for small data")
print("- QDA: Allows different covariances per class, more flexible, suitable for large data")
3.4 Decision Boundary Visualization¶
def plot_decision_boundary_2d(model, X, y, title='', ax=None):
"""Visualize 2D decision boundary"""
if ax is None:
fig, ax = plt.subplots(figsize=(8, 6))
# Create grid
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Predict
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Decision boundary
ax.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
ax.contour(xx, yy, Z, colors='k', linewidths=0.5)
# Data points
scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis',
edgecolors='k', s=50, alpha=0.7)
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title(title)
ax.grid(True, alpha=0.3)
return ax
# Reduce to 2D for visualization
X_2d = X_scaled[:, :2]
X_train_2d, X_test_2d, y_train_2d, y_test_2d = train_test_split(
X_2d, y_iris, test_size=0.3, random_state=42
)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# LDA decision boundary
lda_2d = LinearDiscriminantAnalysis()
lda_2d.fit(X_train_2d, y_train_2d)
plot_decision_boundary_2d(lda_2d, X_2d, y_iris, 'LDA Decision Boundary', axes[0])
# QDA decision boundary
qda_2d = QuadraticDiscriminantAnalysis()
qda_2d.fit(X_train_2d, y_train_2d)
plot_decision_boundary_2d(qda_2d, X_2d, y_iris, 'QDA Decision Boundary', axes[1])
plt.tight_layout()
plt.show()
4. Cluster Validation¶
4.1 Internal Metrics¶
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples, calinski_harabasz_score, davies_bouldin_score
# Clustering analysis example
np.random.seed(42)
# Generate data (3 clusters)
n_samples = 300
X_cluster = np.vstack([
np.random.normal([0, 0], 0.5, (n_samples//3, 2)),
np.random.normal([3, 3], 0.5, (n_samples//3, 2)),
np.random.normal([0, 3], 0.5, (n_samples//3, 2))
])
def evaluate_clustering(X, k_range=range(2, 8)):
"""
Evaluate cluster validity for various K
"""
results = []
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
silhouette = silhouette_score(X, labels)
calinski = calinski_harabasz_score(X, labels)
davies = davies_bouldin_score(X, labels)
inertia = kmeans.inertia_
results.append({
'k': k,
'silhouette': silhouette,
'calinski_harabasz': calinski,
'davies_bouldin': davies,
'inertia': inertia
})
return pd.DataFrame(results)
# Evaluation
eval_results = evaluate_clustering(X_cluster)
print("=== Cluster Validity Metrics ===")
print(eval_results)
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Elbow plot (Inertia)
ax = axes[0, 0]
ax.plot(eval_results['k'], eval_results['inertia'], 'bo-')
ax.set_xlabel('K')
ax.set_ylabel('Inertia')
ax.set_title('Elbow Plot')
ax.grid(True, alpha=0.3)
# Silhouette score
ax = axes[0, 1]
ax.plot(eval_results['k'], eval_results['silhouette'], 'go-')
ax.set_xlabel('K')
ax.set_ylabel('Silhouette Score')
ax.set_title('Silhouette Score (higher is better)')
ax.grid(True, alpha=0.3)
# Calinski-Harabasz
ax = axes[1, 0]
ax.plot(eval_results['k'], eval_results['calinski_harabasz'], 'ro-')
ax.set_xlabel('K')
ax.set_ylabel('Calinski-Harabasz Index')
ax.set_title('Calinski-Harabasz (higher is better)')
ax.grid(True, alpha=0.3)
# Davies-Bouldin
ax = axes[1, 1]
ax.plot(eval_results['k'], eval_results['davies_bouldin'], 'mo-')
ax.set_xlabel('K')
ax.set_ylabel('Davies-Bouldin Index')
ax.set_title('Davies-Bouldin (lower is better)')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
4.2 Silhouette Analysis¶
def silhouette_analysis(X, n_clusters):
"""
Silhouette analysis visualization
"""
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
silhouette_avg = silhouette_score(X, labels)
sample_silhouette_values = silhouette_samples(X, labels)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Silhouette plot
ax = axes[0]
y_lower = 10
for i in range(n_clusters):
ith_cluster_silhouette_values = sample_silhouette_values[labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
ax.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
alpha=0.7, label=f'Cluster {i}')
ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10
ax.axvline(x=silhouette_avg, color="red", linestyle="--",
label=f'Average: {silhouette_avg:.3f}')
ax.set_xlabel('Silhouette Coefficient')
ax.set_ylabel('Cluster')
ax.set_title(f'Silhouette Analysis (K={n_clusters})')
ax.legend()
# Cluster visualization
ax = axes[1]
colors = plt.cm.viridis(np.linspace(0, 1, n_clusters))
for i, c in enumerate(colors):
ax.scatter(X[labels == i, 0], X[labels == i, 1],
color=c, alpha=0.7, label=f'Cluster {i}')
ax.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
marker='x', s=200, linewidths=3, color='red', label='Centroids')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title(f'K-Means Clustering (K={n_clusters})')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return silhouette_avg
# Silhouette analysis with K=3
silhouette_analysis(X_cluster, n_clusters=3)
# Compare with K=4
silhouette_analysis(X_cluster, n_clusters=4)
4.3 External Metrics (when labels are available)¶
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, homogeneity_completeness_v_measure
# True labels
true_labels = np.repeat([0, 1, 2], n_samples//3)
# K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
pred_labels = kmeans.fit_predict(X_cluster)
# External metrics
ari = adjusted_rand_score(true_labels, pred_labels)
nmi = normalized_mutual_info_score(true_labels, pred_labels)
homogeneity, completeness, v_measure = homogeneity_completeness_v_measure(true_labels, pred_labels)
print("=== External Cluster Validity Metrics ===")
print(f"Adjusted Rand Index (ARI): {ari:.4f}")
print(f" - Range: [-1, 1], 1 is perfect match")
print(f"\nNormalized Mutual Information (NMI): {nmi:.4f}")
print(f" - Range: [0, 1], 1 is perfect match")
print(f"\nHomogeneity: {homogeneity:.4f}")
print(f" - Extent to which each cluster contains only members of a single class")
print(f"\nCompleteness: {completeness:.4f}")
print(f" - Extent to which all members of a class are assigned to the same cluster")
print(f"\nV-measure: {v_measure:.4f}")
print(f" - Harmonic mean of homogeneity and completeness")
5. Practice Example¶
5.1 Comprehensive Multivariate Analysis¶
def comprehensive_multivariate_analysis(X, y, feature_names, target_names):
"""
Perform comprehensive multivariate analysis
"""
print("="*60)
print("Comprehensive Multivariate Analysis")
print("="*60)
# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 1. PCA
print("\n[1] Principal Component Analysis (PCA)")
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
print(f"Explained variance: {pca.explained_variance_ratio_.round(3)}")
print(f"Cumulative variance: {np.cumsum(pca.explained_variance_ratio_).round(3)}")
# 2. LDA
print("\n[2] Linear Discriminant Analysis (LDA)")
lda = LinearDiscriminantAnalysis()
X_lda = lda.fit_transform(X_scaled, y)
print(f"Number of LDA axes: {X_lda.shape[1]}")
print(f"Explained variance: {lda.explained_variance_ratio_.round(3)}")
# 3. Classification performance
print("\n[3] Classification Performance")
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.3, random_state=42
)
models = {
'LDA': LinearDiscriminantAnalysis(),
'QDA': QuadraticDiscriminantAnalysis()
}
for name, model in models.items():
model.fit(X_train, y_train)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
print(f"{name}: Train={train_acc:.4f}, Test={test_acc:.4f}")
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
# PCA
ax = axes[0, 0]
for target in np.unique(y):
mask = y == target
ax.scatter(X_pca[mask, 0], X_pca[mask, 1], alpha=0.7,
label=target_names[target])
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
ax.set_title('PCA')
ax.legend()
ax.grid(True, alpha=0.3)
# LDA
ax = axes[0, 1]
for target in np.unique(y):
mask = y == target
if X_lda.shape[1] >= 2:
ax.scatter(X_lda[mask, 0], X_lda[mask, 1], alpha=0.7,
label=target_names[target])
ax.set_xlabel('LD1')
ax.set_ylabel('LD2')
else:
ax.scatter(X_lda[mask, 0], np.random.randn(mask.sum())*0.1, alpha=0.7,
label=target_names[target])
ax.set_xlabel('LD1')
ax.set_ylabel('(jitter)')
ax.set_title('LDA')
ax.legend()
ax.grid(True, alpha=0.3)
# Scree plot
ax = axes[1, 0]
ax.bar(range(1, len(pca.explained_variance_ratio_)+1),
pca.explained_variance_ratio_, alpha=0.7, label='Individual')
ax.plot(range(1, len(pca.explained_variance_ratio_)+1),
np.cumsum(pca.explained_variance_ratio_), 'ro-', label='Cumulative')
ax.set_xlabel('Principal Component')
ax.set_ylabel('Explained Variance')
ax.set_title('Scree Plot')
ax.legend()
ax.grid(True, alpha=0.3)
# Loadings
ax = axes[1, 1]
loadings_df = pd.DataFrame(
pca.components_[:2].T,
columns=['PC1', 'PC2'],
index=feature_names
)
loadings_df.plot(kind='bar', ax=ax, alpha=0.7)
ax.set_ylabel('Loading')
ax.set_title('PC1, PC2 Loadings')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
# Test with Wine data
comprehensive_multivariate_analysis(wine.data, wine.target,
wine.feature_names, wine.target_names)
6. Practice Problems¶
Problem 1: PCA¶
Apply PCA to the breast cancer dataset (load_breast_cancer): 1. Determine number of components needed to explain 95% variance 2. Visualize with first 2 principal components 3. Identify top 3 features contributing to PC1
Problem 2: Factor Analysis¶
Generate a 6-variable dataset (with 2 latent factors): 1. Fit a 2-factor model 2. Apply Varimax rotation 3. Interpret each factor
Problem 3: LDA vs QDA¶
Using Wine dataset: 1. Compare LDA and QDA classification performance 2. Perform 5-fold cross-validation 3. Visualize decision boundaries (after 2D reduction)
Problem 4: Cluster Validation¶
K-means clustering on synthetic data: 1. Determine optimal K using elbow method 2. Perform silhouette analysis 3. Compare cluster quality at K=2, 3, 4
7. Key Summary¶
Method Selection Guide¶
| Purpose | Method | Characteristics |
|---|---|---|
| Dimensionality reduction (unsupervised) | PCA | Maximizes variance, fast |
| Structure discovery | Factor Analysis | Interprets latent variables |
| Dimensionality reduction (supervised) | LDA | Maximizes class separation |
| Classification (linear) | LDA | Assumes equal covariances |
| Classification (nonlinear) | QDA | Allows different covariances |
PCA Essentials¶
- Apply after standardization (uniform variable scale)
- Number of components: Kaiser rule, variance criterion, scree plot
- Loading interpretation: contribution of each variable to components
Cluster Validation¶
- Internal metrics: Silhouette, Calinski-Harabasz, Davies-Bouldin
- External metrics (when labels available): ARI, NMI
Next Chapter Preview¶
Chapter 13 Nonparametric Statistics will cover: - When nonparametric tests are needed - Mann-Whitney U, Wilcoxon, Kruskal-Wallis - Spearman/Kendall correlation