Support Vector Machine (SVM)
Support Vector Machine (SVM)¶
Overview¶
Support Vector Machine (SVM) is a powerful supervised learning algorithm for classification and regression. It finds the optimal decision boundary (hyperplane) that maximally separates classes.
1. Core Concepts of SVM¶
1.1 Hyperplane and Margin¶
Hyperplane - Decision boundary that separates different classes - In 2D: Line, in 3D: Plane, in N-D: Hyperplane - Equation: w·x + b = 0 (where w is weight vector, b is bias)
Margin - Distance between hyperplane and nearest data points - SVM finds the hyperplane with the maximum margin - Larger margin → better generalization
1.2 Support Vectors¶
Definition - Data points closest to the decision boundary - Critical points that define the hyperplane - Only support vectors affect the decision boundary
Characteristics - Removing non-support vectors doesn't change the model - SVM is robust to outliers far from the boundary - Memory efficient (only stores support vectors)
1.3 Hard Margin vs Soft Margin¶
| Type | Description | Use Case |
|---|---|---|
| Hard Margin | Strictly separates all training data | Linearly separable data only |
| Soft Margin | Allows some misclassification | Real-world data (with noise/outliers) |
Soft Margin SVM introduces: - Slack variables ξ (xi): Allows some points to violate the margin - C parameter: Controls trade-off between margin size and misclassification
2. Linear SVM¶
2.1 sklearn Implementation¶
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
# Generate linearly separable data
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
n_informative=2, n_clusters_per_class=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Linear SVM
svm_clf = SVC(kernel='linear', C=1.0, random_state=42)
svm_clf.fit(X_train, y_train)
print(f"Train Accuracy: {svm_clf.score(X_train, y_train):.4f}")
print(f"Test Accuracy: {svm_clf.score(X_test, y_test):.4f}")
print(f"Number of Support Vectors: {len(svm_clf.support_vectors_)}")
2.2 Visualizing Decision Boundary¶
def plot_svm_decision_boundary(svm_clf, X, y):
# Create mesh grid
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
# Predict on mesh grid
Z = svm_clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', edgecolors='k')
# Plot support vectors
plt.scatter(svm_clf.support_vectors_[:, 0],
svm_clf.support_vectors_[:, 1],
s=200, linewidth=1.5, facecolors='none', edgecolors='black',
label='Support Vectors')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM Decision Boundary')
plt.legend()
plt.show()
plot_svm_decision_boundary(svm_clf, X_train, y_train)
2.3 Effect of C Parameter¶
# Compare different C values
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
C_values = [0.1, 1.0, 10.0]
for ax, C in zip(axes, C_values):
svm = SVC(kernel='linear', C=C, random_state=42)
svm.fit(X_train, y_train)
# Plot decision boundary
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='coolwarm', edgecolors='k')
ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
s=200, facecolors='none', edgecolors='black')
ax.set_title(f'C={C}, Support Vectors={len(svm.support_vectors_)}')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
plt.tight_layout()
plt.show()
C Parameter Interpretation: - Small C (e.g., 0.1): Wide margin, more misclassification allowed (high bias, low variance) - Large C (e.g., 10): Narrow margin, fewer misclassifications (low bias, high variance)
3. Kernel Trick¶
3.1 Why Kernels?¶
Problem: Real-world data is often not linearly separable
Solution: Kernel trick - Map data to higher-dimensional space where it becomes linearly separable - No explicit transformation needed (computed via kernel function)
Kernel Function: K(x, x') = φ(x) · φ(x') - Computes inner product in high-dimensional space - Computationally efficient
3.2 Common Kernels¶
| Kernel | Formula | Use Case |
|---|---|---|
| Linear | K(x, x') = x · x' | Linearly separable data |
| Polynomial | K(x, x') = (γx · x' + r)^d | Moderately non-linear data |
| RBF (Radial Basis Function) | K(x, x') = exp(-γ||x - x'||²) | Most common, highly non-linear data |
| Sigmoid | K(x, x') = tanh(γx · x' + r) | Neural network-like behavior |
3.3 RBF Kernel Example¶
from sklearn.datasets import make_circles
# Generate non-linearly separable data (two concentric circles)
X, y = make_circles(n_samples=200, factor=0.5, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# RBF SVM
rbf_svm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
rbf_svm.fit(X_train, y_train)
print(f"Train Accuracy: {rbf_svm.score(X_train, y_train):.4f}")
print(f"Test Accuracy: {rbf_svm.score(X_test, y_test):.4f}")
# Visualize
plot_svm_decision_boundary(rbf_svm, X_train, y_train)
3.4 Gamma Parameter (RBF Kernel)¶
Gamma (γ): Controls the influence of a single training example - Small gamma: Far reach → smooth decision boundary (high bias) - Large gamma: Close reach → complex decision boundary (high variance)
# Compare different gamma values
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
gamma_values = [0.1, 1.0, 10.0]
for ax, gamma in zip(axes, gamma_values):
svm = SVC(kernel='rbf', C=1.0, gamma=gamma, random_state=42)
svm.fit(X_train, y_train)
# Plot decision boundary
x_min, x_max = X_train[:, 0].min() - 0.5, X_train[:, 0].max() + 0.5
y_min, y_max = X_train[:, 1].min() - 0.5, X_train[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='coolwarm', edgecolors='k')
ax.set_title(f'gamma={gamma}, Accuracy={svm.score(X_test, y_test):.3f}')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
plt.tight_layout()
plt.show()
3.5 Polynomial Kernel Example¶
# Polynomial kernel
poly_svm = SVC(kernel='poly', degree=3, C=1.0, gamma='scale', random_state=42)
poly_svm.fit(X_train, y_train)
print(f"Polynomial SVM Accuracy: {poly_svm.score(X_test, y_test):.4f}")
4. Multiclass Classification with SVM¶
SVM is originally a binary classifier. For multiclass problems:
4.1 Strategies¶
| Strategy | Description |
|---|---|
| One-vs-Rest (OvR) | Train N classifiers (one per class vs all others) |
| One-vs-One (OvO) | Train N(N-1)/2 classifiers (one for each pair of classes) |
sklearn's SVC uses One-vs-One by default.
4.2 Example¶
from sklearn.datasets import load_iris
# Load iris dataset (3 classes)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Multiclass SVM
multi_svm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
multi_svm.fit(X_train, y_train)
print(f"Train Accuracy: {multi_svm.score(X_train, y_train):.4f}")
print(f"Test Accuracy: {multi_svm.score(X_test, y_test):.4f}")
# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
y_pred = multi_svm.predict(X_test)
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
5. Feature Scaling and Preprocessing¶
5.1 Why Scaling is Critical for SVM¶
SVM is distance-based and sensitive to feature scales. - Features with large scales dominate the decision boundary - Always standardize or normalize features before SVM
5.2 Standardization Example¶
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Create pipeline with scaling
svm_pipeline = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42))
])
# Without scaling
svm_no_scale = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_no_scale.fit(X_train, y_train)
# With scaling
svm_pipeline.fit(X_train, y_train)
print("Without Scaling:")
print(f" Train: {svm_no_scale.score(X_train, y_train):.4f}")
print(f" Test: {svm_no_scale.score(X_test, y_test):.4f}")
print("\nWith Scaling:")
print(f" Train: {svm_pipeline.score(X_train, y_train):.4f}")
print(f" Test: {svm_pipeline.score(X_test, y_test):.4f}")
6. Hyperparameter Tuning¶
6.1 Grid Search for RBF SVM¶
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1, 'scale', 'auto'],
'kernel': ['rbf']
}
# Grid search
grid_search = GridSearchCV(
SVC(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")
print(f"Test score: {grid_search.score(X_test, y_test):.4f}")
6.2 Kernel Selection¶
# Compare different kernels
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
results = {}
for kernel in kernels:
svm = SVC(kernel=kernel, C=1.0, gamma='scale', random_state=42)
svm.fit(X_train, y_train)
results[kernel] = {
'train': svm.score(X_train, y_train),
'test': svm.score(X_test, y_test)
}
import pandas as pd
df_results = pd.DataFrame(results).T
print(df_results)
7. SVM for Regression (SVR)¶
7.1 SVR Concept¶
Support Vector Regression (SVR): Finds a function that deviates from actual targets by at most ε (epsilon) - Instead of maximizing margin between classes, maximizes margin around regression line - Points within ε-tube don't contribute to loss
7.2 SVR Example¶
from sklearn.svm import SVR
from sklearn.datasets import make_regression
# Generate regression data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train SVR
svr = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svr.fit(X_train, y_train)
# Predict
y_pred = svr.predict(X_test)
# Evaluate
from sklearn.metrics import mean_squared_error, r2_score
print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")
print(f"R²: {r2_score(y_test, y_pred):.4f}")
# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, label='Train', alpha=0.6)
plt.scatter(X_test, y_test, label='Test', alpha=0.6)
X_plot = np.linspace(X.min(), X.max(), 300).reshape(-1, 1)
y_plot = svr.predict(X_plot)
plt.plot(X_plot, y_plot, 'r-', label='SVR prediction', linewidth=2)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Support Vector Regression')
plt.legend()
plt.show()
8. Advantages and Disadvantages of SVM¶
8.1 Advantages¶
- Effective in High Dimensions: Works well with many features
- Memory Efficient: Only stores support vectors (not all training data)
- Versatile: Different kernels for different data patterns
- Robust to Overfitting: Especially in high-dimensional space (with proper C and gamma)
- Works with Small Datasets: Effective even with limited training samples
8.2 Disadvantages¶
- Slow Training: O(N²) to O(N³) time complexity for large datasets (N > 10,000)
- Sensitive to Feature Scaling: Requires normalization/standardization
- No Probability Estimates: Requires additional computation (
probability=True) - Difficult Hyperparameter Tuning: C, gamma, kernel selection requires experimentation
- Black Box with Non-linear Kernels: Hard to interpret decision process
9. Practical Tips¶
9.1 When to Use SVM¶
| Scenario | Recommendation |
|---|---|
| Small to medium dataset (<10K samples) | SVM is a good choice |
| High-dimensional data (text, genomics) | SVM works well |
| Clear margin of separation | Linear SVM is efficient |
| Non-linear relationships | RBF or polynomial kernel |
| Need probability estimates | Use probability=True or consider Logistic Regression |
| Very large dataset (>100K samples) | Consider faster alternatives (Logistic Regression, SGDClassifier) |
9.2 Hyperparameter Tuning Guidelines¶
1. Always scale features (StandardScaler)
2. Start with RBF kernel
3. Use GridSearchCV with cross-validation
4. Tune C and gamma together:
- Start: C=1, gamma='scale'
- Try: C=[0.1, 1, 10, 100], gamma=[0.001, 0.01, 0.1, 1]
5. If RBF doesn't work well, try:
- Linear kernel (for linearly separable data)
- Polynomial kernel (for moderate non-linearity)
9.3 Common Mistakes¶
- Not scaling features: Always use StandardScaler or MinMaxScaler
- Using default C=1 without tuning: C depends on your data scale
- Ignoring computational cost: SVM is slow on large datasets
- Forgetting to set
probability=True: If you need class probabilities
10. Exercises¶
Exercise 1: Linear vs RBF Kernel¶
Load the wine dataset and compare linear and RBF SVM. Which performs better?
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
# Your code here
Exercise 2: Hyperparameter Tuning¶
Use GridSearchCV to find optimal C and gamma for the breast cancer dataset.
from sklearn.datasets import load_breast_cancer
# Your code here
Exercise 3: SVM Regression¶
Create a polynomial regression problem and compare SVR with different kernels.
# Your code here
Exercise 4: Effect of Scaling¶
Train SVM on the iris dataset with and without scaling. Compare the results.
# Your code here
Summary¶
| Topic | Key Points |
|---|---|
| Core Concept | Maximum margin classifier, support vectors define boundary |
| Linear SVM | C parameter controls margin vs misclassification trade-off |
| Kernel Trick | Maps to high-dimensional space without explicit transformation |
| RBF Kernel | Most common, gamma controls complexity |
| Multiclass | One-vs-One (OvO) strategy by default in sklearn |
| Scaling | CRITICAL - always scale features before SVM |
| Hyperparameters | C (regularization), gamma (RBF influence), kernel type |
| Use Cases | Small to medium datasets, high-dimensional data, clear margins |
| Limitations | Slow on large datasets, requires scaling, black box with kernels |
Key Takeaway: SVM is powerful for small to medium-sized datasets with complex decision boundaries. Always scale features and tune C and gamma for best results.