Support Vector Machine (SVM)¶

Overview¶

Support Vector Machine (SVM) is a powerful supervised learning algorithm for classification and regression. It finds the optimal decision boundary (hyperplane) that maximally separates classes.

1. Core Concepts of SVM¶

1.1 Hyperplane and Margin¶

Hyperplane - Decision boundary that separates different classes - In 2D: Line, in 3D: Plane, in N-D: Hyperplane - Equation: w·x + b = 0 (where w is weight vector, b is bias)

Margin - Distance between hyperplane and nearest data points - SVM finds the hyperplane with the maximum margin - Larger margin → better generalization

1.2 Support Vectors¶

Definition - Data points closest to the decision boundary - Critical points that define the hyperplane - Only support vectors affect the decision boundary

Characteristics - Removing non-support vectors doesn't change the model - SVM is robust to outliers far from the boundary - Memory efficient (only stores support vectors)

1.3 Hard Margin vs Soft Margin¶

Type	Description	Use Case
Hard Margin	Strictly separates all training data	Linearly separable data only
Soft Margin	Allows some misclassification	Real-world data (with noise/outliers)

Soft Margin SVM introduces: - Slack variables ξ (xi): Allows some points to violate the margin - C parameter: Controls trade-off between margin size and misclassification

2. Linear SVM¶

2.1 sklearn Implementation¶

from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

# Generate linearly separable data
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
                           n_informative=2, n_clusters_per_class=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear SVM
svm_clf = SVC(kernel='linear', C=1.0, random_state=42)
svm_clf.fit(X_train, y_train)

print(f"Train Accuracy: {svm_clf.score(X_train, y_train):.4f}")
print(f"Test Accuracy: {svm_clf.score(X_test, y_test):.4f}")
print(f"Number of Support Vectors: {len(svm_clf.support_vectors_)}")

2.2 Visualizing Decision Boundary¶

def plot_svm_decision_boundary(svm_clf, X, y):
    # Create mesh grid
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))

    # Predict on mesh grid
    Z = svm_clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot
    plt.figure(figsize=(10, 6))
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', edgecolors='k')

    # Plot support vectors
    plt.scatter(svm_clf.support_vectors_[:, 0],
                svm_clf.support_vectors_[:, 1],
                s=200, linewidth=1.5, facecolors='none', edgecolors='black',
                label='Support Vectors')

    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('SVM Decision Boundary')
    plt.legend()
    plt.show()

plot_svm_decision_boundary(svm_clf, X_train, y_train)

2.3 Effect of C Parameter¶

# Compare different C values
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
C_values = [0.1, 1.0, 10.0]

for ax, C in zip(axes, C_values):
    svm = SVC(kernel='linear', C=C, random_state=42)
    svm.fit(X_train, y_train)

    # Plot decision boundary
    x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
    y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='coolwarm', edgecolors='k')
    ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
               s=200, facecolors='none', edgecolors='black')
    ax.set_title(f'C={C}, Support Vectors={len(svm.support_vectors_)}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

C Parameter Interpretation: - Small C (e.g., 0.1): Wide margin, more misclassification allowed (high bias, low variance) - Large C (e.g., 10): Narrow margin, fewer misclassifications (low bias, high variance)

3. Kernel Trick¶

3.1 Why Kernels?¶

Problem: Real-world data is often not linearly separable

Solution: Kernel trick - Map data to higher-dimensional space where it becomes linearly separable - No explicit transformation needed (computed via kernel function)

Kernel Function: K(x, x') = φ(x) · φ(x') - Computes inner product in high-dimensional space - Computationally efficient

3.2 Common Kernels¶

Kernel	Formula	Use Case
Linear	K(x, x') = x · x'	Linearly separable data
Polynomial	K(x, x') = (γx · x' + r)^d	Moderately non-linear data
RBF (Radial Basis Function)	K(x, x') = exp(-γ\|\|x - x'\|\|²)	Most common, highly non-linear data
Sigmoid	K(x, x') = tanh(γx · x' + r)	Neural network-like behavior

3.3 RBF Kernel Example¶

from sklearn.datasets import make_circles

# Generate non-linearly separable data (two concentric circles)
X, y = make_circles(n_samples=200, factor=0.5, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# RBF SVM
rbf_svm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
rbf_svm.fit(X_train, y_train)

print(f"Train Accuracy: {rbf_svm.score(X_train, y_train):.4f}")
print(f"Test Accuracy: {rbf_svm.score(X_test, y_test):.4f}")

# Visualize
plot_svm_decision_boundary(rbf_svm, X_train, y_train)

3.4 Gamma Parameter (RBF Kernel)¶

Gamma (γ): Controls the influence of a single training example - Small gamma: Far reach → smooth decision boundary (high bias) - Large gamma: Close reach → complex decision boundary (high variance)

# Compare different gamma values
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
gamma_values = [0.1, 1.0, 10.0]

for ax, gamma in zip(axes, gamma_values):
    svm = SVC(kernel='rbf', C=1.0, gamma=gamma, random_state=42)
    svm.fit(X_train, y_train)

    # Plot decision boundary
    x_min, x_max = X_train[:, 0].min() - 0.5, X_train[:, 0].max() + 0.5
    y_min, y_max = X_train[:, 1].min() - 0.5, X_train[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='coolwarm', edgecolors='k')
    ax.set_title(f'gamma={gamma}, Accuracy={svm.score(X_test, y_test):.3f}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

3.5 Polynomial Kernel Example¶

# Polynomial kernel
poly_svm = SVC(kernel='poly', degree=3, C=1.0, gamma='scale', random_state=42)
poly_svm.fit(X_train, y_train)

print(f"Polynomial SVM Accuracy: {poly_svm.score(X_test, y_test):.4f}")

4. Multiclass Classification with SVM¶

SVM is originally a binary classifier. For multiclass problems:

4.1 Strategies¶

Strategy	Description
One-vs-Rest (OvR)	Train N classifiers (one per class vs all others)
One-vs-One (OvO)	Train N(N-1)/2 classifiers (one for each pair of classes)

sklearn's SVC uses One-vs-One by default.

4.2 Example¶

from sklearn.datasets import load_iris

# Load iris dataset (3 classes)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Multiclass SVM
multi_svm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
multi_svm.fit(X_train, y_train)

print(f"Train Accuracy: {multi_svm.score(X_train, y_train):.4f}")
print(f"Test Accuracy: {multi_svm.score(X_test, y_test):.4f}")

# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
y_pred = multi_svm.predict(X_test)
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

5. Feature Scaling and Preprocessing¶

5.1 Why Scaling is Critical for SVM¶

SVM is distance-based and sensitive to feature scales. - Features with large scales dominate the decision boundary - Always standardize or normalize features before SVM

5.2 Standardization Example¶

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Create pipeline with scaling
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42))
])

# Without scaling
svm_no_scale = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_no_scale.fit(X_train, y_train)

# With scaling
svm_pipeline.fit(X_train, y_train)

print("Without Scaling:")
print(f"  Train: {svm_no_scale.score(X_train, y_train):.4f}")
print(f"  Test:  {svm_no_scale.score(X_test, y_test):.4f}")

print("\nWith Scaling:")
print(f"  Train: {svm_pipeline.score(X_train, y_train):.4f}")
print(f"  Test:  {svm_pipeline.score(X_test, y_test):.4f}")

6. Hyperparameter Tuning¶

6.1 Grid Search for RBF SVM¶

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1, 'scale', 'auto'],
    'kernel': ['rbf']
}

# Grid search
grid_search = GridSearchCV(
    SVC(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")
print(f"Test score: {grid_search.score(X_test, y_test):.4f}")

6.2 Kernel Selection¶

# Compare different kernels
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
results = {}

for kernel in kernels:
    svm = SVC(kernel=kernel, C=1.0, gamma='scale', random_state=42)
    svm.fit(X_train, y_train)
    results[kernel] = {
        'train': svm.score(X_train, y_train),
        'test': svm.score(X_test, y_test)
    }

import pandas as pd
df_results = pd.DataFrame(results).T
print(df_results)

7. SVM for Regression (SVR)¶

7.1 SVR Concept¶

Support Vector Regression (SVR): Finds a function that deviates from actual targets by at most ε (epsilon) - Instead of maximizing margin between classes, maximizes margin around regression line - Points within ε-tube don't contribute to loss

7.2 SVR Example¶

from sklearn.svm import SVR
from sklearn.datasets import make_regression

# Generate regression data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train SVR
svr = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svr.fit(X_train, y_train)

# Predict
y_pred = svr.predict(X_test)

# Evaluate
from sklearn.metrics import mean_squared_error, r2_score
print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")
print(f"R²: {r2_score(y_test, y_pred):.4f}")

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, label='Train', alpha=0.6)
plt.scatter(X_test, y_test, label='Test', alpha=0.6)
X_plot = np.linspace(X.min(), X.max(), 300).reshape(-1, 1)
y_plot = svr.predict(X_plot)
plt.plot(X_plot, y_plot, 'r-', label='SVR prediction', linewidth=2)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Support Vector Regression')
plt.legend()
plt.show()

8. Advantages and Disadvantages of SVM¶

8.1 Advantages¶

Effective in High Dimensions: Works well with many features
Memory Efficient: Only stores support vectors (not all training data)
Versatile: Different kernels for different data patterns
Robust to Overfitting: Especially in high-dimensional space (with proper C and gamma)
Works with Small Datasets: Effective even with limited training samples

8.2 Disadvantages¶

Slow Training: O(N²) to O(N³) time complexity for large datasets (N > 10,000)
Sensitive to Feature Scaling: Requires normalization/standardization
No Probability Estimates: Requires additional computation (probability=True)
Difficult Hyperparameter Tuning: C, gamma, kernel selection requires experimentation
Black Box with Non-linear Kernels: Hard to interpret decision process

9. Practical Tips¶

9.1 When to Use SVM¶

Scenario	Recommendation
Small to medium dataset (<10K samples)	SVM is a good choice
High-dimensional data (text, genomics)	SVM works well
Clear margin of separation	Linear SVM is efficient
Non-linear relationships	RBF or polynomial kernel
Need probability estimates	Use `probability=True` or consider Logistic Regression
Very large dataset (>100K samples)	Consider faster alternatives (Logistic Regression, SGDClassifier)

9.2 Hyperparameter Tuning Guidelines¶

1. Always scale features (StandardScaler)
2. Start with RBF kernel
3. Use GridSearchCV with cross-validation
4. Tune C and gamma together:
   - Start: C=1, gamma='scale'
   - Try: C=[0.1, 1, 10, 100], gamma=[0.001, 0.01, 0.1, 1]
5. If RBF doesn't work well, try:
   - Linear kernel (for linearly separable data)
   - Polynomial kernel (for moderate non-linearity)

9.3 Common Mistakes¶

Not scaling features: Always use StandardScaler or MinMaxScaler
Using default C=1 without tuning: C depends on your data scale
Ignoring computational cost: SVM is slow on large datasets
Forgetting to set probability=True: If you need class probabilities

10. Exercises¶

Exercise 1: Linear vs RBF Kernel¶

Load the wine dataset and compare linear and RBF SVM. Which performs better?

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler

# Your code here

Exercise 2: Hyperparameter Tuning¶

Use GridSearchCV to find optimal C and gamma for the breast cancer dataset.

from sklearn.datasets import load_breast_cancer

# Your code here

Exercise 3: SVM Regression¶

Create a polynomial regression problem and compare SVR with different kernels.

# Your code here

Exercise 4: Effect of Scaling¶

Train SVM on the iris dataset with and without scaling. Compare the results.

# Your code here

Summary¶

Topic	Key Points
Core Concept	Maximum margin classifier, support vectors define boundary
Linear SVM	C parameter controls margin vs misclassification trade-off
Kernel Trick	Maps to high-dimensional space without explicit transformation
RBF Kernel	Most common, gamma controls complexity
Multiclass	One-vs-One (OvO) strategy by default in sklearn
Scaling	CRITICAL - always scale features before SVM
Hyperparameters	C (regularization), gamma (RBF influence), kernel type
Use Cases	Small to medium datasets, high-dimensional data, clear margins
Limitations	Slow on large datasets, requires scaling, black box with kernels

Key Takeaway: SVM is powerful for small to medium-sized datasets with complex decision boundaries. Always scale features and tune C and gamma for best results.