k-Nearest Neighbors (k-NN) and Naive Bayes
k-Nearest Neighbors (k-NN) and Naive Bayes¶
Overview¶
This lesson covers two simple yet powerful algorithms: - k-NN: Distance-based classification and regression - Naive Bayes: Probabilistic classification based on Bayes' theorem
Part 1: k-Nearest Neighbors (k-NN)¶
1. k-NN Concepts¶
1.1 Basic Principle¶
"You are the average of your k nearest neighbors"
- Non-parametric algorithm: Doesn't assume any data distribution
- Instance-based learning (lazy learning): Stores all training data, no explicit training phase
- Prediction: Based on majority vote (classification) or average (regression) of k nearest neighbors
1.2 How k-NN Works¶
1. Store all training data
2. For a new data point:
a. Calculate distance to all training points
b. Find k nearest neighbors
c. Classification: Majority vote among k neighbors
Regression: Average of k neighbors' values
1.3 Distance Metrics¶
| Metric | Formula | Use Case |
|---|---|---|
| Euclidean | โ(ฮฃ(xi - yi)ยฒ) | Most common, continuous features |
| Manhattan | ฮฃ|xi - yi| | High-dimensional data, grid-like paths |
| Minkowski | (ฮฃ|xi - yi|^p)^(1/p) | Generalization (p=1: Manhattan, p=2: Euclidean) |
| Hamming | Count of differing positions | Categorical/binary features |
| Cosine | 1 - (xยทy)/( |
2. k-NN Classification¶
2.1 Basic Implementation¶
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Train k-NN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# Predict
y_pred = knn.predict(X_test)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
2.2 Choosing Optimal k¶
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import numpy as np
# Try different k values
k_values = range(1, 31)
cv_scores = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
cv_scores.append(scores.mean())
# Plot results
plt.figure(figsize=(10, 6))
plt.plot(k_values, cv_scores, marker='o')
plt.xlabel('k (Number of Neighbors)')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Optimal k Selection')
plt.grid(True)
plt.show()
# Best k
optimal_k = k_values[np.argmax(cv_scores)]
print(f"Optimal k: {optimal_k}, Accuracy: {max(cv_scores):.4f}")
Guidelines for choosing k: - Small k (e.g., 1-5): More complex decision boundary, sensitive to noise - Large k: Smoother decision boundary, may underfit - Rule of thumb: Start with k = โN (N = number of training samples) - Odd k: Avoids ties in binary classification
2.3 Distance Metrics Comparison¶
metrics = ['euclidean', 'manhattan', 'minkowski', 'cosine']
results = {}
for metric in metrics:
knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
knn.fit(X_train, y_train)
results[metric] = knn.score(X_test, y_test)
import pandas as pd
df_results = pd.DataFrame(list(results.items()), columns=['Metric', 'Accuracy'])
print(df_results.sort_values('Accuracy', ascending=False))
2.4 Weighted k-NN¶
Give closer neighbors more weight:
# Uniform weights (all neighbors equal)
knn_uniform = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn_uniform.fit(X_train, y_train)
# Distance-based weights (closer neighbors have more influence)
knn_distance = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_distance.fit(X_train, y_train)
print(f"Uniform weights: {knn_uniform.score(X_test, y_test):.4f}")
print(f"Distance weights: {knn_distance.score(X_test, y_test):.4f}")
3. k-NN Regression¶
3.1 Basic Implementation¶
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error, r2_score
# Load data
diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
diabetes.data, diabetes.target, test_size=0.2, random_state=42
)
# Train k-NN Regressor
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)
# Predict
y_pred = knn_reg.predict(X_test)
# Evaluate
print(f"MSE: {mean_squared_error(y_test, y_pred):.2f}")
print(f"Rยฒ: {r2_score(y_test, y_pred):.4f}")
3.2 Visualizing k-NN Regression¶
from sklearn.datasets import make_regression
import numpy as np
# Generate 1D regression data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different k values
k_values = [1, 5, 20]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, k in zip(axes, k_values):
knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(X_train, y_train)
# Predict on smooth curve
X_plot = np.linspace(X.min(), X.max(), 300).reshape(-1, 1)
y_plot = knn.predict(X_plot)
# Plot
ax.scatter(X_train, y_train, alpha=0.6, label='Train')
ax.scatter(X_test, y_test, alpha=0.6, label='Test')
ax.plot(X_plot, y_plot, 'r-', linewidth=2, label='k-NN prediction')
ax.set_title(f'k={k}')
ax.set_xlabel('X')
ax.set_ylabel('y')
ax.legend()
plt.tight_layout()
plt.show()
4. Feature Scaling for k-NN¶
4.1 Why Scaling is Critical¶
k-NN is distance-based, so features with large scales dominate the distance calculation.
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Without scaling
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
# With scaling (Pipeline)
knn_pipeline = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=5))
])
knn_pipeline.fit(X_train, y_train)
print("Without Scaling:")
print(f" Train: {knn_no_scale.score(X_train, y_train):.4f}")
print(f" Test: {knn_no_scale.score(X_test, y_test):.4f}")
print("\nWith Scaling:")
print(f" Train: {knn_pipeline.score(X_train, y_train):.4f}")
print(f" Test: {knn_pipeline.score(X_test, y_test):.4f}")
5. Advantages and Disadvantages of k-NN¶
5.1 Advantages¶
- Simple and Intuitive: Easy to understand and implement
- No Training Phase: Fast training (just stores data)
- No Assumptions: Works with any data distribution
- Versatile: Works for classification and regression
5.2 Disadvantages¶
- Slow Prediction: O(N) per prediction (N = training set size)
- Memory Intensive: Stores all training data
- Sensitive to Feature Scaling: Requires normalization
- Curse of Dimensionality: Performance degrades in high dimensions
- Sensitive to Irrelevant Features: Noisy features hurt performance
Part 2: Naive Bayes¶
6. Naive Bayes Concepts¶
6.1 Bayes' Theorem¶
P(Class|Features) = P(Features|Class) ร P(Class) / P(Features)
Where:
- P(Class|Features): Posterior probability (what we want)
- P(Features|Class): Likelihood (probability of features given class)
- P(Class): Prior probability (class frequency in training data)
- P(Features): Evidence (normalizing constant)
6.2 "Naive" Assumption¶
Assumption: All features are independent given the class
P(x1, x2, ..., xn | Class) = P(x1|Class) ร P(x2|Class) ร ... ร P(xn|Class)
This assumption is rarely true in practice, but Naive Bayes works well anyway!
7. Types of Naive Bayes¶
7.1 Gaussian Naive Bayes¶
Assumption: Features follow a normal (Gaussian) distribution
Use Case: Continuous features
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Predict
y_pred = gnb.predict(X_test)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
7.2 Multinomial Naive Bayes¶
Assumption: Features represent counts or frequencies
Use Case: Text classification (word counts, TF-IDF)
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
# Load text data
categories = ['alt.atheism', 'talk.religion.misc']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, random_state=42)
# Convert text to word counts
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)
y_train = newsgroups_train.target
y_test = newsgroups_test.target
# Train Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
# Predict
y_pred = mnb.predict(X_test)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=newsgroups_test.target_names))
7.3 Bernoulli Naive Bayes¶
Assumption: Features are binary (0 or 1)
Use Case: Binary features (document contains word or not)
from sklearn.naive_bayes import BernoulliNB
# Convert to binary (word present or not)
X_train_binary = (X_train > 0).astype(int)
X_test_binary = (X_test > 0).astype(int)
# Train Bernoulli Naive Bayes
bnb = BernoulliNB()
bnb.fit(X_train_binary, y_train)
# Predict
y_pred = bnb.predict(X_test_binary)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
8. Text Classification with Naive Bayes¶
8.1 Complete Pipeline¶
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
# Create pipeline
text_clf = Pipeline([
('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')),
('clf', MultinomialNB(alpha=1.0))
])
# Train
text_clf.fit(newsgroups_train.data, newsgroups_train.target)
# Predict
predicted = text_clf.predict(newsgroups_test.data)
# Evaluate
print(f"Accuracy: {accuracy_score(newsgroups_test.target, predicted):.4f}")
# Test on custom text
docs_new = [
'God is love',
'OpenGL on the GPU is fast'
]
predicted_new = text_clf.predict(docs_new)
for doc, category in zip(docs_new, predicted_new):
print(f'{doc} => {newsgroups_test.target_names[category]}')
8.2 Alpha Parameter (Laplace Smoothing)¶
Problem: What if a word never appears in a class during training? - P(word|class) = 0 โ Entire probability becomes 0
Solution: Add-one smoothing (Laplace smoothing) - Add ฮฑ (alpha) to all counts
# Compare different alpha values
alphas = [0.1, 1.0, 10.0]
for alpha in alphas:
mnb = MultinomialNB(alpha=alpha)
mnb.fit(X_train, y_train)
accuracy = mnb.score(X_test, y_test)
print(f"alpha={alpha}: Accuracy={accuracy:.4f}")
9. Comparison: k-NN vs Naive Bayes¶
| Feature | k-NN | Naive Bayes |
|---|---|---|
| Type | Instance-based | Probabilistic |
| Training Speed | Fast (no training) | Fast |
| Prediction Speed | Slow (O(N)) | Fast |
| Memory Usage | High (stores all data) | Low (stores parameters) |
| Feature Scaling | Required | Not required |
| High Dimensions | Suffers (curse of dimensionality) | Works well (especially text) |
| Interpretability | Hard | Easy (probability-based) |
| Best Use Case | Small datasets, low dimensions | Text classification, categorical data |
10. Practical Tips¶
10.1 When to Use k-NN¶
- Small to medium datasets (<10K samples)
- Low-dimensional data (<20 features)
- Non-linear decision boundaries
- Need for interpretability (show nearest neighbors)
10.2 When to Use Naive Bayes¶
- Text classification (spam detection, sentiment analysis)
- Large datasets (fast training and prediction)
- Categorical features
- Need for probability estimates
- When independence assumption is reasonable
10.3 Hyperparameter Tuning¶
from sklearn.model_selection import GridSearchCV
# k-NN Grid Search
knn_params = {
'n_neighbors': [3, 5, 7, 9, 11],
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan']
}
knn_grid = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5, scoring='accuracy')
knn_grid.fit(X_train, y_train)
print(f"Best k-NN params: {knn_grid.best_params_}")
# Naive Bayes Grid Search
nb_params = {
'alpha': [0.1, 0.5, 1.0, 2.0, 10.0]
}
nb_grid = GridSearchCV(MultinomialNB(), nb_params, cv=5, scoring='accuracy')
nb_grid.fit(X_train, y_train)
print(f"Best NB params: {nb_grid.best_params_}")
11. Exercises¶
Exercise 1: k-NN Classification¶
Load the wine dataset and find the optimal k value using cross-validation.
from sklearn.datasets import load_wine
# Your code here
Exercise 2: k-NN Regression¶
Create a sine wave dataset and compare k-NN regression with different k values.
# Your code here
Exercise 3: Text Classification¶
Use Naive Bayes to classify movie reviews as positive or negative.
from sklearn.datasets import load_files
# Your code here
Exercise 4: Feature Scaling Impact¶
Compare k-NN performance on the breast cancer dataset with and without feature scaling.
from sklearn.datasets import load_breast_cancer
# Your code here
Summary¶
| Topic | Key Points |
|---|---|
| k-NN Principle | Classification/regression based on k nearest neighbors |
| k Selection | Small k = complex boundary, large k = smooth boundary |
| Distance Metrics | Euclidean (most common), Manhattan, Cosine (text) |
| Feature Scaling | CRITICAL for k-NN (distance-based algorithm) |
| k-NN Limitations | Slow prediction, high memory, curse of dimensionality |
| Naive Bayes | Probabilistic classifier based on Bayes' theorem |
| Independence Assumption | Assumes features are independent (naive but works well) |
| NB Types | Gaussian (continuous), Multinomial (counts), Bernoulli (binary) |
| Text Classification | Naive Bayes excels at text data (spam, sentiment) |
| Alpha (Smoothing) | Prevents zero probabilities for unseen words |
Key Takeaway: k-NN is simple and works well for small, low-dimensional datasets. Naive Bayes is fast and excellent for text classification despite its naive independence assumption.