Ensemble Learning - Boosting
Ensemble Learning - Boosting¶
Overview¶
Boosting is an ensemble technique that sequentially trains weak learners, with each model focusing on the errors made by previous models to improve overall performance.
1. Boosting Concepts¶
1.1 Key Principles¶
Sequential Training - Train weak learners sequentially - Each model corrects the errors of previous models - Final prediction combines all models
Sample Weighting - Increase weights on incorrectly classified samples - Subsequent models focus on difficult cases - Achieve high accuracy progressively
1.2 Differences from Bagging¶
| Feature | Bagging | Boosting |
|---|---|---|
| Training | Parallel | Sequential |
| Sample Weighting | Equal | Increases for errors |
| Primary Goal | Reduce variance | Reduce bias |
| Overfitting Risk | Low | Higher (requires careful tuning) |
| Example | Random Forest | XGBoost, AdaBoost |
2. AdaBoost (Adaptive Boosting)¶
2.1 Algorithm Process¶
1. Initialize sample weights (1/N for all)
2. For each iteration t:
a. Train weak learner h_t on weighted samples
b. Calculate error rate ε_t
c. Calculate model weight α_t = 0.5 * ln((1-ε_t) / ε_t)
d. Update sample weights:
- Increase weight for misclassified samples
- Decrease weight for correctly classified samples
e. Normalize weights
3. Final prediction: weighted vote of all weak learners
2.2 Weight Update Formula¶
New weight = Old weight × exp(α_t × prediction_error)
Where:
- prediction_error = 1 (incorrect), -1 (correct)
- α_t = model weight (higher when error rate is lower)
2.3 Implementation with sklearn¶
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# AdaBoost Classifier
ada_clf = AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=1), # Weak learner (stump)
n_estimators=50, # Number of weak learners
learning_rate=1.0, # Weight update rate
algorithm='SAMME.R', # Algorithm ('SAMME', 'SAMME.R')
random_state=42
)
ada_clf.fit(X_train, y_train)
print(f"Train Accuracy: {ada_clf.score(X_train, y_train):.4f}")
print(f"Test Accuracy: {ada_clf.score(X_test, y_test):.4f}")
# Feature importance
import matplotlib.pyplot as plt
import numpy as np
importances = ada_clf.feature_importances_
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 6))
plt.title("Feature Importances (AdaBoost)")
plt.bar(range(X.shape[1]), importances[indices])
plt.xlabel("Feature Index")
plt.ylabel("Importance")
plt.show()
2.4 AdaBoost Hyperparameters¶
n_estimators: Number of weak learners (default: 50)learning_rate: Contribution weight of each weak learner (default: 1.0)base_estimator: Weak learner model (default: Decision Tree with depth 1)algorithm: 'SAMME' (discrete) or 'SAMME.R' (real, recommended)
3. Gradient Boosting¶
3.1 Core Concepts¶
Gradient Descent in Function Space - Each model predicts the residuals (errors) of previous models - Uses gradient descent to minimize loss function - Powerful for regression and classification
Process
1. Initialize with a simple model (e.g., mean)
2. For each iteration t:
a. Calculate residuals (negative gradient of loss)
b. Train weak learner h_t to predict residuals
c. Add h_t to ensemble with learning rate η
3. Final prediction = initial model + Σ(η × h_t)
3.2 sklearn GradientBoostingClassifier¶
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Gradient Boosting
gb_clf = GradientBoostingClassifier(
n_estimators=100, # Number of boosting stages
learning_rate=0.1, # Shrinkage rate
max_depth=3, # Max depth of trees
subsample=0.8, # Fraction of samples for training each tree
min_samples_split=2, # Minimum samples to split a node
min_samples_leaf=1, # Minimum samples in a leaf
max_features='sqrt', # Number of features to consider
random_state=42
)
gb_clf.fit(X_train, y_train)
print(f"Train Accuracy: {gb_clf.score(X_train, y_train):.4f}")
print(f"Test Accuracy: {gb_clf.score(X_test, y_test):.4f}")
3.3 Key Hyperparameters¶
| Parameter | Description | Tuning Tips |
|---|---|---|
n_estimators |
Number of boosting stages | More is better, but watch for overfitting |
learning_rate |
Shrinkage rate for each tree | Lower values require more trees (trade-off) |
max_depth |
Maximum depth of trees | 3-5 typically works well |
subsample |
Fraction of samples for training | 0.5-0.8 reduces overfitting |
min_samples_split |
Minimum samples to split | Increase to prevent overfitting |
min_samples_leaf |
Minimum samples in leaf | Increase to prevent overfitting |
max_features |
Features to consider | 'sqrt' or 'log2' for high-dimensional data |
4. XGBoost (Extreme Gradient Boosting)¶
4.1 Advantages of XGBoost¶
- High Performance: Parallel processing, cache optimization
- Regularization: L1 (Lasso) and L2 (Ridge) regularization to prevent overfitting
- Tree Pruning: Depth-first pruning with max_depth
- Missing Value Handling: Automatically learns best direction for missing values
- Early Stopping: Stops training when validation performance doesn't improve
4.2 Installation and Basic Usage¶
# Install XGBoost
pip install xgboost
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# XGBoost Classifier
xgb_clf = xgb.XGBClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
subsample=0.8,
colsample_bytree=0.8, # Fraction of features to use per tree
gamma=0, # Minimum loss reduction for split
reg_alpha=0, # L1 regularization
reg_lambda=1, # L2 regularization
random_state=42
)
xgb_clf.fit(X_train, y_train)
y_pred = xgb_clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
4.3 Early Stopping¶
# Early stopping with validation set
xgb_clf = xgb.XGBClassifier(
n_estimators=1000,
learning_rate=0.1,
max_depth=3,
early_stopping_rounds=10, # Stop if no improvement for 10 rounds
random_state=42
)
xgb_clf.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=True
)
print(f"Best iteration: {xgb_clf.best_iteration}")
print(f"Best score: {xgb_clf.best_score:.4f}")
4.4 Feature Importance Visualization¶
import matplotlib.pyplot as plt
# Plot feature importance
xgb.plot_importance(xgb_clf, max_num_features=10)
plt.title("Feature Importance (XGBoost)")
plt.show()
# Get feature importance as array
importances = xgb_clf.feature_importances_
print("Top 5 features:")
for idx in importances.argsort()[::-1][:5]:
print(f"Feature {idx}: {importances[idx]:.4f}")
5. LightGBM¶
5.1 Features of LightGBM¶
- Leaf-wise Growth: Grows tree leaf-wise (not level-wise) for better accuracy
- Histogram-based Learning: Bins continuous features for faster training
- GOSS (Gradient-based One-Side Sampling): Samples based on gradients
- EFB (Exclusive Feature Bundling): Bundles mutually exclusive features
- Categorical Feature Support: Handles categorical features directly
5.2 Installation and Usage¶
# Install LightGBM
pip install lightgbm
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# LightGBM Classifier
lgb_clf = lgb.LGBMClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=-1, # No limit (use num_leaves instead)
num_leaves=31, # Maximum number of leaves
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0, # L1 regularization
reg_lambda=1, # L2 regularization
random_state=42
)
lgb_clf.fit(X_train, y_train)
y_pred = lgb_clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
5.3 LightGBM with Categorical Features¶
import pandas as pd
import lightgbm as lgb
# Example with categorical features
df = pd.DataFrame({
'cat_feature': ['A', 'B', 'A', 'C', 'B'],
'num_feature': [1.0, 2.0, 3.0, 4.0, 5.0],
'target': [0, 1, 0, 1, 1]
})
# Specify categorical features
lgb_clf = lgb.LGBMClassifier(random_state=42)
lgb_clf.fit(
df[['cat_feature', 'num_feature']],
df['target'],
categorical_feature=['cat_feature'] # Specify categorical features
)
6. Comparison: XGBoost vs LightGBM vs CatBoost¶
| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Tree Growth | Level-wise | Leaf-wise | Symmetric (level-wise) |
| Speed | Fast | Fastest | Moderate |
| Memory Usage | Moderate | Low | Moderate |
| Categorical Handling | Manual encoding | Supported | Best support |
| Overfitting Risk | Moderate | Higher (leaf-wise) | Lower |
| Tuning Difficulty | Moderate | Moderate | Easier (good defaults) |
| Use Case | General purpose | Large datasets, speed critical | Categorical features, ease of use |
6.1 CatBoost Example¶
# Install CatBoost
pip install catboost
from catboost import CatBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# CatBoost Classifier
cat_clf = CatBoostClassifier(
iterations=100,
learning_rate=0.1,
depth=3,
verbose=False,
random_state=42
)
cat_clf.fit(X_train, y_train)
print(f"Accuracy: {cat_clf.score(X_test, y_test):.4f}")
7. Hyperparameter Tuning for Boosting Models¶
7.1 Grid Search for XGBoost¶
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
param_grid = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.3],
'max_depth': [3, 5, 7],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0]
}
xgb_clf = xgb.XGBClassifier(random_state=42)
grid_search = GridSearchCV(
xgb_clf, param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")
7.2 Randomized Search for LightGBM¶
from sklearn.model_selection import RandomizedSearchCV
import lightgbm as lgb
from scipy.stats import randint, uniform
param_dist = {
'n_estimators': randint(50, 300),
'learning_rate': uniform(0.01, 0.3),
'num_leaves': randint(20, 100),
'max_depth': randint(3, 10),
'subsample': uniform(0.6, 0.4),
'colsample_bytree': uniform(0.6, 0.4)
}
lgb_clf = lgb.LGBMClassifier(random_state=42)
random_search = RandomizedSearchCV(
lgb_clf, param_dist,
n_iter=50, # Number of parameter combinations to try
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1,
random_state=42
)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best score: {random_search.best_score_:.4f}")
8. Preventing Overfitting in Boosting¶
8.1 Regularization Techniques¶
# Example with multiple regularization techniques
xgb_clf = xgb.XGBClassifier(
n_estimators=100,
learning_rate=0.05, # Lower learning rate
max_depth=3, # Limit tree depth
min_child_weight=3, # Minimum sum of weights in child node
gamma=0.1, # Minimum loss reduction for split
subsample=0.8, # Row sampling
colsample_bytree=0.8, # Column sampling
reg_alpha=0.1, # L1 regularization
reg_lambda=1.0, # L2 regularization
random_state=42
)
8.2 Early Stopping¶
# Early stopping to prevent overfitting
xgb_clf = xgb.XGBClassifier(
n_estimators=1000,
learning_rate=0.1,
early_stopping_rounds=50, # Stop if no improvement for 50 rounds
random_state=42
)
xgb_clf.fit(
X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='logloss',
verbose=10
)
9. Practical Tips¶
9.1 When to Use Which Algorithm¶
| Scenario | Recommended Algorithm |
|---|---|
| Small dataset (<10K rows) | Gradient Boosting, AdaBoost |
| Large dataset (>100K rows) | LightGBM |
| Many categorical features | CatBoost |
| Need feature importance | XGBoost, LightGBM |
| Need high interpretability | Gradient Boosting (fewer trees) |
| Speed is critical | LightGBM |
| Balanced performance | XGBoost (most versatile) |
9.2 Hyperparameter Tuning Order¶
1. Fix n_estimators to a high value (e.g., 1000)
2. Tune learning_rate (start with 0.1)
3. Tune tree-specific parameters (max_depth, num_leaves, min_child_weight)
4. Tune sampling parameters (subsample, colsample_bytree)
5. Tune regularization parameters (gamma, reg_alpha, reg_lambda)
6. Lower learning_rate and increase n_estimators for final model
9.3 Common Mistakes to Avoid¶
- Not using early stopping: Always use validation set with early stopping
- Ignoring feature scaling: While tree-based models don't require scaling, it can help with convergence
- Default hyperparameters: Always tune for your specific dataset
- Overfitting on small datasets: Use stronger regularization
- Not handling imbalanced data: Use
scale_pos_weightorclass_weight
10. Exercises¶
Exercise 1: AdaBoost vs Gradient Boosting¶
Compare AdaBoost and Gradient Boosting on the iris dataset. Which performs better?
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
# Your code here
Exercise 2: XGBoost Hyperparameter Tuning¶
Load the wine dataset and use GridSearchCV to find optimal hyperparameters for XGBoost.
from sklearn.datasets import load_wine
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
# Your code here
Exercise 3: LightGBM with Early Stopping¶
Train a LightGBM model on the digits dataset with early stopping. Plot training and validation curves.
from sklearn.datasets import load_digits
import lightgbm as lgb
import matplotlib.pyplot as plt
# Your code here
Exercise 4: Feature Importance Comparison¶
Compare feature importances from XGBoost, LightGBM, and Random Forest on the same dataset.
# Your code here
Summary¶
| Topic | Key Points |
|---|---|
| Boosting Basics | Sequential training, error correction, sample weighting |
| AdaBoost | Adaptive weighting, weak learners, SAMME algorithm |
| Gradient Boosting | Gradient descent in function space, residual prediction |
| XGBoost | Regularization, parallel processing, early stopping |
| LightGBM | Leaf-wise growth, histogram-based, fastest training |
| CatBoost | Best categorical handling, symmetric trees, easy to use |
| Tuning | learning_rate ↔ n_estimators trade-off, regularization |
| Overfitting | Early stopping, regularization, sampling, tree depth |
Key Takeaway: Boosting models are powerful but require careful tuning. Start with XGBoost for general use, use LightGBM for large datasets, and CatBoost for categorical features.