Real-World Projects
Real-World Projects¶
Overview¶
This lesson covers end-to-end solutions for classification and regression problems using real datasets, including Kaggle-style problem-solving processes and practical know-how.
1. Machine Learning Project Workflow¶
1.1 Overall Process¶
"""
Machine Learning Project Stages:
1. Problem Definition
- Understand business objectives
- Define success metrics
- Decide: classification/regression/clustering
2. Data Collection and Exploration
- Load data
- EDA (Exploratory Data Analysis)
- Check data quality
3. Data Preprocessing
- Handle missing values
- Handle outliers
- Feature engineering
- Encoding and scaling
4. Modeling
- Baseline model
- Model selection and comparison
- Hyperparameter tuning
5. Evaluation and Interpretation
- Performance evaluation
- Error analysis
- Feature importance
6. Deployment and Monitoring
- Model saving
- Prediction API
- Performance monitoring
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (
accuracy_score, classification_report, confusion_matrix,
mean_squared_error, r2_score, mean_absolute_error
)
import warnings
warnings.filterwarnings('ignore')
2. Project 1: Titanic Survival Prediction (Classification)¶
2.1 Data Loading and Exploration¶
# Load data (actual Kaggle data or seaborn built-in data)
# df = pd.read_csv('titanic.csv')
df = sns.load_dataset('titanic')
print("=== Basic Data Information ===")
print(f"Data shape: {df.shape}")
print(f"\nColumn info:")
print(df.info())
print(f"\nFirst 5 rows:")
print(df.head())
print(f"\nDescriptive statistics:")
print(df.describe())
print(f"\nTarget distribution:")
print(df['survived'].value_counts(normalize=True))
2.2 Exploratory Data Analysis (EDA)¶
# Check missing values
print("=== Missing Value Analysis ===")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing Count': missing, 'Missing Percentage (%)': missing_pct})
print(missing_df[missing_df['Missing Count'] > 0])
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
# Survival rate
sns.countplot(data=df, x='survived', ax=axes[0, 0])
axes[0, 0].set_title('Survival Distribution')
# Survival by gender
sns.countplot(data=df, x='sex', hue='survived', ax=axes[0, 1])
axes[0, 1].set_title('Survival by Sex')
# Survival by class
sns.countplot(data=df, x='pclass', hue='survived', ax=axes[0, 2])
axes[0, 2].set_title('Survival by Class')
# Age distribution
sns.histplot(data=df, x='age', hue='survived', kde=True, ax=axes[1, 0])
axes[1, 0].set_title('Age Distribution by Survival')
# Fare distribution
sns.histplot(data=df, x='fare', hue='survived', kde=True, ax=axes[1, 1])
axes[1, 1].set_title('Fare Distribution by Survival')
# Survival by port of embarkation
sns.countplot(data=df, x='embarked', hue='survived', ax=axes[1, 2])
axes[1, 2].set_title('Survival by Embarked')
plt.tight_layout()
plt.show()
# Correlation
print("\n=== Correlation with Numeric Variables ===")
numeric_cols = df.select_dtypes(include=[np.number]).columns
print(df[numeric_cols].corr()['survived'].sort_values(ascending=False))
2.3 Data Preprocessing¶
# Working copy
df_clean = df.copy()
# Remove unnecessary columns
drop_cols = ['deck', 'embark_town', 'alive', 'who', 'adult_male', 'class']
df_clean = df_clean.drop(columns=drop_cols, errors='ignore')
# Handle missing values
# Age: Replace with median
df_clean['age'] = df_clean['age'].fillna(df_clean['age'].median())
# Port of embarkation: Replace with mode
df_clean['embarked'] = df_clean['embarked'].fillna(df_clean['embarked'].mode()[0])
# Feature engineering
# Family size
df_clean['family_size'] = df_clean['sibsp'] + df_clean['parch'] + 1
# Traveling alone
df_clean['is_alone'] = (df_clean['family_size'] == 1).astype(int)
# Age groups
df_clean['age_group'] = pd.cut(df_clean['age'],
bins=[0, 12, 18, 35, 60, 100],
labels=['Child', 'Teen', 'Young', 'Middle', 'Senior'])
# Categorical encoding
df_clean['sex'] = LabelEncoder().fit_transform(df_clean['sex'])
df_clean['embarked'] = LabelEncoder().fit_transform(df_clean['embarked'])
df_clean['age_group'] = LabelEncoder().fit_transform(df_clean['age_group'])
# Final feature selection
features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
'embarked', 'family_size', 'is_alone', 'age_group']
X = df_clean[features]
y = df_clean['survived']
print(f"Final features: {features}")
print(f"X shape: {X.shape}")
2.4 Modeling¶
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
# Data split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Model definitions
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'XGBoost': XGBClassifier(eval_metric='logloss', random_state=42),
'LightGBM': LGBMClassifier(random_state=42, verbose=-1)
}
# Model comparison
print("=== Model Comparison ===")
results = []
for name, model in models.items():
# Use scaled data for SVM and Logistic Regression
if name in ['Logistic Regression']:
X_tr, X_te = X_train_scaled, X_test_scaled
else:
X_tr, X_te = X_train, X_test
# Cross-validation
cv_scores = cross_val_score(model, X_tr, y_train, cv=5, scoring='accuracy')
# Train and test
model.fit(X_tr, y_train)
test_score = model.score(X_te, y_test)
results.append({
'Model': name,
'CV Mean': cv_scores.mean(),
'CV Std': cv_scores.std(),
'Test Score': test_score
})
print(f"{name}: CV={cv_scores.mean():.4f}(+/-{cv_scores.std():.4f}), Test={test_score:.4f}")
results_df = pd.DataFrame(results)
print(f"\nBest CV score: {results_df.loc[results_df['CV Mean'].idxmax(), 'Model']}")
2.5 Hyperparameter Tuning¶
# Tune best performing model (e.g., Random Forest)
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)
print("\n=== Hyperparameter Tuning Results ===")
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Test score: {grid_search.score(X_test, y_test):.4f}")
best_model = grid_search.best_estimator_
2.6 Final Evaluation¶
# Predictions
y_pred = best_model.predict(X_test)
# Confusion matrix
print("=== Final Evaluation ===")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Not Survived', 'Survived']))
# Visualize confusion matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Not Survived', 'Survived'],
yticklabels=['Not Survived', 'Survived'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Feature importance
importances = best_model.feature_importances_
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 6))
plt.bar(range(len(importances)), importances[indices])
plt.xticks(range(len(importances)), [features[i] for i in indices], rotation=45)
plt.title('Feature Importance')
plt.tight_layout()
plt.show()
print("\nFeature importance:")
for i in indices:
print(f" {features[i]}: {importances[i]:.4f}")
3. Project 2: Housing Price Prediction (Regression)¶
3.1 Data Loading and Exploration¶
from sklearn.datasets import fetch_california_housing
# California housing price data
housing = fetch_california_housing()
df_house = pd.DataFrame(housing.data, columns=housing.feature_names)
df_house['MedHouseVal'] = housing.target
print("=== Housing Price Data ===")
print(f"Data shape: {df_house.shape}")
print(f"\nColumns: {list(df_house.columns)}")
print(f"\nDescriptive statistics:")
print(df_house.describe())
# Target distribution
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(df_house['MedHouseVal'], bins=50, edgecolor='black')
plt.xlabel('Median House Value')
plt.ylabel('Count')
plt.title('Target Distribution')
plt.subplot(1, 2, 2)
plt.hist(np.log1p(df_house['MedHouseVal']), bins=50, edgecolor='black')
plt.xlabel('Log(Median House Value)')
plt.ylabel('Count')
plt.title('Log-Transformed Target')
plt.tight_layout()
plt.show()
3.2 Exploratory Data Analysis¶
# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(df_house.corr(), annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()
# Correlation with target
print("\nCorrelation with target:")
correlations = df_house.corr()['MedHouseVal'].drop('MedHouseVal').sort_values(ascending=False)
print(correlations)
# Main features vs target relationship
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
for ax, col in zip(axes.flatten(), housing.feature_names):
ax.scatter(df_house[col], df_house['MedHouseVal'], alpha=0.1)
ax.set_xlabel(col)
ax.set_ylabel('MedHouseVal')
ax.set_title(f'Corr: {df_house[col].corr(df_house["MedHouseVal"]):.3f}')
plt.tight_layout()
plt.show()
3.3 Data Preprocessing¶
# Separate features and target
X = df_house.drop('MedHouseVal', axis=1)
y = df_house['MedHouseVal']
# Feature engineering
X_eng = X.copy()
# Rooms per person
X_eng['RoomsPerPerson'] = X_eng['AveRooms'] / X_eng['AveOccup']
# Bedroom ratio
X_eng['BedroomRatio'] = X_eng['AveBedrms'] / X_eng['AveRooms']
# Population density (approximate)
X_eng['PopDensity'] = X_eng['Population'] / X_eng['AveOccup']
# Handle inf/NaN
X_eng = X_eng.replace([np.inf, -np.inf], np.nan)
X_eng = X_eng.fillna(X_eng.median())
# Data split
X_train, X_test, y_train, y_test = train_test_split(
X_eng, y, test_size=0.2, random_state=42
)
# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Training data: {X_train.shape}")
print(f"Test data: {X_test.shape}")
3.4 Modeling¶
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
# Model definitions
reg_models = {
'Linear Regression': LinearRegression(),
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.1),
'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5),
'Decision Tree': DecisionTreeRegressor(random_state=42),
'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
'Gradient Boosting': GradientBoostingRegressor(random_state=42),
'XGBoost': XGBRegressor(random_state=42),
'LightGBM': LGBMRegressor(random_state=42, verbose=-1)
}
# Model comparison
print("=== Regression Model Comparison ===")
reg_results = []
for name, model in reg_models.items():
# Use scaled data for linear models
if name in ['Linear Regression', 'Ridge', 'Lasso', 'ElasticNet']:
X_tr, X_te = X_train_scaled, X_test_scaled
else:
X_tr, X_te = X_train, X_test
# Cross-validation
cv_scores = cross_val_score(model, X_tr, y_train, cv=5, scoring='r2')
# Train and predict
model.fit(X_tr, y_train)
y_pred = model.predict(X_te)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
reg_results.append({
'Model': name,
'CV R2 Mean': cv_scores.mean(),
'CV R2 Std': cv_scores.std(),
'Test RMSE': rmse,
'Test R2': r2
})
print(f"{name}: CV R2={cv_scores.mean():.4f}(+/-{cv_scores.std():.4f}), RMSE={rmse:.4f}, R2={r2:.4f}")
reg_results_df = pd.DataFrame(reg_results)
print(f"\nBest test R2: {reg_results_df.loc[reg_results_df['Test R2'].idxmax(), 'Model']}")
3.5 Hyperparameter Tuning¶
# Tune LightGBM
lgbm_params = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15, -1],
'learning_rate': [0.01, 0.05, 0.1],
'num_leaves': [31, 63, 127]
}
lgbm = LGBMRegressor(random_state=42, verbose=-1)
grid_lgbm = GridSearchCV(lgbm, lgbm_params, cv=5, scoring='r2', n_jobs=-1, verbose=1)
grid_lgbm.fit(X_train, y_train)
print("\n=== LightGBM Tuning Results ===")
print(f"Best parameters: {grid_lgbm.best_params_}")
print(f"Best CV R2: {grid_lgbm.best_score_:.4f}")
y_pred_best = grid_lgbm.predict(X_test)
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_best)):.4f}")
print(f"Test R2: {r2_score(y_test, y_pred_best):.4f}")
3.6 Final Evaluation¶
# Final model
best_reg = grid_lgbm.best_estimator_
y_pred_final = best_reg.predict(X_test)
# Evaluation metrics
print("=== Final Regression Evaluation ===")
print(f"MAE: {mean_absolute_error(y_test, y_pred_final):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_final)):.4f}")
print(f"R2: {r2_score(y_test, y_pred_final):.4f}")
# Actual vs Predicted
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Scatter plot
axes[0].scatter(y_test, y_pred_final, alpha=0.3)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
axes[0].set_xlabel('Actual')
axes[0].set_ylabel('Predicted')
axes[0].set_title('Actual vs Predicted')
# Residual distribution
residuals = y_test - y_pred_final
axes[1].hist(residuals, bins=50, edgecolor='black')
axes[1].set_xlabel('Residual')
axes[1].set_ylabel('Count')
axes[1].set_title(f'Residual Distribution\nMean: {residuals.mean():.4f}')
# Residuals vs predicted
axes[2].scatter(y_pred_final, residuals, alpha=0.3)
axes[2].axhline(y=0, color='r', linestyle='--')
axes[2].set_xlabel('Predicted')
axes[2].set_ylabel('Residual')
axes[2].set_title('Residuals vs Predicted')
plt.tight_layout()
plt.show()
# Feature importance
importances = best_reg.feature_importances_
feature_names = X_eng.columns
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(12, 6))
plt.bar(range(len(importances)), importances[indices])
plt.xticks(range(len(importances)), [feature_names[i] for i in indices], rotation=45)
plt.title('Feature Importance (LightGBM)')
plt.tight_layout()
plt.show()
4. Kaggle Competition Strategy¶
4.1 Basic Strategy¶
"""
Kaggle competition strategy:
1. Quick Start
- Run provided baseline code
- Make first submission with simple model
- Check leaderboard position
2. Focus on EDA
- Data understanding is key
- Identify missing values, outliers, distributions
- Analyze relationship with target
3. Feature Engineering
- Leverage domain knowledge
- Create interaction features
- Group statistics
4. Try Various Models
- Linear models → Tree-based → Ensemble
- Hyperparameter tuning
5. Ensemble
- Combine predictions from different models
- Blending, stacking
6. Validation Strategy
- Ensure local CV matches leaderboard scores
- Prevent overfitting
"""
4.2 Cross-Validation Strategy¶
from sklearn.model_selection import KFold, StratifiedKFold
def cross_validate_model(model, X, y, n_splits=5, stratified=False, return_preds=False):
"""
Cross-validation and OOF prediction generation
"""
if stratified:
kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
split_func = kf.split(X, y)
else:
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
split_func = kf.split(X)
scores = []
oof_preds = np.zeros(len(X))
for fold, (train_idx, val_idx) in enumerate(split_func):
X_train_fold = X.iloc[train_idx] if hasattr(X, 'iloc') else X[train_idx]
X_val_fold = X.iloc[val_idx] if hasattr(X, 'iloc') else X[val_idx]
y_train_fold = y.iloc[train_idx] if hasattr(y, 'iloc') else y[train_idx]
y_val_fold = y.iloc[val_idx] if hasattr(y, 'iloc') else y[val_idx]
model.fit(X_train_fold, y_train_fold)
val_pred = model.predict(X_val_fold)
oof_preds[val_idx] = val_pred
score = r2_score(y_val_fold, val_pred)
scores.append(score)
print(f"Fold {fold+1}: {score:.4f}")
print(f"Mean: {np.mean(scores):.4f} (+/- {np.std(scores):.4f})")
if return_preds:
return oof_preds
return np.mean(scores)
# Usage example
# oof_preds = cross_validate_model(model, X_train, y_train, n_splits=5, return_preds=True)
4.3 Ensemble Techniques¶
def simple_blend(predictions_list, weights=None):
"""Simple blending"""
if weights is None:
weights = [1/len(predictions_list)] * len(predictions_list)
blended = np.zeros(len(predictions_list[0]))
for pred, weight in zip(predictions_list, weights):
blended += weight * pred
return blended
def stacking_ensemble(models, X_train, y_train, X_test, n_folds=5):
"""Stacking ensemble"""
n_models = len(models)
n_train = len(X_train)
n_test = len(X_test)
# OOF predictions
oof_train = np.zeros((n_train, n_models))
oof_test = np.zeros((n_test, n_models))
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
for i, model in enumerate(models):
print(f"Training model {i+1}/{n_models}")
oof_test_fold = np.zeros((n_test, n_folds))
for fold, (train_idx, val_idx) in enumerate(kf.split(X_train)):
X_tr = X_train.iloc[train_idx] if hasattr(X_train, 'iloc') else X_train[train_idx]
X_val = X_train.iloc[val_idx] if hasattr(X_train, 'iloc') else X_train[val_idx]
y_tr = y_train.iloc[train_idx] if hasattr(y_train, 'iloc') else y_train[train_idx]
model.fit(X_tr, y_tr)
oof_train[val_idx, i] = model.predict(X_val)
oof_test_fold[:, fold] = model.predict(X_test)
oof_test[:, i] = oof_test_fold.mean(axis=1)
return oof_train, oof_test
# Usage example
# models = [RandomForestRegressor(), XGBRegressor(), LGBMRegressor()]
# oof_train, oof_test = stacking_ensemble(models, X_train, y_train, X_test)
# meta_model = Ridge()
# meta_model.fit(oof_train, y_train)
# final_preds = meta_model.predict(oof_test)
5. Practical Tips Collection¶
5.1 Quick Experiment Template¶
def quick_experiment(X_train, y_train, X_test, y_test, task='classification'):
"""Quick model comparison"""
if task == 'classification':
models = {
'LR': LogisticRegression(max_iter=1000),
'RF': RandomForestClassifier(n_estimators=100, random_state=42),
'XGB': XGBClassifier(eval_metric='logloss', random_state=42),
'LGBM': LGBMClassifier(random_state=42, verbose=-1)
}
scoring = 'accuracy'
else:
models = {
'Ridge': Ridge(),
'RF': RandomForestRegressor(n_estimators=100, random_state=42),
'XGB': XGBRegressor(random_state=42),
'LGBM': LGBMRegressor(random_state=42, verbose=-1)
}
scoring = 'r2'
results = {}
for name, model in models.items():
cv_score = cross_val_score(model, X_train, y_train, cv=5, scoring=scoring).mean()
model.fit(X_train, y_train)
test_score = model.score(X_test, y_test)
results[name] = {'CV': cv_score, 'Test': test_score}
print(f"{name}: CV={cv_score:.4f}, Test={test_score:.4f}")
return results
5.2 Memory Optimization¶
def reduce_memory_usage(df):
"""Reduce DataFrame memory usage"""
start_mem = df.memory_usage().sum() / 1024**2
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
end_mem = df.memory_usage().sum() / 1024**2
print(f'Memory: {start_mem:.2f} MB → {end_mem:.2f} MB ({100*(start_mem-end_mem)/start_mem:.1f}% reduction)')
return df
5.3 Error Analysis¶
def analyze_errors(y_true, y_pred, X, feature_names, top_n=10):
"""Error analysis"""
errors = np.abs(y_true - y_pred)
# Largest errors
top_errors_idx = np.argsort(errors)[-top_n:]
print(f"=== Top {top_n} Error Analysis ===")
for idx in top_errors_idx[::-1]:
print(f"\nIndex {idx}:")
print(f" Actual: {y_true.iloc[idx] if hasattr(y_true, 'iloc') else y_true[idx]:.4f}")
print(f" Predicted: {y_pred[idx]:.4f}")
print(f" Error: {errors[idx]:.4f}")
# Error correlation with features
X_arr = X.values if hasattr(X, 'values') else X
error_corr = []
for i, name in enumerate(feature_names):
corr = np.corrcoef(errors, X_arr[:, i])[0, 1]
error_corr.append((name, corr))
error_corr.sort(key=lambda x: abs(x[1]), reverse=True)
print("\n=== Error Correlation with Features ===")
for name, corr in error_corr[:5]:
print(f" {name}: {corr:.4f}")
# Usage example
# analyze_errors(y_test, y_pred, X_test, feature_names)
Exercises¶
Exercise 1: Complete Pipeline¶
Build a complete ML pipeline with Iris data.
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
# Solution
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"CV score: {cv_scores.mean():.4f}")
pipeline.fit(X_train, y_train)
print(f"Test score: {pipeline.score(X_test, y_test):.4f}")
Exercise 2: Feature Engineering¶
Add new features to given data.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50],
'C': [100, 200, 300, 400, 500]
})
# Solution: Add ratio features, sum features
df['A_B_ratio'] = df['A'] / df['B']
df['AB_sum'] = df['A'] + df['B']
df['log_C'] = np.log1p(df['C'])
df['A_squared'] = df['A'] ** 2
print(df)
Exercise 3: Ensemble¶
Blend multiple models.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Solution
models = [
LogisticRegression(max_iter=1000),
RandomForestClassifier(n_estimators=100, random_state=42)
]
predictions = []
for model in models:
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)[:, 1]
predictions.append(pred)
# Average blending
blended = np.mean(predictions, axis=0)
blended_labels = (blended > 0.5).astype(int)
print(f"Blending accuracy: {accuracy_score(y_test, blended_labels):.4f}")
Summary¶
Classification vs Regression Checklist¶
| Stage | Classification | Regression |
|---|---|---|
| Evaluation Metrics | Accuracy, F1, AUC | RMSE, MAE, R2 |
| Target Handling | Encoding | Check outliers, log transform |
| Imbalance | SMOTE, class weights | N/A |
| Error Analysis | Confusion matrix | Residual analysis |
Model Selection Guide¶
| Situation | Recommended Model |
|---|---|
| Quick baseline | Logistic/Linear regression |
| General performance | Random Forest |
| Best performance | XGBoost, LightGBM |
| Interpretation needed | Decision tree, linear model |
| Large dataset | LightGBM |
Kaggle Essential Tips¶
- Always compare local CV and leaderboard scores
- Beware of overfitting - don't tune to public leaderboard
- Ensemble with diverse models
- Feature engineering is key
- Reference kernels/notebooks but understand and apply