Machine Learning Overview
Machine Learning Overview¶
1. What is Machine Learning?¶
Machine Learning is an algorithm that learns from data to perform predictions or decisions without explicit programming.
# Traditional Programming vs Machine Learning
# Traditional: Rules + Data β Results
# Machine Learning: Data + Results β Rules (Model)
2. Types of Machine Learning¶
2.1 Supervised Learning¶
Learning with labeled data (input X and target y).
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Example: Predict house price from size
X = np.array([[50], [60], [70], [80], [90], [100]]) # Size (pyeong)
y = np.array([1.5, 1.8, 2.1, 2.5, 2.8, 3.2]) # Price (100M KRW)
# Train model
model = LinearRegression()
model.fit(X, y)
# Predict
new_house = [[75]]
predicted_price = model.predict(new_house)
print(f"Predicted price for 75 pyeong house: {predicted_price[0]:.2f} 100M KRW")
Main Algorithms: - Regression: Predict continuous values - Linear regression, polynomial regression, ridge, lasso - Classification: Predict categories - Logistic regression, SVM, decision trees, random forest
2.2 Unsupervised Learning¶
Learning patterns or structures in data without labels.
from sklearn.cluster import KMeans
import numpy as np
# Customer data (age, purchase amount)
X = np.array([[25, 100], [30, 150], [35, 120],
[50, 300], [55, 350], [60, 400]])
# K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict(X)
print(f"Cluster labels: {labels}")
print(f"Cluster centers:\n{kmeans.cluster_centers_}")
Main Algorithms: - Clustering: K-Means, DBSCAN, hierarchical clustering - Dimensionality Reduction: PCA, t-SNE - Anomaly Detection: Isolation Forest
2.3 Reinforcement Learning¶
Learning by interacting with environment to maximize rewards.
- Agent selects actions
- Receives rewards or penalties from environment
- Maximizes cumulative reward
Applications: Game AI, robot control, autonomous driving
3. Machine Learning Workflow¶
1. Problem Definition β 2. Data Collection β 3. Data Exploration (EDA)
β
7. Deployment/Monitoring β 6. Model Selection β 5. Model Training β 4. Data Preprocessing
3.1 Basic Workflow Example¶
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# 1. Load data
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 3. Data preprocessing (scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 4. Train model
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)
# 5. Predict
y_pred = model.predict(X_test_scaled)
# 6. Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
4. Core Concepts¶
4.1 Train/Validation/Test Split¶
from sklearn.model_selection import train_test_split
# Split data into train (60%), validation (20%), test (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42 # 0.25 * 0.8 = 0.2
)
print(f"Train: {len(X_train)}, Validation: {len(X_val)}, Test: {len(X_test)}")
- Training data: Used for model training
- Validation data: Used for hyperparameter tuning
- Test data: Used for final performance evaluation (only once)
4.2 Overfitting and Underfitting¶
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate data
np.random.seed(42)
X = np.sort(np.random.rand(20, 1) * 6, axis=0)
y = np.sin(X).ravel() + np.random.randn(20) * 0.1
# Models with different complexity
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
degrees = [1, 4, 15]
titles = ['Underfitting', 'Good Fit', 'Overfitting']
for ax, degree, title in zip(axes, degrees, titles):
poly = PolynomialFeatures(degree=degree)
X_poly = poly.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, y)
X_plot = np.linspace(0, 6, 100).reshape(-1, 1)
X_plot_poly = poly.transform(X_plot)
y_plot = model.predict(X_plot_poly)
ax.scatter(X, y, color='blue', alpha=0.7)
ax.plot(X_plot, y_plot, color='red', linewidth=2)
ax.set_title(title)
ax.set_xlabel('X')
ax.set_ylabel('y')
plt.tight_layout()
plt.show()
- Underfitting: Model is too simple to learn even training data well
- Overfitting: Model is too fitted to training data, fails to generalize to new data
4.3 Bias-Variance Tradeoff¶
Total Error = BiasΒ² + Variance + Noise
Bias: Error due to model simplicity
Variance: Sensitivity of model to data changes
High bias β Underfitting
High variance β Overfitting
4.4 Feature Scaling¶
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Example data
X = np.array([[100, 0.001], [200, 0.002], [300, 0.003]])
# StandardScaler (Z-score normalization)
scaler_std = StandardScaler()
X_std = scaler_std.fit_transform(X)
print("StandardScaler result:")
print(X_std)
# MinMaxScaler (0-1 normalization)
scaler_minmax = MinMaxScaler()
X_minmax = scaler_minmax.fit_transform(X)
print("\nMinMaxScaler result:")
print(X_minmax)
5. sklearn Basic API¶
5.1 Estimator Interface¶
# All sklearn models follow the same interface
from sklearn.ensemble import RandomForestClassifier
# 1. Create model
model = RandomForestClassifier(n_estimators=100, random_state=42)
# 2. Train (fit)
model.fit(X_train, y_train)
# 3. Predict (predict)
y_pred = model.predict(X_test)
# 4. Predict probability (predict_proba) - classification models
y_proba = model.predict_proba(X_test)
# 5. Score (score)
accuracy = model.score(X_test, y_test)
5.2 Transformer Interface¶
from sklearn.preprocessing import StandardScaler
# 1. Create transformer
scaler = StandardScaler()
# 2. Fit (fit)
scaler.fit(X_train)
# 3. Transform (transform)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# fit + transform together
X_train_scaled = scaler.fit_transform(X_train)
# Warning: only transform on test data!
X_test_scaled = scaler.transform(X_test)
6. Datasets¶
6.1 sklearn Built-in Datasets¶
from sklearn.datasets import (
load_iris, # Classification (3 classes)
load_digits, # Classification (10 classes)
load_breast_cancer, # Binary classification
load_boston, # Regression (deprecated)
load_diabetes, # Regression
make_classification, # Synthetic classification data
make_regression, # Synthetic regression data
)
# Iris dataset
iris = load_iris()
print(f"Features: {iris.feature_names}")
print(f"Targets: {iris.target_names}")
print(f"Data shape: {iris.data.shape}")
# Generate synthetic data
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=10,
n_redundant=5,
n_classes=2,
random_state=42
)
print(f"Synthetic data shape: {X.shape}")
6.2 Load External Data¶
import pandas as pd
# CSV
df = pd.read_csv('data.csv')
# Separate features and target
X = df.drop('target', axis=1)
y = df['target']
# Kaggle data (example)
# !pip install kaggle
# !kaggle datasets download -d username/dataset-name
7. Machine Learning Project Template¶
"""
Basic Machine Learning Project Template
"""
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
# 1. Load data
# df = pd.read_csv('data.csv')
# X = df.drop('target', axis=1)
# y = df['target']
# 2. Exploratory Data Analysis (EDA)
# print(df.info())
# print(df.describe())
# print(df['target'].value_counts())
# 3. Data preprocessing
# - Handle missing values
# - Encoding
# - Scaling
# 4. Split data
# X_train, X_test, y_train, y_test = train_test_split(
# X, y, test_size=0.2, random_state=42, stratify=y
# )
# 5. Model selection and training
# from sklearn.ensemble import RandomForestClassifier
# model = RandomForestClassifier(random_state=42)
# model.fit(X_train, y_train)
# 6. Cross-validation
# cv_scores = cross_val_score(model, X_train, y_train, cv=5)
# print(f"CV score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# 7. Hyperparameter tuning
# from sklearn.model_selection import GridSearchCV
# param_grid = {'n_estimators': [50, 100, 200]}
# grid_search = GridSearchCV(model, param_grid, cv=5)
# grid_search.fit(X_train, y_train)
# 8. Final evaluation
# y_pred = model.predict(X_test)
# print(classification_report(y_test, y_pred))
# 9. Save model
# import joblib
# joblib.dump(model, 'model.pkl')
Practice Problems¶
Problem 1: Data Splitting¶
Split iris data into 80:20 while maintaining class proportions.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
# Solution
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training data: {len(X_train)}")
print(f"Test data: {len(X_test)}")
print(f"Test class distribution: {np.bincount(y_test)}")
Problem 2: Basic Model Training¶
Train a logistic regression model and compute accuracy.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Solution
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Summary¶
| Concept | Description |
|---|---|
| Supervised Learning | Learn from labeled data (regression, classification) |
| Unsupervised Learning | Learn patterns without labels (clustering, dimensionality reduction) |
| Reinforcement Learning | Learn by interaction to maximize rewards |
| Overfitting | Too fitted to training data |
| Underfitting | Model too simple |
| Bias-Variance | Tradeoff between model complexity and generalization |