1{
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "id": "b2c3d4e5-0001-4000-8000-000000000001",
6 "metadata": {},
7 "source": [
8 "# Regression Analysis\n",
9 "\n",
10 "Regression analysis models the relationship between a **response variable** (dependent) and one or more **predictor variables** (independent). It is the workhorse of statistical modelling and machine learning.\n",
11 "\n",
12 "**Topics covered in this notebook:**\n",
13 "1. Simple linear regression — fitting a line, OLS estimation\n",
14 "2. Interpreting coefficients — R², p-values, residual diagnostics\n",
15 "3. Multiple linear regression — multiple predictors\n",
16 "4. Polynomial regression — non-linear patterns, bias-variance tradeoff\n",
17 "5. Logistic regression — binary outcomes, sigmoid function\n",
18 "6. Regression assumptions — linearity, normality, homoscedasticity\n",
19 "\n",
20 "**Mathematical foundation — Ordinary Least Squares (OLS):**\n",
21 "\n",
22 "$$\\hat{\\boldsymbol{\\beta}} = (\\mathbf{X}^\\top \\mathbf{X})^{-1} \\mathbf{X}^\\top \\mathbf{y}$$\n",
23 "\n",
24 "OLS minimises the **residual sum of squares (RSS)** $\\sum_{i=1}^{n}(y_i - \\hat{y}_i)^2$ and is the **BLUE** (Best Linear Unbiased Estimator) under the Gauss-Markov assumptions."
25 ]
26 },
27 {
28 "cell_type": "code",
29 "execution_count": null,
30 "id": "b2c3d4e5-0002-4000-8000-000000000002",
31 "metadata": {},
32 "outputs": [],
33 "source": [
34 "import numpy as np\n",
35 "import scipy.stats as stats\n",
36 "import matplotlib.pyplot as plt\n",
37 "import matplotlib.gridspec as gridspec\n",
38 "\n",
39 "# Optional: scikit-learn (used where noted; gracefully falls back to manual)\n",
40 "try:\n",
41 " from sklearn.linear_model import LinearRegression, LogisticRegression\n",
42 " from sklearn.preprocessing import PolynomialFeatures, StandardScaler\n",
43 " from sklearn.pipeline import make_pipeline\n",
44 " from sklearn.metrics import r2_score\n",
45 " SKLEARN = True\n",
46 " print('scikit-learn is available — using sklearn where convenient.')\n",
47 "except ImportError:\n",
48 " SKLEARN = False\n",
49 " print('scikit-learn not found — using NumPy/SciPy only.')\n",
50 "\n",
51 "rng = np.random.default_rng(seed=0)\n",
52 "\n",
53 "plt.rcParams.update({\n",
54 " 'figure.dpi': 100,\n",
55 " 'axes.spines.top': False,\n",
56 " 'axes.spines.right': False,\n",
57 " 'font.size': 11,\n",
58 "})\n",
59 "\n",
60 "print('NumPy:', np.__version__)\n",
61 "print('SciPy:', __import__('scipy').__version__)"
62 ]
63 },
64 {
65 "cell_type": "markdown",
66 "id": "b2c3d4e5-0003-4000-8000-000000000003",
67 "metadata": {},
68 "source": [
69 "## 1. Simple Linear Regression\n",
70 "\n",
71 "The simple linear model relates a single predictor $x$ to the response $y$:\n",
72 "\n",
73 "$$y_i = \\beta_0 + \\beta_1 x_i + \\varepsilon_i, \\qquad \\varepsilon_i \\overset{\\text{i.i.d.}}{\\sim} \\mathcal{N}(0, \\sigma^2)$$\n",
74 "\n",
75 "OLS closed-form estimates:\n",
76 "\n",
77 "$$\\hat{\\beta}_1 = \\frac{\\sum(x_i - \\bar{x})(y_i - \\bar{y})}{\\sum(x_i - \\bar{x})^2} = \\frac{S_{xy}}{S_{xx}}, \\qquad \\hat{\\beta}_0 = \\bar{y} - \\hat{\\beta}_1 \\bar{x}$$\n",
78 "\n",
79 "The **95% confidence band** around the regression line reflects uncertainty in the estimated mean response $E[y \\mid x]$. The wider **prediction interval** additionally accounts for individual observation noise $\\sigma^2$."
80 ]
81 },
82 {
83 "cell_type": "code",
84 "execution_count": null,
85 "id": "b2c3d4e5-0004-4000-8000-000000000004",
86 "metadata": {},
87 "outputs": [],
88 "source": [
89 "# Scenario: predict house price (€ thousands) from size (m²)\n",
90 "n = 80\n",
91 "x = rng.uniform(40, 200, size=n) # house size (m²)\n",
92 "true_beta0, true_beta1 = 30.0, 2.5 # true intercept, slope\n",
93 "noise_std = 25.0\n",
94 "y = true_beta0 + true_beta1 * x + rng.normal(0, noise_std, size=n)\n",
95 "\n",
96 "# OLS via scipy.stats.linregress\n",
97 "slope, intercept, r_value, p_value, stderr = stats.linregress(x, y)\n",
98 "\n",
99 "print(f'Fitted intercept β₀ : {intercept:.4f} (true: {true_beta0})')\n",
100 "print(f'Fitted slope β₁ : {slope:.4f} (true: {true_beta1})')\n",
101 "print(f'R : {r_value:.4f}')\n",
102 "print(f'R² : {r_value**2:.4f}')\n",
103 "print(f'p-value (slope) : {p_value:.6f}')\n",
104 "print(f'Std error (slope) : {stderr:.4f}')\n",
105 "\n",
106 "# Predictions + 95% confidence band (analytical formula)\n",
107 "x_grid = np.linspace(x.min(), x.max(), 300)\n",
108 "y_hat = intercept + slope * x_grid\n",
109 "\n",
110 "x_mean = x.mean()\n",
111 "n_obs = len(x)\n",
112 "Sxx = ((x - x_mean) ** 2).sum()\n",
113 "residuals = y - (intercept + slope * x)\n",
114 "s_e = np.sqrt((residuals ** 2).sum() / (n_obs - 2)) # residual std error\n",
115 "\n",
116 "t_crit = stats.t.ppf(0.975, df=n_obs - 2)\n",
117 "se_mean = s_e * np.sqrt(1 / n_obs + (x_grid - x_mean) ** 2 / Sxx)\n",
118 "se_pred = s_e * np.sqrt(1 + 1 / n_obs + (x_grid - x_mean) ** 2 / Sxx)\n",
119 "\n",
120 "ci_lower = y_hat - t_crit * se_mean\n",
121 "ci_upper = y_hat + t_crit * se_mean\n",
122 "pi_lower = y_hat - t_crit * se_pred\n",
123 "pi_upper = y_hat + t_crit * se_pred\n",
124 "\n",
125 "# Plot\n",
126 "fig, ax = plt.subplots(figsize=(8, 5))\n",
127 "ax.scatter(x, y, color='steelblue', alpha=0.6, s=30, label='Observations')\n",
128 "ax.plot(x_grid, y_hat, 'r-', lw=2, label=f'OLS fit: y = {intercept:.1f} + {slope:.2f}x')\n",
129 "ax.fill_between(x_grid, ci_lower, ci_upper, color='red', alpha=0.15,\n",
130 " label='95% confidence band')\n",
131 "ax.fill_between(x_grid, pi_lower, pi_upper, color='orange', alpha=0.10,\n",
132 " label='95% prediction interval')\n",
133 "ax.set_xlabel('House size (m²)')\n",
134 "ax.set_ylabel('Price (€ thousands)')\n",
135 "ax.set_title(f'Simple Linear Regression | R²={r_value**2:.3f}, p={p_value:.4f}')\n",
136 "ax.legend()\n",
137 "plt.tight_layout()\n",
138 "plt.show()"
139 ]
140 },
141 {
142 "cell_type": "markdown",
143 "id": "b2c3d4e5-0005-4000-8000-000000000005",
144 "metadata": {},
145 "source": [
146 "## 2. Interpreting Coefficients: R², Residuals, and Q-Q Plot\n",
147 "\n",
148 "**R² (coefficient of determination)** measures the proportion of variance in $y$ explained by the model:\n",
149 "\n",
150 "$$R^2 = 1 - \\frac{\\text{RSS}}{\\text{TSS}} = 1 - \\frac{\\sum(y_i - \\hat{y}_i)^2}{\\sum(y_i - \\bar{y})^2}$$\n",
151 "\n",
152 "**Residual diagnostics** are essential for validating OLS assumptions:\n",
153 "\n",
154 "| Plot | What to look for | Violation suggests |\n",
155 "|------|------------------|-----------------------|\n",
156 "| Residuals vs Fitted | Random scatter around 0 | Non-linearity or heteroscedasticity |\n",
157 "| Q-Q plot | Points on diagonal | Non-normality of errors |\n",
158 "| Scale-Location | Horizontal band | Heteroscedasticity |\n",
159 "| Residuals vs Leverage | No high-leverage outliers | Influential observations |\n",
160 "\n",
161 "The **p-value of the slope** tests H₀: β₁ = 0 (no linear relationship). Reject when p < α."
162 ]
163 },
164 {
165 "cell_type": "code",
166 "execution_count": null,
167 "id": "b2c3d4e5-0006-4000-8000-000000000006",
168 "metadata": {},
169 "outputs": [],
170 "source": [
171 "# Re-use the house price data from Cell 4\n",
172 "y_fitted = intercept + slope * x\n",
173 "residuals = y - y_fitted\n",
174 "std_resid = residuals / s_e # standardised residuals\n",
175 "\n",
176 "RSS = (residuals ** 2).sum()\n",
177 "TSS = ((y - y.mean()) ** 2).sum()\n",
178 "R2 = 1 - RSS / TSS\n",
179 "adj_R2 = 1 - (1 - R2) * (n_obs - 1) / (n_obs - 2)\n",
180 "\n",
181 "print(f'R² : {R2:.4f}')\n",
182 "print(f'Adjusted R² : {adj_R2:.4f}')\n",
183 "print(f'RSS : {RSS:.2f}')\n",
184 "print(f'Residual SE : {s_e:.4f}')\n",
185 "\n",
186 "# Shapiro-Wilk test on residuals\n",
187 "sw_stat, sw_p = stats.shapiro(residuals)\n",
188 "print(f'\\nShapiro-Wilk test on residuals: W={sw_stat:.4f}, p={sw_p:.4f}')\n",
189 "print('Residuals appear normal.' if sw_p > 0.05 else 'Residuals may NOT be normal.')\n",
190 "\n",
191 "fig = plt.figure(figsize=(11, 8))\n",
192 "gs = gridspec.GridSpec(2, 2, hspace=0.40, wspace=0.35)\n",
193 "\n",
194 "# 1. Residuals vs Fitted\n",
195 "ax1 = fig.add_subplot(gs[0, 0])\n",
196 "ax1.scatter(y_fitted, residuals, alpha=0.6, color='steelblue', s=25)\n",
197 "ax1.axhline(0, color='red', lw=1.5, ls='--')\n",
198 "ax1.set_xlabel('Fitted values'); ax1.set_ylabel('Residuals')\n",
199 "ax1.set_title('Residuals vs Fitted')\n",
200 "\n",
201 "# 2. Q-Q plot\n",
202 "ax2 = fig.add_subplot(gs[0, 1])\n",
203 "(osm, osr), (slope_qq, intercept_qq, r_qq) = stats.probplot(residuals, dist='norm')\n",
204 "ax2.scatter(osm, osr, alpha=0.6, color='steelblue', s=25)\n",
205 "ax2.plot(osm, slope_qq * np.array(osm) + intercept_qq, 'r-', lw=1.5)\n",
206 "ax2.set_xlabel('Theoretical quantiles'); ax2.set_ylabel('Sample quantiles')\n",
207 "ax2.set_title('Normal Q-Q Plot')\n",
208 "\n",
209 "# 3. Scale-Location (sqrt |standardised residuals| vs fitted)\n",
210 "ax3 = fig.add_subplot(gs[1, 0])\n",
211 "ax3.scatter(y_fitted, np.sqrt(np.abs(std_resid)), alpha=0.6, color='steelblue', s=25)\n",
212 "ax3.set_xlabel('Fitted values'); ax3.set_ylabel('√|Standardised residuals|')\n",
213 "ax3.set_title('Scale-Location')\n",
214 "\n",
215 "# 4. Residual histogram\n",
216 "ax4 = fig.add_subplot(gs[1, 1])\n",
217 "ax4.hist(residuals, bins=14, color='steelblue', edgecolor='white', alpha=0.8, density=True)\n",
218 "xr = np.linspace(residuals.min(), residuals.max(), 200)\n",
219 "ax4.plot(xr, stats.norm.pdf(xr, residuals.mean(), residuals.std()), 'r-', lw=2)\n",
220 "ax4.set_xlabel('Residual'); ax4.set_ylabel('Density')\n",
221 "ax4.set_title('Residual Distribution')\n",
222 "\n",
223 "fig.suptitle(f'Residual Diagnostics | R²={R2:.3f}', fontsize=13, y=1.01)\n",
224 "plt.show()"
225 ]
226 },
227 {
228 "cell_type": "markdown",
229 "id": "b2c3d4e5-0007-4000-8000-000000000007",
230 "metadata": {},
231 "source": [
232 "## 3. Multiple Linear Regression\n",
233 "\n",
234 "With $p$ predictors the model becomes:\n",
235 "\n",
236 "$$\\mathbf{y} = \\mathbf{X}\\boldsymbol{\\beta} + \\boldsymbol{\\varepsilon}, \\qquad \\hat{\\boldsymbol{\\beta}} = (\\mathbf{X}^\\top\\mathbf{X})^{-1}\\mathbf{X}^\\top\\mathbf{y}$$\n",
237 "\n",
238 "where $\\mathbf{X}$ is the $n \\times (p+1)$ design matrix (first column all ones for the intercept).\n",
239 "\n",
240 "**Key considerations:**\n",
241 "- **Multicollinearity**: when predictors are correlated, coefficient estimates become unstable. Check with Variance Inflation Factor (VIF).\n",
242 "- **Adjusted R²**: penalises adding uninformative predictors — always report alongside R².\n",
243 "- **F-statistic**: tests whether *all* coefficients (except intercept) are simultaneously zero.\n",
244 "\n",
245 "$$F = \\frac{(\\text{TSS} - \\text{RSS})/p}{\\text{RSS}/(n - p - 1)}$$"
246 ]
247 },
248 {
249 "cell_type": "code",
250 "execution_count": null,
251 "id": "b2c3d4e5-0008-4000-8000-000000000008",
252 "metadata": {},
253 "outputs": [],
254 "source": [
255 "# Scenario: predict salary (€k) from years of experience, education level (0-3), and performance score\n",
256 "n = 120\n",
257 "experience = rng.uniform(0, 30, n)\n",
258 "education = rng.integers(0, 4, n).astype(float)\n",
259 "performance = rng.uniform(50, 100, n)\n",
260 "\n",
261 "# True model: salary = 25 + 1.8*exp + 5*edu + 0.3*perf + noise\n",
262 "true_betas = np.array([25.0, 1.8, 5.0, 0.3])\n",
263 "noise = rng.normal(0, 8, n)\n",
264 "y = true_betas[0] + true_betas[1]*experience + true_betas[2]*education + true_betas[3]*performance + noise\n",
265 "\n",
266 "# Build design matrix (with intercept column)\n",
267 "X_raw = np.column_stack([experience, education, performance])\n",
268 "ones = np.ones((n, 1))\n",
269 "X = np.hstack([ones, X_raw]) # shape: (n, 4)\n",
270 "\n",
271 "# OLS via normal equations\n",
272 "beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]\n",
273 "\n",
274 "y_hat = X @ beta_hat\n",
275 "residuals = y - y_hat\n",
276 "p_pred = X.shape[1] - 1 # number of predictors (excl. intercept)\n",
277 "RSS = (residuals ** 2).sum()\n",
278 "TSS = ((y - y.mean()) ** 2).sum()\n",
279 "R2 = 1 - RSS / TSS\n",
280 "adj_R2 = 1 - (1 - R2) * (n - 1) / (n - p_pred - 1)\n",
281 "s2 = RSS / (n - p_pred - 1)\n",
282 "\n",
283 "# Standard errors of coefficients\n",
284 "XtX_inv = np.linalg.inv(X.T @ X)\n",
285 "se_betas = np.sqrt(np.diag(XtX_inv) * s2)\n",
286 "\n",
287 "# t-statistics and p-values\n",
288 "t_stats = beta_hat / se_betas\n",
289 "df_res = n - p_pred - 1\n",
290 "p_vals = 2 * stats.t.sf(np.abs(t_stats), df=df_res)\n",
291 "\n",
292 "# F-statistic for overall model significance\n",
293 "F_stat = ((TSS - RSS) / p_pred) / (RSS / df_res)\n",
294 "F_pval = stats.f.sf(F_stat, p_pred, df_res)\n",
295 "\n",
296 "print('Multiple Linear Regression Results')\n",
297 "print('=' * 55)\n",
298 "print(f'R²: {R2:.4f} Adjusted R²: {adj_R2:.4f}')\n",
299 "print(f'F-statistic: {F_stat:.2f} (p={F_pval:.2e})')\n",
300 "print(f'n={n}, p={p_pred}')\n",
301 "print()\n",
302 "labels = ['Intercept', 'Experience', 'Education', 'Performance']\n",
303 "print(f'{\"Variable\":<14} {\"Estimate\":>10} {\"Std Err\":>10} {\"t-stat\":>10} {\"p-value\":>12}')\n",
304 "print('-' * 58)\n",
305 "for lbl, b, se, t, p in zip(labels, beta_hat, se_betas, t_stats, p_vals):\n",
306 " sig = '***' if p < 0.001 else ('**' if p < 0.01 else ('*' if p < 0.05 else ''))\n",
307 " print(f'{lbl:<14} {b:>10.4f} {se:>10.4f} {t:>10.4f} {p:>12.4f} {sig}')\n",
308 "\n",
309 "# Coefficient plot\n",
310 "fig, ax = plt.subplots(figsize=(7, 4))\n",
311 "ci_mult = stats.t.ppf(0.975, df=df_res)\n",
312 "ci_half = ci_mult * se_betas[1:] # skip intercept\n",
313 "y_pos = np.arange(p_pred)\n",
314 "ax.barh(y_pos, beta_hat[1:], xerr=ci_half, color='steelblue', alpha=0.7,\n",
315 " edgecolor='white', capsize=4)\n",
316 "ax.axvline(0, color='black', lw=1)\n",
317 "ax.set_yticks(y_pos)\n",
318 "ax.set_yticklabels(['Experience', 'Education', 'Performance'])\n",
319 "ax.set_xlabel('Coefficient estimate (with 95% CI)')\n",
320 "ax.set_title('Multiple Regression Coefficients')\n",
321 "plt.tight_layout()\n",
322 "plt.show()"
323 ]
324 },
325 {
326 "cell_type": "markdown",
327 "id": "b2c3d4e5-0009-4000-8000-000000000009",
328 "metadata": {},
329 "source": [
330 "## 4. Polynomial Regression\n",
331 "\n",
332 "When the relationship between $x$ and $y$ is non-linear, we can extend OLS by including polynomial features:\n",
333 "\n",
334 "$$y = \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_d x^d + \\varepsilon$$\n",
335 "\n",
336 "This is still *linear* in the parameters $\\boldsymbol{\\beta}$ — just a linear model in a transformed feature space.\n",
337 "\n",
338 "**Bias-variance tradeoff:**\n",
339 "- Low-degree polynomial → high bias (underfitting)\n",
340 "- High-degree polynomial → high variance (overfitting)\n",
341 "- The optimal degree minimises prediction error on unseen data (use cross-validation)\n",
342 "\n",
343 "The training R² always increases with degree, so use **adjusted R²** or **AIC/BIC** to compare models."
344 ]
345 },
346 {
347 "cell_type": "code",
348 "execution_count": null,
349 "id": "b2c3d4e5-0010-4000-8000-000000000010",
350 "metadata": {},
351 "outputs": [],
352 "source": [
353 "# Scenario: model a noisy sinusoidal relationship\n",
354 "n = 60\n",
355 "x_poly = rng.uniform(-3, 3, size=n)\n",
356 "y_poly = np.sin(x_poly) + rng.normal(0, 0.3, size=n)\n",
357 "\n",
358 "x_grid_poly = np.linspace(-3.2, 3.2, 300)\n",
359 "degrees = [1, 3, 7, 15]\n",
360 "\n",
361 "fig, axes = plt.subplots(2, 2, figsize=(11, 8), sharey=True)\n",
362 "axes = axes.ravel()\n",
363 "\n",
364 "r2_scores = []\n",
365 "for ax, deg in zip(axes, degrees):\n",
366 " # Build Vandermonde matrix for training data and grid\n",
367 " X_train = np.vander(x_poly, N=deg + 1, increasing=True)\n",
368 " X_grid = np.vander(x_grid_poly, N=deg + 1, increasing=True)\n",
369 "\n",
370 " coeffs = np.linalg.lstsq(X_train, y_poly, rcond=None)[0]\n",
371 " y_pred_train = X_train @ coeffs\n",
372 " y_pred_grid = X_grid @ coeffs\n",
373 "\n",
374 " rss = ((y_poly - y_pred_train) ** 2).sum()\n",
375 " tss = ((y_poly - y_poly.mean()) ** 2).sum()\n",
376 " r2 = 1 - rss / tss\n",
377 " adj_r2 = 1 - (1 - r2) * (n - 1) / (n - deg - 1)\n",
378 " r2_scores.append(r2)\n",
379 "\n",
380 " ax.scatter(x_poly, y_poly, color='steelblue', alpha=0.5, s=20, label='Data')\n",
381 " ax.plot(x_grid_poly, np.sin(x_grid_poly), 'g--', lw=1.5, label='True sin(x)')\n",
382 " ax.plot(x_grid_poly, y_pred_grid, 'r-', lw=2,\n",
383 " label=f'Degree {deg}')\n",
384 " ax.set_ylim(-2.5, 2.5)\n",
385 " ax.set_title(f'Degree {deg} | R²={r2:.3f}, adj-R²={adj_r2:.3f}')\n",
386 " ax.set_xlabel('x')\n",
387 " ax.set_ylabel('y')\n",
388 " ax.legend(fontsize=9)\n",
389 "\n",
390 "plt.suptitle('Polynomial Regression: Underfitting → Overfitting', fontsize=13)\n",
391 "plt.tight_layout()\n",
392 "plt.show()\n",
393 "\n",
394 "print('Training R² by degree:')\n",
395 "for d, r2 in zip(degrees, r2_scores):\n",
396 " bar = '#' * int(r2 * 40)\n",
397 " print(f' Degree {d:>2}: R²={r2:.4f} |{bar}')"
398 ]
399 },
400 {
401 "cell_type": "markdown",
402 "id": "b2c3d4e5-0011-4000-8000-000000000011",
403 "metadata": {},
404 "source": [
405 "## 5. Logistic Regression (Binary Classification)\n",
406 "\n",
407 "When the response is binary ($y \\in \\{0, 1\\}$), linear regression is inappropriate. **Logistic regression** models the log-odds:\n",
408 "\n",
409 "$$\\log\\frac{P(y=1 \\mid x)}{1 - P(y=1 \\mid x)} = \\beta_0 + \\beta_1 x$$\n",
410 "\n",
411 "Transforming back via the **sigmoid function**:\n",
412 "\n",
413 "$$P(y=1 \\mid x) = \\sigma(\\beta_0 + \\beta_1 x) = \\frac{1}{1 + e^{-(\\beta_0 + \\beta_1 x)}}$$\n",
414 "\n",
415 "Parameters are estimated by **maximum likelihood** (not OLS). The decision boundary is where $P = 0.5$, i.e., $\\beta_0 + \\beta_1 x = 0$.\n",
416 "\n",
417 "**Interpretation of coefficients:** $e^{\\beta_j}$ is the **odds ratio** — the factor by which the odds of $y=1$ multiply for a unit increase in $x_j$, holding other predictors constant."
418 ]
419 },
420 {
421 "cell_type": "code",
422 "execution_count": null,
423 "id": "b2c3d4e5-0012-4000-8000-000000000012",
424 "metadata": {},
425 "outputs": [],
426 "source": [
427 "# Scenario: predict exam pass/fail from hours studied\n",
428 "n_logit = 100\n",
429 "hours = rng.uniform(0, 10, n_logit)\n",
430 "\n",
431 "# True logistic model: log-odds = -4 + 1.2 * hours\n",
432 "true_b0, true_b1 = -4.0, 1.2\n",
433 "log_odds = true_b0 + true_b1 * hours\n",
434 "prob_pass = 1 / (1 + np.exp(-log_odds))\n",
435 "passed = rng.binomial(1, prob_pass) # 0 = fail, 1 = pass\n",
436 "\n",
437 "# Fit via scikit-learn (or manual gradient descent fallback)\n",
438 "if SKLEARN:\n",
439 " X_logit = hours.reshape(-1, 1)\n",
440 " clf = LogisticRegression(solver='lbfgs', C=1e6) # large C → minimal regularisation\n",
441 " clf.fit(X_logit, passed)\n",
442 " b0_hat = clf.intercept_[0]\n",
443 " b1_hat = clf.coef_[0, 0]\n",
444 " print(f'Fitted (sklearn): β₀={b0_hat:.4f}, β₁={b1_hat:.4f}')\n",
445 "else:\n",
446 " # Manual gradient descent\n",
447 " def sigmoid(z): return 1 / (1 + np.exp(-z))\n",
448 " b0_hat, b1_hat = 0.0, 0.0\n",
449 " lr = 0.05\n",
450 " for _ in range(3000):\n",
451 " p = sigmoid(b0_hat + b1_hat * hours)\n",
452 " err = p - passed\n",
453 " b0_hat -= lr * err.mean()\n",
454 " b1_hat -= lr * (err * hours).mean()\n",
455 " print(f'Fitted (grad desc): β₀={b0_hat:.4f}, β₁={b1_hat:.4f}')\n",
456 "\n",
457 "print(f'True coefficients: β₀={true_b0}, β₁={true_b1}')\n",
458 "\n",
459 "# Decision boundary: b0 + b1*x = 0 → x = -b0/b1\n",
460 "boundary = -b0_hat / b1_hat\n",
461 "print(f'Decision boundary : {boundary:.2f} hours')\n",
462 "print(f'Odds ratio e^β₁ : {np.exp(b1_hat):.4f}')\n",
463 "\n",
464 "# Visualise\n",
465 "x_range = np.linspace(0, 10, 300)\n",
466 "p_hat = 1 / (1 + np.exp(-(b0_hat + b1_hat * x_range)))\n",
467 "\n",
468 "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
469 "\n",
470 "# Left: sigmoid curve\n",
471 "ax = axes[0]\n",
472 "ax.scatter(hours[passed == 0], passed[passed == 0], color='tomato', alpha=0.5, s=30,\n",
473 " label='Fail (0)', zorder=3)\n",
474 "ax.scatter(hours[passed == 1], passed[passed == 1], color='steelblue', alpha=0.5, s=30,\n",
475 " label='Pass (1)', zorder=3)\n",
476 "ax.plot(x_range, p_hat, 'k-', lw=2, label='Fitted P(pass)')\n",
477 "ax.axvline(boundary, color='green', ls='--', lw=1.5, label=f'Boundary = {boundary:.1f}h')\n",
478 "ax.axhline(0.5, color='gray', ls=':', lw=1)\n",
479 "ax.set_xlabel('Hours studied')\n",
480 "ax.set_ylabel('P(pass)')\n",
481 "ax.set_title('Logistic Regression: Sigmoid Curve')\n",
482 "ax.legend()\n",
483 "\n",
484 "# Right: log-odds plot\n",
485 "ax = axes[1]\n",
486 "ax.plot(x_range, b0_hat + b1_hat * x_range, 'purple', lw=2, label='Log-odds')\n",
487 "ax.axhline(0, color='gray', ls='--')\n",
488 "ax.axvline(boundary, color='green', ls='--', lw=1.5, label=f'Boundary ({boundary:.1f}h)')\n",
489 "ax.fill_between(x_range, b0_hat + b1_hat * x_range, 0,\n",
490 " where=(b0_hat + b1_hat * x_range > 0), alpha=0.15, color='steelblue', label='Predict pass')\n",
491 "ax.fill_between(x_range, b0_hat + b1_hat * x_range, 0,\n",
492 " where=(b0_hat + b1_hat * x_range < 0), alpha=0.15, color='tomato', label='Predict fail')\n",
493 "ax.set_xlabel('Hours studied')\n",
494 "ax.set_ylabel('Log-odds')\n",
495 "ax.set_title('Log-odds and Decision Boundary')\n",
496 "ax.legend(fontsize=9)\n",
497 "\n",
498 "plt.tight_layout()\n",
499 "plt.show()"
500 ]
501 },
502 {
503 "cell_type": "markdown",
504 "id": "b2c3d4e5-0013-4000-8000-000000000013",
505 "metadata": {},
506 "source": [
507 "## 6. Checking Regression Assumptions\n",
508 "\n",
509 "OLS inference is valid only when the Gauss-Markov assumptions hold. Violations affect different inferential properties:\n",
510 "\n",
511 "| Assumption | Consequence if violated | Diagnostic |\n",
512 "|------------|------------------------|------------|\n",
513 "| **Linearity** | Biased estimates | Residuals vs Fitted — look for curvature |\n",
514 "| **Independence** | Underestimated SEs | Durbin-Watson test (time-series) |\n",
515 "| **Homoscedasticity** | Inefficient estimates, invalid SEs | Scale-Location plot, Breusch-Pagan |\n",
516 "| **Normality of errors** | Invalid t/F intervals (small n) | Q-Q plot, Shapiro-Wilk test |\n",
517 "| **No perfect multicollinearity** | Singular (X'X), unreliable estimates | VIF > 10 is problematic |\n",
518 "\n",
519 "Below we deliberately create two datasets — one that satisfies assumptions and one that violates **homoscedasticity** (heteroscedastic errors) — to compare the diagnostic plots."
520 ]
521 },
522 {
523 "cell_type": "code",
524 "execution_count": null,
525 "id": "b2c3d4e5-0014-4000-8000-000000000014",
526 "metadata": {},
527 "outputs": [],
528 "source": [
529 "n_diag = 100\n",
530 "x_diag = rng.uniform(1, 10, n_diag)\n",
531 "\n",
532 "# --- Dataset A: well-behaved (homoscedastic, normal errors)\n",
533 "y_good = 3 + 2 * x_diag + rng.normal(0, 1.5, n_diag)\n",
534 "\n",
535 "# --- Dataset B: heteroscedastic (variance grows with x)\n",
536 "y_hetero = 3 + 2 * x_diag + rng.normal(0, 0.5 * x_diag, n_diag)\n",
537 "\n",
538 "def fit_ols_1d(x, y):\n",
539 " slope, intercept, *_ = stats.linregress(x, y)\n",
540 " y_hat = intercept + slope * x\n",
541 " resid = y - y_hat\n",
542 " s = np.sqrt((resid**2).sum() / (len(x) - 2))\n",
543 " return y_hat, resid, s\n",
544 "\n",
545 "y_hat_good, resid_good, s_good = fit_ols_1d(x_diag, y_good)\n",
546 "y_hat_hetero, resid_hetero, s_hetero = fit_ols_1d(x_diag, y_hetero)\n",
547 "\n",
548 "def add_lowess_line(ax, x, y, frac=0.5, color='red'):\n",
549 " \"\"\"Simple moving average as a smoother (no statsmodels dependency).\"\"\"\n",
550 " idx = np.argsort(x)\n",
551 " x_s, y_s = x[idx], y[idx]\n",
552 " w = max(1, int(frac * len(x)))\n",
553 " y_smooth = np.convolve(y_s, np.ones(w)/w, mode='same')\n",
554 " ax.plot(x_s, y_smooth, color=color, lw=1.5, ls='--')\n",
555 "\n",
556 "fig, axes = plt.subplots(2, 3, figsize=(13, 8))\n",
557 "titles_row = ['Homoscedastic (well-behaved)', 'Heteroscedastic (variance grows with x)']\n",
558 "\n",
559 "for row, (x, y, y_hat, resid, s, label) in enumerate([\n",
560 " (x_diag, y_good, y_hat_good, resid_good, s_good, 'A: Well-behaved'),\n",
561 " (x_diag, y_hetero, y_hat_hetero, resid_hetero, s_hetero, 'B: Heteroscedastic')]):\n",
562 "\n",
563 " std_resid = resid / s\n",
564 "\n",
565 " # (1) Residuals vs Fitted\n",
566 " ax = axes[row, 0]\n",
567 " ax.scatter(y_hat, resid, alpha=0.5, color='steelblue', s=20)\n",
568 " ax.axhline(0, color='red', lw=1.5, ls='--')\n",
569 " add_lowess_line(ax, y_hat, resid)\n",
570 " ax.set_xlabel('Fitted'); ax.set_ylabel('Residuals')\n",
571 " ax.set_title(f'{label}\\nResiduals vs Fitted')\n",
572 "\n",
573 " # (2) Q-Q plot\n",
574 " ax = axes[row, 1]\n",
575 " (osm, osr), (sq, iq, rq) = stats.probplot(resid, dist='norm')\n",
576 " ax.scatter(osm, osr, alpha=0.5, color='steelblue', s=20)\n",
577 " ax.plot(osm, sq * np.array(osm) + iq, 'r-', lw=1.5)\n",
578 " ax.set_xlabel('Theoretical quantiles'); ax.set_ylabel('Sample quantiles')\n",
579 " ax.set_title('Normal Q-Q')\n",
580 "\n",
581 " # (3) Scale-Location\n",
582 " ax = axes[row, 2]\n",
583 " ax.scatter(y_hat, np.sqrt(np.abs(std_resid)), alpha=0.5, color='steelblue', s=20)\n",
584 " add_lowess_line(ax, y_hat, np.sqrt(np.abs(std_resid)))\n",
585 " ax.set_xlabel('Fitted'); ax.set_ylabel('√|Std residuals|')\n",
586 " ax.set_title('Scale-Location')\n",
587 "\n",
588 " # Breusch-Pagan-like test: correlation between |residuals| and fitted values\n",
589 " corr, bp_p = stats.pearsonr(y_hat, np.abs(resid))\n",
590 " print(f'{label}: |residual| ~ fitted r={corr:.3f}, p={bp_p:.4f}')\n",
591 "\n",
592 "plt.suptitle('Regression Diagnostic Plots: Well-behaved vs Heteroscedastic', fontsize=13)\n",
593 "plt.tight_layout()\n",
594 "plt.show()"
595 ]
596 },
597 {
598 "cell_type": "markdown",
599 "id": "b2c3d4e5-0015-4000-8000-000000000015",
600 "metadata": {},
601 "source": [
602 "## Summary\n",
603 "\n",
604 "### Regression Model Comparison\n",
605 "\n",
606 "| Model | Response | Key idea | Fitting method | Interpretation |\n",
607 "|-------|----------|----------|----------------|----------------|\n",
608 "| Simple linear | Continuous | One predictor | OLS | β₁: change in ŷ per unit x |\n",
609 "| Multiple linear | Continuous | Multiple predictors | OLS | β_j: partial effect of x_j |\n",
610 "| Polynomial | Continuous | Non-linear with OLS | OLS in transformed space | Compare adj-R² across degrees |\n",
611 "| Logistic | Binary (0/1) | Log-odds linear in x | MLE (iterative) | e^β_j: odds ratio |\n",
612 "\n",
613 "### Workflow for Any Regression Problem\n",
614 "\n",
615 "1. **Explore** — scatter plots, correlation matrix, check for outliers\n",
616 "2. **Fit** — choose model family (linear / polynomial / logistic / etc.)\n",
617 "3. **Evaluate** — R², adjusted R², AIC/BIC, cross-validation MSE\n",
618 "4. **Diagnose** — residual plots, Q-Q plot, scale-location, leverage\n",
619 "5. **Iterate** — transform variables, add/remove predictors, consider regularisation (Ridge/Lasso)\n",
620 "6. **Report** — coefficient estimates, confidence intervals, effect sizes\n",
621 "\n",
622 "### Extensions to Explore\n",
623 "- **Regularised regression**: Ridge (L2), Lasso (L1), ElasticNet — handle multicollinearity and feature selection\n",
624 "- **Generalised Linear Models (GLM)**: Poisson (count data), Negative Binomial, Gamma — when errors are non-normal\n",
625 "- **Mixed-effects models**: repeated measures and nested data structures\n",
626 "- **Robust regression**: Huber loss — reduces sensitivity to outliers\n",
627 "- **Non-parametric regression**: Kernel regression, GAM — flexible functional forms"
628 ]
629 }
630 ],
631 "metadata": {
632 "kernelspec": {
633 "display_name": "Python 3",
634 "language": "python",
635 "name": "python3"
636 },
637 "language_info": {
638 "name": "python",
639 "version": "3.11.0"
640 }
641 },
642 "nbformat": 4,
643 "nbformat_minor": 5
644}