12_regression_analysis.ipynb

Download
json 645 lines 28.2 KB
  1{
  2 "cells": [
  3  {
  4   "cell_type": "markdown",
  5   "id": "b2c3d4e5-0001-4000-8000-000000000001",
  6   "metadata": {},
  7   "source": [
  8    "# Regression Analysis\n",
  9    "\n",
 10    "Regression analysis models the relationship between a **response variable** (dependent) and one or more **predictor variables** (independent). It is the workhorse of statistical modelling and machine learning.\n",
 11    "\n",
 12    "**Topics covered in this notebook:**\n",
 13    "1. Simple linear regression — fitting a line, OLS estimation\n",
 14    "2. Interpreting coefficients — R², p-values, residual diagnostics\n",
 15    "3. Multiple linear regression — multiple predictors\n",
 16    "4. Polynomial regression — non-linear patterns, bias-variance tradeoff\n",
 17    "5. Logistic regression — binary outcomes, sigmoid function\n",
 18    "6. Regression assumptions — linearity, normality, homoscedasticity\n",
 19    "\n",
 20    "**Mathematical foundation — Ordinary Least Squares (OLS):**\n",
 21    "\n",
 22    "$$\\hat{\\boldsymbol{\\beta}} = (\\mathbf{X}^\\top \\mathbf{X})^{-1} \\mathbf{X}^\\top \\mathbf{y}$$\n",
 23    "\n",
 24    "OLS minimises the **residual sum of squares (RSS)** $\\sum_{i=1}^{n}(y_i - \\hat{y}_i)^2$ and is the **BLUE** (Best Linear Unbiased Estimator) under the Gauss-Markov assumptions."
 25   ]
 26  },
 27  {
 28   "cell_type": "code",
 29   "execution_count": null,
 30   "id": "b2c3d4e5-0002-4000-8000-000000000002",
 31   "metadata": {},
 32   "outputs": [],
 33   "source": [
 34    "import numpy as np\n",
 35    "import scipy.stats as stats\n",
 36    "import matplotlib.pyplot as plt\n",
 37    "import matplotlib.gridspec as gridspec\n",
 38    "\n",
 39    "# Optional: scikit-learn (used where noted; gracefully falls back to manual)\n",
 40    "try:\n",
 41    "    from sklearn.linear_model import LinearRegression, LogisticRegression\n",
 42    "    from sklearn.preprocessing import PolynomialFeatures, StandardScaler\n",
 43    "    from sklearn.pipeline import make_pipeline\n",
 44    "    from sklearn.metrics import r2_score\n",
 45    "    SKLEARN = True\n",
 46    "    print('scikit-learn is available — using sklearn where convenient.')\n",
 47    "except ImportError:\n",
 48    "    SKLEARN = False\n",
 49    "    print('scikit-learn not found — using NumPy/SciPy only.')\n",
 50    "\n",
 51    "rng = np.random.default_rng(seed=0)\n",
 52    "\n",
 53    "plt.rcParams.update({\n",
 54    "    'figure.dpi': 100,\n",
 55    "    'axes.spines.top': False,\n",
 56    "    'axes.spines.right': False,\n",
 57    "    'font.size': 11,\n",
 58    "})\n",
 59    "\n",
 60    "print('NumPy:', np.__version__)\n",
 61    "print('SciPy:', __import__('scipy').__version__)"
 62   ]
 63  },
 64  {
 65   "cell_type": "markdown",
 66   "id": "b2c3d4e5-0003-4000-8000-000000000003",
 67   "metadata": {},
 68   "source": [
 69    "## 1. Simple Linear Regression\n",
 70    "\n",
 71    "The simple linear model relates a single predictor $x$ to the response $y$:\n",
 72    "\n",
 73    "$$y_i = \\beta_0 + \\beta_1 x_i + \\varepsilon_i, \\qquad \\varepsilon_i \\overset{\\text{i.i.d.}}{\\sim} \\mathcal{N}(0, \\sigma^2)$$\n",
 74    "\n",
 75    "OLS closed-form estimates:\n",
 76    "\n",
 77    "$$\\hat{\\beta}_1 = \\frac{\\sum(x_i - \\bar{x})(y_i - \\bar{y})}{\\sum(x_i - \\bar{x})^2} = \\frac{S_{xy}}{S_{xx}}, \\qquad \\hat{\\beta}_0 = \\bar{y} - \\hat{\\beta}_1 \\bar{x}$$\n",
 78    "\n",
 79    "The **95% confidence band** around the regression line reflects uncertainty in the estimated mean response $E[y \\mid x]$. The wider **prediction interval** additionally accounts for individual observation noise $\\sigma^2$."
 80   ]
 81  },
 82  {
 83   "cell_type": "code",
 84   "execution_count": null,
 85   "id": "b2c3d4e5-0004-4000-8000-000000000004",
 86   "metadata": {},
 87   "outputs": [],
 88   "source": [
 89    "# Scenario: predict house price (€ thousands) from size (m²)\n",
 90    "n = 80\n",
 91    "x = rng.uniform(40, 200, size=n)                            # house size (m²)\n",
 92    "true_beta0, true_beta1 = 30.0, 2.5                         # true intercept, slope\n",
 93    "noise_std = 25.0\n",
 94    "y = true_beta0 + true_beta1 * x + rng.normal(0, noise_std, size=n)\n",
 95    "\n",
 96    "# OLS via scipy.stats.linregress\n",
 97    "slope, intercept, r_value, p_value, stderr = stats.linregress(x, y)\n",
 98    "\n",
 99    "print(f'Fitted intercept β₀ : {intercept:.4f}  (true: {true_beta0})')\n",
100    "print(f'Fitted slope     β₁ : {slope:.4f}  (true: {true_beta1})')\n",
101    "print(f'R                   : {r_value:.4f}')\n",
102    "print(f'R²                  : {r_value**2:.4f}')\n",
103    "print(f'p-value (slope)     : {p_value:.6f}')\n",
104    "print(f'Std error (slope)   : {stderr:.4f}')\n",
105    "\n",
106    "# Predictions + 95% confidence band (analytical formula)\n",
107    "x_grid = np.linspace(x.min(), x.max(), 300)\n",
108    "y_hat  = intercept + slope * x_grid\n",
109    "\n",
110    "x_mean = x.mean()\n",
111    "n_obs  = len(x)\n",
112    "Sxx = ((x - x_mean) ** 2).sum()\n",
113    "residuals = y - (intercept + slope * x)\n",
114    "s_e = np.sqrt((residuals ** 2).sum() / (n_obs - 2))       # residual std error\n",
115    "\n",
116    "t_crit = stats.t.ppf(0.975, df=n_obs - 2)\n",
117    "se_mean = s_e * np.sqrt(1 / n_obs + (x_grid - x_mean) ** 2 / Sxx)\n",
118    "se_pred = s_e * np.sqrt(1 + 1 / n_obs + (x_grid - x_mean) ** 2 / Sxx)\n",
119    "\n",
120    "ci_lower = y_hat - t_crit * se_mean\n",
121    "ci_upper = y_hat + t_crit * se_mean\n",
122    "pi_lower = y_hat - t_crit * se_pred\n",
123    "pi_upper = y_hat + t_crit * se_pred\n",
124    "\n",
125    "# Plot\n",
126    "fig, ax = plt.subplots(figsize=(8, 5))\n",
127    "ax.scatter(x, y, color='steelblue', alpha=0.6, s=30, label='Observations')\n",
128    "ax.plot(x_grid, y_hat, 'r-', lw=2, label=f'OLS fit: y = {intercept:.1f} + {slope:.2f}x')\n",
129    "ax.fill_between(x_grid, ci_lower, ci_upper, color='red', alpha=0.15,\n",
130    "                label='95% confidence band')\n",
131    "ax.fill_between(x_grid, pi_lower, pi_upper, color='orange', alpha=0.10,\n",
132    "                label='95% prediction interval')\n",
133    "ax.set_xlabel('House size (m²)')\n",
134    "ax.set_ylabel('Price (€ thousands)')\n",
135    "ax.set_title(f'Simple Linear Regression  |  R²={r_value**2:.3f}, p={p_value:.4f}')\n",
136    "ax.legend()\n",
137    "plt.tight_layout()\n",
138    "plt.show()"
139   ]
140  },
141  {
142   "cell_type": "markdown",
143   "id": "b2c3d4e5-0005-4000-8000-000000000005",
144   "metadata": {},
145   "source": [
146    "## 2. Interpreting Coefficients: R², Residuals, and Q-Q Plot\n",
147    "\n",
148    "**R² (coefficient of determination)** measures the proportion of variance in $y$ explained by the model:\n",
149    "\n",
150    "$$R^2 = 1 - \\frac{\\text{RSS}}{\\text{TSS}} = 1 - \\frac{\\sum(y_i - \\hat{y}_i)^2}{\\sum(y_i - \\bar{y})^2}$$\n",
151    "\n",
152    "**Residual diagnostics** are essential for validating OLS assumptions:\n",
153    "\n",
154    "| Plot | What to look for | Violation suggests |\n",
155    "|------|------------------|-----------------------|\n",
156    "| Residuals vs Fitted | Random scatter around 0 | Non-linearity or heteroscedasticity |\n",
157    "| Q-Q plot | Points on diagonal | Non-normality of errors |\n",
158    "| Scale-Location | Horizontal band | Heteroscedasticity |\n",
159    "| Residuals vs Leverage | No high-leverage outliers | Influential observations |\n",
160    "\n",
161    "The **p-value of the slope** tests H₀: β₁ = 0 (no linear relationship). Reject when p < α."
162   ]
163  },
164  {
165   "cell_type": "code",
166   "execution_count": null,
167   "id": "b2c3d4e5-0006-4000-8000-000000000006",
168   "metadata": {},
169   "outputs": [],
170   "source": [
171    "# Re-use the house price data from Cell 4\n",
172    "y_fitted   = intercept + slope * x\n",
173    "residuals  = y - y_fitted\n",
174    "std_resid  = residuals / s_e                               # standardised residuals\n",
175    "\n",
176    "RSS = (residuals ** 2).sum()\n",
177    "TSS = ((y - y.mean()) ** 2).sum()\n",
178    "R2  = 1 - RSS / TSS\n",
179    "adj_R2 = 1 - (1 - R2) * (n_obs - 1) / (n_obs - 2)\n",
180    "\n",
181    "print(f'R²          : {R2:.4f}')\n",
182    "print(f'Adjusted R² : {adj_R2:.4f}')\n",
183    "print(f'RSS         : {RSS:.2f}')\n",
184    "print(f'Residual SE : {s_e:.4f}')\n",
185    "\n",
186    "# Shapiro-Wilk test on residuals\n",
187    "sw_stat, sw_p = stats.shapiro(residuals)\n",
188    "print(f'\\nShapiro-Wilk test on residuals: W={sw_stat:.4f}, p={sw_p:.4f}')\n",
189    "print('Residuals appear normal.' if sw_p > 0.05 else 'Residuals may NOT be normal.')\n",
190    "\n",
191    "fig = plt.figure(figsize=(11, 8))\n",
192    "gs  = gridspec.GridSpec(2, 2, hspace=0.40, wspace=0.35)\n",
193    "\n",
194    "# 1. Residuals vs Fitted\n",
195    "ax1 = fig.add_subplot(gs[0, 0])\n",
196    "ax1.scatter(y_fitted, residuals, alpha=0.6, color='steelblue', s=25)\n",
197    "ax1.axhline(0, color='red', lw=1.5, ls='--')\n",
198    "ax1.set_xlabel('Fitted values'); ax1.set_ylabel('Residuals')\n",
199    "ax1.set_title('Residuals vs Fitted')\n",
200    "\n",
201    "# 2. Q-Q plot\n",
202    "ax2 = fig.add_subplot(gs[0, 1])\n",
203    "(osm, osr), (slope_qq, intercept_qq, r_qq) = stats.probplot(residuals, dist='norm')\n",
204    "ax2.scatter(osm, osr, alpha=0.6, color='steelblue', s=25)\n",
205    "ax2.plot(osm, slope_qq * np.array(osm) + intercept_qq, 'r-', lw=1.5)\n",
206    "ax2.set_xlabel('Theoretical quantiles'); ax2.set_ylabel('Sample quantiles')\n",
207    "ax2.set_title('Normal Q-Q Plot')\n",
208    "\n",
209    "# 3. Scale-Location (sqrt |standardised residuals| vs fitted)\n",
210    "ax3 = fig.add_subplot(gs[1, 0])\n",
211    "ax3.scatter(y_fitted, np.sqrt(np.abs(std_resid)), alpha=0.6, color='steelblue', s=25)\n",
212    "ax3.set_xlabel('Fitted values'); ax3.set_ylabel('√|Standardised residuals|')\n",
213    "ax3.set_title('Scale-Location')\n",
214    "\n",
215    "# 4. Residual histogram\n",
216    "ax4 = fig.add_subplot(gs[1, 1])\n",
217    "ax4.hist(residuals, bins=14, color='steelblue', edgecolor='white', alpha=0.8, density=True)\n",
218    "xr = np.linspace(residuals.min(), residuals.max(), 200)\n",
219    "ax4.plot(xr, stats.norm.pdf(xr, residuals.mean(), residuals.std()), 'r-', lw=2)\n",
220    "ax4.set_xlabel('Residual'); ax4.set_ylabel('Density')\n",
221    "ax4.set_title('Residual Distribution')\n",
222    "\n",
223    "fig.suptitle(f'Residual Diagnostics  |  R²={R2:.3f}', fontsize=13, y=1.01)\n",
224    "plt.show()"
225   ]
226  },
227  {
228   "cell_type": "markdown",
229   "id": "b2c3d4e5-0007-4000-8000-000000000007",
230   "metadata": {},
231   "source": [
232    "## 3. Multiple Linear Regression\n",
233    "\n",
234    "With $p$ predictors the model becomes:\n",
235    "\n",
236    "$$\\mathbf{y} = \\mathbf{X}\\boldsymbol{\\beta} + \\boldsymbol{\\varepsilon}, \\qquad \\hat{\\boldsymbol{\\beta}} = (\\mathbf{X}^\\top\\mathbf{X})^{-1}\\mathbf{X}^\\top\\mathbf{y}$$\n",
237    "\n",
238    "where $\\mathbf{X}$ is the $n \\times (p+1)$ design matrix (first column all ones for the intercept).\n",
239    "\n",
240    "**Key considerations:**\n",
241    "- **Multicollinearity**: when predictors are correlated, coefficient estimates become unstable. Check with Variance Inflation Factor (VIF).\n",
242    "- **Adjusted R²**: penalises adding uninformative predictors — always report alongside R².\n",
243    "- **F-statistic**: tests whether *all* coefficients (except intercept) are simultaneously zero.\n",
244    "\n",
245    "$$F = \\frac{(\\text{TSS} - \\text{RSS})/p}{\\text{RSS}/(n - p - 1)}$$"
246   ]
247  },
248  {
249   "cell_type": "code",
250   "execution_count": null,
251   "id": "b2c3d4e5-0008-4000-8000-000000000008",
252   "metadata": {},
253   "outputs": [],
254   "source": [
255    "# Scenario: predict salary (€k) from years of experience, education level (0-3), and performance score\n",
256    "n = 120\n",
257    "experience  = rng.uniform(0, 30, n)\n",
258    "education   = rng.integers(0, 4, n).astype(float)\n",
259    "performance = rng.uniform(50, 100, n)\n",
260    "\n",
261    "# True model: salary = 25 + 1.8*exp + 5*edu + 0.3*perf + noise\n",
262    "true_betas = np.array([25.0, 1.8, 5.0, 0.3])\n",
263    "noise = rng.normal(0, 8, n)\n",
264    "y = true_betas[0] + true_betas[1]*experience + true_betas[2]*education + true_betas[3]*performance + noise\n",
265    "\n",
266    "# Build design matrix (with intercept column)\n",
267    "X_raw = np.column_stack([experience, education, performance])\n",
268    "ones  = np.ones((n, 1))\n",
269    "X     = np.hstack([ones, X_raw])                           # shape: (n, 4)\n",
270    "\n",
271    "# OLS via normal equations\n",
272    "beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]\n",
273    "\n",
274    "y_hat    = X @ beta_hat\n",
275    "residuals = y - y_hat\n",
276    "p_pred   = X.shape[1] - 1                                  # number of predictors (excl. intercept)\n",
277    "RSS      = (residuals ** 2).sum()\n",
278    "TSS      = ((y - y.mean()) ** 2).sum()\n",
279    "R2       = 1 - RSS / TSS\n",
280    "adj_R2   = 1 - (1 - R2) * (n - 1) / (n - p_pred - 1)\n",
281    "s2       = RSS / (n - p_pred - 1)\n",
282    "\n",
283    "# Standard errors of coefficients\n",
284    "XtX_inv  = np.linalg.inv(X.T @ X)\n",
285    "se_betas = np.sqrt(np.diag(XtX_inv) * s2)\n",
286    "\n",
287    "# t-statistics and p-values\n",
288    "t_stats = beta_hat / se_betas\n",
289    "df_res  = n - p_pred - 1\n",
290    "p_vals  = 2 * stats.t.sf(np.abs(t_stats), df=df_res)\n",
291    "\n",
292    "# F-statistic for overall model significance\n",
293    "F_stat = ((TSS - RSS) / p_pred) / (RSS / df_res)\n",
294    "F_pval = stats.f.sf(F_stat, p_pred, df_res)\n",
295    "\n",
296    "print('Multiple Linear Regression Results')\n",
297    "print('=' * 55)\n",
298    "print(f'R²: {R2:.4f}   Adjusted R²: {adj_R2:.4f}')\n",
299    "print(f'F-statistic: {F_stat:.2f}  (p={F_pval:.2e})')\n",
300    "print(f'n={n}, p={p_pred}')\n",
301    "print()\n",
302    "labels = ['Intercept', 'Experience', 'Education', 'Performance']\n",
303    "print(f'{\"Variable\":<14} {\"Estimate\":>10} {\"Std Err\":>10} {\"t-stat\":>10} {\"p-value\":>12}')\n",
304    "print('-' * 58)\n",
305    "for lbl, b, se, t, p in zip(labels, beta_hat, se_betas, t_stats, p_vals):\n",
306    "    sig = '***' if p < 0.001 else ('**' if p < 0.01 else ('*' if p < 0.05 else ''))\n",
307    "    print(f'{lbl:<14} {b:>10.4f} {se:>10.4f} {t:>10.4f} {p:>12.4f} {sig}')\n",
308    "\n",
309    "# Coefficient plot\n",
310    "fig, ax = plt.subplots(figsize=(7, 4))\n",
311    "ci_mult = stats.t.ppf(0.975, df=df_res)\n",
312    "ci_half = ci_mult * se_betas[1:]                           # skip intercept\n",
313    "y_pos = np.arange(p_pred)\n",
314    "ax.barh(y_pos, beta_hat[1:], xerr=ci_half, color='steelblue', alpha=0.7,\n",
315    "        edgecolor='white', capsize=4)\n",
316    "ax.axvline(0, color='black', lw=1)\n",
317    "ax.set_yticks(y_pos)\n",
318    "ax.set_yticklabels(['Experience', 'Education', 'Performance'])\n",
319    "ax.set_xlabel('Coefficient estimate (with 95% CI)')\n",
320    "ax.set_title('Multiple Regression Coefficients')\n",
321    "plt.tight_layout()\n",
322    "plt.show()"
323   ]
324  },
325  {
326   "cell_type": "markdown",
327   "id": "b2c3d4e5-0009-4000-8000-000000000009",
328   "metadata": {},
329   "source": [
330    "## 4. Polynomial Regression\n",
331    "\n",
332    "When the relationship between $x$ and $y$ is non-linear, we can extend OLS by including polynomial features:\n",
333    "\n",
334    "$$y = \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_d x^d + \\varepsilon$$\n",
335    "\n",
336    "This is still *linear* in the parameters $\\boldsymbol{\\beta}$ — just a linear model in a transformed feature space.\n",
337    "\n",
338    "**Bias-variance tradeoff:**\n",
339    "- Low-degree polynomial → high bias (underfitting)\n",
340    "- High-degree polynomial → high variance (overfitting)\n",
341    "- The optimal degree minimises prediction error on unseen data (use cross-validation)\n",
342    "\n",
343    "The training R² always increases with degree, so use **adjusted R²** or **AIC/BIC** to compare models."
344   ]
345  },
346  {
347   "cell_type": "code",
348   "execution_count": null,
349   "id": "b2c3d4e5-0010-4000-8000-000000000010",
350   "metadata": {},
351   "outputs": [],
352   "source": [
353    "# Scenario: model a noisy sinusoidal relationship\n",
354    "n = 60\n",
355    "x_poly = rng.uniform(-3, 3, size=n)\n",
356    "y_poly = np.sin(x_poly) + rng.normal(0, 0.3, size=n)\n",
357    "\n",
358    "x_grid_poly = np.linspace(-3.2, 3.2, 300)\n",
359    "degrees = [1, 3, 7, 15]\n",
360    "\n",
361    "fig, axes = plt.subplots(2, 2, figsize=(11, 8), sharey=True)\n",
362    "axes = axes.ravel()\n",
363    "\n",
364    "r2_scores = []\n",
365    "for ax, deg in zip(axes, degrees):\n",
366    "    # Build Vandermonde matrix for training data and grid\n",
367    "    X_train = np.vander(x_poly,      N=deg + 1, increasing=True)\n",
368    "    X_grid  = np.vander(x_grid_poly, N=deg + 1, increasing=True)\n",
369    "\n",
370    "    coeffs = np.linalg.lstsq(X_train, y_poly, rcond=None)[0]\n",
371    "    y_pred_train = X_train @ coeffs\n",
372    "    y_pred_grid  = X_grid  @ coeffs\n",
373    "\n",
374    "    rss = ((y_poly - y_pred_train) ** 2).sum()\n",
375    "    tss = ((y_poly - y_poly.mean()) ** 2).sum()\n",
376    "    r2  = 1 - rss / tss\n",
377    "    adj_r2 = 1 - (1 - r2) * (n - 1) / (n - deg - 1)\n",
378    "    r2_scores.append(r2)\n",
379    "\n",
380    "    ax.scatter(x_poly, y_poly, color='steelblue', alpha=0.5, s=20, label='Data')\n",
381    "    ax.plot(x_grid_poly, np.sin(x_grid_poly), 'g--', lw=1.5, label='True sin(x)')\n",
382    "    ax.plot(x_grid_poly, y_pred_grid, 'r-', lw=2,\n",
383    "            label=f'Degree {deg}')\n",
384    "    ax.set_ylim(-2.5, 2.5)\n",
385    "    ax.set_title(f'Degree {deg}  |  R²={r2:.3f}, adj-R²={adj_r2:.3f}')\n",
386    "    ax.set_xlabel('x')\n",
387    "    ax.set_ylabel('y')\n",
388    "    ax.legend(fontsize=9)\n",
389    "\n",
390    "plt.suptitle('Polynomial Regression: Underfitting → Overfitting', fontsize=13)\n",
391    "plt.tight_layout()\n",
392    "plt.show()\n",
393    "\n",
394    "print('Training R² by degree:')\n",
395    "for d, r2 in zip(degrees, r2_scores):\n",
396    "    bar = '#' * int(r2 * 40)\n",
397    "    print(f'  Degree {d:>2}: R²={r2:.4f}  |{bar}')"
398   ]
399  },
400  {
401   "cell_type": "markdown",
402   "id": "b2c3d4e5-0011-4000-8000-000000000011",
403   "metadata": {},
404   "source": [
405    "## 5. Logistic Regression (Binary Classification)\n",
406    "\n",
407    "When the response is binary ($y \\in \\{0, 1\\}$), linear regression is inappropriate. **Logistic regression** models the log-odds:\n",
408    "\n",
409    "$$\\log\\frac{P(y=1 \\mid x)}{1 - P(y=1 \\mid x)} = \\beta_0 + \\beta_1 x$$\n",
410    "\n",
411    "Transforming back via the **sigmoid function**:\n",
412    "\n",
413    "$$P(y=1 \\mid x) = \\sigma(\\beta_0 + \\beta_1 x) = \\frac{1}{1 + e^{-(\\beta_0 + \\beta_1 x)}}$$\n",
414    "\n",
415    "Parameters are estimated by **maximum likelihood** (not OLS). The decision boundary is where $P = 0.5$, i.e., $\\beta_0 + \\beta_1 x = 0$.\n",
416    "\n",
417    "**Interpretation of coefficients:** $e^{\\beta_j}$ is the **odds ratio** — the factor by which the odds of $y=1$ multiply for a unit increase in $x_j$, holding other predictors constant."
418   ]
419  },
420  {
421   "cell_type": "code",
422   "execution_count": null,
423   "id": "b2c3d4e5-0012-4000-8000-000000000012",
424   "metadata": {},
425   "outputs": [],
426   "source": [
427    "# Scenario: predict exam pass/fail from hours studied\n",
428    "n_logit = 100\n",
429    "hours   = rng.uniform(0, 10, n_logit)\n",
430    "\n",
431    "# True logistic model: log-odds = -4 + 1.2 * hours\n",
432    "true_b0, true_b1 = -4.0, 1.2\n",
433    "log_odds = true_b0 + true_b1 * hours\n",
434    "prob_pass = 1 / (1 + np.exp(-log_odds))\n",
435    "passed = rng.binomial(1, prob_pass)                        # 0 = fail, 1 = pass\n",
436    "\n",
437    "# Fit via scikit-learn (or manual gradient descent fallback)\n",
438    "if SKLEARN:\n",
439    "    X_logit = hours.reshape(-1, 1)\n",
440    "    clf = LogisticRegression(solver='lbfgs', C=1e6)        # large C → minimal regularisation\n",
441    "    clf.fit(X_logit, passed)\n",
442    "    b0_hat = clf.intercept_[0]\n",
443    "    b1_hat = clf.coef_[0, 0]\n",
444    "    print(f'Fitted (sklearn):   β₀={b0_hat:.4f}, β₁={b1_hat:.4f}')\n",
445    "else:\n",
446    "    # Manual gradient descent\n",
447    "    def sigmoid(z): return 1 / (1 + np.exp(-z))\n",
448    "    b0_hat, b1_hat = 0.0, 0.0\n",
449    "    lr = 0.05\n",
450    "    for _ in range(3000):\n",
451    "        p = sigmoid(b0_hat + b1_hat * hours)\n",
452    "        err = p - passed\n",
453    "        b0_hat -= lr * err.mean()\n",
454    "        b1_hat -= lr * (err * hours).mean()\n",
455    "    print(f'Fitted (grad desc): β₀={b0_hat:.4f}, β₁={b1_hat:.4f}')\n",
456    "\n",
457    "print(f'True coefficients:  β₀={true_b0}, β₁={true_b1}')\n",
458    "\n",
459    "# Decision boundary: b0 + b1*x = 0  →  x = -b0/b1\n",
460    "boundary = -b0_hat / b1_hat\n",
461    "print(f'Decision boundary   : {boundary:.2f} hours')\n",
462    "print(f'Odds ratio e^β₁     : {np.exp(b1_hat):.4f}')\n",
463    "\n",
464    "# Visualise\n",
465    "x_range = np.linspace(0, 10, 300)\n",
466    "p_hat   = 1 / (1 + np.exp(-(b0_hat + b1_hat * x_range)))\n",
467    "\n",
468    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
469    "\n",
470    "# Left: sigmoid curve\n",
471    "ax = axes[0]\n",
472    "ax.scatter(hours[passed == 0], passed[passed == 0], color='tomato',    alpha=0.5, s=30,\n",
473    "           label='Fail (0)', zorder=3)\n",
474    "ax.scatter(hours[passed == 1], passed[passed == 1], color='steelblue', alpha=0.5, s=30,\n",
475    "           label='Pass (1)', zorder=3)\n",
476    "ax.plot(x_range, p_hat, 'k-', lw=2, label='Fitted P(pass)')\n",
477    "ax.axvline(boundary, color='green', ls='--', lw=1.5, label=f'Boundary = {boundary:.1f}h')\n",
478    "ax.axhline(0.5, color='gray', ls=':', lw=1)\n",
479    "ax.set_xlabel('Hours studied')\n",
480    "ax.set_ylabel('P(pass)')\n",
481    "ax.set_title('Logistic Regression: Sigmoid Curve')\n",
482    "ax.legend()\n",
483    "\n",
484    "# Right: log-odds plot\n",
485    "ax = axes[1]\n",
486    "ax.plot(x_range, b0_hat + b1_hat * x_range, 'purple', lw=2, label='Log-odds')\n",
487    "ax.axhline(0, color='gray', ls='--')\n",
488    "ax.axvline(boundary, color='green', ls='--', lw=1.5, label=f'Boundary ({boundary:.1f}h)')\n",
489    "ax.fill_between(x_range, b0_hat + b1_hat * x_range, 0,\n",
490    "                where=(b0_hat + b1_hat * x_range > 0), alpha=0.15, color='steelblue', label='Predict pass')\n",
491    "ax.fill_between(x_range, b0_hat + b1_hat * x_range, 0,\n",
492    "                where=(b0_hat + b1_hat * x_range < 0), alpha=0.15, color='tomato', label='Predict fail')\n",
493    "ax.set_xlabel('Hours studied')\n",
494    "ax.set_ylabel('Log-odds')\n",
495    "ax.set_title('Log-odds and Decision Boundary')\n",
496    "ax.legend(fontsize=9)\n",
497    "\n",
498    "plt.tight_layout()\n",
499    "plt.show()"
500   ]
501  },
502  {
503   "cell_type": "markdown",
504   "id": "b2c3d4e5-0013-4000-8000-000000000013",
505   "metadata": {},
506   "source": [
507    "## 6. Checking Regression Assumptions\n",
508    "\n",
509    "OLS inference is valid only when the Gauss-Markov assumptions hold. Violations affect different inferential properties:\n",
510    "\n",
511    "| Assumption | Consequence if violated | Diagnostic |\n",
512    "|------------|------------------------|------------|\n",
513    "| **Linearity** | Biased estimates | Residuals vs Fitted — look for curvature |\n",
514    "| **Independence** | Underestimated SEs | Durbin-Watson test (time-series) |\n",
515    "| **Homoscedasticity** | Inefficient estimates, invalid SEs | Scale-Location plot, Breusch-Pagan |\n",
516    "| **Normality of errors** | Invalid t/F intervals (small n) | Q-Q plot, Shapiro-Wilk test |\n",
517    "| **No perfect multicollinearity** | Singular (X'X), unreliable estimates | VIF > 10 is problematic |\n",
518    "\n",
519    "Below we deliberately create two datasets — one that satisfies assumptions and one that violates **homoscedasticity** (heteroscedastic errors) — to compare the diagnostic plots."
520   ]
521  },
522  {
523   "cell_type": "code",
524   "execution_count": null,
525   "id": "b2c3d4e5-0014-4000-8000-000000000014",
526   "metadata": {},
527   "outputs": [],
528   "source": [
529    "n_diag = 100\n",
530    "x_diag = rng.uniform(1, 10, n_diag)\n",
531    "\n",
532    "# --- Dataset A: well-behaved (homoscedastic, normal errors)\n",
533    "y_good = 3 + 2 * x_diag + rng.normal(0, 1.5, n_diag)\n",
534    "\n",
535    "# --- Dataset B: heteroscedastic (variance grows with x)\n",
536    "y_hetero = 3 + 2 * x_diag + rng.normal(0, 0.5 * x_diag, n_diag)\n",
537    "\n",
538    "def fit_ols_1d(x, y):\n",
539    "    slope, intercept, *_ = stats.linregress(x, y)\n",
540    "    y_hat = intercept + slope * x\n",
541    "    resid = y - y_hat\n",
542    "    s = np.sqrt((resid**2).sum() / (len(x) - 2))\n",
543    "    return y_hat, resid, s\n",
544    "\n",
545    "y_hat_good,   resid_good,   s_good   = fit_ols_1d(x_diag, y_good)\n",
546    "y_hat_hetero, resid_hetero, s_hetero = fit_ols_1d(x_diag, y_hetero)\n",
547    "\n",
548    "def add_lowess_line(ax, x, y, frac=0.5, color='red'):\n",
549    "    \"\"\"Simple moving average as a smoother (no statsmodels dependency).\"\"\"\n",
550    "    idx    = np.argsort(x)\n",
551    "    x_s, y_s = x[idx], y[idx]\n",
552    "    w = max(1, int(frac * len(x)))\n",
553    "    y_smooth = np.convolve(y_s, np.ones(w)/w, mode='same')\n",
554    "    ax.plot(x_s, y_smooth, color=color, lw=1.5, ls='--')\n",
555    "\n",
556    "fig, axes = plt.subplots(2, 3, figsize=(13, 8))\n",
557    "titles_row = ['Homoscedastic (well-behaved)', 'Heteroscedastic (variance grows with x)']\n",
558    "\n",
559    "for row, (x, y, y_hat, resid, s, label) in enumerate([\n",
560    "        (x_diag, y_good,   y_hat_good,   resid_good,   s_good,   'A: Well-behaved'),\n",
561    "        (x_diag, y_hetero, y_hat_hetero, resid_hetero, s_hetero, 'B: Heteroscedastic')]):\n",
562    "\n",
563    "    std_resid = resid / s\n",
564    "\n",
565    "    # (1) Residuals vs Fitted\n",
566    "    ax = axes[row, 0]\n",
567    "    ax.scatter(y_hat, resid, alpha=0.5, color='steelblue', s=20)\n",
568    "    ax.axhline(0, color='red', lw=1.5, ls='--')\n",
569    "    add_lowess_line(ax, y_hat, resid)\n",
570    "    ax.set_xlabel('Fitted'); ax.set_ylabel('Residuals')\n",
571    "    ax.set_title(f'{label}\\nResiduals vs Fitted')\n",
572    "\n",
573    "    # (2) Q-Q plot\n",
574    "    ax = axes[row, 1]\n",
575    "    (osm, osr), (sq, iq, rq) = stats.probplot(resid, dist='norm')\n",
576    "    ax.scatter(osm, osr, alpha=0.5, color='steelblue', s=20)\n",
577    "    ax.plot(osm, sq * np.array(osm) + iq, 'r-', lw=1.5)\n",
578    "    ax.set_xlabel('Theoretical quantiles'); ax.set_ylabel('Sample quantiles')\n",
579    "    ax.set_title('Normal Q-Q')\n",
580    "\n",
581    "    # (3) Scale-Location\n",
582    "    ax = axes[row, 2]\n",
583    "    ax.scatter(y_hat, np.sqrt(np.abs(std_resid)), alpha=0.5, color='steelblue', s=20)\n",
584    "    add_lowess_line(ax, y_hat, np.sqrt(np.abs(std_resid)))\n",
585    "    ax.set_xlabel('Fitted'); ax.set_ylabel('√|Std residuals|')\n",
586    "    ax.set_title('Scale-Location')\n",
587    "\n",
588    "    # Breusch-Pagan-like test: correlation between |residuals| and fitted values\n",
589    "    corr, bp_p = stats.pearsonr(y_hat, np.abs(resid))\n",
590    "    print(f'{label}: |residual| ~ fitted  r={corr:.3f}, p={bp_p:.4f}')\n",
591    "\n",
592    "plt.suptitle('Regression Diagnostic Plots: Well-behaved vs Heteroscedastic', fontsize=13)\n",
593    "plt.tight_layout()\n",
594    "plt.show()"
595   ]
596  },
597  {
598   "cell_type": "markdown",
599   "id": "b2c3d4e5-0015-4000-8000-000000000015",
600   "metadata": {},
601   "source": [
602    "## Summary\n",
603    "\n",
604    "### Regression Model Comparison\n",
605    "\n",
606    "| Model | Response | Key idea | Fitting method | Interpretation |\n",
607    "|-------|----------|----------|----------------|----------------|\n",
608    "| Simple linear | Continuous | One predictor | OLS | β₁: change in ŷ per unit x |\n",
609    "| Multiple linear | Continuous | Multiple predictors | OLS | β_j: partial effect of x_j |\n",
610    "| Polynomial | Continuous | Non-linear with OLS | OLS in transformed space | Compare adj-R² across degrees |\n",
611    "| Logistic | Binary (0/1) | Log-odds linear in x | MLE (iterative) | e^β_j: odds ratio |\n",
612    "\n",
613    "### Workflow for Any Regression Problem\n",
614    "\n",
615    "1. **Explore** — scatter plots, correlation matrix, check for outliers\n",
616    "2. **Fit** — choose model family (linear / polynomial / logistic / etc.)\n",
617    "3. **Evaluate** — R², adjusted R², AIC/BIC, cross-validation MSE\n",
618    "4. **Diagnose** — residual plots, Q-Q plot, scale-location, leverage\n",
619    "5. **Iterate** — transform variables, add/remove predictors, consider regularisation (Ridge/Lasso)\n",
620    "6. **Report** — coefficient estimates, confidence intervals, effect sizes\n",
621    "\n",
622    "### Extensions to Explore\n",
623    "- **Regularised regression**: Ridge (L2), Lasso (L1), ElasticNet — handle multicollinearity and feature selection\n",
624    "- **Generalised Linear Models (GLM)**: Poisson (count data), Negative Binomial, Gamma — when errors are non-normal\n",
625    "- **Mixed-effects models**: repeated measures and nested data structures\n",
626    "- **Robust regression**: Huber loss — reduces sensitivity to outliers\n",
627    "- **Non-parametric regression**: Kernel regression, GAM — flexible functional forms"
628   ]
629  }
630 ],
631 "metadata": {
632  "kernelspec": {
633   "display_name": "Python 3",
634   "language": "python",
635   "name": "python3"
636  },
637  "language_info": {
638   "name": "python",
639   "version": "3.11.0"
640  }
641 },
642 "nbformat": 4,
643 "nbformat_minor": 5
644}