11_hypothesis_testing.ipynb

  1{
  2 "cells": [
  3  {
  4   "cell_type": "markdown",
  5   "id": "a1b2c3d4-0001-4000-8000-000000000001",
  6   "metadata": {},
  7   "source": [
  8    "# Hypothesis Testing\n",
  9    "\n",
 10    "Hypothesis testing is a statistical framework for making decisions from data under uncertainty. We define a **null hypothesis** (H₀) representing the status quo, and an **alternative hypothesis** (H₁) representing the claim we want to support. A test statistic is computed from sample data and compared against a critical value or converted to a p-value.\n",
 11    "\n",
 12    "**Key concepts:**\n",
 13    "- **p-value**: probability of observing data at least as extreme as the sample, assuming H₀ is true\n",
 14    "- **Significance level (α)**: threshold below which we reject H₀ (commonly 0.05)\n",
 15    "- **Type I error**: rejecting a true H₀ (false positive, probability = α)\n",
 16    "- **Type II error**: failing to reject a false H₀ (false negative, probability = β)\n",
 17    "- **Power**: probability of correctly rejecting a false H₀ (= 1 − β)\n",
 18    "\n",
 19    "This notebook covers the most common parametric and non-parametric tests used in statistics and data science."
 20   ]
 21  },
 22  {
 23   "cell_type": "code",
 24   "execution_count": null,
 25   "id": "a1b2c3d4-0002-4000-8000-000000000002",
 26   "metadata": {},
 27   "outputs": [],
 28   "source": [
 29    "import numpy as np\n",
 30    "import scipy.stats as stats\n",
 31    "import matplotlib.pyplot as plt\n",
 32    "import matplotlib.gridspec as gridspec\n",
 33    "from itertools import combinations\n",
 34    "\n",
 35    "# Reproducibility\n",
 36    "rng = np.random.default_rng(seed=42)\n",
 37    "\n",
 38    "# Plot aesthetics\n",
 39    "plt.rcParams.update({\n",
 40    "    'figure.dpi': 100,\n",
 41    "    'axes.spines.top': False,\n",
 42    "    'axes.spines.right': False,\n",
 43    "    'font.size': 11,\n",
 44    "})\n",
 45    "\n",
 46    "print('NumPy:', np.__version__)\n",
 47    "print('SciPy:', __import__('scipy').__version__)\n",
 48    "print('Matplotlib:', __import__('matplotlib').__version__)"
 49   ]
 50  },
 51  {
 52   "cell_type": "markdown",
 53   "id": "a1b2c3d4-0003-4000-8000-000000000003",
 54   "metadata": {},
 55   "source": [
 56    "## 1. One-Sample t-Test\n",
 57    "\n",
 58    "The **one-sample t-test** asks: is the mean of a population equal to a hypothesised value μ₀?\n",
 59    "\n",
 60    "$$H_0: \\mu = \\mu_0 \\qquad H_1: \\mu \\neq \\mu_0$$\n",
 61    "\n",
 62    "The test statistic is:\n",
 63    "\n",
 64    "$$t = \\frac{\\bar{x} - \\mu_0}{s / \\sqrt{n}}$$\n",
 65    "\n",
 66    "where $\\bar{x}$ is the sample mean, $s$ is the sample standard deviation, and $n$ is the sample size. Under H₀ this follows a t-distribution with $n - 1$ degrees of freedom.\n",
 67    "\n",
 68    "**Assumptions:** observations are independent and (approximately) normally distributed — the test is robust to mild departures when $n \\geq 30$."
 69   ]
 70  },
 71  {
 72   "cell_type": "code",
 73   "execution_count": null,
 74   "id": "a1b2c3d4-0004-4000-8000-000000000004",
 75   "metadata": {},
 76   "outputs": [],
 77   "source": [
 78    "# Scenario: a factory claims its bolts have mean diameter 10 mm.\n",
 79    "# We sample 40 bolts and want to test that claim.\n",
 80    "\n",
 81    "mu_claimed = 10.0          # hypothesised population mean (mm)\n",
 82    "n = 40\n",
 83    "true_mean = 10.3           # the factory is actually producing slightly larger bolts\n",
 84    "true_std  = 0.8\n",
 85    "\n",
 86    "sample = rng.normal(loc=true_mean, scale=true_std, size=n)\n",
 87    "\n",
 88    "t_stat, p_value = stats.ttest_1samp(sample, popmean=mu_claimed)\n",
 89    "\n",
 90    "print(f'Sample mean : {sample.mean():.4f} mm')\n",
 91    "print(f'Sample std  : {sample.std(ddof=1):.4f} mm')\n",
 92    "print(f't-statistic : {t_stat:.4f}')\n",
 93    "print(f'p-value     : {p_value:.4f}')\n",
 94    "\n",
 95    "alpha = 0.05\n",
 96    "if p_value < alpha:\n",
 97    "    print(f'\\nDecision: Reject H₀ (p={p_value:.4f} < α={alpha})')\n",
 98    "    print(f'The sample mean differs significantly from {mu_claimed} mm.')\n",
 99    "else:\n",
100    "    print(f'\\nDecision: Fail to reject H₀ (p={p_value:.4f} ≥ α={alpha})')\n",
101    "    print(f'No significant evidence that the mean differs from {mu_claimed} mm.')\n",
102    "\n",
103    "# Visualise: sample distribution vs hypothesised mean\n",
104    "fig, ax = plt.subplots(figsize=(7, 4))\n",
105    "ax.hist(sample, bins=10, color='steelblue', edgecolor='white', alpha=0.8, label='Sample')\n",
106    "ax.axvline(sample.mean(), color='steelblue', lw=2, ls='--', label=f'Sample mean ({sample.mean():.2f})')\n",
107    "ax.axvline(mu_claimed,    color='tomato',   lw=2, ls='--', label=f'Claimed mean ({mu_claimed})')\n",
108    "ax.set_xlabel('Bolt diameter (mm)')\n",
109    "ax.set_ylabel('Count')\n",
110    "ax.set_title(f'One-sample t-test  |  t={t_stat:.3f}, p={p_value:.4f}')\n",
111    "ax.legend()\n",
112    "plt.tight_layout()\n",
113    "plt.show()"
114   ]
115  },
116  {
117   "cell_type": "markdown",
118   "id": "a1b2c3d4-0005-4000-8000-000000000005",
119   "metadata": {},
120   "source": [
121    "## 2. Two-Sample t-Test (Independent Groups)\n",
122    "\n",
123    "The **independent two-sample t-test** compares the means of two separate groups:\n",
124    "\n",
125    "$$H_0: \\mu_1 = \\mu_2 \\qquad H_1: \\mu_1 \\neq \\mu_2$$\n",
126    "\n",
127    "SciPy's `ttest_ind` uses Welch's t-test by default (`equal_var=False`), which does **not** assume equal variances and is generally preferred over Student's t-test:\n",
128    "\n",
129    "$$t = \\frac{\\bar{x}_1 - \\bar{x}_2}{\\sqrt{\\frac{s_1^2}{n_1} + \\frac{s_2^2}{n_2}}}$$\n",
130    "\n",
131    "**When to use:** comparing outcomes between two independent groups (e.g., A/B test, treatment vs control)."
132   ]
133  },
134  {
135   "cell_type": "code",
136   "execution_count": null,
137   "id": "a1b2c3d4-0006-4000-8000-000000000006",
138   "metadata": {},
139   "outputs": [],
140   "source": [
141    "# Scenario: compare exam scores between two teaching methods.\n",
142    "n1, n2 = 35, 40\n",
143    "group_A = rng.normal(loc=72, scale=10, size=n1)   # Method A\n",
144    "group_B = rng.normal(loc=77, scale=12, size=n2)   # Method B (slightly better)\n",
145    "\n",
146    "# Welch's t-test (does not assume equal variances)\n",
147    "t_stat, p_value = stats.ttest_ind(group_A, group_B, equal_var=False)\n",
148    "\n",
149    "print('Group A — mean: {:.2f}, std: {:.2f}, n={}'.format(group_A.mean(), group_A.std(ddof=1), n1))\n",
150    "print('Group B — mean: {:.2f}, std: {:.2f}, n={}'.format(group_B.mean(), group_B.std(ddof=1), n2))\n",
151    "print(f'\\nt-statistic : {t_stat:.4f}')\n",
152    "print(f'p-value     : {p_value:.4f}')\n",
153    "\n",
154    "alpha = 0.05\n",
155    "conclusion = 'Reject H₀' if p_value < alpha else 'Fail to reject H₀'\n",
156    "print(f'Decision    : {conclusion} at α={alpha}')\n",
157    "\n",
158    "# Visualise distributions\n",
159    "fig, ax = plt.subplots(figsize=(8, 4))\n",
160    "bins = np.linspace(40, 115, 25)\n",
161    "ax.hist(group_A, bins=bins, alpha=0.6, color='steelblue', edgecolor='white', label=f'Method A (mean={group_A.mean():.1f})')\n",
162    "ax.hist(group_B, bins=bins, alpha=0.6, color='salmon',    edgecolor='white', label=f'Method B (mean={group_B.mean():.1f})')\n",
163    "ax.axvline(group_A.mean(), color='steelblue', lw=2, ls='--')\n",
164    "ax.axvline(group_B.mean(), color='salmon',    lw=2, ls='--')\n",
165    "ax.set_xlabel('Exam Score')\n",
166    "ax.set_ylabel('Count')\n",
167    "ax.set_title(f'Two-sample t-test (Welch)  |  t={t_stat:.3f}, p={p_value:.4f}')\n",
168    "ax.legend()\n",
169    "plt.tight_layout()\n",
170    "plt.show()"
171   ]
172  },
173  {
174   "cell_type": "markdown",
175   "id": "a1b2c3d4-0007-4000-8000-000000000007",
176   "metadata": {},
177   "source": [
178    "## 3. Paired t-Test\n",
179    "\n",
180    "The **paired t-test** is used when each observation in one group is matched with exactly one observation in the other group. Common examples: measuring the same subject before and after treatment, or comparing two sensors on the same specimen.\n",
181    "\n",
182    "$$H_0: \\mu_d = 0 \\qquad H_1: \\mu_d \\neq 0$$\n",
183    "\n",
184    "where $d_i = x_{i,\\text{after}} - x_{i,\\text{before}}$. This reduces to a one-sample t-test on the differences:\n",
185    "\n",
186    "$$t = \\frac{\\bar{d}}{s_d / \\sqrt{n}}$$\n",
187    "\n",
188    "**Advantage over two-sample t-test:** pairing removes between-subject variability, increasing statistical power."
189   ]
190  },
191  {
192   "cell_type": "code",
193   "execution_count": null,
194   "id": "a1b2c3d4-0008-4000-8000-000000000008",
195   "metadata": {},
196   "outputs": [],
197   "source": [
198    "# Scenario: blood pressure (mmHg) measured before and after a 12-week exercise programme.\n",
199    "n = 25\n",
200    "baseline = rng.normal(loc=130, scale=15, size=n)                 # baseline (before)\n",
201    "treatment_effect = rng.normal(loc=-8, scale=5, size=n)           # individual response\n",
202    "followup = baseline + treatment_effect                            # after\n",
203    "\n",
204    "t_stat, p_value = stats.ttest_rel(followup, baseline)\n",
205    "\n",
206    "differences = followup - baseline\n",
207    "print(f'Mean difference (after − before): {differences.mean():.2f} mmHg')\n",
208    "print(f'Std of differences              : {differences.std(ddof=1):.2f} mmHg')\n",
209    "print(f't-statistic : {t_stat:.4f}')\n",
210    "print(f'p-value     : {p_value:.6f}')\n",
211    "\n",
212    "alpha = 0.05\n",
213    "if p_value < alpha:\n",
214    "    print(f'\\nDecision: Reject H₀ — the exercise programme significantly changed blood pressure.')\n",
215    "else:\n",
216    "    print(f'\\nDecision: Fail to reject H₀.')\n",
217    "\n",
218    "# Plot differences\n",
219    "fig, axes = plt.subplots(1, 2, figsize=(10, 4))\n",
220    "\n",
221    "# Paired lines\n",
222    "ax = axes[0]\n",
223    "for i in range(n):\n",
224    "    color = 'steelblue' if followup[i] < baseline[i] else 'salmon'\n",
225    "    ax.plot([0, 1], [baseline[i], followup[i]], color=color, alpha=0.5, lw=1)\n",
226    "ax.plot([0, 1], [baseline.mean(), followup.mean()], 'k-o', lw=2.5, markersize=8, label='Group mean')\n",
227    "ax.set_xticks([0, 1])\n",
228    "ax.set_xticklabels(['Before', 'After'])\n",
229    "ax.set_ylabel('Blood Pressure (mmHg)')\n",
230    "ax.set_title('Individual trajectories')\n",
231    "ax.legend()\n",
232    "\n",
233    "# Histogram of differences\n",
234    "ax = axes[1]\n",
235    "ax.hist(differences, bins=8, color='steelblue', edgecolor='white', alpha=0.8)\n",
236    "ax.axvline(0, color='tomato', lw=2, ls='--', label='H₀: mean diff = 0')\n",
237    "ax.axvline(differences.mean(), color='steelblue', lw=2, ls='--',\n",
238    "           label=f'Observed mean ({differences.mean():.1f})')\n",
239    "ax.set_xlabel('After − Before (mmHg)')\n",
240    "ax.set_ylabel('Count')\n",
241    "ax.set_title(f'Paired t-test  |  p={p_value:.4f}')\n",
242    "ax.legend()\n",
243    "\n",
244    "plt.tight_layout()\n",
245    "plt.show()"
246   ]
247  },
248  {
249   "cell_type": "markdown",
250   "id": "a1b2c3d4-0009-4000-8000-000000000009",
251   "metadata": {},
252   "source": [
253    "## 4. Chi-Squared Test for Independence\n",
254    "\n",
255    "The **χ² test for independence** determines whether two categorical variables are associated. Given an observed contingency table with counts $O_{ij}$ and expected counts $E_{ij}$ under independence:\n",
256    "\n",
257    "$$\\chi^2 = \\sum_{i,j} \\frac{(O_{ij} - E_{ij})^2}{E_{ij}}, \\qquad E_{ij} = \\frac{R_i \\cdot C_j}{N}$$\n",
258    "\n",
259    "Degrees of freedom: $(r - 1)(c - 1)$ where $r$ = rows, $c$ = columns.\n",
260    "\n",
261    "**Assumptions:**\n",
262    "- Observations are independent\n",
263    "- Expected cell counts ≥ 5 (use Fisher's exact test when this fails)\n",
264    "\n",
265    "**When to use:** testing association between categorical variables (e.g., survey responses vs demographic group)."
266   ]
267  },
268  {
269   "cell_type": "code",
270   "execution_count": null,
271   "id": "a1b2c3d4-0010-4000-8000-000000000010",
272   "metadata": {},
273   "outputs": [],
274   "source": [
275    "# Scenario: Is preference for a product (A/B/C) independent of age group (18-34 / 35-54 / 55+)?\n",
276    "observed = np.array([\n",
277    "    #  Prod A  Prod B  Prod C\n",
278    "    [   45,    30,    25],   # 18-34\n",
279    "    [   35,    50,    15],   # 35-54\n",
280    "    [   20,    20,    60],   # 55+\n",
281    "])\n",
282    "\n",
283    "chi2, p_value, dof, expected = stats.chi2_contingency(observed)\n",
284    "\n",
285    "print('Observed frequencies:')\n",
286    "print(observed)\n",
287    "print('\\nExpected frequencies (under independence):')\n",
288    "print(np.round(expected, 1))\n",
289    "print(f'\\nχ² statistic : {chi2:.4f}')\n",
290    "print(f'Degrees of freedom: {dof}')\n",
291    "print(f'p-value      : {p_value:.6f}')\n",
292    "\n",
293    "alpha = 0.05\n",
294    "if p_value < alpha:\n",
295    "    print(f'\\nDecision: Reject H₀ — product preference IS associated with age group.')\n",
296    "else:\n",
297    "    print(f'\\nDecision: Fail to reject H₀ — no significant association detected.')\n",
298    "\n",
299    "# Heatmap of standardised residuals\n",
300    "std_residuals = (observed - expected) / np.sqrt(expected)\n",
301    "\n",
302    "fig, axes = plt.subplots(1, 2, figsize=(11, 4))\n",
303    "\n",
304    "age_labels    = ['18-34', '35-54', '55+']\n",
305    "product_labels = ['Product A', 'Product B', 'Product C']\n",
306    "\n",
307    "for ax, data, title, fmt in zip(\n",
308    "        axes,\n",
309    "        [observed, std_residuals],\n",
310    "        ['Observed counts', 'Standardised residuals  (>|2| = notable)'],\n",
311    "        ['.0f', '.2f']):\n",
312    "    im = ax.imshow(data, cmap='RdYlGn' if 'residual' in title.lower() else 'Blues',\n",
313    "                   aspect='auto')\n",
314    "    ax.set_xticks(range(3)); ax.set_xticklabels(product_labels)\n",
315    "    ax.set_yticks(range(3)); ax.set_yticklabels(age_labels)\n",
316    "    for i in range(3):\n",
317    "        for j in range(3):\n",
318    "            ax.text(j, i, format(data[i, j], fmt), ha='center', va='center', fontsize=12)\n",
319    "    ax.set_title(title)\n",
320    "    plt.colorbar(im, ax=ax, shrink=0.8)\n",
321    "\n",
322    "plt.suptitle(f'Chi-squared test  |  χ²={chi2:.2f}, df={dof}, p={p_value:.4f}', y=1.02)\n",
323    "plt.tight_layout()\n",
324    "plt.show()"
325   ]
326  },
327  {
328   "cell_type": "markdown",
329   "id": "a1b2c3d4-0011-4000-8000-000000000011",
330   "metadata": {},
331   "source": [
332    "## 5. One-Way ANOVA\n",
333    "\n",
334    "**Analysis of Variance (ANOVA)** tests whether the means of three or more independent groups are all equal:\n",
335    "\n",
336    "$$H_0: \\mu_1 = \\mu_2 = \\cdots = \\mu_k \\qquad H_1: \\text{at least one } \\mu_i \\neq \\mu_j$$\n",
337    "\n",
338    "The F-statistic partitions total variance into:\n",
339    "- **Between-group variance** (signal): variation explained by group membership\n",
340    "- **Within-group variance** (noise): variation within each group\n",
341    "\n",
342    "$$F = \\frac{\\text{MS}_{\\text{between}}}{\\text{MS}_{\\text{within}}}$$\n",
343    "\n",
344    "**Assumptions:** independence, normality within groups, homogeneity of variances (Levene's test).\n",
345    "\n",
346    "A significant ANOVA result only tells us *some* means differ — post-hoc tests (Tukey HSD, Bonferroni) identify *which* pairs."
347   ]
348  },
349  {
350   "cell_type": "code",
351   "execution_count": null,
352   "id": "a1b2c3d4-0012-4000-8000-000000000012",
353   "metadata": {},
354   "outputs": [],
355   "source": [
356    "# Scenario: compare crop yield (kg/plot) under three fertiliser treatments.\n",
357    "n_per_group = 20\n",
358    "control  = rng.normal(loc=50, scale=8, size=n_per_group)\n",
359    "fert_A   = rng.normal(loc=55, scale=8, size=n_per_group)\n",
360    "fert_B   = rng.normal(loc=62, scale=9, size=n_per_group)\n",
361    "\n",
362    "f_stat, p_value = stats.f_oneway(control, fert_A, fert_B)\n",
363    "\n",
364    "print(f'Control  — mean: {control.mean():.2f}, std: {control.std(ddof=1):.2f}')\n",
365    "print(f'Fert A   — mean: {fert_A.mean():.2f}, std: {fert_A.std(ddof=1):.2f}')\n",
366    "print(f'Fert B   — mean: {fert_B.mean():.2f}, std: {fert_B.std(ddof=1):.2f}')\n",
367    "print(f'\\nF-statistic : {f_stat:.4f}')\n",
368    "print(f'p-value     : {p_value:.6f}')\n",
369    "\n",
370    "alpha = 0.05\n",
371    "if p_value < alpha:\n",
372    "    print(f'\\nDecision: Reject H₀ — at least one group mean differs significantly.')\n",
373    "else:\n",
374    "    print(f'\\nDecision: Fail to reject H₀.')\n",
375    "\n",
376    "# Post-hoc pairwise t-tests with Bonferroni correction\n",
377    "groups = {'Control': control, 'Fert A': fert_A, 'Fert B': fert_B}\n",
378    "names = list(groups.keys())\n",
379    "pairs = list(combinations(names, 2))\n",
380    "print('\\nPost-hoc pairwise Welch t-tests (Bonferroni corrected):')\n",
381    "for g1, g2 in pairs:\n",
382    "    t, p = stats.ttest_ind(groups[g1], groups[g2], equal_var=False)\n",
383    "    p_adj = min(p * len(pairs), 1.0)   # Bonferroni adjustment\n",
384    "    sig = '***' if p_adj < 0.001 else ('**' if p_adj < 0.01 else ('*' if p_adj < 0.05 else 'ns'))\n",
385    "    print(f'  {g1} vs {g2}: p_adj={p_adj:.4f} {sig}')\n",
386    "\n",
387    "# Boxplot\n",
388    "fig, ax = plt.subplots(figsize=(7, 5))\n",
389    "data_to_plot = [control, fert_A, fert_B]\n",
390    "bp = ax.boxplot(data_to_plot, patch_artist=True, widths=0.5,\n",
391    "                medianprops={'color': 'black', 'lw': 2})\n",
392    "colors = ['#6baed6', '#74c476', '#fd8d3c']\n",
393    "for patch, color in zip(bp['boxes'], colors):\n",
394    "    patch.set_facecolor(color)\n",
395    "    patch.set_alpha(0.75)\n",
396    "ax.set_xticklabels(['Control', 'Fertiliser A', 'Fertiliser B'])\n",
397    "ax.set_ylabel('Yield (kg / plot)')\n",
398    "ax.set_title(f'One-way ANOVA  |  F={f_stat:.3f}, p={p_value:.4f}')\n",
399    "\n",
400    "# Strip plot overlay\n",
401    "for i, arr in enumerate(data_to_plot, start=1):\n",
402    "    jitter = rng.uniform(-0.1, 0.1, size=len(arr))\n",
403    "    ax.scatter(i + jitter, arr, color='black', alpha=0.3, s=15, zorder=3)\n",
404    "\n",
405    "plt.tight_layout()\n",
406    "plt.show()"
407   ]
408  },
409  {
410   "cell_type": "markdown",
411   "id": "a1b2c3d4-0013-4000-8000-000000000013",
412   "metadata": {},
413   "source": [
414    "## 6. Multiple Testing Correction\n",
415    "\n",
416    "When we perform $m$ hypothesis tests simultaneously, the probability of at least one false positive increases rapidly. If each test has type I error rate α, the **family-wise error rate (FWER)** is:\n",
417    "\n",
418    "$$\\text{FWER} = 1 - (1 - \\alpha)^m$$\n",
419    "\n",
420    "For $m = 20$ tests at α = 0.05: FWER ≈ 64%!\n",
421    "\n",
422    "**Bonferroni correction** is the simplest method: adjust the threshold to $\\alpha^* = \\alpha / m$, or equivalently multiply each p-value by $m$ (capping at 1).\n",
423    "\n",
424    "**Benjamini-Hochberg (BH)** controls the **False Discovery Rate (FDR)** instead — less conservative and more powerful when many tests are conducted (e.g., genomics, neuroimaging).\n",
425    "\n",
426    "| Method | Controls | Conservative? | Use when |\n",
427    "|--------|----------|---------------|---------|\n",
428    "| Bonferroni | FWER | Very | Few tests, confirmatory study |\n",
429    "| Holm-Bonferroni | FWER | Moderately | Few tests, step-down |\n",
430    "| Benjamini-Hochberg | FDR | Mildly | Many tests, exploratory study |"
431   ]
432  },
433  {
434   "cell_type": "code",
435   "execution_count": null,
436   "id": "a1b2c3d4-0014-4000-8000-000000000014",
437   "metadata": {},
438   "outputs": [],
439   "source": [
440    "# Simulate 20 one-sample t-tests.\n",
441    "# 15 null hypotheses are TRUE (μ=0), 5 are FALSE (μ≠0 — real effects).\n",
442    "\n",
443    "m_null = 15   # true null\n",
444    "m_alt  = 5    # true alternatives\n",
445    "m_total = m_null + m_alt\n",
446    "n_obs = 30\n",
447    "\n",
448    "null_samples = [rng.normal(0, 1, n_obs) for _ in range(m_null)]\n",
449    "alt_samples  = [rng.normal(0.9, 1, n_obs) for _ in range(m_alt)]  # real effect\n",
450    "all_samples  = null_samples + alt_samples\n",
451    "true_labels  = ['null'] * m_null + ['alternative'] * m_alt\n",
452    "\n",
453    "raw_pvals = np.array([stats.ttest_1samp(s, popmean=0).pvalue for s in all_samples])\n",
454    "\n",
455    "# Bonferroni correction\n",
456    "bonferroni_pvals = np.minimum(raw_pvals * m_total, 1.0)\n",
457    "\n",
458    "# Benjamini-Hochberg (manual implementation)\n",
459    "def bh_correction(pvals, fdr=0.05):\n",
460    "    n = len(pvals)\n",
461    "    sorted_idx = np.argsort(pvals)\n",
462    "    sorted_p = pvals[sorted_idx]\n",
463    "    thresholds = fdr * np.arange(1, n + 1) / n\n",
464    "    reject_mask = sorted_p <= thresholds\n",
465    "    # Find the largest k such that p_(k) <= k/m * FDR\n",
466    "    if reject_mask.any():\n",
467    "        cutoff = np.where(reject_mask)[0].max()\n",
468    "        final_reject = np.zeros(n, dtype=bool)\n",
469    "        final_reject[sorted_idx[:cutoff + 1]] = True\n",
470    "    else:\n",
471    "        final_reject = np.zeros(n, dtype=bool)\n",
472    "    return final_reject\n",
473    "\n",
474    "bh_reject = bh_correction(raw_pvals, fdr=0.05)\n",
475    "\n",
476    "alpha = 0.05\n",
477    "print(f'Uncorrected  α=0.05: {(raw_pvals < alpha).sum()} significant')\n",
478    "print(f'Bonferroni corrected: {(bonferroni_pvals < alpha).sum()} significant')\n",
479    "print(f'BH FDR-corrected    : {bh_reject.sum()} significant')\n",
480    "\n",
481    "# Visualise p-values\n",
482    "fig, ax = plt.subplots(figsize=(10, 4))\n",
483    "x = np.arange(m_total)\n",
484    "colors_bar = ['#e74c3c' if t == 'alternative' else '#3498db' for t in true_labels]\n",
485    "bars = ax.bar(x, -np.log10(raw_pvals), color=colors_bar, alpha=0.7, edgecolor='white')\n",
486    "\n",
487    "ax.axhline(-np.log10(alpha),           color='black',  ls='-',  lw=1.5, label=f'Uncorrected α={alpha}')\n",
488    "ax.axhline(-np.log10(alpha / m_total), color='orange', ls='--', lw=1.5, label=f'Bonferroni α*={alpha/m_total:.4f}')\n",
489    "\n",
490    "from matplotlib.patches import Patch\n",
491    "legend_handles = [\n",
492    "    Patch(facecolor='#e74c3c', alpha=0.7, label='True alternative (real effect)'),\n",
493    "    Patch(facecolor='#3498db', alpha=0.7, label='True null (no effect)'),\n",
494    "]\n",
495    "leg1 = ax.legend(handles=legend_handles, loc='upper left')\n",
496    "ax.add_artist(leg1)\n",
497    "ax.legend(loc='upper right')\n",
498    "\n",
499    "ax.set_xlabel('Test index')\n",
500    "ax.set_ylabel('-log₁₀(p-value)')\n",
501    "ax.set_title('Multiple testing: raw p-values and correction thresholds')\n",
502    "ax.set_xticks(x)\n",
503    "\n",
504    "plt.tight_layout()\n",
505    "plt.show()\n",
506    "\n",
507    "# FWER inflation demo\n",
508    "m_range = np.arange(1, 101)\n",
509    "fwer = 1 - (1 - 0.05) ** m_range\n",
510    "\n",
511    "fig, ax = plt.subplots(figsize=(7, 3))\n",
512    "ax.plot(m_range, fwer, color='tomato', lw=2)\n",
513    "ax.axhline(0.05, color='steelblue', ls='--', label='α = 0.05')\n",
514    "ax.axhline(0.50, color='gray',      ls=':',  label='50% FWER')\n",
515    "ax.set_xlabel('Number of tests (m)')\n",
516    "ax.set_ylabel('Family-wise error rate')\n",
517    "ax.set_title('FWER inflation as number of simultaneous tests grows')\n",
518    "ax.legend()\n",
519    "plt.tight_layout()\n",
520    "plt.show()"
521   ]
522  },
523  {
524   "cell_type": "markdown",
525   "id": "a1b2c3d4-0015-4000-8000-000000000015",
526   "metadata": {},
527   "source": [
528    "## Summary: Choosing the Right Test\n",
529    "\n",
530    "| Test | Data type | Groups | Key assumption | SciPy function |\n",
531    "|------|-----------|--------|----------------|----------------|\n",
532    "| One-sample t-test | Continuous | 1 vs hypothesised μ | Normality | `ttest_1samp` |\n",
533    "| Independent two-sample t-test (Welch) | Continuous | 2 independent | Normality | `ttest_ind` |\n",
534    "| Paired t-test | Continuous | 2 matched/repeated | Normality of differences | `ttest_rel` |\n",
535    "| One-way ANOVA | Continuous | ≥ 3 independent | Normality, equal variance | `f_oneway` |\n",
536    "| Chi-squared test | Categorical | 2+ categories × 2+ categories | Expected counts ≥ 5 | `chi2_contingency` |\n",
537    "| Mann-Whitney U | Ordinal/non-normal | 2 independent | None (non-parametric) | `mannwhitneyu` |\n",
538    "| Wilcoxon signed-rank | Ordinal/non-normal | 2 matched | Symmetry of differences | `wilcoxon` |\n",
539    "| Kruskal-Wallis | Ordinal/non-normal | ≥ 3 independent | None (non-parametric) | `kruskal` |\n",
540    "\n",
541    "**Decision flowchart:**\n",
542    "1. How many groups? → 1, 2, or ≥ 3\n",
543    "2. Are groups independent or paired?\n",
544    "3. Is the outcome continuous or categorical?\n",
545    "4. Are normality assumptions met? (use Q-Q plots + Shapiro-Wilk)\n",
546    "5. If conducting multiple tests → apply correction (Bonferroni or BH)\n",
547    "\n",
548    "**Effect size matters:** a statistically significant result does not imply practical significance. Report Cohen's d (t-tests), η² (ANOVA), or Cramér's V (chi-squared) alongside p-values."
549   ]
550  }
551 ],
552 "metadata": {
553  "kernelspec": {
554   "display_name": "Python 3",
555   "language": "python",
556   "name": "python3"
557  },
558  "language_info": {
559   "name": "python",
560   "version": "3.11.0"
561  }
562 },
563 "nbformat": 4,
564 "nbformat_minor": 5
565}