1{
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "id": "a1b2c3d4-0001-4000-8000-000000000001",
6 "metadata": {},
7 "source": [
8 "# Hypothesis Testing\n",
9 "\n",
10 "Hypothesis testing is a statistical framework for making decisions from data under uncertainty. We define a **null hypothesis** (H₀) representing the status quo, and an **alternative hypothesis** (H₁) representing the claim we want to support. A test statistic is computed from sample data and compared against a critical value or converted to a p-value.\n",
11 "\n",
12 "**Key concepts:**\n",
13 "- **p-value**: probability of observing data at least as extreme as the sample, assuming H₀ is true\n",
14 "- **Significance level (α)**: threshold below which we reject H₀ (commonly 0.05)\n",
15 "- **Type I error**: rejecting a true H₀ (false positive, probability = α)\n",
16 "- **Type II error**: failing to reject a false H₀ (false negative, probability = β)\n",
17 "- **Power**: probability of correctly rejecting a false H₀ (= 1 − β)\n",
18 "\n",
19 "This notebook covers the most common parametric and non-parametric tests used in statistics and data science."
20 ]
21 },
22 {
23 "cell_type": "code",
24 "execution_count": null,
25 "id": "a1b2c3d4-0002-4000-8000-000000000002",
26 "metadata": {},
27 "outputs": [],
28 "source": [
29 "import numpy as np\n",
30 "import scipy.stats as stats\n",
31 "import matplotlib.pyplot as plt\n",
32 "import matplotlib.gridspec as gridspec\n",
33 "from itertools import combinations\n",
34 "\n",
35 "# Reproducibility\n",
36 "rng = np.random.default_rng(seed=42)\n",
37 "\n",
38 "# Plot aesthetics\n",
39 "plt.rcParams.update({\n",
40 " 'figure.dpi': 100,\n",
41 " 'axes.spines.top': False,\n",
42 " 'axes.spines.right': False,\n",
43 " 'font.size': 11,\n",
44 "})\n",
45 "\n",
46 "print('NumPy:', np.__version__)\n",
47 "print('SciPy:', __import__('scipy').__version__)\n",
48 "print('Matplotlib:', __import__('matplotlib').__version__)"
49 ]
50 },
51 {
52 "cell_type": "markdown",
53 "id": "a1b2c3d4-0003-4000-8000-000000000003",
54 "metadata": {},
55 "source": [
56 "## 1. One-Sample t-Test\n",
57 "\n",
58 "The **one-sample t-test** asks: is the mean of a population equal to a hypothesised value μ₀?\n",
59 "\n",
60 "$$H_0: \\mu = \\mu_0 \\qquad H_1: \\mu \\neq \\mu_0$$\n",
61 "\n",
62 "The test statistic is:\n",
63 "\n",
64 "$$t = \\frac{\\bar{x} - \\mu_0}{s / \\sqrt{n}}$$\n",
65 "\n",
66 "where $\\bar{x}$ is the sample mean, $s$ is the sample standard deviation, and $n$ is the sample size. Under H₀ this follows a t-distribution with $n - 1$ degrees of freedom.\n",
67 "\n",
68 "**Assumptions:** observations are independent and (approximately) normally distributed — the test is robust to mild departures when $n \\geq 30$."
69 ]
70 },
71 {
72 "cell_type": "code",
73 "execution_count": null,
74 "id": "a1b2c3d4-0004-4000-8000-000000000004",
75 "metadata": {},
76 "outputs": [],
77 "source": [
78 "# Scenario: a factory claims its bolts have mean diameter 10 mm.\n",
79 "# We sample 40 bolts and want to test that claim.\n",
80 "\n",
81 "mu_claimed = 10.0 # hypothesised population mean (mm)\n",
82 "n = 40\n",
83 "true_mean = 10.3 # the factory is actually producing slightly larger bolts\n",
84 "true_std = 0.8\n",
85 "\n",
86 "sample = rng.normal(loc=true_mean, scale=true_std, size=n)\n",
87 "\n",
88 "t_stat, p_value = stats.ttest_1samp(sample, popmean=mu_claimed)\n",
89 "\n",
90 "print(f'Sample mean : {sample.mean():.4f} mm')\n",
91 "print(f'Sample std : {sample.std(ddof=1):.4f} mm')\n",
92 "print(f't-statistic : {t_stat:.4f}')\n",
93 "print(f'p-value : {p_value:.4f}')\n",
94 "\n",
95 "alpha = 0.05\n",
96 "if p_value < alpha:\n",
97 " print(f'\\nDecision: Reject H₀ (p={p_value:.4f} < α={alpha})')\n",
98 " print(f'The sample mean differs significantly from {mu_claimed} mm.')\n",
99 "else:\n",
100 " print(f'\\nDecision: Fail to reject H₀ (p={p_value:.4f} ≥ α={alpha})')\n",
101 " print(f'No significant evidence that the mean differs from {mu_claimed} mm.')\n",
102 "\n",
103 "# Visualise: sample distribution vs hypothesised mean\n",
104 "fig, ax = plt.subplots(figsize=(7, 4))\n",
105 "ax.hist(sample, bins=10, color='steelblue', edgecolor='white', alpha=0.8, label='Sample')\n",
106 "ax.axvline(sample.mean(), color='steelblue', lw=2, ls='--', label=f'Sample mean ({sample.mean():.2f})')\n",
107 "ax.axvline(mu_claimed, color='tomato', lw=2, ls='--', label=f'Claimed mean ({mu_claimed})')\n",
108 "ax.set_xlabel('Bolt diameter (mm)')\n",
109 "ax.set_ylabel('Count')\n",
110 "ax.set_title(f'One-sample t-test | t={t_stat:.3f}, p={p_value:.4f}')\n",
111 "ax.legend()\n",
112 "plt.tight_layout()\n",
113 "plt.show()"
114 ]
115 },
116 {
117 "cell_type": "markdown",
118 "id": "a1b2c3d4-0005-4000-8000-000000000005",
119 "metadata": {},
120 "source": [
121 "## 2. Two-Sample t-Test (Independent Groups)\n",
122 "\n",
123 "The **independent two-sample t-test** compares the means of two separate groups:\n",
124 "\n",
125 "$$H_0: \\mu_1 = \\mu_2 \\qquad H_1: \\mu_1 \\neq \\mu_2$$\n",
126 "\n",
127 "SciPy's `ttest_ind` uses Welch's t-test by default (`equal_var=False`), which does **not** assume equal variances and is generally preferred over Student's t-test:\n",
128 "\n",
129 "$$t = \\frac{\\bar{x}_1 - \\bar{x}_2}{\\sqrt{\\frac{s_1^2}{n_1} + \\frac{s_2^2}{n_2}}}$$\n",
130 "\n",
131 "**When to use:** comparing outcomes between two independent groups (e.g., A/B test, treatment vs control)."
132 ]
133 },
134 {
135 "cell_type": "code",
136 "execution_count": null,
137 "id": "a1b2c3d4-0006-4000-8000-000000000006",
138 "metadata": {},
139 "outputs": [],
140 "source": [
141 "# Scenario: compare exam scores between two teaching methods.\n",
142 "n1, n2 = 35, 40\n",
143 "group_A = rng.normal(loc=72, scale=10, size=n1) # Method A\n",
144 "group_B = rng.normal(loc=77, scale=12, size=n2) # Method B (slightly better)\n",
145 "\n",
146 "# Welch's t-test (does not assume equal variances)\n",
147 "t_stat, p_value = stats.ttest_ind(group_A, group_B, equal_var=False)\n",
148 "\n",
149 "print('Group A — mean: {:.2f}, std: {:.2f}, n={}'.format(group_A.mean(), group_A.std(ddof=1), n1))\n",
150 "print('Group B — mean: {:.2f}, std: {:.2f}, n={}'.format(group_B.mean(), group_B.std(ddof=1), n2))\n",
151 "print(f'\\nt-statistic : {t_stat:.4f}')\n",
152 "print(f'p-value : {p_value:.4f}')\n",
153 "\n",
154 "alpha = 0.05\n",
155 "conclusion = 'Reject H₀' if p_value < alpha else 'Fail to reject H₀'\n",
156 "print(f'Decision : {conclusion} at α={alpha}')\n",
157 "\n",
158 "# Visualise distributions\n",
159 "fig, ax = plt.subplots(figsize=(8, 4))\n",
160 "bins = np.linspace(40, 115, 25)\n",
161 "ax.hist(group_A, bins=bins, alpha=0.6, color='steelblue', edgecolor='white', label=f'Method A (mean={group_A.mean():.1f})')\n",
162 "ax.hist(group_B, bins=bins, alpha=0.6, color='salmon', edgecolor='white', label=f'Method B (mean={group_B.mean():.1f})')\n",
163 "ax.axvline(group_A.mean(), color='steelblue', lw=2, ls='--')\n",
164 "ax.axvline(group_B.mean(), color='salmon', lw=2, ls='--')\n",
165 "ax.set_xlabel('Exam Score')\n",
166 "ax.set_ylabel('Count')\n",
167 "ax.set_title(f'Two-sample t-test (Welch) | t={t_stat:.3f}, p={p_value:.4f}')\n",
168 "ax.legend()\n",
169 "plt.tight_layout()\n",
170 "plt.show()"
171 ]
172 },
173 {
174 "cell_type": "markdown",
175 "id": "a1b2c3d4-0007-4000-8000-000000000007",
176 "metadata": {},
177 "source": [
178 "## 3. Paired t-Test\n",
179 "\n",
180 "The **paired t-test** is used when each observation in one group is matched with exactly one observation in the other group. Common examples: measuring the same subject before and after treatment, or comparing two sensors on the same specimen.\n",
181 "\n",
182 "$$H_0: \\mu_d = 0 \\qquad H_1: \\mu_d \\neq 0$$\n",
183 "\n",
184 "where $d_i = x_{i,\\text{after}} - x_{i,\\text{before}}$. This reduces to a one-sample t-test on the differences:\n",
185 "\n",
186 "$$t = \\frac{\\bar{d}}{s_d / \\sqrt{n}}$$\n",
187 "\n",
188 "**Advantage over two-sample t-test:** pairing removes between-subject variability, increasing statistical power."
189 ]
190 },
191 {
192 "cell_type": "code",
193 "execution_count": null,
194 "id": "a1b2c3d4-0008-4000-8000-000000000008",
195 "metadata": {},
196 "outputs": [],
197 "source": [
198 "# Scenario: blood pressure (mmHg) measured before and after a 12-week exercise programme.\n",
199 "n = 25\n",
200 "baseline = rng.normal(loc=130, scale=15, size=n) # baseline (before)\n",
201 "treatment_effect = rng.normal(loc=-8, scale=5, size=n) # individual response\n",
202 "followup = baseline + treatment_effect # after\n",
203 "\n",
204 "t_stat, p_value = stats.ttest_rel(followup, baseline)\n",
205 "\n",
206 "differences = followup - baseline\n",
207 "print(f'Mean difference (after − before): {differences.mean():.2f} mmHg')\n",
208 "print(f'Std of differences : {differences.std(ddof=1):.2f} mmHg')\n",
209 "print(f't-statistic : {t_stat:.4f}')\n",
210 "print(f'p-value : {p_value:.6f}')\n",
211 "\n",
212 "alpha = 0.05\n",
213 "if p_value < alpha:\n",
214 " print(f'\\nDecision: Reject H₀ — the exercise programme significantly changed blood pressure.')\n",
215 "else:\n",
216 " print(f'\\nDecision: Fail to reject H₀.')\n",
217 "\n",
218 "# Plot differences\n",
219 "fig, axes = plt.subplots(1, 2, figsize=(10, 4))\n",
220 "\n",
221 "# Paired lines\n",
222 "ax = axes[0]\n",
223 "for i in range(n):\n",
224 " color = 'steelblue' if followup[i] < baseline[i] else 'salmon'\n",
225 " ax.plot([0, 1], [baseline[i], followup[i]], color=color, alpha=0.5, lw=1)\n",
226 "ax.plot([0, 1], [baseline.mean(), followup.mean()], 'k-o', lw=2.5, markersize=8, label='Group mean')\n",
227 "ax.set_xticks([0, 1])\n",
228 "ax.set_xticklabels(['Before', 'After'])\n",
229 "ax.set_ylabel('Blood Pressure (mmHg)')\n",
230 "ax.set_title('Individual trajectories')\n",
231 "ax.legend()\n",
232 "\n",
233 "# Histogram of differences\n",
234 "ax = axes[1]\n",
235 "ax.hist(differences, bins=8, color='steelblue', edgecolor='white', alpha=0.8)\n",
236 "ax.axvline(0, color='tomato', lw=2, ls='--', label='H₀: mean diff = 0')\n",
237 "ax.axvline(differences.mean(), color='steelblue', lw=2, ls='--',\n",
238 " label=f'Observed mean ({differences.mean():.1f})')\n",
239 "ax.set_xlabel('After − Before (mmHg)')\n",
240 "ax.set_ylabel('Count')\n",
241 "ax.set_title(f'Paired t-test | p={p_value:.4f}')\n",
242 "ax.legend()\n",
243 "\n",
244 "plt.tight_layout()\n",
245 "plt.show()"
246 ]
247 },
248 {
249 "cell_type": "markdown",
250 "id": "a1b2c3d4-0009-4000-8000-000000000009",
251 "metadata": {},
252 "source": [
253 "## 4. Chi-Squared Test for Independence\n",
254 "\n",
255 "The **χ² test for independence** determines whether two categorical variables are associated. Given an observed contingency table with counts $O_{ij}$ and expected counts $E_{ij}$ under independence:\n",
256 "\n",
257 "$$\\chi^2 = \\sum_{i,j} \\frac{(O_{ij} - E_{ij})^2}{E_{ij}}, \\qquad E_{ij} = \\frac{R_i \\cdot C_j}{N}$$\n",
258 "\n",
259 "Degrees of freedom: $(r - 1)(c - 1)$ where $r$ = rows, $c$ = columns.\n",
260 "\n",
261 "**Assumptions:**\n",
262 "- Observations are independent\n",
263 "- Expected cell counts ≥ 5 (use Fisher's exact test when this fails)\n",
264 "\n",
265 "**When to use:** testing association between categorical variables (e.g., survey responses vs demographic group)."
266 ]
267 },
268 {
269 "cell_type": "code",
270 "execution_count": null,
271 "id": "a1b2c3d4-0010-4000-8000-000000000010",
272 "metadata": {},
273 "outputs": [],
274 "source": [
275 "# Scenario: Is preference for a product (A/B/C) independent of age group (18-34 / 35-54 / 55+)?\n",
276 "observed = np.array([\n",
277 " # Prod A Prod B Prod C\n",
278 " [ 45, 30, 25], # 18-34\n",
279 " [ 35, 50, 15], # 35-54\n",
280 " [ 20, 20, 60], # 55+\n",
281 "])\n",
282 "\n",
283 "chi2, p_value, dof, expected = stats.chi2_contingency(observed)\n",
284 "\n",
285 "print('Observed frequencies:')\n",
286 "print(observed)\n",
287 "print('\\nExpected frequencies (under independence):')\n",
288 "print(np.round(expected, 1))\n",
289 "print(f'\\nχ² statistic : {chi2:.4f}')\n",
290 "print(f'Degrees of freedom: {dof}')\n",
291 "print(f'p-value : {p_value:.6f}')\n",
292 "\n",
293 "alpha = 0.05\n",
294 "if p_value < alpha:\n",
295 " print(f'\\nDecision: Reject H₀ — product preference IS associated with age group.')\n",
296 "else:\n",
297 " print(f'\\nDecision: Fail to reject H₀ — no significant association detected.')\n",
298 "\n",
299 "# Heatmap of standardised residuals\n",
300 "std_residuals = (observed - expected) / np.sqrt(expected)\n",
301 "\n",
302 "fig, axes = plt.subplots(1, 2, figsize=(11, 4))\n",
303 "\n",
304 "age_labels = ['18-34', '35-54', '55+']\n",
305 "product_labels = ['Product A', 'Product B', 'Product C']\n",
306 "\n",
307 "for ax, data, title, fmt in zip(\n",
308 " axes,\n",
309 " [observed, std_residuals],\n",
310 " ['Observed counts', 'Standardised residuals (>|2| = notable)'],\n",
311 " ['.0f', '.2f']):\n",
312 " im = ax.imshow(data, cmap='RdYlGn' if 'residual' in title.lower() else 'Blues',\n",
313 " aspect='auto')\n",
314 " ax.set_xticks(range(3)); ax.set_xticklabels(product_labels)\n",
315 " ax.set_yticks(range(3)); ax.set_yticklabels(age_labels)\n",
316 " for i in range(3):\n",
317 " for j in range(3):\n",
318 " ax.text(j, i, format(data[i, j], fmt), ha='center', va='center', fontsize=12)\n",
319 " ax.set_title(title)\n",
320 " plt.colorbar(im, ax=ax, shrink=0.8)\n",
321 "\n",
322 "plt.suptitle(f'Chi-squared test | χ²={chi2:.2f}, df={dof}, p={p_value:.4f}', y=1.02)\n",
323 "plt.tight_layout()\n",
324 "plt.show()"
325 ]
326 },
327 {
328 "cell_type": "markdown",
329 "id": "a1b2c3d4-0011-4000-8000-000000000011",
330 "metadata": {},
331 "source": [
332 "## 5. One-Way ANOVA\n",
333 "\n",
334 "**Analysis of Variance (ANOVA)** tests whether the means of three or more independent groups are all equal:\n",
335 "\n",
336 "$$H_0: \\mu_1 = \\mu_2 = \\cdots = \\mu_k \\qquad H_1: \\text{at least one } \\mu_i \\neq \\mu_j$$\n",
337 "\n",
338 "The F-statistic partitions total variance into:\n",
339 "- **Between-group variance** (signal): variation explained by group membership\n",
340 "- **Within-group variance** (noise): variation within each group\n",
341 "\n",
342 "$$F = \\frac{\\text{MS}_{\\text{between}}}{\\text{MS}_{\\text{within}}}$$\n",
343 "\n",
344 "**Assumptions:** independence, normality within groups, homogeneity of variances (Levene's test).\n",
345 "\n",
346 "A significant ANOVA result only tells us *some* means differ — post-hoc tests (Tukey HSD, Bonferroni) identify *which* pairs."
347 ]
348 },
349 {
350 "cell_type": "code",
351 "execution_count": null,
352 "id": "a1b2c3d4-0012-4000-8000-000000000012",
353 "metadata": {},
354 "outputs": [],
355 "source": [
356 "# Scenario: compare crop yield (kg/plot) under three fertiliser treatments.\n",
357 "n_per_group = 20\n",
358 "control = rng.normal(loc=50, scale=8, size=n_per_group)\n",
359 "fert_A = rng.normal(loc=55, scale=8, size=n_per_group)\n",
360 "fert_B = rng.normal(loc=62, scale=9, size=n_per_group)\n",
361 "\n",
362 "f_stat, p_value = stats.f_oneway(control, fert_A, fert_B)\n",
363 "\n",
364 "print(f'Control — mean: {control.mean():.2f}, std: {control.std(ddof=1):.2f}')\n",
365 "print(f'Fert A — mean: {fert_A.mean():.2f}, std: {fert_A.std(ddof=1):.2f}')\n",
366 "print(f'Fert B — mean: {fert_B.mean():.2f}, std: {fert_B.std(ddof=1):.2f}')\n",
367 "print(f'\\nF-statistic : {f_stat:.4f}')\n",
368 "print(f'p-value : {p_value:.6f}')\n",
369 "\n",
370 "alpha = 0.05\n",
371 "if p_value < alpha:\n",
372 " print(f'\\nDecision: Reject H₀ — at least one group mean differs significantly.')\n",
373 "else:\n",
374 " print(f'\\nDecision: Fail to reject H₀.')\n",
375 "\n",
376 "# Post-hoc pairwise t-tests with Bonferroni correction\n",
377 "groups = {'Control': control, 'Fert A': fert_A, 'Fert B': fert_B}\n",
378 "names = list(groups.keys())\n",
379 "pairs = list(combinations(names, 2))\n",
380 "print('\\nPost-hoc pairwise Welch t-tests (Bonferroni corrected):')\n",
381 "for g1, g2 in pairs:\n",
382 " t, p = stats.ttest_ind(groups[g1], groups[g2], equal_var=False)\n",
383 " p_adj = min(p * len(pairs), 1.0) # Bonferroni adjustment\n",
384 " sig = '***' if p_adj < 0.001 else ('**' if p_adj < 0.01 else ('*' if p_adj < 0.05 else 'ns'))\n",
385 " print(f' {g1} vs {g2}: p_adj={p_adj:.4f} {sig}')\n",
386 "\n",
387 "# Boxplot\n",
388 "fig, ax = plt.subplots(figsize=(7, 5))\n",
389 "data_to_plot = [control, fert_A, fert_B]\n",
390 "bp = ax.boxplot(data_to_plot, patch_artist=True, widths=0.5,\n",
391 " medianprops={'color': 'black', 'lw': 2})\n",
392 "colors = ['#6baed6', '#74c476', '#fd8d3c']\n",
393 "for patch, color in zip(bp['boxes'], colors):\n",
394 " patch.set_facecolor(color)\n",
395 " patch.set_alpha(0.75)\n",
396 "ax.set_xticklabels(['Control', 'Fertiliser A', 'Fertiliser B'])\n",
397 "ax.set_ylabel('Yield (kg / plot)')\n",
398 "ax.set_title(f'One-way ANOVA | F={f_stat:.3f}, p={p_value:.4f}')\n",
399 "\n",
400 "# Strip plot overlay\n",
401 "for i, arr in enumerate(data_to_plot, start=1):\n",
402 " jitter = rng.uniform(-0.1, 0.1, size=len(arr))\n",
403 " ax.scatter(i + jitter, arr, color='black', alpha=0.3, s=15, zorder=3)\n",
404 "\n",
405 "plt.tight_layout()\n",
406 "plt.show()"
407 ]
408 },
409 {
410 "cell_type": "markdown",
411 "id": "a1b2c3d4-0013-4000-8000-000000000013",
412 "metadata": {},
413 "source": [
414 "## 6. Multiple Testing Correction\n",
415 "\n",
416 "When we perform $m$ hypothesis tests simultaneously, the probability of at least one false positive increases rapidly. If each test has type I error rate α, the **family-wise error rate (FWER)** is:\n",
417 "\n",
418 "$$\\text{FWER} = 1 - (1 - \\alpha)^m$$\n",
419 "\n",
420 "For $m = 20$ tests at α = 0.05: FWER ≈ 64%!\n",
421 "\n",
422 "**Bonferroni correction** is the simplest method: adjust the threshold to $\\alpha^* = \\alpha / m$, or equivalently multiply each p-value by $m$ (capping at 1).\n",
423 "\n",
424 "**Benjamini-Hochberg (BH)** controls the **False Discovery Rate (FDR)** instead — less conservative and more powerful when many tests are conducted (e.g., genomics, neuroimaging).\n",
425 "\n",
426 "| Method | Controls | Conservative? | Use when |\n",
427 "|--------|----------|---------------|---------|\n",
428 "| Bonferroni | FWER | Very | Few tests, confirmatory study |\n",
429 "| Holm-Bonferroni | FWER | Moderately | Few tests, step-down |\n",
430 "| Benjamini-Hochberg | FDR | Mildly | Many tests, exploratory study |"
431 ]
432 },
433 {
434 "cell_type": "code",
435 "execution_count": null,
436 "id": "a1b2c3d4-0014-4000-8000-000000000014",
437 "metadata": {},
438 "outputs": [],
439 "source": [
440 "# Simulate 20 one-sample t-tests.\n",
441 "# 15 null hypotheses are TRUE (μ=0), 5 are FALSE (μ≠0 — real effects).\n",
442 "\n",
443 "m_null = 15 # true null\n",
444 "m_alt = 5 # true alternatives\n",
445 "m_total = m_null + m_alt\n",
446 "n_obs = 30\n",
447 "\n",
448 "null_samples = [rng.normal(0, 1, n_obs) for _ in range(m_null)]\n",
449 "alt_samples = [rng.normal(0.9, 1, n_obs) for _ in range(m_alt)] # real effect\n",
450 "all_samples = null_samples + alt_samples\n",
451 "true_labels = ['null'] * m_null + ['alternative'] * m_alt\n",
452 "\n",
453 "raw_pvals = np.array([stats.ttest_1samp(s, popmean=0).pvalue for s in all_samples])\n",
454 "\n",
455 "# Bonferroni correction\n",
456 "bonferroni_pvals = np.minimum(raw_pvals * m_total, 1.0)\n",
457 "\n",
458 "# Benjamini-Hochberg (manual implementation)\n",
459 "def bh_correction(pvals, fdr=0.05):\n",
460 " n = len(pvals)\n",
461 " sorted_idx = np.argsort(pvals)\n",
462 " sorted_p = pvals[sorted_idx]\n",
463 " thresholds = fdr * np.arange(1, n + 1) / n\n",
464 " reject_mask = sorted_p <= thresholds\n",
465 " # Find the largest k such that p_(k) <= k/m * FDR\n",
466 " if reject_mask.any():\n",
467 " cutoff = np.where(reject_mask)[0].max()\n",
468 " final_reject = np.zeros(n, dtype=bool)\n",
469 " final_reject[sorted_idx[:cutoff + 1]] = True\n",
470 " else:\n",
471 " final_reject = np.zeros(n, dtype=bool)\n",
472 " return final_reject\n",
473 "\n",
474 "bh_reject = bh_correction(raw_pvals, fdr=0.05)\n",
475 "\n",
476 "alpha = 0.05\n",
477 "print(f'Uncorrected α=0.05: {(raw_pvals < alpha).sum()} significant')\n",
478 "print(f'Bonferroni corrected: {(bonferroni_pvals < alpha).sum()} significant')\n",
479 "print(f'BH FDR-corrected : {bh_reject.sum()} significant')\n",
480 "\n",
481 "# Visualise p-values\n",
482 "fig, ax = plt.subplots(figsize=(10, 4))\n",
483 "x = np.arange(m_total)\n",
484 "colors_bar = ['#e74c3c' if t == 'alternative' else '#3498db' for t in true_labels]\n",
485 "bars = ax.bar(x, -np.log10(raw_pvals), color=colors_bar, alpha=0.7, edgecolor='white')\n",
486 "\n",
487 "ax.axhline(-np.log10(alpha), color='black', ls='-', lw=1.5, label=f'Uncorrected α={alpha}')\n",
488 "ax.axhline(-np.log10(alpha / m_total), color='orange', ls='--', lw=1.5, label=f'Bonferroni α*={alpha/m_total:.4f}')\n",
489 "\n",
490 "from matplotlib.patches import Patch\n",
491 "legend_handles = [\n",
492 " Patch(facecolor='#e74c3c', alpha=0.7, label='True alternative (real effect)'),\n",
493 " Patch(facecolor='#3498db', alpha=0.7, label='True null (no effect)'),\n",
494 "]\n",
495 "leg1 = ax.legend(handles=legend_handles, loc='upper left')\n",
496 "ax.add_artist(leg1)\n",
497 "ax.legend(loc='upper right')\n",
498 "\n",
499 "ax.set_xlabel('Test index')\n",
500 "ax.set_ylabel('-log₁₀(p-value)')\n",
501 "ax.set_title('Multiple testing: raw p-values and correction thresholds')\n",
502 "ax.set_xticks(x)\n",
503 "\n",
504 "plt.tight_layout()\n",
505 "plt.show()\n",
506 "\n",
507 "# FWER inflation demo\n",
508 "m_range = np.arange(1, 101)\n",
509 "fwer = 1 - (1 - 0.05) ** m_range\n",
510 "\n",
511 "fig, ax = plt.subplots(figsize=(7, 3))\n",
512 "ax.plot(m_range, fwer, color='tomato', lw=2)\n",
513 "ax.axhline(0.05, color='steelblue', ls='--', label='α = 0.05')\n",
514 "ax.axhline(0.50, color='gray', ls=':', label='50% FWER')\n",
515 "ax.set_xlabel('Number of tests (m)')\n",
516 "ax.set_ylabel('Family-wise error rate')\n",
517 "ax.set_title('FWER inflation as number of simultaneous tests grows')\n",
518 "ax.legend()\n",
519 "plt.tight_layout()\n",
520 "plt.show()"
521 ]
522 },
523 {
524 "cell_type": "markdown",
525 "id": "a1b2c3d4-0015-4000-8000-000000000015",
526 "metadata": {},
527 "source": [
528 "## Summary: Choosing the Right Test\n",
529 "\n",
530 "| Test | Data type | Groups | Key assumption | SciPy function |\n",
531 "|------|-----------|--------|----------------|----------------|\n",
532 "| One-sample t-test | Continuous | 1 vs hypothesised μ | Normality | `ttest_1samp` |\n",
533 "| Independent two-sample t-test (Welch) | Continuous | 2 independent | Normality | `ttest_ind` |\n",
534 "| Paired t-test | Continuous | 2 matched/repeated | Normality of differences | `ttest_rel` |\n",
535 "| One-way ANOVA | Continuous | ≥ 3 independent | Normality, equal variance | `f_oneway` |\n",
536 "| Chi-squared test | Categorical | 2+ categories × 2+ categories | Expected counts ≥ 5 | `chi2_contingency` |\n",
537 "| Mann-Whitney U | Ordinal/non-normal | 2 independent | None (non-parametric) | `mannwhitneyu` |\n",
538 "| Wilcoxon signed-rank | Ordinal/non-normal | 2 matched | Symmetry of differences | `wilcoxon` |\n",
539 "| Kruskal-Wallis | Ordinal/non-normal | ≥ 3 independent | None (non-parametric) | `kruskal` |\n",
540 "\n",
541 "**Decision flowchart:**\n",
542 "1. How many groups? → 1, 2, or ≥ 3\n",
543 "2. Are groups independent or paired?\n",
544 "3. Is the outcome continuous or categorical?\n",
545 "4. Are normality assumptions met? (use Q-Q plots + Shapiro-Wilk)\n",
546 "5. If conducting multiple tests → apply correction (Bonferroni or BH)\n",
547 "\n",
548 "**Effect size matters:** a statistically significant result does not imply practical significance. Report Cohen's d (t-tests), η² (ANOVA), or Cramér's V (chi-squared) alongside p-values."
549 ]
550 }
551 ],
552 "metadata": {
553 "kernelspec": {
554 "display_name": "Python 3",
555 "language": "python",
556 "name": "python3"
557 },
558 "language_info": {
559 "name": "python",
560 "version": "3.11.0"
561 }
562 },
563 "nbformat": 4,
564 "nbformat_minor": 5
565}