24. Experimental Design
24. Experimental Design¶
Previous: Nonparametric Statistics | Next: Practical Projects
Overview¶
Experimental design is a systematic methodology for inferring causal relationships. This chapter covers the basic principles of experimental design, A/B testing, sample size determination through power analysis, and sequential testing methods.
1. Basic Principles of Experimental Design¶
1.1 Three Core Principles¶
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from scipy.stats import norm, t
np.random.seed(42)
def experimental_design_principles():
"""Three core principles of experimental design"""
print("""
=================================================
Three Core Principles of Experimental Design
=================================================
1. Randomization
─────────────────────────────
- Randomly assign subjects to treatment groups
- Equally distribute effects of confounding variables
- Foundation for causal inference
Examples:
- Coin flip for A/B group assignment
- Computer-generated random numbers
- Blocked randomization (stratify then randomize)
2. Replication
─────────────────────────────
- Sufficient number of independent observations
- Ensure statistical power
- Enable variability estimation
Considerations:
- Sample size calculation (power analysis)
- Cost-benefit trade-off
- Practical constraints
3. Blocking
─────────────────────────────
- Group subjects by known sources of variation
- Random assignment within blocks
- Reduce error, increase power
Examples:
- Block by gender → random assignment within each block
- Stratify by age group
- Region, time period, etc.
=================================================
Additional Principles
=================================================
- Control: Include control group
- Blinding: Single/double blind
- Balance: Equal allocation across groups
""")
experimental_design_principles()
1.2 Implementing Randomization¶
def randomize_participants(participants, n_groups=2, method='simple', block_var=None):
"""
Randomize participants
Parameters:
-----------
participants : DataFrame
Participant information
n_groups : int
Number of groups
method : str
'simple' - Simple randomization
'stratified' - Stratified randomization
block_var : str
Stratification variable (when method='stratified')
"""
n = len(participants)
result = participants.copy()
if method == 'simple':
# Simple random assignment
assignments = np.random.choice(range(n_groups), size=n)
result['group'] = assignments
elif method == 'stratified' and block_var is not None:
# Stratified random assignment
result['group'] = -1
for block_value in participants[block_var].unique():
mask = participants[block_var] == block_value
block_n = mask.sum()
assignments = np.random.choice(range(n_groups), size=block_n)
result.loc[mask, 'group'] = assignments
return result
# Example: 100 participants
np.random.seed(42)
participants = pd.DataFrame({
'id': range(100),
'age': np.random.choice(['young', 'middle', 'old'], 100),
'gender': np.random.choice(['M', 'F'], 100)
})
# Simple randomization
simple_rand = randomize_participants(participants, n_groups=2, method='simple')
# Stratified randomization (by gender)
stratified_rand = randomize_participants(participants, n_groups=2,
method='stratified', block_var='gender')
print("=== Simple Randomization Results ===")
print(pd.crosstab(simple_rand['gender'], simple_rand['group']))
print("\n=== Stratified Randomization Results (by gender) ===")
print(pd.crosstab(stratified_rand['gender'], stratified_rand['group']))
1.3 Types of Experimental Designs¶
def experimental_design_types():
"""Major types of experimental designs"""
print("""
=================================================
Experimental Design Types
=================================================
1. Completely Randomized Design
- Simplest design
- Complete random assignment of subjects to treatments
- Analysis: Independent t-test, one-way ANOVA
2. Randomized Block Design
- Stratify by blocking variable then randomize
- All treatment levels included within each block
- Analysis: Two-way ANOVA (removing block effect)
3. Factorial Design
- Study combined effects of multiple factors
- Can detect interaction effects
- Analysis: Multi-way ANOVA
4. Crossover Design
- Subjects receive all treatments sequentially
- Controls for between-subject variation
- Watch for carryover effects
5. Split-Plot Design
- One factor applied to whole units, another to sub-units
- Common in agriculture and engineering
""")
experimental_design_types()
2. A/B Testing Theory¶
2.1 A/B Testing Overview¶
def ab_test_overview():
"""A/B testing overview"""
print("""
=================================================
A/B Testing
=================================================
Definition:
- Randomized controlled experiment comparing two versions (A, B)
- Most widely used experimental method in web/app
Terminology:
- Control (A): Existing version (control group)
- Treatment (B): New version (treatment group)
- Conversion Rate: Proportion of goal actions
- Lift: (B - A) / A
Process:
1. Formulate hypothesis
2. Define metrics
3. Calculate sample size
4. Run experiment
5. Statistical analysis
6. Decision making
Cautions:
- Unit consistency (users vs sessions vs pageviews)
- Experiment duration (minimum 1-2 weeks, consider day-of-week effects)
- Multiple comparison correction
- Network effects (spillover)
""")
ab_test_overview()
2.2 A/B Test Analysis¶
class ABTest:
"""A/B test analysis class"""
def __init__(self, control_visitors, control_conversions,
treatment_visitors, treatment_conversions):
self.n_c = control_visitors
self.x_c = control_conversions
self.n_t = treatment_visitors
self.x_t = treatment_conversions
self.p_c = self.x_c / self.n_c
self.p_t = self.x_t / self.n_t
def z_test(self, alternative='two-sided'):
"""Z-test for two proportions"""
# Pooled proportion
p_pooled = (self.x_c + self.x_t) / (self.n_c + self.n_t)
# Standard error
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/self.n_c + 1/self.n_t))
# Z statistic
z = (self.p_t - self.p_c) / se
# p-value
if alternative == 'two-sided':
p_value = 2 * (1 - norm.cdf(abs(z)))
elif alternative == 'greater': # treatment > control
p_value = 1 - norm.cdf(z)
else: # treatment < control
p_value = norm.cdf(z)
return z, p_value
def confidence_interval(self, alpha=0.05):
"""Confidence interval for the difference"""
diff = self.p_t - self.p_c
# Variance for each proportion
var_c = self.p_c * (1 - self.p_c) / self.n_c
var_t = self.p_t * (1 - self.p_t) / self.n_t
se = np.sqrt(var_c + var_t)
z_crit = norm.ppf(1 - alpha/2)
ci_lower = diff - z_crit * se
ci_upper = diff + z_crit * se
return diff, (ci_lower, ci_upper)
def lift(self):
"""Calculate lift"""
if self.p_c == 0:
return np.inf
return (self.p_t - self.p_c) / self.p_c
def summary(self):
"""Summary of results"""
print("=== A/B Test Summary ===")
print(f"\nControl: {self.x_c:,}/{self.n_c:,} = {self.p_c:.4f} ({self.p_c*100:.2f}%)")
print(f"Treatment: {self.x_t:,}/{self.n_t:,} = {self.p_t:.4f} ({self.p_t*100:.2f}%)")
z, p_value = self.z_test()
diff, ci = self.confidence_interval()
lift = self.lift()
print(f"\nDifference: {diff:.4f} ({diff*100:.2f}%p)")
print(f"Lift: {lift*100:.2f}%")
print(f"95% CI: ({ci[0]*100:.2f}%p, {ci[1]*100:.2f}%p)")
print(f"\nZ statistic: {z:.3f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("\nConclusion: Statistically significant difference (p < 0.05)")
if diff > 0:
print("Treatment is significantly higher than Control")
else:
print("Treatment is significantly lower than Control")
else:
print("\nConclusion: No statistically significant difference (p >= 0.05)")
# Example: Button color A/B test
ab_test = ABTest(
control_visitors=10000,
control_conversions=350,
treatment_visitors=10000,
treatment_conversions=420
)
ab_test.summary()
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Conversion rate comparison
ax = axes[0]
bars = ax.bar(['Control', 'Treatment'], [ab_test.p_c, ab_test.p_t], alpha=0.7)
ax.set_ylabel('Conversion Rate')
ax.set_title('A/B Test: Conversion Rate Comparison')
# Add error bars
se_c = np.sqrt(ab_test.p_c * (1 - ab_test.p_c) / ab_test.n_c)
se_t = np.sqrt(ab_test.p_t * (1 - ab_test.p_t) / ab_test.n_t)
ax.errorbar(['Control', 'Treatment'], [ab_test.p_c, ab_test.p_t],
yerr=[1.96*se_c, 1.96*se_t], fmt='none', color='black', capsize=5)
ax.grid(True, alpha=0.3, axis='y')
# Confidence interval of difference
ax = axes[1]
diff, ci = ab_test.confidence_interval()
ax.errorbar([0], [diff], yerr=[[diff - ci[0]], [ci[1] - diff]],
fmt='o', markersize=10, capsize=10, capthick=2)
ax.axhline(0, color='r', linestyle='--', label='No difference')
ax.set_xlim(-1, 1)
ax.set_ylabel('Conversion Rate Difference')
ax.set_title(f'95% Confidence Interval of Difference\n({ci[0]:.4f}, {ci[1]:.4f})')
ax.set_xticks([])
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
2.3 Bayesian A/B Testing¶
def bayesian_ab_test(n_c, x_c, n_t, x_t, alpha_prior=1, beta_prior=1, n_samples=100000):
"""
Bayesian A/B test
Conversion rate estimation using Beta prior
"""
# Posterior distribution (Beta-Binomial conjugate)
alpha_c = alpha_prior + x_c
beta_c = beta_prior + n_c - x_c
alpha_t = alpha_prior + x_t
beta_t = beta_prior + n_t - x_t
# Sample from posterior
samples_c = np.random.beta(alpha_c, beta_c, n_samples)
samples_t = np.random.beta(alpha_t, beta_t, n_samples)
# P(Treatment > Control)
prob_t_better = np.mean(samples_t > samples_c)
# Expected lift
lift_samples = (samples_t - samples_c) / samples_c
expected_lift = np.mean(lift_samples)
lift_ci = np.percentile(lift_samples, [2.5, 97.5])
print("=== Bayesian A/B Test ===")
print(f"\nP(Treatment > Control): {prob_t_better:.4f} ({prob_t_better*100:.1f}%)")
print(f"Expected lift: {expected_lift*100:.2f}%")
print(f"Lift 95% CI: ({lift_ci[0]*100:.2f}%, {lift_ci[1]*100:.2f}%)")
# Decision criteria
print("\nDecision:")
if prob_t_better > 0.95:
print(" → Recommend adopting Treatment (P > 95%)")
elif prob_t_better < 0.05:
print(" → Recommend keeping Control (P < 5%)")
else:
print(" → Need more data")
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Posterior distribution comparison
ax = axes[0]
x_range = np.linspace(0, 0.1, 200)
ax.plot(x_range, stats.beta(alpha_c, beta_c).pdf(x_range), label='Control')
ax.plot(x_range, stats.beta(alpha_t, beta_t).pdf(x_range), label='Treatment')
ax.fill_between(x_range, stats.beta(alpha_c, beta_c).pdf(x_range), alpha=0.3)
ax.fill_between(x_range, stats.beta(alpha_t, beta_t).pdf(x_range), alpha=0.3)
ax.set_xlabel('Conversion Rate')
ax.set_ylabel('Density')
ax.set_title('Posterior Distribution of Conversion Rate')
ax.legend()
ax.grid(True, alpha=0.3)
# Difference distribution
ax = axes[1]
diff_samples = samples_t - samples_c
ax.hist(diff_samples, bins=50, density=True, alpha=0.7, edgecolor='black')
ax.axvline(0, color='r', linestyle='--', label='No difference')
ax.axvline(np.mean(diff_samples), color='g', linestyle='-',
label=f'Mean: {np.mean(diff_samples):.4f}')
ax.set_xlabel('Conversion Rate Difference (T - C)')
ax.set_ylabel('Density')
ax.set_title(f'Posterior Distribution of Difference\nP(T>C)={prob_t_better:.3f}')
ax.legend()
ax.grid(True, alpha=0.3)
# Lift distribution
ax = axes[2]
lift_samples_clipped = np.clip(lift_samples, -1, 2)
ax.hist(lift_samples_clipped, bins=50, density=True, alpha=0.7, edgecolor='black')
ax.axvline(0, color='r', linestyle='--', label='0%')
ax.axvline(expected_lift, color='g', linestyle='-',
label=f'Expected: {expected_lift*100:.1f}%')
ax.set_xlabel('Lift')
ax.set_ylabel('Density')
ax.set_title('Posterior Distribution of Lift')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return prob_t_better, expected_lift
# Bayesian analysis
prob_better, exp_lift = bayesian_ab_test(10000, 350, 10000, 420)
3. Sample Size Determination (Power Analysis)¶
3.1 Power Analysis Concepts¶
def power_analysis_concepts():
"""Core concepts of power analysis"""
print("""
=================================================
Power Analysis
=================================================
Four elements (calculate one from the other three):
─────────────────────────────────────────────────
1. Effect Size
- Minimum effect to detect
- Example: conversion rate difference 0.02 (2%p)
2. Significance Level α
- Type I error probability
- Typically 0.05
3. Power 1-β
- Probability of detecting effect when it exists
- Typically 0.80 (minimum) ~ 0.90
4. Sample Size n
- Required number of observations
Calculation flow:
─────────────────────────────────────────────────
Effect size + α + (1-β) → n (prior design)
n + α + (1-β) → Minimum detectable effect (sensitivity)
n + α + Effect size → Achieved power (post-hoc)
Rules of thumb:
─────────────────────────────────────────────────
- Power < 80%: Underpowered
- Power 80-90%: Generally recommended
- Power > 90%: High power study
""")
power_analysis_concepts()
3.2 Sample Size for Two Proportion Comparison¶
def sample_size_two_proportions(p1, p2, alpha=0.05, power=0.80, ratio=1):
"""
Calculate sample size for comparing two proportions
Parameters:
-----------
p1 : float
Control conversion rate (baseline)
p2 : float
Treatment conversion rate (target)
alpha : float
Significance level
power : float
Statistical power
ratio : float
n2/n1 ratio (default 1 = equal size)
Returns:
--------
n1, n2 : int
Required sample size for each group
"""
# Effect size
effect = abs(p2 - p1)
p_pooled = (p1 + ratio * p2) / (1 + ratio)
# Z values
z_alpha = norm.ppf(1 - alpha/2) # Two-sided
z_beta = norm.ppf(power)
# Sample size formula
numerator = (z_alpha * np.sqrt((1 + ratio) * p_pooled * (1 - p_pooled)) +
z_beta * np.sqrt(p1 * (1 - p1) + ratio * p2 * (1 - p2)))**2
n1 = numerator / (effect**2 * ratio)
n2 = n1 * ratio
return int(np.ceil(n1)), int(np.ceil(n2))
def plot_sample_size_analysis(p1_base, effects, alpha=0.05, power=0.80):
"""Required sample size by effect size"""
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Effect size vs sample size
ax = axes[0]
sample_sizes = []
for effect in effects:
p2 = p1_base + effect
n1, _ = sample_size_two_proportions(p1_base, p2, alpha, power)
sample_sizes.append(n1)
ax.plot(np.array(effects)*100, sample_sizes, 'bo-', linewidth=2)
ax.set_xlabel('Effect Size (Conversion Rate Difference %p)')
ax.set_ylabel('Required Sample Size per Group')
ax.set_title(f'Effect Size vs Sample Size\n(Baseline rate={p1_base:.1%}, α={alpha}, power={power})')
ax.grid(True, alpha=0.3)
# Log scale
ax.set_yscale('log')
for i, (eff, n) in enumerate(zip(effects, sample_sizes)):
ax.annotate(f'{n:,}', (eff*100, n), textcoords="offset points",
xytext=(0, 10), ha='center', fontsize=9)
# Power vs sample size
ax = axes[1]
effect_fixed = 0.02 # Fixed at 2%p
p2_fixed = p1_base + effect_fixed
powers = np.linspace(0.5, 0.95, 10)
sample_sizes_power = []
for pwr in powers:
n1, _ = sample_size_two_proportions(p1_base, p2_fixed, alpha, pwr)
sample_sizes_power.append(n1)
ax.plot(powers*100, sample_sizes_power, 'go-', linewidth=2)
ax.set_xlabel('Power (%)')
ax.set_ylabel('Required Sample Size per Group')
ax.set_title(f'Power vs Sample Size\n(Effect size={effect_fixed:.1%}p)')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Example
p1 = 0.05 # Baseline conversion rate 5%
effects = [0.005, 0.01, 0.015, 0.02, 0.025, 0.03] # 0.5%p ~ 3%p
print("=== Sample Size Calculation ===")
print(f"Baseline conversion rate: {p1:.1%}")
print(f"α = 0.05, Power = 0.80")
print()
for effect in effects:
p2 = p1 + effect
n1, n2 = sample_size_two_proportions(p1, p2)
print(f"Effect {effect*100:.1f}%p (relative {effect/p1*100:.0f}%): n1={n1:,}, n2={n2:,}, total={n1+n2:,}")
plot_sample_size_analysis(p1, effects)
3.3 Power Analysis with statsmodels¶
from statsmodels.stats.power import TTestPower, NormalIndPower, tt_ind_solve_power
from statsmodels.stats.proportion import proportion_effectsize
def statsmodels_power_analysis():
"""Power analysis using statsmodels"""
# 1. t-test power analysis
print("=== t-test Power Analysis ===")
# Calculate effect size (Cohen's d)
# d = (μ1 - μ2) / σ
mean_diff = 5
std = 15
d = mean_diff / std
print(f"Cohen's d = {d:.3f}")
# Required sample size
power_analysis = TTestPower()
n = power_analysis.solve_power(effect_size=d, alpha=0.05, power=0.80,
alternative='two-sided')
print(f"Required sample size (per group): {int(np.ceil(n))}")
# Achieved power
achieved_power = power_analysis.power(effect_size=d, nobs=100, alpha=0.05,
alternative='two-sided')
print(f"Power at n=100: {achieved_power:.3f}")
# 2. Proportion test power analysis
print("\n=== Proportion Test Power Analysis ===")
p1 = 0.05
p2 = 0.07
effect = proportion_effectsize(p1, p2)
print(f"Effect size (h): {effect:.3f}")
# Required sample size
power_prop = NormalIndPower()
n_prop = power_prop.solve_power(effect_size=effect, alpha=0.05, power=0.80,
alternative='two-sided', ratio=1)
print(f"Required sample size (per group): {int(np.ceil(n_prop))}")
return n, n_prop
n_t, n_prop = statsmodels_power_analysis()
3.4 Power Curves¶
def plot_power_curve(effect_sizes, n_per_group, alpha=0.05):
"""Visualize power curves"""
power_analysis = NormalIndPower()
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Effect size vs power (n fixed)
ax = axes[0]
for n in n_per_group:
powers = [power_analysis.power(effect_size=es, nobs=n, alpha=alpha,
alternative='two-sided', ratio=1)
for es in effect_sizes]
ax.plot(effect_sizes, powers, '-o', label=f'n={n}')
ax.axhline(0.80, color='r', linestyle='--', alpha=0.5, label='Power=0.80')
ax.set_xlabel('Effect Size (Cohen\'s h)')
ax.set_ylabel('Power')
ax.set_title('Power Curves (by Sample Size)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 1)
# Sample size vs power (effect size fixed)
ax = axes[1]
n_range = np.arange(50, 1001, 50)
effect_fixed = [0.1, 0.2, 0.3, 0.5]
for es in effect_fixed:
powers = [power_analysis.power(effect_size=es, nobs=n, alpha=alpha,
alternative='two-sided', ratio=1)
for n in n_range]
ax.plot(n_range, powers, '-', label=f'h={es}')
ax.axhline(0.80, color='r', linestyle='--', alpha=0.5, label='Power=0.80')
ax.set_xlabel('Sample Size per Group')
ax.set_ylabel('Power')
ax.set_title('Power Curves (by Effect Size)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 1)
plt.tight_layout()
plt.show()
effect_sizes = np.linspace(0.05, 0.5, 20)
n_per_group = [50, 100, 200, 500]
plot_power_curve(effect_sizes, n_per_group)
4. Sequential Testing¶
4.1 Why Sequential Testing?¶
def sequential_testing_motivation():
"""Need for sequential testing"""
print("""
=================================================
Sequential Testing
=================================================
Problem: Peeking Problem
─────────────────────────────────────────────────
- Checking results mid-experiment increases Type I error rate
- Designed at α=0.05, but 5 interim looks → actual rate ~14%
Examples:
- 1 look: α = 0.05
- 5 looks: α ≈ 0.14
- 10 looks: α ≈ 0.19
Solutions:
─────────────────────────────────────────────────
1. Fixed sample test: Wait until predetermined n
2. Sequential test: Allow interim looks with correction
- O'Brien-Fleming
- Pocock
- Alpha spending functions
Advantages:
- Early stopping if effect is clear → Save cost/time
- Quick stop if no effect
- Maintain statistical validity
""")
sequential_testing_motivation()
4.2 Peeking Problem Simulation¶
def simulate_peeking_problem(n_simulations=10000, n_total=1000, n_looks=5):
"""
Peeking problem simulation:
How often is significance found when null is true (no actual difference)
"""
np.random.seed(42)
alpha = 0.05
# Interim look points
look_points = np.linspace(n_total // n_looks, n_total, n_looks).astype(int)
false_positives_fixed = 0 # Fixed sample (last look only)
false_positives_peeking = 0 # All looks
for _ in range(n_simulations):
# Generate data under null hypothesis (two groups identical)
control = np.random.binomial(1, 0.1, n_total)
treatment = np.random.binomial(1, 0.1, n_total)
# Peeking: Test at each look
for look in look_points:
x_c = control[:look].sum()
x_t = treatment[:look].sum()
n = look
# Proportions
p_c = x_c / n
p_t = x_t / n
p_pooled = (x_c + x_t) / (2 * n)
if p_pooled > 0 and p_pooled < 1:
se = np.sqrt(p_pooled * (1 - p_pooled) * 2 / n)
z = (p_t - p_c) / se if se > 0 else 0
p_value = 2 * (1 - norm.cdf(abs(z)))
if p_value < alpha:
false_positives_peeking += 1
break # Stop once significant
# Fixed sample: Last look only
x_c = control.sum()
x_t = treatment.sum()
p_c = x_c / n_total
p_t = x_t / n_total
p_pooled = (x_c + x_t) / (2 * n_total)
if p_pooled > 0 and p_pooled < 1:
se = np.sqrt(p_pooled * (1 - p_pooled) * 2 / n_total)
z = (p_t - p_c) / se if se > 0 else 0
p_value = 2 * (1 - norm.cdf(abs(z)))
if p_value < alpha:
false_positives_fixed += 1
fpr_fixed = false_positives_fixed / n_simulations
fpr_peeking = false_positives_peeking / n_simulations
print("=== Peeking Problem Simulation ===")
print(f"Number of simulations: {n_simulations:,}")
print(f"Total sample size: {n_total}")
print(f"Number of interim looks: {n_looks}")
print(f"Target α: {alpha}")
print(f"\nFixed sample test false positive rate: {fpr_fixed:.4f} ({fpr_fixed*100:.2f}%)")
print(f"Peeking test false positive rate: {fpr_peeking:.4f} ({fpr_peeking*100:.2f}%)")
print(f"False positive rate inflation: {(fpr_peeking/alpha - 1)*100:.1f}%")
return fpr_fixed, fpr_peeking
fpr_fixed, fpr_peeking = simulate_peeking_problem()
4.3 Alpha Spending Functions¶
def alpha_spending_pocock(t, alpha=0.05):
"""Pocock alpha spending function"""
return alpha * np.log(1 + (np.e - 1) * t)
def alpha_spending_obrien_fleming(t, alpha=0.05):
"""O'Brien-Fleming alpha spending function"""
return 2 * (1 - norm.cdf(norm.ppf(1 - alpha/2) / np.sqrt(t)))
def plot_alpha_spending():
"""Visualize alpha spending functions"""
t = np.linspace(0.01, 1, 100)
alpha = 0.05
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(t, alpha_spending_pocock(t, alpha), label='Pocock', linewidth=2)
ax.plot(t, alpha_spending_obrien_fleming(t, alpha), label="O'Brien-Fleming", linewidth=2)
ax.plot(t, t * alpha, '--', label='Linear (reference)', alpha=0.5)
ax.axhline(alpha, color='r', linestyle=':', label=f'Total α={alpha}')
ax.set_xlabel('Information Fraction (Current/Final)')
ax.set_ylabel('Cumulative α Spent')
ax.set_title('Alpha Spending Functions')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 1)
ax.set_ylim(0, alpha * 1.1)
plt.show()
print("=== Alpha Spending Function Comparison ===")
print("\nPocock:")
print(" - Same threshold at each analysis")
print(" - More lenient for early stopping")
print(" - More conservative at final analysis")
print("\nO'Brien-Fleming:")
print(" - Very conservative early (high threshold)")
print(" - Similar to fixed sample later")
print(" - Early stopping only for extreme effects")
plot_alpha_spending()
4.4 Sequential Test Implementation¶
class SequentialTest:
"""Sequential A/B test"""
def __init__(self, max_n, n_looks, alpha=0.05, spending='obrien_fleming'):
"""
Parameters:
-----------
max_n : int
Maximum sample size (per group)
n_looks : int
Number of interim analyses
alpha : float
Overall significance level
spending : str
'pocock' or 'obrien_fleming'
"""
self.max_n = max_n
self.n_looks = n_looks
self.alpha = alpha
self.spending = spending
# Analysis times
self.look_times = np.linspace(1/n_looks, 1, n_looks)
# Alpha to use at each analysis
self.alphas = self._compute_alphas()
def _compute_alphas(self):
"""Calculate alpha for each analysis"""
if self.spending == 'pocock':
cumulative = [alpha_spending_pocock(t, self.alpha) for t in self.look_times]
else:
cumulative = [alpha_spending_obrien_fleming(t, self.alpha) for t in self.look_times]
# Incremental alpha
alphas = [cumulative[0]]
for i in range(1, len(cumulative)):
alphas.append(cumulative[i] - cumulative[i-1])
return alphas
def critical_values(self):
"""Critical Z values for each analysis"""
return [norm.ppf(1 - a/2) for a in self.alphas]
def summary(self):
"""Analysis plan summary"""
print("=== Sequential Test Plan ===")
print(f"Maximum sample: {self.max_n} (per group)")
print(f"Interim analyses: {self.n_looks} times")
print(f"Overall α: {self.alpha}")
print(f"Spending: {self.spending}")
print("\nPlan by analysis:")
print("-" * 50)
print(f"{'Look':<6} {'n':<10} {'Cumulative α':<12} {'Increment α':<12} {'Z Critical':<10}")
print("-" * 50)
cumulative_alpha = 0
z_crits = self.critical_values()
for i, (t, a) in enumerate(zip(self.look_times, self.alphas)):
n = int(t * self.max_n)
cumulative_alpha += a
print(f"{i+1:<6} {n:<10} {cumulative_alpha:<12.4f} {a:<12.4f} {z_crits[i]:<10.3f}")
# Example
seq_test = SequentialTest(max_n=5000, n_looks=5, alpha=0.05, spending='obrien_fleming')
seq_test.summary()
print("\n")
seq_test_pocock = SequentialTest(max_n=5000, n_looks=5, alpha=0.05, spending='pocock')
seq_test_pocock.summary()
5. Common Pitfalls and Cautions¶
5.1 Multiple Comparisons Problem¶
def multiple_comparisons_problem():
"""Multiple comparisons problem"""
print("""
=================================================
Multiple Comparisons Problem
=================================================
Problem:
- Conducting multiple tests simultaneously increases Type I error
- Probability of at least one false positive with k tests: 1 - (1-α)^k
Examples (α=0.05):
- 1 test: 5%
- 5 tests: 23%
- 10 tests: 40%
- 20 tests: 64%
Correction methods:
─────────────────────────────────────────────────
1. Bonferroni: α' = α/k (most conservative)
2. Holm-Bonferroni: Sequential Bonferroni
3. Benjamini-Hochberg (FDR): Control false discovery rate
4. Pre-registration: Specify one primary hypothesis
""")
# Visualization
k_values = range(1, 21)
alpha = 0.05
fwer = [1 - (1 - alpha)**k for k in k_values]
bonferroni = [min(alpha * k, 1.0) for k in k_values] # Pre-correction tolerance
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(k_values, fwer, 'b-o', label='Without correction (FWER)')
ax.axhline(alpha, color='r', linestyle='--', label=f'Target α={alpha}')
ax.set_xlabel('Number of Tests')
ax.set_ylabel('Probability of At Least One False Positive')
ax.set_title('Multiple Comparisons Problem: Tests vs False Positive Rate')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 0.7)
plt.show()
multiple_comparisons_problem()
5.2 Other Cautions¶
def common_pitfalls():
"""Common A/B testing pitfalls"""
print("""
=================================================
A/B Testing Cautions
=================================================
1. Peeking (Interim Checking)
- Problem: Check until desired result appears
- Solution: Sequential testing or fixed sample
2. Multiple Comparisons
- Problem: Testing multiple metrics/segments
- Solution: Pre-registration, correction, specify primary metric
3. Inadequate Sample Size
- Problem: Too small → fail to detect effect, too large → wasteful
- Solution: Prior power analysis
4. Novelty Effect
- Problem: Newness itself causes temporary effect
- Solution: Sufficient experiment duration
5. Network Effects (Spillover)
- Problem: Interaction between groups
- Solution: Cluster randomization
6. Simpson's Paradox
- Problem: Overall vs segment-wise results conflict
- Solution: Stratified analysis, causal graphs
7. Ignoring Practical Significance
- Problem: Statistical significance ≠ Practical importance
- Solution: Consider effect size, CI, business impact
8. Insufficient Power
- Problem: "No effect" ≠ "Null hypothesis is true"
- Solution: Report power, equivalence testing
""")
common_pitfalls()
6. Practice Examples¶
6.1 Complete Experimental Design¶
def complete_ab_test_workflow():
"""Complete A/B test workflow"""
print("="*60)
print("A/B Test Workflow")
print("="*60)
# 1. Formulate hypothesis
print("\n[Step 1] Formulate Hypothesis")
print(" H0: New button color has no effect on conversion rate")
print(" H1: New button color changes conversion rate")
# 2. Define metrics
print("\n[Step 2] Define Metrics")
baseline_rate = 0.05 # 5%
mde = 0.01 # Minimum detectable effect: 1%p
print(f" Baseline conversion rate: {baseline_rate:.1%}")
print(f" MDE (Minimum Detectable Effect): {mde:.1%}p")
# 3. Calculate sample size
print("\n[Step 3] Calculate Sample Size")
target_rate = baseline_rate + mde
n1, n2 = sample_size_two_proportions(baseline_rate, target_rate, alpha=0.05, power=0.80)
print(f" Required sample size: {n1:,} (per group)")
print(f" Total required traffic: {n1 + n2:,}")
# 4. Run experiment (simulation)
print("\n[Step 4] Run Experiment (Simulation)")
np.random.seed(42)
n_control = n1
n_treatment = n2
x_control = np.random.binomial(n_control, baseline_rate)
x_treatment = np.random.binomial(n_treatment, baseline_rate + mde * 0.8) # Actual effect is 80% of MDE
print(f" Control: {x_control:,}/{n_control:,} = {x_control/n_control:.2%}")
print(f" Treatment: {x_treatment:,}/{n_treatment:,} = {x_treatment/n_treatment:.2%}")
# 5. Analysis
print("\n[Step 5] Analysis")
ab = ABTest(n_control, x_control, n_treatment, x_treatment)
z, p = ab.z_test()
diff, ci = ab.confidence_interval()
lift = ab.lift()
print(f" Difference: {diff:.4f} ({diff*100:.2f}%p)")
print(f" Lift: {lift*100:.2f}%")
print(f" 95% CI: ({ci[0]*100:.2f}%p, {ci[1]*100:.2f}%p)")
print(f" Z statistic: {z:.3f}")
print(f" p-value: {p:.4f}")
# 6. Decision making
print("\n[Step 6] Decision Making")
if p < 0.05:
if diff > 0:
print(" Conclusion: Adopt Treatment (statistically significant improvement)")
else:
print(" Conclusion: Keep Control (Treatment is worse)")
else:
print(" Conclusion: Decision deferred (no significant difference)")
print(" Considerations: Increase sample size or test another variant")
return ab
ab_result = complete_ab_test_workflow()
7. Practice Problems¶
Problem 1: Sample Size Calculation¶
If baseline conversion rate is 3%, and you want to detect a minimum 20% relative lift (to 3.6%): 1. Required sample size at α=0.05, Power=0.80 2. Sample size change if increasing Power to 0.90 3. Sample size change if reducing MDE to 10% lift
Problem 2: Experiment Duration Estimation¶
If daily traffic is 10,000 visits with 50:50 split: 1. Days needed to achieve sample size from Problem 1 2. Minimum weeks for experiment considering weekend effects?
Problem 3: Sequential Test Design¶
If planning 5 interim analyses: 1. Each analysis threshold using O'Brien-Fleming method 2. Compare with Pocock method 3. Early stopping condition at first analysis
Problem 4: Multiple Comparison Correction¶
If analyzing A/B test results across 5 segments (age groups): 1. Bonferroni-corrected significance level 2. Conclusion if one segment shows p=0.02 3. How would pre-registration change this?
8. Key Summary¶
Experimental Design Checklist¶
- [ ] Clear hypothesis and metric definition
- [ ] Sample size determination via power analysis
- [ ] Randomization method selection
- [ ] Experiment duration setting (consider weekly effects)
- [ ] Interim analysis plan (sequential testing)
- [ ] Multiple comparison considerations
- [ ] Pre-registration
Sample Size Formula (Proportion Comparison)¶
$$n = \frac{(z_{\alpha/2}\sqrt{2\bar{p}(1-\bar{p})} + z_{\beta}\sqrt{p_1(1-p_1)+p_2(1-p_2)})^2}{(p_1-p_2)^2}$$
Power Relationships¶
| Factor | Sample Size When Increased |
|---|---|
| Effect size ↑ | Decreases |
| Power ↑ | Increases |
| α ↓ (more strict) | Increases |
| Variance ↑ | Increases |
Sequential Testing¶
| Method | Early | Late |
|---|---|---|
| O'Brien-Fleming | Very conservative | Similar to fixed sample |
| Pocock | Constant | More conservative |
Python Libraries¶
from statsmodels.stats.power import TTestPower, NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
from scipy.stats import norm
# Sample size calculation
power_analysis = NormalIndPower()
n = power_analysis.solve_power(effect_size=h, alpha=0.05, power=0.80)