From EDA to Statistical Inference¶

Learning Objectives¶

Understand the limitations of descriptive statistics and EDA
Distinguish between populations and samples and the need for inference
Recognize different types of statistical questions (estimation, testing, prediction)
Learn when to use which statistical method based on data type and research question
Connect EDA findings to formal statistical tests
Avoid common pitfalls in statistical inference
Transition from exploratory analysis to confirmatory analysis

Difficulty: ⭐⭐ (Intermediate)

1. Introduction: The Limits of "Just Looking"¶

In the previous lessons, we've learned powerful tools for Exploratory Data Analysis (EDA): - Data manipulation with Pandas - Visualization with Matplotlib and Seaborn - Descriptive statistics (mean, median, standard deviation) - Pattern detection and outlier identification

But EDA alone cannot answer critical questions: - "Is this difference real or just random noise?" - "Can we generalize these findings beyond our dataset?" - "How confident are we in our conclusions?" - "What can we predict about future observations?"

This is where statistical inference comes in.

The Detective Analogy¶

Think of data science as detective work: - EDA = gathering clues, examining the crime scene, forming hypotheses - Statistical Inference = testing those hypotheses rigorously, building a case for court

EDA tells you what happened in your data. Inference tells you what it means for the world beyond your data.

2. Population vs Sample: Why We Need Inference¶

Key Concepts¶

Population: The complete set of all individuals/items we want to study
Example: All customers of an e-commerce platform
Example: All possible measurements of a physical constant
Sample: A subset of the population we actually observe
Example: 10,000 customers from our database
Example: 100 measurements in our experiment
Sampling Variability: Different samples from the same population will give different results

Why Can't We Just Use the Sample Statistics?¶

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Simulate a population: 1 million customers with average purchase $50, std $15
np.random.seed(42)
population = np.random.normal(loc=50, scale=15, size=1_000_000)
true_mean = population.mean()
print(f"True population mean: ${true_mean:.2f}")

# Take 5 different samples of size 100
sample_means = []
for i in range(5):
    sample = np.random.choice(population, size=100, replace=False)
    sample_mean = sample.mean()
    sample_means.append(sample_mean)
    print(f"Sample {i+1} mean: ${sample_mean:.2f}")

print(f"\nRange of sample means: ${min(sample_means):.2f} to ${max(sample_means):.2f}")

Output:

True population mean: $50.00
Sample 1 mean: $48.93
Sample 2 mean: $51.24
Sample 3 mean: $49.67
Sample 4 mean: $50.82
Sample 5 mean: $49.15

Range of sample means: $48.93 to $51.24

Key insight: Each sample gives a different estimate! Statistical inference helps us: 1. Quantify this uncertainty 2. Make probabilistic statements about the population 3. Test hypotheses with controlled error rates

3. The Statistical Thinking Shift¶

From Descriptive to Inferential¶

Descriptive Statistics (EDA)	Inferential Statistics
"The sample mean is 50.2"	"The population mean is likely between 48.5 and 51.9 (95% CI)"
"Group A has a higher average"	"Group A's mean is significantly higher (p < 0.01)"
"Variables X and Y correlate at 0.7"	"The population correlation is positive (p < 0.001)"
"This pattern appears in our data"	"This pattern generalizes beyond our sample (AIC comparison)"

The Inference Mindset¶

When moving from EDA to inference, ask:

What is my population of interest?
Not just "my dataset" but the broader context
How was my sample obtained?
Random sampling? Convenience sample? This affects validity
What assumptions am I making?
Normality? Independence? Homogeneity of variance?
What is my uncertainty?
Confidence intervals, p-values, credible intervals
What are the practical consequences of being wrong?
Type I vs Type II errors, effect sizes

4. Types of Statistical Questions¶

4.1 Estimation Questions¶

Question: "What is the value of a population parameter?"

Examples: - What is the average customer lifetime value? - What proportion of users click the button?

Tools: Confidence intervals, point estimates

# Example: Estimate mean customer spend with confidence interval
sample = np.random.choice(population, size=100, replace=False)
sample_mean = sample.mean()
sample_se = stats.sem(sample)  # Standard error of the mean
ci_95 = stats.t.interval(0.95, len(sample)-1, loc=sample_mean, scale=sample_se)

print(f"Sample mean: ${sample_mean:.2f}")
print(f"95% Confidence Interval: ${ci_95[0]:.2f} to ${ci_95[1]:.2f}")
print(f"Interpretation: We are 95% confident the true population mean is in this range")

4.2 Hypothesis Testing Questions¶

Question: "Is there a significant difference/effect?"

Examples: - Does treatment A work better than treatment B? - Did the website redesign increase conversion rates?

Tools: t-tests, chi-square tests, ANOVA, permutation tests

# Example: A/B test - did the new design increase conversion?
# Control group (old design)
control_conversions = np.random.binomial(1, 0.10, size=1000)  # 10% conversion
# Treatment group (new design)
treatment_conversions = np.random.binomial(1, 0.12, size=1000)  # 12% conversion

# Hypothesis test
from statsmodels.stats.proportion import proportions_ztest

count = np.array([treatment_conversions.sum(), control_conversions.sum()])
nobs = np.array([len(treatment_conversions), len(control_conversions)])

z_stat, p_value = proportions_ztest(count, nobs)
print(f"Treatment conversion: {treatment_conversions.mean():.3f}")
print(f"Control conversion: {control_conversions.mean():.3f}")
print(f"p-value: {p_value:.4f}")
print(f"Result: {'Significant' if p_value < 0.05 else 'Not significant'} at α=0.05")

4.3 Prediction Questions¶

Question: "What will happen for new observations?"

Examples: - What will this customer spend next month? - How many units will we sell?

Tools: Regression, time series models, machine learning

4.4 Association Questions¶

Question: "How are variables related?"

Examples: - Does education level correlate with income? - Are variables independent or dependent?

Tools: Correlation, regression, contingency tables

5. When to Use Which Method: A Decision Guide¶

5.1 Based on Data Type¶

┌─── What type of outcome variable? ───┐
│                                       │
│  Continuous (numeric)                 │
│  ├─ One group → One-sample t-test    │
│  ├─ Two groups → Two-sample t-test   │
│  ├─ 3+ groups → ANOVA                │
│  └─ Predictor variables → Regression │
│                                       │
│  Categorical (binary/count)           │
│  ├─ One proportion → Proportion test │
│  ├─ Two proportions → Chi-square     │
│  └─ Predictor variables → Logistic   │
│                                       │
│  Time series                          │
│  └─ Temporal patterns → ARIMA, etc   │
└───────────────────────────────────────┘

5.2 Based on Research Question¶

def suggest_test(data_type, num_groups, paired=False, question_type="difference"):
    """
    Simple decision tree for choosing statistical test

    Parameters:
    -----------
    data_type : str
        'continuous' or 'categorical'
    num_groups : int
        Number of groups to compare
    paired : bool
        Are observations paired/matched?
    question_type : str
        'difference', 'association', 'prediction'
    """

    if question_type == "association":
        if data_type == "continuous":
            return "Pearson/Spearman correlation, Linear regression"
        else:
            return "Chi-square test of independence, Odds ratio"

    if question_type == "prediction":
        return "Regression (linear/logistic), Machine learning"

    # For difference questions
    if data_type == "continuous":
        if num_groups == 1:
            return "One-sample t-test"
        elif num_groups == 2:
            if paired:
                return "Paired t-test"
            else:
                return "Independent two-sample t-test (or Mann-Whitney if not normal)"
        else:
            return "One-way ANOVA (or Kruskal-Wallis if not normal)"
    else:  # categorical
        if num_groups == 1:
            return "One-proportion z-test, Binomial test"
        elif num_groups == 2:
            return "Two-proportion z-test, Chi-square test"
        else:
            return "Chi-square test for multiple groups"

# Examples
print(suggest_test('continuous', 2, paired=False))
# → Independent two-sample t-test (or Mann-Whitney if not normal)

print(suggest_test('categorical', 2, question_type='difference'))
# → Two-proportion z-test, Chi-square test

print(suggest_test('continuous', 1, question_type='association'))
# → Pearson/Spearman correlation, Linear regression

6. Connecting EDA to Inference¶

The Workflow: From Exploration to Confirmation¶

import pandas as pd
import seaborn as sns

# Step 1: EXPLORATORY - Load and visualize data
np.random.seed(42)
data = pd.DataFrame({
    'group': ['A']*50 + ['B']*50,
    'score': np.concatenate([
        np.random.normal(75, 10, 50),  # Group A
        np.random.normal(80, 10, 50)   # Group B
    ])
})

# EDA: Visualize the difference
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
sns.boxplot(data=data, x='group', y='score')
plt.title('EDA: Boxplot suggests Group B scores higher')

plt.subplot(1, 2, 2)
sns.histplot(data=data, x='score', hue='group', kde=True)
plt.title('EDA: Distributions appear roughly normal')

plt.tight_layout()
plt.savefig('/tmp/eda_to_inference.png', dpi=100, bbox_inches='tight')
plt.close()

# EDA: Descriptive statistics
print("=== EXPLORATORY PHASE ===")
print(data.groupby('group')['score'].describe())
print("\nObservation: Group B has a higher mean (80.5 vs 74.9)")
print("Question: Is this difference statistically significant?\n")

# Step 2: CONFIRMATORY - Hypothesis test
print("=== INFERENTIAL PHASE ===")

# Check assumptions
group_a = data[data['group'] == 'A']['score']
group_b = data[data['group'] == 'B']['score']

# Normality test
_, p_norm_a = stats.shapiro(group_a)
_, p_norm_b = stats.shapiro(group_b)
print(f"Normality test (Shapiro-Wilk):")
print(f"  Group A: p={p_norm_a:.3f} → {'Normal' if p_norm_a > 0.05 else 'Not normal'}")
print(f"  Group B: p={p_norm_b:.3f} → {'Normal' if p_norm_b > 0.05 else 'Not normal'}")

# Variance equality test
_, p_var = stats.levene(group_a, group_b)
print(f"\nEqual variance test (Levene):")
print(f"  p={p_var:.3f} → {'Equal variances' if p_var > 0.05 else 'Unequal variances'}")

# Two-sample t-test
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"\nTwo-sample t-test:")
print(f"  t-statistic: {t_stat:.3f}")
print(f"  p-value: {p_value:.4f}")
print(f"  Result: {'Reject H0' if p_value < 0.05 else 'Fail to reject H0'} at α=0.05")

# Effect size (Cohen's d)
pooled_std = np.sqrt((group_a.std()**2 + group_b.std()**2) / 2)
cohens_d = (group_b.mean() - group_a.mean()) / pooled_std
print(f"  Cohen's d: {cohens_d:.3f} ({'small' if abs(cohens_d) < 0.5 else 'medium' if abs(cohens_d) < 0.8 else 'large'} effect)")

# Confidence interval for difference
diff_mean = group_b.mean() - group_a.mean()
se_diff = np.sqrt(group_a.var()/len(group_a) + group_b.var()/len(group_b))
ci_diff = stats.t.interval(0.95, len(group_a)+len(group_b)-2, loc=diff_mean, scale=se_diff)
print(f"  95% CI for difference: ({ci_diff[0]:.2f}, {ci_diff[1]:.2f})")

print("\n=== CONCLUSION ===")
print(f"Group B scores are significantly higher than Group A (p={p_value:.4f}).")
print(f"The mean difference is {diff_mean:.2f} points (95% CI: {ci_diff[0]:.2f} to {ci_diff[1]:.2f}).")
print(f"This represents a {('small' if abs(cohens_d) < 0.5 else 'medium' if abs(cohens_d) < 0.8 else 'large')} effect size.")

Key Takeaway: EDA guides your inference strategy: - Histograms → Check normality assumption - Boxplots → Identify appropriate test (parametric vs non-parametric) - Scatter plots → Inform regression choices - Missing data patterns → Handle before inference

7. Common Pitfalls in Statistical Inference¶

7.1 p-Hacking (Data Dredging)¶

Problem: Testing many hypotheses until you find p < 0.05

# BAD PRACTICE: Testing many variables without correction
np.random.seed(123)
num_tests = 20
p_values = []

for i in range(num_tests):
    # Generate random data (no real effect)
    group1 = np.random.normal(0, 1, 30)
    group2 = np.random.normal(0, 1, 30)
    _, p = stats.ttest_ind(group1, group2)
    p_values.append(p)
    if p < 0.05:
        print(f"Test {i+1}: p={p:.4f} 🎉 Significant!")

print(f"\nFound {sum(p < 0.05 for p in p_values)} 'significant' results out of {num_tests} tests")
print("But all data was random! This is Type I error (false positive)")

Solution: - Use multiple testing correction (Bonferroni, Benjamini-Hochberg) - Pre-register hypotheses - Report all tests performed

from statsmodels.stats.multitest import multipletests

# Correct for multiple comparisons
rejected, p_corrected, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')
print(f"\nAfter Bonferroni correction: {sum(rejected)} significant results")

7.2 Confusing Correlation with Causation¶

Problem: "X and Y correlate, therefore X causes Y"

# Spurious correlation example
np.random.seed(42)
years = np.arange(2000, 2020)
ice_cream_sales = 100 + 2*years + np.random.normal(0, 50, len(years)) - 4000
drowning_deaths = 50 + 1*years + np.random.normal(0, 20, len(years)) - 2000

corr, p_corr = stats.pearsonr(ice_cream_sales, drowning_deaths)
print(f"Correlation between ice cream sales and drowning deaths: r={corr:.3f}, p={p_corr:.4f}")
print("Conclusion: Ice cream causes drowning? NO!")
print("Explanation: Both are caused by a confounding variable (summer/temperature)")

Remember: - Correlation ≠ Causation - Need experimental design (randomization, control) for causal claims - Consider confounding variables, reverse causation, third variables

7.3 Ignoring Assumptions¶

Problem: Using tests without checking their assumptions

# Example: t-test on heavily skewed data
np.random.seed(42)
skewed_data1 = np.random.exponential(scale=2, size=30)
skewed_data2 = np.random.exponential(scale=2.5, size=30)

# Wrong: Using t-test without checking normality
t_stat, p_ttest = stats.ttest_ind(skewed_data1, skewed_data2)
print(f"t-test p-value: {p_ttest:.4f}")

# Right: Check assumption first
_, p_norm = stats.shapiro(skewed_data1)
print(f"Shapiro-Wilk test for normality: p={p_norm:.4f}")
if p_norm < 0.05:
    print("Data is not normal! Use Mann-Whitney U test instead")
    u_stat, p_mann = stats.mannwhitneyu(skewed_data1, skewed_data2)
    print(f"Mann-Whitney U test p-value: {p_mann:.4f}")

7.4 Confusing Statistical and Practical Significance¶

Problem: "p < 0.05, therefore it's important!"

# Large sample can make tiny effects "significant"
np.random.seed(42)
large_group1 = np.random.normal(100, 15, 10000)
large_group2 = np.random.normal(100.5, 15, 10000)  # Tiny difference

t_stat, p_value = stats.ttest_ind(large_group1, large_group2)
cohens_d = (large_group2.mean() - large_group1.mean()) / large_group1.std()

print(f"Mean difference: {large_group2.mean() - large_group1.mean():.3f}")
print(f"p-value: {p_value:.4f} → Statistically significant!")
print(f"Cohen's d: {cohens_d:.3f} → Practically negligible (tiny effect)")
print("\nAlways report effect sizes, not just p-values!")

8. Practical Example: From EDA to Full Inference¶

Scenario: E-commerce A/B Test¶

We want to know if a new checkout flow increases purchase amounts.

# Generate realistic data
np.random.seed(42)
n = 200

data_ab = pd.DataFrame({
    'user_id': range(n),
    'variant': ['control']*100 + ['treatment']*100,
    'purchase_amount': np.concatenate([
        np.random.gamma(shape=2, scale=25, size=100),  # Control
        np.random.gamma(shape=2.3, scale=25, size=100)  # Treatment (slightly higher)
    ])
})

# Add some confounding variable: user tenure
data_ab['tenure_months'] = np.random.poisson(lam=12, size=n)

print("=== STAGE 1: EXPLORATORY DATA ANALYSIS ===\n")

# 1. Summary statistics
print(data_ab.groupby('variant')['purchase_amount'].describe())

# 2. Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Histogram
data_ab.hist(column='purchase_amount', by='variant', bins=20, ax=axes[0:2], alpha=0.7)
axes[0].set_title('Control Group')
axes[1].set_title('Treatment Group')

# Boxplot
axes[2].boxplot([
    data_ab[data_ab['variant']=='control']['purchase_amount'],
    data_ab[data_ab['variant']=='treatment']['purchase_amount']
], labels=['Control', 'Treatment'])
axes[2].set_ylabel('Purchase Amount')
axes[2].set_title('Distribution Comparison')

plt.tight_layout()
plt.savefig('/tmp/ab_test_eda.png', dpi=100, bbox_inches='tight')
plt.close()

print("\n=== STAGE 2: FORMULATE STATISTICAL QUESTION ===\n")
print("Research Question: Does the new checkout flow (treatment) increase purchase amounts?")
print("Null Hypothesis (H0): μ_treatment = μ_control")
print("Alternative Hypothesis (H1): μ_treatment > μ_control (one-tailed)")
print("Significance level: α = 0.05")

print("\n=== STAGE 3: CHECK ASSUMPTIONS ===\n")

control = data_ab[data_ab['variant']=='control']['purchase_amount']
treatment = data_ab[data_ab['variant']=='treatment']['purchase_amount']

# Normality (note: with n=100, CLT applies, but let's check)
_, p_norm_c = stats.shapiro(control)
_, p_norm_t = stats.shapiro(treatment)
print(f"Normality (Shapiro-Wilk):")
print(f"  Control: p={p_norm_c:.4f}")
print(f"  Treatment: p={p_norm_t:.4f}")
print(f"  → Data is {'normal' if min(p_norm_c, p_norm_t) > 0.05 else 'not normal, but n=100 so CLT applies'}")

# Equal variance
_, p_var = stats.levene(control, treatment)
print(f"\nEqual variance (Levene): p={p_var:.4f}")
print(f"  → Variances are {'equal' if p_var > 0.05 else 'unequal'}")

print("\n=== STAGE 4: CONDUCT INFERENCE ===\n")

# Two-sample t-test (one-tailed)
t_stat, p_twotail = stats.ttest_ind(treatment, control, equal_var=(p_var>0.05))
p_onetail = p_twotail / 2 if t_stat > 0 else 1 - p_twotail / 2

print(f"Two-sample t-test (one-tailed):")
print(f"  t-statistic: {t_stat:.3f}")
print(f"  p-value: {p_onetail:.4f}")
print(f"  Decision: {'Reject H0' if p_onetail < 0.05 else 'Fail to reject H0'}")

# Effect size
cohens_d = (treatment.mean() - control.mean()) / np.sqrt((control.var() + treatment.var())/2)
print(f"\nEffect size (Cohen's d): {cohens_d:.3f}")

# Confidence interval
diff_mean = treatment.mean() - control.mean()
se_diff = np.sqrt(control.var()/len(control) + treatment.var()/len(treatment))
ci_95 = stats.t.interval(0.95, len(control)+len(treatment)-2, loc=diff_mean, scale=se_diff)
print(f"95% CI for difference: (${ci_95[0]:.2f}, ${ci_95[1]:.2f})")

# Practical significance
revenue_increase = (diff_mean / control.mean()) * 100
print(f"\nRevenue increase: {revenue_increase:.1f}%")

print("\n=== STAGE 5: REPORT RESULTS ===\n")
print(f"The treatment group had significantly higher purchase amounts than the control group")
print(f"(M_treatment = ${treatment.mean():.2f}, M_control = ${control.mean():.2f}, t({len(control)+len(treatment)-2}) = {t_stat:.2f}, p = {p_onetail:.4f}).")
print(f"The mean difference was ${diff_mean:.2f} (95% CI: ${ci_95[0]:.2f} to ${ci_95[1]:.2f}),")
print(f"representing a {revenue_increase:.1f}% increase in average purchase amount.")
print(f"The effect size was {('small' if abs(cohens_d)<0.5 else 'medium' if abs(cohens_d)<0.8 else 'large')} (Cohen's d = {cohens_d:.2f}).")

9. Exercises¶

Exercise 1: Choose the Right Test¶

For each scenario, identify the appropriate statistical test:

a) A company wants to know if customer satisfaction scores (1-10) differ between three service centers.

b) A researcher wants to test if the proportion of left-handed people differs between men and women.

c) A data scientist wants to predict house prices based on square footage, number of bedrooms, and location.

d) An analyst wants to know if there's a relationship between hours studied and exam scores.

Answers: - a) One-way ANOVA (continuous outcome, 3 groups) - b) Two-proportion z-test or Chi-square test (categorical outcome, 2 groups) - c) Multiple linear regression (continuous outcome, multiple predictors) - d) Pearson correlation / Simple linear regression (two continuous variables)

Exercise 2: From EDA to Hypothesis¶

You perform EDA on employee data and notice: - The median salary for Department A is $75,000 - The median salary for Department B is $82,000 - Boxplots show some overlap but Department B appears higher

Questions: 1. What is the population of interest? 2. Formulate a null and alternative hypothesis 3. What test would you use? What assumptions would you check? 4. If p = 0.03, what would you conclude? 5. What additional information would help assess practical significance?

Exercise 3: Identify the Pitfall¶

Identify the statistical pitfall in each scenario:

a) A researcher tests 50 different variables and reports only the 3 that had p < 0.05.

b) A study finds that coffee consumption correlates with heart disease and concludes coffee causes heart disease.

c) A company reports "statistically significant improvement" (p=0.04) but the actual increase in conversion rate was 0.1%.

d) An analyst uses a t-test on heavily right-skewed income data without checking assumptions.

Answers: - a) p-hacking / multiple comparisons problem - b) Confusing correlation with causation - c) Confusing statistical and practical significance - d) Violating test assumptions

Exercise 4: Complete Analysis Pipeline¶

Using the tips dataset (available in seaborn), perform a complete analysis:

import seaborn as sns
tips = sns.load_dataset('tips')

# Your task:
# 1. EDA: Explore if smokers tip differently than non-smokers
# 2. Formulate hypothesis
# 3. Check assumptions
# 4. Conduct appropriate test
# 5. Report results with effect size and confidence interval

10. Summary¶

Key Takeaways¶

EDA is exploration; Inference is confirmation
EDA generates hypotheses; inference tests them rigorously
Samples are imperfect windows into populations
Sampling variability means we need probabilistic statements
Confidence intervals and p-values quantify uncertainty
Different questions require different methods
Estimation → Confidence intervals
Testing → Hypothesis tests (t-test, ANOVA, chi-square, etc.)
Prediction → Regression, ML models
Association → Correlation, contingency tables
Always check assumptions
Normality, independence, equal variance
Use robust alternatives when assumptions fail
Report effect sizes, not just p-values
Statistical significance ≠ Practical significance
Context matters for interpretation
Beware common pitfalls
Multiple testing without correction
Correlation ≠ Causation
Violating assumptions
Overinterpreting p-values

The Bridge You've Crossed¶

You now understand: - ✅ Why we can't just "trust the data" without inference - ✅ How to move from descriptive patterns to formal statistical questions - ✅ When to use which statistical method - ✅ How to connect EDA findings to rigorous tests - ✅ Common mistakes to avoid

What's Next?¶

The remaining lessons will dive deeper into: - L11-L13: Probability foundations and distributions - L14-L16: Hypothesis testing frameworks and power analysis - L17-L18: Regression and model evaluation - L19-L21: Bayesian inference - L22-L24: Time series and advanced topics

You're now ready to move beyond "what the data shows" to "what we can conclude with confidence."

11. Additional Resources¶

Books¶

"The Art of Statistics" by David Spiegelhalter - accessible introduction to statistical thinking
"Statistical Rethinking" by Richard McElreath - Bayesian approach with intuitive examples
"Naked Statistics" by Charles Wheelan - conceptual understanding without heavy math

Online Resources¶

Seeing Theory - Visual introduction to probability and statistics
StatQuest - Video explanations of statistical concepts
Cross Validated - Q&A for statistics

Python Libraries¶

scipy.stats: Statistical tests and distributions
statsmodels: Regression, hypothesis testing, time series
pingouin: User-friendly statistical tests with effect sizes
scikit-learn: Machine learning and predictive modeling

Previous: 09_Data_Visualization_Advanced
Next: 11_Probability_Review
Overview: 00_Overview