Data Science Study Guide¶

Introduction¶

Welcome to the Data Science Study Guide! This comprehensive topic covers the essential tools and statistical methods for modern data analysis. You'll learn how to manipulate, visualize, and draw meaningful conclusions from data using industry-standard Python libraries.

Data Science combines: - Data manipulation tools (NumPy, Pandas) for handling structured data - Visualization techniques (Matplotlib, Seaborn) for exploring patterns - Statistical inference (scipy, statsmodels) for drawing valid conclusions - Practical applications through hands-on projects

This topic is designed to take you from basic data manipulation through exploratory data analysis (EDA) to rigorous statistical inference and advanced modeling techniques.

Learning Roadmap¶

The 25 lessons follow a structured progression:

Phase 1: Data Tools (L01-L06)¶

Master the fundamental libraries for data manipulation and preprocessing.

Phase 2: Exploratory Data Analysis (L07-L09)¶

Learn to visualize, summarize, and explore data patterns.

Phase 3: Bridge to Inference (L10) 🌉¶

Critical transition: Understand when and why to move beyond descriptive statistics to formal statistical testing.

Phase 4: Statistical Foundations (L11-L14)¶

Build probability foundations and learn hypothesis testing frameworks.

Phase 5: Advanced Inference (L15-L24)¶

Master specialized techniques: ANOVA, regression, Bayesian methods, time series, and experimental design.

Phase 6: Practical Integration (L25)¶

Apply everything in comprehensive real-world projects.

Lesson List¶

Lesson	Title	Difficulty	Topics
01	NumPy Basics	⭐	Arrays, indexing, broadcasting, basic operations
02	NumPy Advanced	⭐⭐	Vectorization, linear algebra, random sampling
03	Pandas Basics	⭐	Series, DataFrames, reading/writing data
04	Pandas Data Manipulation	⭐⭐	Filtering, groupby, merging, reshaping
05	Pandas Advanced	⭐⭐⭐	MultiIndex, time series, categorical data
06	Data Preprocessing	⭐⭐	Missing data, outliers, scaling, encoding
07	Descriptive Statistics & EDA	⭐⭐	Summary statistics, distributions, correlation
08	Data Visualization Basics	⭐⭐	Matplotlib fundamentals, plot types
09	Data Visualization Advanced	⭐⭐⭐	Seaborn, complex plots, interactive viz
10	From EDA to Inference	⭐⭐	Bridge lesson: population vs sample, statistical thinking, choosing tests
11	Probability Review	⭐⭐	Random variables, distributions, expectation
12	Sampling and Estimation	⭐⭐	Sampling methods, point estimation, bias/variance
13	Confidence Intervals	⭐⭐⭐	CI construction, interpretation, margin of error
14	Hypothesis Testing Advanced	⭐⭐⭐	p-values, Type I/II errors, power analysis
15	ANOVA	⭐⭐⭐	One-way, two-way, post-hoc tests
16	Regression Analysis Advanced	⭐⭐⭐	Multiple regression, diagnostics, regularization
17	Generalized Linear Models	⭐⭐⭐⭐	Logistic regression, Poisson regression, GLM theory
18	Bayesian Statistics Basics	⭐⭐⭐	Bayes theorem, prior/posterior, conjugacy
19	Bayesian Inference	⭐⭐⭐⭐	MCMC, PyMC, credible intervals
20	Time Series Basics	⭐⭐⭐	Trends, seasonality, decomposition
21	Time Series Models	⭐⭐⭐⭐	ARIMA, SARIMA, forecasting, diagnostics
22	Multivariate Analysis	⭐⭐⭐	PCA, factor analysis, clustering
23	Nonparametric Statistics	⭐⭐⭐	Rank tests, bootstrap, permutation tests
24	Experimental Design	⭐⭐⭐	A/B testing, randomization, DOE principles
25	Practical Projects	⭐⭐⭐⭐	End-to-end data science projects

Prerequisites¶

Required Knowledge¶

Python Basics: Variables, functions, loops, conditionals
Basic Math: Algebra, basic calculus (helpful but not required)
Curiosity: Willingness to ask "why?" and "how can I test this?"

Recommended (but not required)¶

Familiarity with Jupyter notebooks
Basic understanding of scientific notation
Experience with any programming language

Environment Setup¶

Installation¶

Install the required libraries using pip:

# Core data science stack
pip install numpy pandas matplotlib seaborn

# Statistical libraries
pip install scipy statsmodels

# Optional: Bayesian inference
pip install pymc arviz

# Optional: Machine learning integration
pip install scikit-learn

# Optional: Interactive visualization
pip install plotly

Verify Installation¶

Run this Python snippet to verify all libraries are installed:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm

print("NumPy version:", np.__version__)
print("Pandas version:", pd.__version__)
print("Matplotlib version:", plt.__version__)
print("Seaborn version:", sns.__version__)
print("SciPy version:", stats.__version__)
print("Statsmodels version:", sm.__version__)

Recommended IDE¶

Jupyter Notebook or JupyterLab: Best for exploratory analysis
VS Code with Python extension: Good for script development
Google Colab: Free cloud environment (no installation needed)

This topic connects closely with other areas in the study guide:

Prerequisites (Recommended)¶

Python: Learn Python fundamentals first
Programming: Core programming concepts

Next Steps¶

Machine Learning: Predictive modeling with scikit-learn
Deep Learning: Neural networks with PyTorch
Statistics: Deeper statistical theory

Data Analysis: Lighter introduction to NumPy/Pandas
Data Engineering: Large-scale data pipelines
MLOps: Deploying models to production

How to Use This Guide¶

For Beginners¶

Start with L01-L06 to build data manipulation skills
Practice with the provided exercises and datasets
Move to L07-L09 for visualization
Don't skip L10! It's the critical bridge to inference
Progress through inference topics (L11-L24) at your own pace

For Intermediate Learners¶

Skim L01-L09 if you know NumPy/Pandas
Study L10 carefully to solidify your statistical thinking
Focus on inference topics (L11-L24) based on your interests
Complete L25 projects to integrate knowledge

For Advanced Users¶

Use as a reference for specific techniques
Review L10 for decision frameworks on choosing tests
Dive into advanced topics (Bayesian, GLM, time series)
Adapt L25 projects to your domain

Learning Tips¶

Active Learning¶

Code along: Don't just read—run every code example
Modify examples: Change parameters, try different datasets
Ask "what if?": Test edge cases and assumptions

Practice Datasets¶

Use these built-in datasets for practice:

import seaborn as sns

# Load sample datasets
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')
titanic = sns.load_dataset('titanic')
diamonds = sns.load_dataset('diamonds')

Key Habits¶

Always visualize before running statistical tests
Check assumptions (normality, independence, etc.)
Report effect sizes, not just p-values
Document your reasoning in comments/markdown

Assessment and Projects¶

Self-Assessment¶

Each lesson includes: - Exercises: Practice problems with solutions - Conceptual questions: Test your understanding - Code challenges: Apply techniques to new scenarios

Capstone Projects (L25)¶

The final lesson includes complete projects: 1. Retail Sales Analysis: Time series forecasting 2. A/B Test Evaluation: Hypothesis testing workflow 3. Survey Data Analysis: Multivariate techniques 4. Predictive Modeling: Regression and classification

Additional Resources¶

Books¶

"Python for Data Analysis" by Wes McKinney (Pandas creator)
"The Art of Statistics" by David Spiegelhalter
"Statistical Rethinking" by Richard McElreath

Online Courses¶

Kaggle Learn: Free interactive tutorials
StatQuest: Video explanations of statistics
Seeing Theory: Visual probability/statistics

Documentation¶

Getting Help¶

During Study¶

Check official documentation first
Use help() function or ? in Jupyter
Search Stack Overflow for pandas/numpy questions
Ask on Cross Validated for statistics questions

Common Issues¶

ImportError: Reinstall library with pip install --upgrade <library>
DeprecationWarning: Check library versions for compatibility
MemoryError: Use smaller samples or chunking for large datasets

Philosophy of This Guide¶

Balancing Rigor and Intuition¶

We aim to: - Build intuition first: Visual and conceptual understanding before formulas - Connect theory to practice: Every concept with code examples - Emphasize critical thinking: Know when to use techniques, not just how

The EDA-Inference Connection¶

Lesson 10 is the heart of this guide. Most courses treat EDA and inference as separate topics. We emphasize the transition: - EDA generates questions → Inference answers them rigorously - Visualization suggests patterns → Tests confirm them with controlled error - Descriptive stats describe your sample → Inference generalizes to populations

Start here: 01_NumPy_Basics
Critical bridge: 10_From_EDA_to_Inference
Final projects: 25_Practical_Projects

Ready to begin your data science journey? Let's start with NumPy fundamentals!