Data Science Study Guide
Data Science Study Guide¶
Introduction¶
Welcome to the Data Science Study Guide! This comprehensive topic covers the essential tools and statistical methods for modern data analysis. You'll learn how to manipulate, visualize, and draw meaningful conclusions from data using industry-standard Python libraries.
Data Science combines: - Data manipulation tools (NumPy, Pandas) for handling structured data - Visualization techniques (Matplotlib, Seaborn) for exploring patterns - Statistical inference (scipy, statsmodels) for drawing valid conclusions - Practical applications through hands-on projects
This topic is designed to take you from basic data manipulation through exploratory data analysis (EDA) to rigorous statistical inference and advanced modeling techniques.
Learning Roadmap¶
The 25 lessons follow a structured progression:
Phase 1: Data Tools (L01-L06)¶
Master the fundamental libraries for data manipulation and preprocessing.
Phase 2: Exploratory Data Analysis (L07-L09)¶
Learn to visualize, summarize, and explore data patterns.
Phase 3: Bridge to Inference (L10) π¶
Critical transition: Understand when and why to move beyond descriptive statistics to formal statistical testing.
Phase 4: Statistical Foundations (L11-L14)¶
Build probability foundations and learn hypothesis testing frameworks.
Phase 5: Advanced Inference (L15-L24)¶
Master specialized techniques: ANOVA, regression, Bayesian methods, time series, and experimental design.
Phase 6: Practical Integration (L25)¶
Apply everything in comprehensive real-world projects.
Lesson List¶
| Lesson | Title | Difficulty | Topics |
|---|---|---|---|
| 01 | NumPy Basics | β | Arrays, indexing, broadcasting, basic operations |
| 02 | NumPy Advanced | ββ | Vectorization, linear algebra, random sampling |
| 03 | Pandas Basics | β | Series, DataFrames, reading/writing data |
| 04 | Pandas Data Manipulation | ββ | Filtering, groupby, merging, reshaping |
| 05 | Pandas Advanced | βββ | MultiIndex, time series, categorical data |
| 06 | Data Preprocessing | ββ | Missing data, outliers, scaling, encoding |
| 07 | Descriptive Statistics & EDA | ββ | Summary statistics, distributions, correlation |
| 08 | Data Visualization Basics | ββ | Matplotlib fundamentals, plot types |
| 09 | Data Visualization Advanced | βββ | Seaborn, complex plots, interactive viz |
| 10 | From EDA to Inference | ββ | Bridge lesson: population vs sample, statistical thinking, choosing tests |
| 11 | Probability Review | ββ | Random variables, distributions, expectation |
| 12 | Sampling and Estimation | ββ | Sampling methods, point estimation, bias/variance |
| 13 | Confidence Intervals | βββ | CI construction, interpretation, margin of error |
| 14 | Hypothesis Testing Advanced | βββ | p-values, Type I/II errors, power analysis |
| 15 | ANOVA | βββ | One-way, two-way, post-hoc tests |
| 16 | Regression Analysis Advanced | βββ | Multiple regression, diagnostics, regularization |
| 17 | Generalized Linear Models | ββββ | Logistic regression, Poisson regression, GLM theory |
| 18 | Bayesian Statistics Basics | βββ | Bayes theorem, prior/posterior, conjugacy |
| 19 | Bayesian Inference | ββββ | MCMC, PyMC, credible intervals |
| 20 | Time Series Basics | βββ | Trends, seasonality, decomposition |
| 21 | Time Series Models | ββββ | ARIMA, SARIMA, forecasting, diagnostics |
| 22 | Multivariate Analysis | βββ | PCA, factor analysis, clustering |
| 23 | Nonparametric Statistics | βββ | Rank tests, bootstrap, permutation tests |
| 24 | Experimental Design | βββ | A/B testing, randomization, DOE principles |
| 25 | Practical Projects | ββββ | End-to-end data science projects |
Prerequisites¶
Required Knowledge¶
- Python Basics: Variables, functions, loops, conditionals
- Basic Math: Algebra, basic calculus (helpful but not required)
- Curiosity: Willingness to ask "why?" and "how can I test this?"
Recommended (but not required)¶
- Familiarity with Jupyter notebooks
- Basic understanding of scientific notation
- Experience with any programming language
Environment Setup¶
Installation¶
Install the required libraries using pip:
# Core data science stack
pip install numpy pandas matplotlib seaborn
# Statistical libraries
pip install scipy statsmodels
# Optional: Bayesian inference
pip install pymc arviz
# Optional: Machine learning integration
pip install scikit-learn
# Optional: Interactive visualization
pip install plotly
Verify Installation¶
Run this Python snippet to verify all libraries are installed:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
print("NumPy version:", np.__version__)
print("Pandas version:", pd.__version__)
print("Matplotlib version:", plt.__version__)
print("Seaborn version:", sns.__version__)
print("SciPy version:", stats.__version__)
print("Statsmodels version:", sm.__version__)
Recommended IDE¶
- Jupyter Notebook or JupyterLab: Best for exploratory analysis
- VS Code with Python extension: Good for script development
- Google Colab: Free cloud environment (no installation needed)
Related Topics¶
This topic connects closely with other areas in the study guide:
Prerequisites (Recommended)¶
- Python: Learn Python fundamentals first
- Programming: Core programming concepts
Next Steps¶
- Machine Learning: Predictive modeling with scikit-learn
- Deep Learning: Neural networks with PyTorch
- Statistics: Deeper statistical theory
Related Applications¶
- Data Analysis: Lighter introduction to NumPy/Pandas
- Data Engineering: Large-scale data pipelines
- MLOps: Deploying models to production
How to Use This Guide¶
For Beginners¶
- Start with L01-L06 to build data manipulation skills
- Practice with the provided exercises and datasets
- Move to L07-L09 for visualization
- Don't skip L10! It's the critical bridge to inference
- Progress through inference topics (L11-L24) at your own pace
For Intermediate Learners¶
- Skim L01-L09 if you know NumPy/Pandas
- Study L10 carefully to solidify your statistical thinking
- Focus on inference topics (L11-L24) based on your interests
- Complete L25 projects to integrate knowledge
For Advanced Users¶
- Use as a reference for specific techniques
- Review L10 for decision frameworks on choosing tests
- Dive into advanced topics (Bayesian, GLM, time series)
- Adapt L25 projects to your domain
Learning Tips¶
Active Learning¶
- Code along: Don't just readβrun every code example
- Modify examples: Change parameters, try different datasets
- Ask "what if?": Test edge cases and assumptions
Practice Datasets¶
Use these built-in datasets for practice:
import seaborn as sns
# Load sample datasets
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')
titanic = sns.load_dataset('titanic')
diamonds = sns.load_dataset('diamonds')
Key Habits¶
- Always visualize before running statistical tests
- Check assumptions (normality, independence, etc.)
- Report effect sizes, not just p-values
- Document your reasoning in comments/markdown
Assessment and Projects¶
Self-Assessment¶
Each lesson includes: - Exercises: Practice problems with solutions - Conceptual questions: Test your understanding - Code challenges: Apply techniques to new scenarios
Capstone Projects (L25)¶
The final lesson includes complete projects: 1. Retail Sales Analysis: Time series forecasting 2. A/B Test Evaluation: Hypothesis testing workflow 3. Survey Data Analysis: Multivariate techniques 4. Predictive Modeling: Regression and classification
Additional Resources¶
Books¶
- "Python for Data Analysis" by Wes McKinney (Pandas creator)
- "The Art of Statistics" by David Spiegelhalter
- "Statistical Rethinking" by Richard McElreath
Online Courses¶
- Kaggle Learn: Free interactive tutorials
- StatQuest: Video explanations of statistics
- Seeing Theory: Visual probability/statistics
Documentation¶
Getting Help¶
During Study¶
- Check official documentation first
- Use
help()function or?in Jupyter - Search Stack Overflow for pandas/numpy questions
- Ask on Cross Validated for statistics questions
Common Issues¶
- ImportError: Reinstall library with
pip install --upgrade <library> - DeprecationWarning: Check library versions for compatibility
- MemoryError: Use smaller samples or chunking for large datasets
Philosophy of This Guide¶
Balancing Rigor and Intuition¶
We aim to: - Build intuition first: Visual and conceptual understanding before formulas - Connect theory to practice: Every concept with code examples - Emphasize critical thinking: Know when to use techniques, not just how
The EDA-Inference Connection¶
Lesson 10 is the heart of this guide. Most courses treat EDA and inference as separate topics. We emphasize the transition: - EDA generates questions β Inference answers them rigorously - Visualization suggests patterns β Tests confirm them with controlled error - Descriptive stats describe your sample β Inference generalizes to populations
Navigation¶
- Start here: 01_NumPy_Basics
- Critical bridge: 10_From_EDA_to_Inference
- Final projects: 25_Practical_Projects
Ready to begin your data science journey? Let's start with NumPy fundamentals!