课程 · 10 · 09 / 10
Statistical Foundations for Data Science
Build the statistical intuition needed for machine learning. Understand distributions, central tendency, variability, correlation, and basic hypothesis testing.
TIPLearning Objectives: After this lesson, you'll understand the statistical concepts essential for data science—distributions, central tendency, variability, correlation, and basic hypothesis testing.
Why Statistics for Data Science?
Statistics provides the mathematical foundation for making decisions from data. It helps us distinguish real patterns from random noise.
Measures of Central Tendency
Where is the "center" of your data?
Mode and Other Measures
Measures of Spread (Variability)
How spread out is your data?
The 68-95-99.7 Rule (Empirical Rule)
Common Probability Distributions
Visualizing Distributions
Correlation and Relationships
How strongly are two variables related?
Correlation vs Causation
Basic Hypothesis Testing
Make decisions about populations based on samples.
One-Sample t-test
Two-Sample t-test
Confidence Intervals
Quantify uncertainty in estimates.
Practical Application
Key Takeaways
✅ Central tendency: Mean, median, mode—choose based on distribution shape
✅ Spread: Standard deviation and IQR measure variability
✅ Distributions: Normal, uniform, binomial, Poisson—know when to use each
✅ Correlation: Measures linear relationship strength (-1 to +1)
✅ Hypothesis testing: Framework for making data-driven decisions
✅ Confidence intervals: Quantify uncertainty in estimates
✅ Causation: Requires more than just correlation
Connections: Statistics in Data Science
🔗 Connection to Machine Learning
| Statistical Concept | ML Application |
|---|---|
| Probability distributions | Naive Bayes, probabilistic models |
| Hypothesis testing | A/B testing, feature selection |
| Confidence intervals | Model uncertainty |
| Correlation | Feature engineering, multicollinearity |
| Variance | Bias-variance tradeoff |
🔗 Connection to Business Decisions
| Business Question | Statistical Approach |
|---|---|
| Is the new feature better? | A/B test, t-test |
| What's the expected revenue? | Mean + Confidence interval |
| Is this result reliable? | Hypothesis testing |
| Which factors matter most? | Correlation analysis |
Practice Exercise
Next Steps
In the final lesson, you'll apply everything in a complete data analysis project—from loading raw data to presenting insights.
Ready for the capstone? Let's put it all together!
Further Reading
Interactive Visualizations
- Seeing Theory (Brown University) — the most beautiful interactive intro to probability and statistics. Distributions, CLT, regression, Bayesian inference, all live.
- Setosa — Conditional Probability — the clearest single explainer of Bayes' rule.
- Distill — Visual Information Theory — Christopher Olah. Neighbor topic; deepens intuition for entropy.
Free Books
- Think Stats (2e) — Allen Downey. Free, Python-first, exactly the right level for a data scientist.
- Think Bayes (2e) — Downey's Bayesian companion.
- An Introduction to Statistical Learning — James, Witten, Hastie, Tibshirani. Chapter 3 (linear regression) is the bridge from "stats" to "ML."
- OpenIntro Statistics (4e) — undergraduate stats textbook, free PDF.
Video Series
- StatQuest — Statistics Fundamentals (Josh Starmer) — pair every concept in this lesson with the matching StatQuest video. The clearest intuitions on YouTube.
- 3Blue1Brown — Probability — geometric intuition.
Modern Python Stats Stack
scipy.stats— the workhorse.statsmodels— for serious statistical modeling (OLS with diagnostics, time series, GLMs).pingouin— friendlier API for everyday hypothesis tests.PyMCv5 — modern Python probabilistic programming for Bayesian models.
Don't Misuse Stats
- ASA Statement on p-values — the official "p < 0.05 isn't what you think" statement.
- Book: Statistics Done Wrong — Alex Reinhart (free online). Required reading before you ship any stats-driven analysis.