课程 · 10 · 09 / 10
Statistical Foundations for Data Science
Build the statistical intuition needed for machine learning. Understand distributions, central tendency, variability, correlation, and basic hypothesis testing.
TIPLearning Objectives: After this lesson, you'll understand the statistical concepts essential for data science—distributions, central tendency, variability, correlation, and basic hypothesis testing.
Why Statistics for Data Science?
Statistics provides the mathematical foundation for making decisions from data. It helps us distinguish real patterns from random noise. Before we crunch numbers, it helps to see what a distribution is — the shape that all of central tendency, spread, and the empirical rule describe.
Try it: Toggle each curve's legend entry on and off to isolate it, and hover along a line to read its height at any point. Notice how switching between the two series changes how tall and how wide the bell appears — that contrast is exactly the "spread" idea you'll quantify below.
Measures of Central Tendency
Where is the "center" of your data?
Mode and Other Measures
Measures of Spread (Variability)
How spread out is your data?
The 68-95-99.7 Rule (Empirical Rule)
Common Probability Distributions
A handful of distributions cover most real-world data. Know their shape, when they show up, and the NumPy call that generates them:
| Distribution | Shape | Examples | Parameters | NumPy |
|---|---|---|---|---|
| Normal (Gaussian) | Bell curve, symmetric | Heights, test scores, measurement errors | mean (μ), std dev (σ) | np.random.normal(mean, std, size) |
| Uniform | Flat, equal probability | RNGs, dice | min, max | np.random.uniform(low, high, size) |
| Binomial | Discrete, # of successes | Coin flips, conversion rates | n (trials), p (probability) | np.random.binomial(n, p, size) |
| Poisson | Discrete, count of events | Customers/hour, errors/day | λ (average rate) | np.random.poisson(lam, size) |
| Exponential | Right-skewed continuous | Time between events, lifetime | scale (1/λ) | np.random.exponential(scale, size) |
Now see them drawn from real samples:
Visualizing Distributions
We met the GraphPlotter at the top of this lesson — flip back to it and re-toggle the two normal curves now that you know μ and σ by name. Below, we generate samples from several distributions and compare their summary statistics.
Correlation and Relationships
How strongly are two variables related?
Correlation vs Causation
Basic Hypothesis Testing
Make decisions about populations based on samples.
One-Sample t-test
Two-Sample t-test
Confidence Intervals
Quantify uncertainty in estimates.
Practical Application
Key Takeaways
✅ Central tendency: Mean, median, mode—choose based on distribution shape
✅ Spread: Standard deviation and IQR measure variability
✅ Distributions: Normal, uniform, binomial, Poisson—know when to use each
✅ Correlation: Measures linear relationship strength (-1 to +1)
✅ Hypothesis testing: Framework for making data-driven decisions
✅ Confidence intervals: Quantify uncertainty in estimates
✅ Causation: Requires more than just correlation
Connections: Statistics in Data Science
🔗 Connection to Machine Learning
| Statistical Concept | ML Application |
|---|---|
| Probability distributions | Naive Bayes, probabilistic models |
| Hypothesis testing | A/B testing, feature selection |
| Confidence intervals | Model uncertainty |
| Correlation | Feature engineering, multicollinearity |
| Variance | Bias-variance tradeoff |
🔗 Connection to Business Decisions
| Business Question | Statistical Approach |
|---|---|
| Is the new feature better? | A/B test, t-test |
| What's the expected revenue? | Mean + Confidence interval |
| Is this result reliable? | Hypothesis testing |
| Which factors matter most? | Correlation analysis |
Practice Exercise
Next Steps
In the final lesson, you'll apply everything in a complete data analysis project—from loading raw data to presenting insights.
Ready for the capstone? Let's put it all together!
Further Reading
Interactive Visualizations
- Seeing Theory (Brown University) — the most beautiful interactive intro to probability and statistics. Distributions, CLT, regression, Bayesian inference, all live.
- Setosa — Conditional Probability — the clearest single explainer of Bayes' rule.
- Distill — Visual Information Theory — Christopher Olah. Neighbor topic; deepens intuition for entropy.
Free Books
- Think Stats (2e) — Allen Downey. Free, Python-first, exactly the right level for a data scientist.
- Think Bayes (2e) — Downey's Bayesian companion.
- An Introduction to Statistical Learning — James, Witten, Hastie, Tibshirani. Chapter 3 (linear regression) is the bridge from "stats" to "ML."
- OpenIntro Statistics (4e) — undergraduate stats textbook, free PDF.
Video Series
- StatQuest — Statistics Fundamentals (Josh Starmer) — pair every concept in this lesson with the matching StatQuest video. The clearest intuitions on YouTube.
- 3Blue1Brown — Probability — geometric intuition.
Modern Python Stats Stack
scipy.stats— the workhorse.statsmodels— for serious statistical modeling (OLS with diagnostics, time series, GLMs).pingouin— friendlier API for everyday hypothesis tests.PyMCv5 — modern Python probabilistic programming for Bayesian models.
Don't Misuse Stats
- ASA Statement on p-values — the official "p < 0.05 isn't what you think" statement.
- Book: Statistics Done Wrong — Alex Reinhart (free online). Required reading before you ship any stats-driven analysis.