Statistical Foundations for Data Science

TIP

Learning Objectives: After this lesson, you'll understand the statistical concepts essential for data science—distributions, central tendency, variability, correlation, and basic hypothesis testing.

Why Statistics for Data Science?

Statistics provides the mathematical foundation for making decisions from data. It helps us distinguish real patterns from random noise. Before we crunch numbers, it helps to see what a distribution is — the shape that all of central tendency, spread, and the empirical rule describe.

FIG. 02Graph Plotter

INTERACTIVE

LOADING INSTRUMENT

Fig. 02Interactive plotting tool for visualizing data and relationships

Try it: Toggle each curve's legend entry on and off to isolate it, and hover along a line to read its height at any point. Notice how switching between the two series changes how tall and how wide the bell appears — that contrast is exactly the "spread" idea you'll quantify below.

FIG. 04Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 04Interactive Python code execution environment

Measures of Central Tendency

Where is the "center" of your data?

FIG. 06Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 06Interactive Python code execution environment

Mode and Other Measures

FIG. 08Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 08Interactive Python code execution environment

Measures of Spread (Variability)

How spread out is your data?

FIG. 10Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 10Interactive Python code execution environment

The 68-95-99.7 Rule (Empirical Rule)

FIG. 12Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 12Interactive Python code execution environment

Common Probability Distributions

A handful of distributions cover most real-world data. Know their shape, when they show up, and the NumPy call that generates them:

Distribution	Shape	Examples	Parameters	NumPy
Normal (Gaussian)	Bell curve, symmetric	Heights, test scores, measurement errors	mean (μ), std dev (σ)	`np.random.normal(mean, std, size)`
Uniform	Flat, equal probability	RNGs, dice	min, max	`np.random.uniform(low, high, size)`
Binomial	Discrete, # of successes	Coin flips, conversion rates	n (trials), p (probability)	`np.random.binomial(n, p, size)`
Poisson	Discrete, count of events	Customers/hour, errors/day	λ (average rate)	`np.random.poisson(lam, size)`
Exponential	Right-skewed continuous	Time between events, lifetime	scale (1/λ)	`np.random.exponential(scale, size)`

Now see them drawn from real samples:

FIG. 14Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 14Interactive Python code execution environment

Visualizing Distributions

We met the GraphPlotter at the top of this lesson — flip back to it and re-toggle the two normal curves now that you know μ and σ by name. Below, we generate samples from several distributions and compare their summary statistics.

FIG. 16Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 16Interactive Python code execution environment

Correlation and Relationships

How strongly are two variables related?

FIG. 18Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 18Interactive Python code execution environment

Correlation vs Causation

FIG. 20Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 20Interactive Python code execution environment

Basic Hypothesis Testing

Make decisions about populations based on samples.

FIG. 22Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 22Interactive Python code execution environment

One-Sample t-test

FIG. 24Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 24Interactive Python code execution environment

Two-Sample t-test

FIG. 26Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 26Interactive Python code execution environment

Confidence Intervals

Quantify uncertainty in estimates.

FIG. 28Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 28Interactive Python code execution environment

Practical Application

FIG. 30Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 30Interactive Python code execution environment

Key Takeaways

✅ Central tendency: Mean, median, mode—choose based on distribution shape

✅ Spread: Standard deviation and IQR measure variability

✅ Distributions: Normal, uniform, binomial, Poisson—know when to use each

✅ Correlation: Measures linear relationship strength (-1 to +1)

✅ Hypothesis testing: Framework for making data-driven decisions

✅ Confidence intervals: Quantify uncertainty in estimates

✅ Causation: Requires more than just correlation

Connections: Statistics in Data Science

🔗 Connection to Machine Learning

Statistical Concept	ML Application
Probability distributions	Naive Bayes, probabilistic models
Hypothesis testing	A/B testing, feature selection
Confidence intervals	Model uncertainty
Correlation	Feature engineering, multicollinearity
Variance	Bias-variance tradeoff

🔗 Connection to Business Decisions

Business Question	Statistical Approach
Is the new feature better?	A/B test, t-test
What's the expected revenue?	Mean + Confidence interval
Is this result reliable?	Hypothesis testing
Which factors matter most?	Correlation analysis

Practice Exercise

FIG. 32Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 32Interactive Python code execution environment

Next Steps

In the final lesson, you'll apply everything in a complete data analysis project—from loading raw data to presenting insights.

Ready for the capstone? Let's put it all together!

Statistical Foundations for Data Science

Why Statistics for Data Science?

Measures of Central Tendency

Mode and Other Measures

Measures of Spread (Variability)

The 68-95-99.7 Rule (Empirical Rule)

Common Probability Distributions

Visualizing Distributions

Correlation and Relationships

Correlation vs Causation

Basic Hypothesis Testing

One-Sample t-test

Two-Sample t-test

Confidence Intervals

Practical Application

Key Takeaways

Connections: Statistics in Data Science

🔗 Connection to Machine Learning

🔗 Connection to Business Decisions

Practice Exercise

Next Steps

Further Reading

Interactive Visualizations

Free Books

Video Series

Modern Python Stats Stack

Don't Misuse Stats