Mathematical Foundations of Machine Learning

Introduction: The Language of Learning

Imagine you're teaching a child to recognize different types of fruit. You don't just say "this is an apple" – you help them understand patterns: apples are round, usually red or green, have smooth skin, and weigh a certain amount. The child learns by building a mental model from these numerical features and mathematical relationships.

Machine learning works the same way, but it speaks the language of linear algebra, probability, and calculus. In this lesson, we'll build the mathematical toolkit you need to understand and implement ML algorithms. Don't worry if you're rusty – we'll approach each concept visually and intuitively first.

Prerequisites

This lesson assumes you're comfortable with:

Basic Python programming
High school algebra (equations, functions)
Nice to have: Some exposure to matrices and derivatives (we'll review!)

See the Math Update Its Mind

Before the formulas, get your hands on the one idea that ties this whole lesson together: distributions aren't static — they update as evidence arrives. The Probability Field instrument lets you watch a belief shift in real time.

FIG. 02Probability Field

INTERACTIVE

LOADING INSTRUMENT

Fig. 02Conjugate-prior Bayesian updating: Beta–Bernoulli, Normal–Normal, and Monty Hall.

Try it: drag the prior sliders to reshape the starting belief, then click to add observations one at a time and watch the posterior curve tighten and shift toward the data. Switch modes (Beta–Bernoulli, Normal–Normal, Monty Hall) to see the same Bayesian update in different settings. We unpack exactly what's happening in Section 2.

1. Linear Algebra: The Geometry of Data

Why Linear Algebra?

In machine learning, we work with high-dimensional data. A house might have 10 features (size, bedrooms, age, location, etc.). An image might have millions of pixels. Linear algebra gives us a clean, efficient way to represent and manipulate this data.

Key Insight: Machine learning is fundamentally about finding relationships between inputs and outputs. Linear algebra lets us express these relationships compactly.

TIP

💡 Build visual intuition first: before the formulas feel natural, watch 3Blue1Brown's Essence of Linear Algebra — 15 short videos that turn vectors, matrices, and eigenvalues into geometric objects you can see moving. Twenty minutes of the series saves months of confusion.

Vectors: Representing Data Points

A vector is an ordered list of numbers. In ML, each data point is a vector.

# A house represented as a vector
house = [1200,  # square feet
         3,      # bedrooms
         2,      # bathrooms
         2010,   # year built
         350000] # price

# In NumPy
import numpy as np
house_vector = np.array([1200, 3, 2, 2010, 350000])

Geometric Interpretation: In 2D or 3D, we can visualize vectors as arrows from the origin:

FIG. 04Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 04Interactive Python code execution environment

Matrices: Transforming Data

A matrix is a 2D array of numbers. In ML, matrices represent:

Datasets (rows = samples, columns = features)
Transformations (linear mappings)
Model parameters (weights)

# A dataset of 3 houses
dataset = np.array([
    [1200, 3, 2, 2010, 350000],  # House 1
    [1500, 4, 3, 2015, 425000],  # House 2
    [900,  2, 1, 2005, 280000]   # House 3
])

print(f"Shape: {dataset.shape}")  # (3, 5) = 3 samples, 5 features

Key Operations

1. Dot Product (Inner Product)

The dot product measures similarity between vectors:

[ \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1b_1 + a_2b_2 + \cdots + a_nb_n ]

FIG. 06Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 06Interactive Python code execution environment

ML Application: The dot product is the foundation of linear models! A prediction is just the dot product of features and weights.

2. Matrix Multiplication

Matrix multiplication combines transformations:

[ \mathbf{C} = \mathbf{A} \mathbf{B} \quad \text{where} \quad C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj} ]

FIG. 08Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 08Interactive Python code execution environment

3. Transpose

Flipping rows and columns:

[ (\mathbf{A}^T){ij} = \mathbf{A}{ji} ]

A = np.array([[1, 2, 3],
              [4, 5, 6]])
              
print("Original shape:", A.shape)  # (2, 3)
print("Transposed shape:", A.T.shape)  # (3, 2)

ML Application: Computing gradients, normal equations, covariance matrices.

2. Probability Theory: Modeling Uncertainty

Why Probability?

Real-world data is noisy and uncertain. Instead of saying "this house costs exactly $350,000", we say "there's an 80% chance it's between$ 340K and $360K". Probability lets us:

Model noise in data
Quantify prediction confidence
Derive optimal learning algorithms

Random Variables

A random variable (X) is a variable whose value is determined by chance.

Example: The outcome of rolling a die is a random variable (X \in \{1, 2, 3, 4, 5, 6\}).

Probability Distributions

A probability distribution describes the likelihood of different outcomes.

FIG. 10Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 10Interactive Python code execution environment

Key Distributions in ML

Distribution	Use Case	Example
Normal (Gaussian)	Continuous variables, noise modeling	House prices, measurement errors
Bernoulli	Binary outcomes	Classification (yes/no)
Categorical	Multiple classes	Image classification (cat/dog/bird)
Poisson	Count data	Number of website visits per hour

Belief Updating, by Hand

Distributions in ML aren't static — they update as evidence comes in. Open the Probability Field instrument and play with the three modes:

Beta–Bernoulli: drag the α and β sliders to shape a prior over the rate of an unknown coin. Then click add success / add failure and watch the posterior shift. The closed-form rule Beta(α + k, β + n − k) is literally what's happening under your cursor.
Normal–Normal: set a prior N(μ₀, σ₀²) for an unknown mean, then click observe to draw samples from a hidden true distribution. Posterior precision (1/σ_n²) adds linearly — see it tighten with each sample.
Monty Hall: the canonical "your intuition is wrong" Bayes problem, with the 3×3 likelihood table visible.

Scroll back up to the Probability Field instrument at the top of the lesson and revisit it now that you have the closed-form rules in hand.

This is the engine behind every probabilistic ML method — Naïve Bayes, Gaussian Processes, Bayesian Neural Networks. The closed-form updates here generalize to numerical posteriors via MCMC and variational inference.

Expected Value and Variance

Expected value (E[X]): The "average" or "center of mass" of a distribution

[ E[X] = \sum_{x} x \cdot P(X = x) \quad \text{(discrete)} \quad or \quad E[X] = \int_{-\infty}^{\infty} x \cdot f(x) , dx \quad \text{(continuous)} ]

Variance (\text{Var}(X)): How spread out the values are

[ \text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2 ]

FIG. 12Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 12Interactive Python code execution environment

3. Calculus: Finding Optimal Solutions

Why Calculus?

Machine learning algorithms learn by finding the best parameters (weights) that minimize error. Calculus, specifically derivatives and gradients, tells us which direction to adjust parameters to reduce error.

Analogy: Imagine you're hiking in dense fog and want to reach the valley (lowest point). You can't see far, but you can feel the slope under your feet. You always walk in the direction that goes downhill steepest – that's gradient descent!

Derivatives: Rate of Change

The derivative measures how a function changes as its input changes:

[ f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} ]

FIG. 14Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 14Interactive Python code execution environment

Key Insight: When (f'(x) = 0), we're at a critical point (possibly a minimum or maximum).

Partial Derivatives and Gradients

For functions of multiple variables (f(x_1, x_2, \ldots, x_n)), we use partial derivatives:

[ \frac{\partial f}{\partial x_i} = \text{rate of change of } f \text{ with respect to } x_i \text{ (holding others constant)} ]

The gradient is the vector of all partial derivatives:

[ \nabla f = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right] ]

Geometric Interpretation: The gradient points in the direction of steepest ascent. To minimize (f), we move in the direction opposite to the gradient.

FIG. 16Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 16Interactive Python code execution environment

4. The Learning Problem: Putting It All Together

The Mathematical Framework

Machine learning can be formulated as an optimization problem:

Given:

Training data (\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\})
A model (f(\mathbf{x}; \mathbf{w})) parameterized by weights (\mathbf{w})
A loss function (\mathcal{L}(y, \hat{y})) that measures prediction error

Find: [ \mathbf{w}^* = \arg\min_{\mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(y_i, f(\mathbf{x}_i; \mathbf{w})) ]

This is the empirical risk minimization principle.

Example: Linear Regression

Model: (f(\mathbf{x}; \mathbf{w}) = \mathbf{w}^T \mathbf{x} = w_0 + w_1 x_1 + w_2 x_2 + \cdots)

Loss: Mean Squared Error (MSE) [ \mathcal{L}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}_i)^2 ]

Solution: Use calculus to find (\mathbf{w}^*) where (\nabla_{\mathbf{w}} \mathcal{L} = \mathbf{0})

FIG. 18Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 18Interactive Python code execution environment

Key Takeaways

✓ Linear Algebra: Represents data (vectors, matrices) and transformations efficiently

Dot products measure similarity
Matrix multiplication applies transformations
Transpose, inverse are fundamental operations

✓ Probability: Models uncertainty and noise in data

Distributions describe data patterns
Expected value and variance quantify center and spread
Foundation for probabilistic models

✓ Calculus: Finds optimal model parameters

Derivatives measure rate of change
Gradients point to steepest ascent
Setting gradients to zero finds critical points

✓ The Learning Problem: Minimize error (loss) over training data

Choose model architecture
Define loss function
Optimize parameters using calculus

Practice Problems

Problem 1: Vector Operations

Given vectors (\mathbf{a} = [2, 3, -1]) and (\mathbf{b} = [1, -2, 4]), compute:

(\mathbf{a} + \mathbf{b})
(\mathbf{a} \cdot \mathbf{b})
(||\mathbf{a}||) (magnitude/norm)

FIG. 20Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 20Interactive Python code execution environment

Problem 2: Probability

A dataset of house prices follows a normal distribution with mean $350,000 and standard deviation$ 50,000. What's the probability a randomly selected house costs between $300,000 and$ 400,000?

FIG. 22Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 22Interactive Python code execution environment

Problem 3: Gradient Descent

Implement one step of gradient descent for (f(x) = x^2 - 6x + 9) starting at (x = 5) with learning rate (\alpha = 0.1).

FIG. 24Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 24Interactive Python code execution environment

Next Steps

You now have the mathematical foundation for machine learning! In the next lesson, we'll formalize the supervised learning framework, exploring:

Training vs testing data
Loss functions in detail
The bias-variance tradeoff
Model capacity and generalization

These mathematical tools will appear in every ML algorithm we study. Keep this lesson as a reference – you'll return to it often!

Mathematical Foundations of Machine Learning

Introduction: The Language of Learning

Prerequisites

See the Math Update Its Mind

1. Linear Algebra: The Geometry of Data

Why Linear Algebra?

Vectors: Representing Data Points

Matrices: Transforming Data

Key Operations

1. Dot Product (Inner Product)

2. Matrix Multiplication

3. Transpose

2. Probability Theory: Modeling Uncertainty

Why Probability?

Random Variables

Probability Distributions

Key Distributions in ML

Belief Updating, by Hand

Expected Value and Variance

3. Calculus: Finding Optimal Solutions

Why Calculus?

Derivatives: Rate of Change

Partial Derivatives and Gradients

4. The Learning Problem: Putting It All Together

The Mathematical Framework

Example: Linear Regression

Key Takeaways

Practice Problems

Problem 1: Vector Operations

Problem 2: Probability

Problem 3: Gradient Descent

Next Steps

Further Reading

Interactive Visualizations

Video Courses

Free Textbooks

Classic References