Mathematical Foundations of Machine Learning

Introduction: The Language of Learning

Imagine you're teaching a child to recognize different types of fruit. You don't just say "this is an apple" – you help them understand patterns: apples are round, usually red or green, have smooth skin, and weigh a certain amount. The child learns by building a mental model from these numerical features and mathematical relationships.

Machine learning works the same way, but it speaks the language of linear algebra, probability, and calculus. In this lesson, we'll build the mathematical toolkit you need to understand and implement ML algorithms. Don't worry if you're rusty – we'll approach each concept visually and intuitively first.

Prerequisites

This lesson assumes you're comfortable with:

  • Basic Python programming
  • High school algebra (equations, functions)
  • Nice to have: Some exposure to matrices and derivatives (we'll review!)

1. Linear Algebra: The Geometry of Data

Why Linear Algebra?

In machine learning, we work with high-dimensional data. A house might have 10 features (size, bedrooms, age, location, etc.). An image might have millions of pixels. Linear algebra gives us a clean, efficient way to represent and manipulate this data.

Key Insight: Machine learning is fundamentally about finding relationships between inputs and outputs. Linear algebra lets us express these relationships compactly.

Vectors: Representing Data Points

A vector is an ordered list of numbers. In ML, each data point is a vector.

# A house represented as a vector house = [1200, # square feet 3, # bedrooms 2, # bathrooms 2010, # year built 350000] # price # In NumPy import numpy as np house_vector = np.array([1200, 3, 2, 2010, 350000])

Geometric Interpretation: In 2D or 3D, we can visualize vectors as arrows from the origin:

Loading Python runtime...

Matrices: Transforming Data

A matrix is a 2D array of numbers. In ML, matrices represent:

  • Datasets (rows = samples, columns = features)
  • Transformations (linear mappings)
  • Model parameters (weights)
# A dataset of 3 houses dataset = np.array([ [1200, 3, 2, 2010, 350000], # House 1 [1500, 4, 3, 2015, 425000], # House 2 [900, 2, 1, 2005, 280000] # House 3 ]) print(f"Shape: {dataset.shape}") # (3, 5) = 3 samples, 5 features

Key Operations

1. Dot Product (Inner Product)

The dot product measures similarity between vectors:

[ \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1b_1 + a_2b_2 + \cdots + a_nb_n ]

Loading Python runtime...

ML Application: The dot product is the foundation of linear models! A prediction is just the dot product of features and weights.

2. Matrix Multiplication

Matrix multiplication combines transformations:

[ \mathbf{C} = \mathbf{A} \mathbf{B} \quad \text{where} \quad C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj} ]

Loading Python runtime...

3. Transpose

Flipping rows and columns:

[ (\mathbf{A}^T){ij} = \mathbf{A}{ji} ]

A = np.array([[1, 2, 3], [4, 5, 6]]) print("Original shape:", A.shape) # (2, 3) print("Transposed shape:", A.T.shape) # (3, 2)

ML Application: Computing gradients, normal equations, covariance matrices.


2. Probability Theory: Modeling Uncertainty

Why Probability?

Real-world data is noisy and uncertain. Instead of saying "this house costs exactly 350,000",wesay"theresan80350,000", we say "there's an 80% chance it's between 340K and $360K". Probability lets us:

  • Model noise in data
  • Quantify prediction confidence
  • Derive optimal learning algorithms

Random Variables

A random variable (X) is a variable whose value is determined by chance.

Example: The outcome of rolling a die is a random variable (X \in \{1, 2, 3, 4, 5, 6\}).

Probability Distributions

A probability distribution describes the likelihood of different outcomes.

<Python Executor defaultValue={`import numpy as np import matplotlib.pyplot as plt from scipy import stats

Normal (Gaussian) Distribution

x = np.linspace(-4, 4, 200) normal_pdf = stats.norm.pdf(x, loc=0, scale=1)

Binomial Distribution

x_binom = np.arange(0, 11) binomial_pmf = stats.binom.pmf(x_binom, n=10, p=0.5)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

Normal distribution

ax1.plot(x, normal_pdf, 'b-', linewidth=2, label='μ=0, σ²=1') ax1.fill_between(x, normal_pdf, alpha=0.3) ax1.set_title('Normal Distribution (Continuous)', fontsize=14, fontweight='bold') ax1.set_xlabel('x') ax1.set_ylabel('Probability Density') ax1.grid(True, alpha=0.3) ax1.legend()

Binomial distribution

ax2.bar(x_binom, binomial_pmf, color='coral', edgecolor='darkred', linewidth=1.5) ax2.set_title('Binomial Distribution (Discrete)', fontsize=14, fontweight='bold') ax2.set_xlabel('Number of Successes') ax2.set_ylabel('Probability') ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout() plt.show()

print("Normal distribution: models continuous outcomes (e.g., house prices)") print("Binomial distribution: models discrete outcomes (e.g., coin flips)")`} />

Key Distributions in ML

DistributionUse CaseExample
Normal (Gaussian)Continuous variables, noise modelingHouse prices, measurement errors
BernoulliBinary outcomesClassification (yes/no)
CategoricalMultiple classesImage classification (cat/dog/bird)
PoissonCount dataNumber of website visits per hour

Expected Value and Variance

Expected value (E[X]): The "average" or "center of mass" of a distribution

[ E[X] = \sum_{x} x \cdot P(X = x) \quad \text{(discrete)} \quad or \quad E[X] = \int_{-\infty}^{\infty} x \cdot f(x) , dx \quad \text{(continuous)} ]

Variance (\text{Var}(X)): How spread out the values are

[ \text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2 ]

Loading Python runtime...


3. Calculus: Finding Optimal Solutions

Why Calculus?

Machine learning algorithms learn by finding the best parameters (weights) that minimize error. Calculus, specifically derivatives and gradients, tells us which direction to adjust parameters to reduce error.

Analogy: Imagine you're hiking in dense fog and want to reach the valley (lowest point). You can't see far, but you can feel the slope under your feet. You always walk in the direction that goes downhill steepest – that's gradient descent!

Derivatives: Rate of Change

The derivative measures how a function changes as its input changes:

[ f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} ]

Loading Python runtime...

Key Insight: When (f'(x) = 0), we're at a critical point (possibly a minimum or maximum).

Partial Derivatives and Gradients

For functions of multiple variables (f(x_1, x_2, \ldots, x_n)), we use partial derivatives:

[ \frac{\partial f}{\partial x_i} = \text{rate of change of } f \text{ with respect to } x_i \text{ (holding others constant)} ]

The gradient is the vector of all partial derivatives:

[ \nabla f = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right] ]

Geometric Interpretation: The gradient points in the direction of steepest ascent. To minimize (f), we move in the direction opposite to the gradient.

Loading Python runtime...


4. The Learning Problem: Putting It All Together

The Mathematical Framework

Machine learning can be formulated as an optimization problem:

Given:

  • Training data (\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\})
  • A model (f(\mathbf{x}; \mathbf{w})) parameterized by weights (\mathbf{w})
  • A loss function (\mathcal{L}(y, \hat{y})) that measures prediction error

Find: [ \mathbf{w}^* = \arg\min_{\mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(y_i, f(\mathbf{x}_i; \mathbf{w})) ]

This is the empirical risk minimization principle.

Example: Linear Regression

Model: (f(\mathbf{x}; \mathbf{w}) = \mathbf{w}^T \mathbf{x} = w_0 + w_1 x_1 + w_2 x_2 + \cdots)

Loss: Mean Squared Error (MSE) [ \mathcal{L}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}_i)^2 ]

Solution: Use calculus to find (\mathbf{w}^*) where (\nabla_{\mathbf{w}} \mathcal{L} = \mathbf{0})

Loading Python runtime...


Key Takeaways

Linear Algebra: Represents data (vectors, matrices) and transformations efficiently

  • Dot products measure similarity
  • Matrix multiplication applies transformations
  • Transpose, inverse are fundamental operations

Probability: Models uncertainty and noise in data

  • Distributions describe data patterns
  • Expected value and variance quantify center and spread
  • Foundation for probabilistic models

Calculus: Finds optimal model parameters

  • Derivatives measure rate of change
  • Gradients point to steepest ascent
  • Setting gradients to zero finds critical points

The Learning Problem: Minimize error (loss) over training data

  • Choose model architecture
  • Define loss function
  • Optimize parameters using calculus

Practice Problems

Problem 1: Vector Operations

Given vectors (\mathbf{a} = [2, 3, -1]) and (\mathbf{b} = [1, -2, 4]), compute:

  1. (\mathbf{a} + \mathbf{b})
  2. (\mathbf{a} \cdot \mathbf{b})
  3. (||\mathbf{a}||) (magnitude/norm)

Loading Python runtime...

Problem 2: Probability

A dataset of house prices follows a normal distribution with mean 350,000andstandarddeviation350,000 and standard deviation 50,000. What's the probability a randomly selected house costs between 300,000and300,000 and 400,000?

Loading Python runtime...

Problem 3: Gradient Descent

Implement one step of gradient descent for (f(x) = x^2 - 6x + 9) starting at (x = 5) with learning rate (\alpha = 0.1).

Loading Python runtime...


Next Steps

You now have the mathematical foundation for machine learning! In the next lesson, we'll formalize the supervised learning framework, exploring:

  • Training vs testing data
  • Loss functions in detail
  • The bias-variance tradeoff
  • Model capacity and generalization

These mathematical tools will appear in every ML algorithm we study. Keep this lesson as a reference – you'll return to it often!

Further Reading

  • Linear Algebra: Introduction to Linear Algebra by Gilbert Strang
  • Probability: Think Stats by Allen Downey
  • Calculus: Calculus by James Stewart
  • ML Math: Mathematics for Machine Learning by Deisenroth, Faisal, Ong (free online)

Remember: Don't try to memorize every formula. Focus on understanding the intuition – the formulas are just precise ways to express ideas you already understand!