CLASSICAL MACHINE LEARNING: SUPERVISED LEARNING FOUNDATIONS / L01MATHEMATICAL FOUNDATIONS OF MACHINE LEARNING
课程 · 15 · 01 / 15
LESSON 01 · INTERMEDIATE · 60 MIN · ◆ 1 INSTRUMENT

Mathematical Foundations of Machine Learning

Build the mathematical framework for understanding ML: linear algebra essentials, probability theory, optimization basics, and the learning problem formulation.

Introduction: The Language of Learning

Imagine you're teaching a child to recognize different types of fruit. You don't just say "this is an apple" – you help them understand patterns: apples are round, usually red or green, have smooth skin, and weigh a certain amount. The child learns by building a mental model from these numerical features and mathematical relationships.

Machine learning works the same way, but it speaks the language of linear algebra, probability, and calculus. In this lesson, we'll build the mathematical toolkit you need to understand and implement ML algorithms. Don't worry if you're rusty – we'll approach each concept visually and intuitively first.

Prerequisites

This lesson assumes you're comfortable with:

  • Basic Python programming
  • High school algebra (equations, functions)
  • Nice to have: Some exposure to matrices and derivatives (we'll review!)

1. Linear Algebra: The Geometry of Data

Why Linear Algebra?

In machine learning, we work with high-dimensional data. A house might have 10 features (size, bedrooms, age, location, etc.). An image might have millions of pixels. Linear algebra gives us a clean, efficient way to represent and manipulate this data.

Key Insight: Machine learning is fundamentally about finding relationships between inputs and outputs. Linear algebra lets us express these relationships compactly.

TIP

💡 Build visual intuition first: before the formulas feel natural, watch 3Blue1Brown's Essence of Linear Algebra — 15 short videos that turn vectors, matrices, and eigenvalues into geometric objects you can see moving. Twenty minutes of the series saves months of confusion.

Vectors: Representing Data Points

A vector is an ordered list of numbers. In ML, each data point is a vector.

# A house represented as a vector house = [1200, # square feet 3, # bedrooms 2, # bathrooms 2010, # year built 350000] # price # In NumPy import numpy as np house_vector = np.array([1200, 3, 2, 2010, 350000])

Geometric Interpretation: In 2D or 3D, we can visualize vectors as arrows from the origin:

FIG. 02Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 02Interactive Python code execution environment

Matrices: Transforming Data

A matrix is a 2D array of numbers. In ML, matrices represent:

  • Datasets (rows = samples, columns = features)
  • Transformations (linear mappings)
  • Model parameters (weights)
# A dataset of 3 houses dataset = np.array([ [1200, 3, 2, 2010, 350000], # House 1 [1500, 4, 3, 2015, 425000], # House 2 [900, 2, 1, 2005, 280000] # House 3 ]) print(f"Shape: {dataset.shape}") # (3, 5) = 3 samples, 5 features

Key Operations

1. Dot Product (Inner Product)

The dot product measures similarity between vectors:

[ \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1b_1 + a_2b_2 + \cdots + a_nb_n ]

FIG. 04Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 04Interactive Python code execution environment

ML Application: The dot product is the foundation of linear models! A prediction is just the dot product of features and weights.

2. Matrix Multiplication

Matrix multiplication combines transformations:

[ \mathbf{C} = \mathbf{A} \mathbf{B} \quad \text{where} \quad C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj} ]

FIG. 06Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 06Interactive Python code execution environment

3. Transpose

Flipping rows and columns:

[ (\mathbf{A}^T){ij} = \mathbf{A}{ji} ]

A = np.array([[1, 2, 3], [4, 5, 6]]) print("Original shape:", A.shape) # (2, 3) print("Transposed shape:", A.T.shape) # (3, 2)

ML Application: Computing gradients, normal equations, covariance matrices.


2. Probability Theory: Modeling Uncertainty

Why Probability?

Real-world data is noisy and uncertain. Instead of saying "this house costs exactly 350,000",wesay"theresan80350,000", we say "there's an 80% chance it's between 340K and $360K". Probability lets us:

  • Model noise in data
  • Quantify prediction confidence
  • Derive optimal learning algorithms

Random Variables

A random variable (X) is a variable whose value is determined by chance.

Example: The outcome of rolling a die is a random variable (X \in \{1, 2, 3, 4, 5, 6\}).

Probability Distributions

A probability distribution describes the likelihood of different outcomes.

FIG. 08Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 08Interactive Python code execution environment

Key Distributions in ML

DistributionUse CaseExample
Normal (Gaussian)Continuous variables, noise modelingHouse prices, measurement errors
BernoulliBinary outcomesClassification (yes/no)
CategoricalMultiple classesImage classification (cat/dog/bird)
PoissonCount dataNumber of website visits per hour

Expected Value and Variance

Expected value (E[X]): The "average" or "center of mass" of a distribution

[ E[X] = \sum_{x} x \cdot P(X = x) \quad \text{(discrete)} \quad or \quad E[X] = \int_{-\infty}^{\infty} x \cdot f(x) , dx \quad \text{(continuous)} ]

Variance (\text{Var}(X)): How spread out the values are

[ \text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2 ]

FIG. 10Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 10Interactive Python code execution environment

3. Calculus: Finding Optimal Solutions

Why Calculus?

Machine learning algorithms learn by finding the best parameters (weights) that minimize error. Calculus, specifically derivatives and gradients, tells us which direction to adjust parameters to reduce error.

Analogy: Imagine you're hiking in dense fog and want to reach the valley (lowest point). You can't see far, but you can feel the slope under your feet. You always walk in the direction that goes downhill steepest – that's gradient descent!

Derivatives: Rate of Change

The derivative measures how a function changes as its input changes:

[ f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} ]

FIG. 12Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 12Interactive Python code execution environment

Key Insight: When (f'(x) = 0), we're at a critical point (possibly a minimum or maximum).

Partial Derivatives and Gradients

For functions of multiple variables (f(x_1, x_2, \ldots, x_n)), we use partial derivatives:

[ \frac{\partial f}{\partial x_i} = \text{rate of change of } f \text{ with respect to } x_i \text{ (holding others constant)} ]

The gradient is the vector of all partial derivatives:

[ \nabla f = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right] ]

Geometric Interpretation: The gradient points in the direction of steepest ascent. To minimize (f), we move in the direction opposite to the gradient.

FIG. 14Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 14Interactive Python code execution environment

4. The Learning Problem: Putting It All Together

The Mathematical Framework

Machine learning can be formulated as an optimization problem:

Given:

  • Training data (\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\})
  • A model (f(\mathbf{x}; \mathbf{w})) parameterized by weights (\mathbf{w})
  • A loss function (\mathcal{L}(y, \hat{y})) that measures prediction error

Find: [ \mathbf{w}^* = \arg\min_{\mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(y_i, f(\mathbf{x}_i; \mathbf{w})) ]

This is the empirical risk minimization principle.

Example: Linear Regression

Model: (f(\mathbf{x}; \mathbf{w}) = \mathbf{w}^T \mathbf{x} = w_0 + w_1 x_1 + w_2 x_2 + \cdots)

Loss: Mean Squared Error (MSE) [ \mathcal{L}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}_i)^2 ]

Solution: Use calculus to find (\mathbf{w}^*) where (\nabla_{\mathbf{w}} \mathcal{L} = \mathbf{0})

FIG. 16Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 16Interactive Python code execution environment

Key Takeaways

Linear Algebra: Represents data (vectors, matrices) and transformations efficiently

  • Dot products measure similarity
  • Matrix multiplication applies transformations
  • Transpose, inverse are fundamental operations

Probability: Models uncertainty and noise in data

  • Distributions describe data patterns
  • Expected value and variance quantify center and spread
  • Foundation for probabilistic models

Calculus: Finds optimal model parameters

  • Derivatives measure rate of change
  • Gradients point to steepest ascent
  • Setting gradients to zero finds critical points

The Learning Problem: Minimize error (loss) over training data

  • Choose model architecture
  • Define loss function
  • Optimize parameters using calculus

Practice Problems

Problem 1: Vector Operations

Given vectors (\mathbf{a} = [2, 3, -1]) and (\mathbf{b} = [1, -2, 4]), compute:

  1. (\mathbf{a} + \mathbf{b})
  2. (\mathbf{a} \cdot \mathbf{b})
  3. (||\mathbf{a}||) (magnitude/norm)
FIG. 18Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 18Interactive Python code execution environment

Problem 2: Probability

A dataset of house prices follows a normal distribution with mean 350,000andstandarddeviation350,000 and standard deviation 50,000. What's the probability a randomly selected house costs between 300,000and300,000 and 400,000?

FIG. 20Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 20Interactive Python code execution environment

Problem 3: Gradient Descent

Implement one step of gradient descent for (f(x) = x^2 - 6x + 9) starting at (x = 5) with learning rate (\alpha = 0.1).

FIG. 22Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 22Interactive Python code execution environment

Next Steps

You now have the mathematical foundation for machine learning! In the next lesson, we'll formalize the supervised learning framework, exploring:

  • Training vs testing data
  • Loss functions in detail
  • The bias-variance tradeoff
  • Model capacity and generalization

These mathematical tools will appear in every ML algorithm we study. Keep this lesson as a reference – you'll return to it often!

Further Reading

Interactive Visualizations

Video Courses

Free Textbooks

Classic References

  • Book: Introduction to Linear Algebra — Gilbert Strang
  • Book: Think Stats — Allen Downey (free online)
  • Book: Calculus — James Stewart

Remember: Don't try to memorize every formula. Focus on understanding the intuition – the formulas are just precise ways to express ideas you already understand!