Introduction: The Language of Learning

Imagine you're teaching a child to recognize different types of fruit. You don't just say "this is an apple" – you help them understand patterns: apples are round, usually red or green, have smooth skin, and weigh a certain amount. The child learns by building a mental model from these numerical features and mathematical relationships.

Machine learning works the same way, but it speaks the language of linear algebra, probability, and calculus. In this lesson, we'll build the mathematical toolkit you need to understand and implement ML algorithms. Don't worry if you're rusty – we'll approach each concept visually and intuitively first.

Prerequisites

This lesson assumes you're comfortable with:

Basic Python programming
High school algebra (equations, functions)
Nice to have: Some exposure to matrices and derivatives (we'll review!)

1. Linear Algebra: The Geometry of Data

Why Linear Algebra?

In machine learning, we work with high-dimensional data. A house might have 10 features (size, bedrooms, age, location, etc.). An image might have millions of pixels. Linear algebra gives us a clean, efficient way to represent and manipulate this data.

Key Insight: Machine learning is fundamentally about finding relationships between inputs and outputs. Linear algebra lets us express these relationships compactly.

Vectors: Representing Data Points

A vector is an ordered list of numbers. In ML, each data point is a vector.

# A house represented as a vector
house = [1200,  # square feet
         3,      # bedrooms
         2,      # bathrooms
         2010,   # year built
         350000] # price

# In NumPy
import numpy as np
house_vector = np.array([1200, 3, 2, 2010, 350000])

Geometric Interpretation: In 2D or 3D, we can visualize vectors as arrows from the origin:

Loading Python runtime...

Matrices: Transforming Data

A matrix is a 2D array of numbers. In ML, matrices represent:

Datasets (rows = samples, columns = features)
Transformations (linear mappings)
Model parameters (weights)

# A dataset of 3 houses
dataset = np.array([
    [1200, 3, 2, 2010, 350000],  # House 1
    [1500, 4, 3, 2015, 425000],  # House 2
    [900,  2, 1, 2005, 280000]   # House 3
])

print(f"Shape: {dataset.shape}")  # (3, 5) = 3 samples, 5 features

Key Operations

1. Dot Product (Inner Product)

The dot product measures similarity between vectors:

[ \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1b_1 + a_2b_2 + \cdots + a_nb_n ]

Loading Python runtime...

ML Application: The dot product is the foundation of linear models! A prediction is just the dot product of features and weights.

2. Matrix Multiplication

Matrix multiplication combines transformations:

[ \mathbf{C} = \mathbf{A} \mathbf{B} \quad \text{where} \quad C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj} ]

Loading Python runtime...

3. Transpose

Flipping rows and columns:

[ (\mathbf{A}^T){ij} = \mathbf{A}{ji} ]

A = np.array([[1, 2, 3],
              [4, 5, 6]])
              
print("Original shape:", A.shape)  # (2, 3)
print("Transposed shape:", A.T.shape)  # (3, 2)

ML Application: Computing gradients, normal equations, covariance matrices.

2. Probability Theory: Modeling Uncertainty

Why Probability?

Real-world data is noisy and uncertain. Instead of saying "this house costs exactly $350,000", we say "there's an 80% chance it's between$ 340K and $360K". Probability lets us:

Model noise in data
Quantify prediction confidence
Derive optimal learning algorithms

Random Variables

A random variable (X) is a variable whose value is determined by chance.

Example: The outcome of rolling a die is a random variable (X \in \{1, 2, 3, 4, 5, 6\}).

Probability Distributions

A probability distribution describes the likelihood of different outcomes.

<Python Executor defaultValue={`import numpy as np import matplotlib.pyplot as plt from scipy import stats

Normal (Gaussian) Distribution

x = np.linspace(-4, 4, 200) normal_pdf = stats.norm.pdf(x, loc=0, scale=1)

Binomial Distribution

x_binom = np.arange(0, 11) binomial_pmf = stats.binom.pmf(x_binom, n=10, p=0.5)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

Normal distribution

ax1.plot(x, normal_pdf, 'b-', linewidth=2, label='μ=0, σ²=1') ax1.fill_between(x, normal_pdf, alpha=0.3) ax1.set_title('Normal Distribution (Continuous)', fontsize=14, fontweight='bold') ax1.set_xlabel('x') ax1.set_ylabel('Probability Density') ax1.grid(True, alpha=0.3) ax1.legend()

Binomial distribution

ax2.bar(x_binom, binomial_pmf, color='coral', edgecolor='darkred', linewidth=1.5) ax2.set_title('Binomial Distribution (Discrete)', fontsize=14, fontweight='bold') ax2.set_xlabel('Number of Successes') ax2.set_ylabel('Probability') ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout() plt.show()

print("Normal distribution: models continuous outcomes (e.g., house prices)") print("Binomial distribution: models discrete outcomes (e.g., coin flips)")`} />

Key Distributions in ML

Distribution	Use Case	Example
Normal (Gaussian)	Continuous variables, noise modeling	House prices, measurement errors
Bernoulli	Binary outcomes	Classification (yes/no)
Categorical	Multiple classes	Image classification (cat/dog/bird)
Poisson	Count data	Number of website visits per hour

Expected Value and Variance

Expected value (E[X]): The "average" or "center of mass" of a distribution

[ E[X] = \sum_{x} x \cdot P(X = x) \quad \text{(discrete)} \quad or \quad E[X] = \int_{-\infty}^{\infty} x \cdot f(x) , dx \quad \text{(continuous)} ]

Variance (\text{Var}(X)): How spread out the values are

[ \text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2 ]

Loading Python runtime...

3. Calculus: Finding Optimal Solutions

Why Calculus?

Machine learning algorithms learn by finding the best parameters (weights) that minimize error. Calculus, specifically derivatives and gradients, tells us which direction to adjust parameters to reduce error.

Analogy: Imagine you're hiking in dense fog and want to reach the valley (lowest point). You can't see far, but you can feel the slope under your feet. You always walk in the direction that goes downhill steepest – that's gradient descent!

Derivatives: Rate of Change

The derivative measures how a function changes as its input changes:

[ f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} ]

Loading Python runtime...

Key Insight: When (f'(x) = 0), we're at a critical point (possibly a minimum or maximum).

Partial Derivatives and Gradients

For functions of multiple variables (f(x_1, x_2, \ldots, x_n)), we use partial derivatives:

[ \frac{\partial f}{\partial x_i} = \text{rate of change of } f \text{ with respect to } x_i \text{ (holding others constant)} ]

The gradient is the vector of all partial derivatives:

[ \nabla f = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right] ]

Geometric Interpretation: The gradient points in the direction of steepest ascent. To minimize (f), we move in the direction opposite to the gradient.

Loading Python runtime...

4. The Learning Problem: Putting It All Together

The Mathematical Framework

Machine learning can be formulated as an optimization problem:

Given:

Training data (\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\})
A model (f(\mathbf{x}; \mathbf{w})) parameterized by weights (\mathbf{w})
A loss function (\mathcal{L}(y, \hat{y})) that measures prediction error

Find: [ \mathbf{w}^* = \arg\min_{\mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(y_i, f(\mathbf{x}_i; \mathbf{w})) ]

This is the empirical risk minimization principle.

Example: Linear Regression

Model: (f(\mathbf{x}; \mathbf{w}) = \mathbf{w}^T \mathbf{x} = w_0 + w_1 x_1 + w_2 x_2 + \cdots)

Loss: Mean Squared Error (MSE) [ \mathcal{L}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}_i)^2 ]

Solution: Use calculus to find (\mathbf{w}^*) where (\nabla_{\mathbf{w}} \mathcal{L} = \mathbf{0})

Loading Python runtime...

Key Takeaways

✓ Linear Algebra: Represents data (vectors, matrices) and transformations efficiently

Dot products measure similarity
Matrix multiplication applies transformations
Transpose, inverse are fundamental operations

✓ Probability: Models uncertainty and noise in data

Distributions describe data patterns
Expected value and variance quantify center and spread
Foundation for probabilistic models

✓ Calculus: Finds optimal model parameters

Derivatives measure rate of change
Gradients point to steepest ascent
Setting gradients to zero finds critical points

✓ The Learning Problem: Minimize error (loss) over training data

Choose model architecture
Define loss function
Optimize parameters using calculus

Practice Problems

Problem 1: Vector Operations

Given vectors (\mathbf{a} = [2, 3, -1]) and (\mathbf{b} = [1, -2, 4]), compute:

(\mathbf{a} + \mathbf{b})
(\mathbf{a} \cdot \mathbf{b})
(||\mathbf{a}||) (magnitude/norm)

Loading Python runtime...

Problem 2: Probability

A dataset of house prices follows a normal distribution with mean $350,000 and standard deviation$ 50,000. What's the probability a randomly selected house costs between $300,000 and$ 400,000?

Loading Python runtime...

Problem 3: Gradient Descent

Implement one step of gradient descent for (f(x) = x^2 - 6x + 9) starting at (x = 5) with learning rate (\alpha = 0.1).

Loading Python runtime...

Next Steps

You now have the mathematical foundation for machine learning! In the next lesson, we'll formalize the supervised learning framework, exploring:

Training vs testing data
Loss functions in detail
The bias-variance tradeoff
Model capacity and generalization

These mathematical tools will appear in every ML algorithm we study. Keep this lesson as a reference – you'll return to it often!

Classical Machine Learning: Supervised Learning Foundations

Mathematical Foundations of Machine Learning

Introduction: The Language of Learning

Prerequisites

1. Linear Algebra: The Geometry of Data

Why Linear Algebra?

Vectors: Representing Data Points

Matrices: Transforming Data

Key Operations

1. Dot Product (Inner Product)

2. Matrix Multiplication

3. Transpose

2. Probability Theory: Modeling Uncertainty

Why Probability?

Random Variables

Probability Distributions

Normal (Gaussian) Distribution

Binomial Distribution

Normal distribution

Binomial distribution

Key Distributions in ML

Expected Value and Variance

3. Calculus: Finding Optimal Solutions

Why Calculus?

Derivatives: Rate of Change

Partial Derivatives and Gradients

4. The Learning Problem: Putting It All Together

The Mathematical Framework

Example: Linear Regression

Key Takeaways

Practice Problems

Problem 1: Vector Operations

Problem 2: Probability

Problem 3: Gradient Descent

Next Steps

Further Reading