Introduction: The Language of Learning
Imagine you're teaching a child to recognize different types of fruit. You don't just say "this is an apple" – you help them understand patterns: apples are round, usually red or green, have smooth skin, and weigh a certain amount. The child learns by building a mental model from these numerical features and mathematical relationships.
Machine learning works the same way, but it speaks the language of linear algebra, probability, and calculus. In this lesson, we'll build the mathematical toolkit you need to understand and implement ML algorithms. Don't worry if you're rusty – we'll approach each concept visually and intuitively first.
Prerequisites
This lesson assumes you're comfortable with:
- Basic Python programming
- High school algebra (equations, functions)
- Nice to have: Some exposure to matrices and derivatives (we'll review!)
1. Linear Algebra: The Geometry of Data
Why Linear Algebra?
In machine learning, we work with high-dimensional data. A house might have 10 features (size, bedrooms, age, location, etc.). An image might have millions of pixels. Linear algebra gives us a clean, efficient way to represent and manipulate this data.
Key Insight: Machine learning is fundamentally about finding relationships between inputs and outputs. Linear algebra lets us express these relationships compactly.
Vectors: Representing Data Points
A vector is an ordered list of numbers. In ML, each data point is a vector.
# A house represented as a vector house = [1200, # square feet 3, # bedrooms 2, # bathrooms 2010, # year built 350000] # price # In NumPy import numpy as np house_vector = np.array([1200, 3, 2, 2010, 350000])
Geometric Interpretation: In 2D or 3D, we can visualize vectors as arrows from the origin:
Loading Python runtime...
Matrices: Transforming Data
A matrix is a 2D array of numbers. In ML, matrices represent:
- Datasets (rows = samples, columns = features)
- Transformations (linear mappings)
- Model parameters (weights)
# A dataset of 3 houses dataset = np.array([ [1200, 3, 2, 2010, 350000], # House 1 [1500, 4, 3, 2015, 425000], # House 2 [900, 2, 1, 2005, 280000] # House 3 ]) print(f"Shape: {dataset.shape}") # (3, 5) = 3 samples, 5 features
Key Operations
1. Dot Product (Inner Product)
The dot product measures similarity between vectors:
[ \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1b_1 + a_2b_2 + \cdots + a_nb_n ]
Loading Python runtime...
ML Application: The dot product is the foundation of linear models! A prediction is just the dot product of features and weights.
2. Matrix Multiplication
Matrix multiplication combines transformations:
[ \mathbf{C} = \mathbf{A} \mathbf{B} \quad \text{where} \quad C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj} ]
Loading Python runtime...
3. Transpose
Flipping rows and columns:
[ (\mathbf{A}^T){ij} = \mathbf{A}{ji} ]
A = np.array([[1, 2, 3], [4, 5, 6]]) print("Original shape:", A.shape) # (2, 3) print("Transposed shape:", A.T.shape) # (3, 2)
ML Application: Computing gradients, normal equations, covariance matrices.
2. Probability Theory: Modeling Uncertainty
Why Probability?
Real-world data is noisy and uncertain. Instead of saying "this house costs exactly 340K and $360K". Probability lets us:
- Model noise in data
- Quantify prediction confidence
- Derive optimal learning algorithms
Random Variables
A random variable (X) is a variable whose value is determined by chance.
Example: The outcome of rolling a die is a random variable (X \in \{1, 2, 3, 4, 5, 6\}).
Probability Distributions
A probability distribution describes the likelihood of different outcomes.
<Python Executor defaultValue={`import numpy as np import matplotlib.pyplot as plt from scipy import stats
Normal (Gaussian) Distribution
x = np.linspace(-4, 4, 200) normal_pdf = stats.norm.pdf(x, loc=0, scale=1)
Binomial Distribution
x_binom = np.arange(0, 11) binomial_pmf = stats.binom.pmf(x_binom, n=10, p=0.5)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
Normal distribution
ax1.plot(x, normal_pdf, 'b-', linewidth=2, label='μ=0, σ²=1') ax1.fill_between(x, normal_pdf, alpha=0.3) ax1.set_title('Normal Distribution (Continuous)', fontsize=14, fontweight='bold') ax1.set_xlabel('x') ax1.set_ylabel('Probability Density') ax1.grid(True, alpha=0.3) ax1.legend()
Binomial distribution
ax2.bar(x_binom, binomial_pmf, color='coral', edgecolor='darkred', linewidth=1.5) ax2.set_title('Binomial Distribution (Discrete)', fontsize=14, fontweight='bold') ax2.set_xlabel('Number of Successes') ax2.set_ylabel('Probability') ax2.grid(True, alpha=0.3, axis='y')
plt.tight_layout() plt.show()
print("Normal distribution: models continuous outcomes (e.g., house prices)") print("Binomial distribution: models discrete outcomes (e.g., coin flips)")`} />
Key Distributions in ML
Distribution | Use Case | Example |
---|---|---|
Normal (Gaussian) | Continuous variables, noise modeling | House prices, measurement errors |
Bernoulli | Binary outcomes | Classification (yes/no) |
Categorical | Multiple classes | Image classification (cat/dog/bird) |
Poisson | Count data | Number of website visits per hour |
Expected Value and Variance
Expected value (E[X]): The "average" or "center of mass" of a distribution
[ E[X] = \sum_{x} x \cdot P(X = x) \quad \text{(discrete)} \quad or \quad E[X] = \int_{-\infty}^{\infty} x \cdot f(x) , dx \quad \text{(continuous)} ]
Variance (\text{Var}(X)): How spread out the values are
[ \text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2 ]
Loading Python runtime...
3. Calculus: Finding Optimal Solutions
Why Calculus?
Machine learning algorithms learn by finding the best parameters (weights) that minimize error. Calculus, specifically derivatives and gradients, tells us which direction to adjust parameters to reduce error.
Analogy: Imagine you're hiking in dense fog and want to reach the valley (lowest point). You can't see far, but you can feel the slope under your feet. You always walk in the direction that goes downhill steepest – that's gradient descent!
Derivatives: Rate of Change
The derivative measures how a function changes as its input changes:
[ f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} ]
Loading Python runtime...
Key Insight: When (f'(x) = 0), we're at a critical point (possibly a minimum or maximum).
Partial Derivatives and Gradients
For functions of multiple variables (f(x_1, x_2, \ldots, x_n)), we use partial derivatives:
[ \frac{\partial f}{\partial x_i} = \text{rate of change of } f \text{ with respect to } x_i \text{ (holding others constant)} ]
The gradient is the vector of all partial derivatives:
[ \nabla f = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right] ]
Geometric Interpretation: The gradient points in the direction of steepest ascent. To minimize (f), we move in the direction opposite to the gradient.
Loading Python runtime...
4. The Learning Problem: Putting It All Together
The Mathematical Framework
Machine learning can be formulated as an optimization problem:
Given:
- Training data (\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\})
- A model (f(\mathbf{x}; \mathbf{w})) parameterized by weights (\mathbf{w})
- A loss function (\mathcal{L}(y, \hat{y})) that measures prediction error
Find: [ \mathbf{w}^* = \arg\min_{\mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(y_i, f(\mathbf{x}_i; \mathbf{w})) ]
This is the empirical risk minimization principle.
Example: Linear Regression
Model: (f(\mathbf{x}; \mathbf{w}) = \mathbf{w}^T \mathbf{x} = w_0 + w_1 x_1 + w_2 x_2 + \cdots)
Loss: Mean Squared Error (MSE) [ \mathcal{L}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}_i)^2 ]
Solution: Use calculus to find (\mathbf{w}^*) where (\nabla_{\mathbf{w}} \mathcal{L} = \mathbf{0})
Loading Python runtime...
Key Takeaways
✓ Linear Algebra: Represents data (vectors, matrices) and transformations efficiently
- Dot products measure similarity
- Matrix multiplication applies transformations
- Transpose, inverse are fundamental operations
✓ Probability: Models uncertainty and noise in data
- Distributions describe data patterns
- Expected value and variance quantify center and spread
- Foundation for probabilistic models
✓ Calculus: Finds optimal model parameters
- Derivatives measure rate of change
- Gradients point to steepest ascent
- Setting gradients to zero finds critical points
✓ The Learning Problem: Minimize error (loss) over training data
- Choose model architecture
- Define loss function
- Optimize parameters using calculus
Practice Problems
Problem 1: Vector Operations
Given vectors (\mathbf{a} = [2, 3, -1]) and (\mathbf{b} = [1, -2, 4]), compute:
- (\mathbf{a} + \mathbf{b})
- (\mathbf{a} \cdot \mathbf{b})
- (||\mathbf{a}||) (magnitude/norm)
Loading Python runtime...
Problem 2: Probability
A dataset of house prices follows a normal distribution with mean 50,000. What's the probability a randomly selected house costs between 400,000?
Loading Python runtime...
Problem 3: Gradient Descent
Implement one step of gradient descent for (f(x) = x^2 - 6x + 9) starting at (x = 5) with learning rate (\alpha = 0.1).
Loading Python runtime...
Next Steps
You now have the mathematical foundation for machine learning! In the next lesson, we'll formalize the supervised learning framework, exploring:
- Training vs testing data
- Loss functions in detail
- The bias-variance tradeoff
- Model capacity and generalization
These mathematical tools will appear in every ML algorithm we study. Keep this lesson as a reference – you'll return to it often!
Further Reading
- Linear Algebra: Introduction to Linear Algebra by Gilbert Strang
- Probability: Think Stats by Allen Downey
- Calculus: Calculus by James Stewart
- ML Math: Mathematics for Machine Learning by Deisenroth, Faisal, Ong (free online)
Remember: Don't try to memorize every formula. Focus on understanding the intuition – the formulas are just precise ways to express ideas you already understand!