Introduction: Learning from Examples
Imagine you're teaching a child to identify poisonous mushrooms. You don't explain the underlying biology – instead, you show them examples: "This one with red spots is poisonous. This brown one is safe. This one with gills underneath is poisonous." The child learns a pattern by seeing many labeled examples.
This is supervised learning: learning from labeled data to make predictions on new, unseen examples. It's called "supervised" because we provide the "correct answers" (labels) during training – like a teacher supervising a student.
In this lesson, we'll formalize the supervised learning framework and explore the fundamental concepts that underlie every ML algorithm you'll learn.
Learning Objectives
By the end of this lesson, you'll understand:
- The supervised learning problem formulation
- Training, validation, and test sets
- Loss functions and empirical risk minimization
- The bias-variance tradeoff
- Overfitting and underfitting
- Model capacity and generalization
1. The Supervised Learning Problem
Formal Definition
Given:
- A dataset (\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\})
- (\mathbf{x}_i \in \mathbb{R}^d): input features (d-dimensional vector)
- (y_i): output label (real number for regression, category for classification)
Goal: Learn a function (f: \mathbb{R}^d \rightarrow \mathbb{R}) (or (\mathbb{R}^k) for classification) that:
- Fits the training data well (low training error)
- Generalizes to new data (low test error)
Two Main Tasks
Task | Output | Example |
---|---|---|
Regression | Continuous value | Predict house price ($350,000) |
Classification | Discrete category | Predict email spam (yes/no) |
Loading Python runtime...
2. Training, Validation, and Test Sets
Why Split the Data?
The Fundamental Problem: We care about performance on future, unseen data, not memorizing the training examples.
Analogy: Preparing for an exam by memorizing the practice test questions isn't useful if the real exam has different questions. You need to understand the concepts to generalize.
The Three-Way Split
Set | Purpose | Usage |
---|---|---|
Training | Learn model parameters | Optimize weights to minimize loss |
Validation | Select model & hyperparameters | Compare different models, tune settings |
Test | Final performance estimate | Only used once at the end |
Important Rules
- Never train on test data – it's your "sealed envelope" for final evaluation
- Don't tune on test data – use validation set for hyperparameter selection
- Test set performance is your true estimate of real-world performance
Loading Python runtime...
3. Loss Functions: Measuring Error
A loss function (\mathcal{L}(y, \hat{y})) quantifies how wrong our prediction (\hat{y}) is compared to the true value (y).
Common Loss Functions
For Regression: Mean Squared Error (MSE)
[ \mathcal{L}{\text{MSE}} = \frac{1}{n} \sum{i=1}^{n} (y_i - \hat{y}_i)^2 ]
Intuition: Penalize large errors heavily (squared term). Prediction off by 2 is 4× worse than off by 1.
For Classification: Cross-Entropy Loss
[ \mathcal{L}{\text{CE}} = -\frac{1}{n} \sum{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] ]
Intuition: Penalize confident wrong predictions heavily. If truth is 1 but you predict 0.01, loss is huge.
Loading Python runtime...
Empirical Risk Minimization
The training objective is to minimize the empirical risk (average loss on training data):
[ \mathbf{w}^* = \arg\min_{\mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(y_i, f(\mathbf{x}_i; \mathbf{w})) ]
This is what gradient descent and other optimization algorithms do!
4. The Bias-Variance Tradeoff
The Central Challenge of Machine Learning
The Dilemma: We want a model that:
- Fits the training data well (low bias)
- Generalizes to new data (low variance)
But these goals are often in tension!
Definitions
Bias: Error from overly simplistic assumptions
- High bias → underfitting → model too simple
- Can't capture underlying pattern
Variance: Error from sensitivity to training data noise
- High variance → overfitting → model too complex
- Memorizes noise, doesn't generalize
The Decomposition
For any model, the expected test error can be decomposed:
[ \text{Expected Test Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} ]
- Bias²: How far off our model's average prediction is from the truth
- Variance: How much predictions vary for different training sets
- Irreducible Error: Noise in the data (can't be reduced)
Loading Python runtime...
5. Overfitting and Underfitting
Visual Intuition
Loading Python runtime...
Signs of Overfitting
- Training error much lower than validation error (large gap)
- Model performs well on training data, poorly on new data
- Model is very complex relative to amount of training data
- Training error keeps decreasing but validation error increases
Signs of Underfitting
- Both training and validation errors are high
- Model is too simple to capture patterns
- Learning curves plateau at high error
Solutions
Problem | Solutions |
---|---|
Underfitting | • Increase model complexity<br>• Add more features<br>• Reduce regularization<br>• Train longer |
Overfitting | • Get more training data<br>• Reduce model complexity<br>• Add regularization<br>• Early stopping<br>• Dropout/data augmentation |
6. Model Capacity and Generalization
Model Capacity
Model capacity: The range of functions a model can represent.
- Low capacity: Linear models, shallow trees (risk: underfitting)
- High capacity: Deep neural networks, high-degree polynomials (risk: overfitting)
Key Principle: Match model capacity to:
- Problem complexity
- Amount of training data
Loading Python runtime...
The Golden Rule
[ \text{Model Capacity} \propto \sqrt{\text{Training Data Size}} ]
More data → Can use more complex models
7. Practical Guidelines
Checklist for Supervised Learning
-
Split your data properly
- 60-70% training, 15-20% validation, 15-20% test
- Use stratified sampling for classification
-
Choose appropriate loss function
- Regression: MSE, MAE, Huber
- Classification: Cross-entropy, hinge loss
-
Start simple, increase complexity
- Begin with linear models
- Add complexity only if needed
-
Monitor training and validation errors
- Gap widening? → Overfitting
- Both high? → Underfitting
-
Use validation set to tune
- Model selection
- Hyperparameters
- Early stopping
-
Test set: use only once!
- Final performance estimate
- Report this as your result
Key Takeaways
✓ Supervised Learning: Learn from labeled examples to predict on new data
✓ Data Splits: Training (learn), Validation (tune), Test (evaluate once)
✓ Loss Functions: Quantify prediction error
- MSE for regression
- Cross-entropy for classification
✓ Bias-Variance Tradeoff: Balance model complexity
- High bias → underfitting (too simple)
- High variance → overfitting (too complex)
✓ Generalization: True goal is performance on unseen data, not memorizing training data
✓ Model Capacity: Match complexity to data size and problem difficulty
Practice Problems
Problem 1: Data Splitting
You have 5000 labeled images. Split them into train/validation/test sets using scikit-learn.
Loading Python runtime...
Problem 2: Identify the Problem
Describe whether each scenario is underfitting or overfitting:
Loading Python runtime...
Problem 3: Calculate MSE
Implement MSE loss from scratch and compare with sklearn.
Loading Python runtime...
Next Steps
You now understand the framework for supervised learning. In the next lessons, we'll dive into specific algorithms:
- Lesson 3: Linear Regression – the simplest regression algorithm
- Lesson 4: Logistic Regression – binary classification
- Lesson 5: Regularization – preventing overfitting
Each algorithm is a different way to minimize loss and find the best function (f) for your data!
Further Reading
- Bias-Variance: Understanding the Bias-Variance Tradeoff
- Overfitting: Overfitting in Machine Learning
- Cross-Validation: Cross-validation: evaluating estimator performance
- Book: Pattern Recognition and Machine Learning by Christopher Bishop (Chapter 1)
Remember: The goal isn't to memorize training data – it's to learn patterns that generalize to new data. This is the essence of machine learning!