课程 · 15 · 02 / 15
The Supervised Learning Framework
Understand the core concepts: training vs testing, loss functions, empirical risk minimization, and the bias-variance decomposition.
Introduction: Learning from Examples
Imagine you're teaching a child to identify poisonous mushrooms. You don't explain the underlying biology – instead, you show them examples: "This one with red spots is poisonous. This brown one is safe. This one with gills underneath is poisonous." The child learns a pattern by seeing many labeled examples.
This is supervised learning: learning from labeled data to make predictions on new, unseen examples. It's called "supervised" because we provide the "correct answers" (labels) during training – like a teacher supervising a student.
In this lesson, we'll formalize the supervised learning framework and explore the fundamental concepts that underlie every ML algorithm you'll learn.
The single hardest idea here is the bias-variance tradeoff — the tension between a model that's too simple and one that's too complex. Play with it now, then we'll build up the theory behind what you're seeing:
Try it: drag the model-complexity control from low to high and watch how the fitted curve changes — at low complexity it stays too flat to follow the data (underfitting), and at high complexity it wiggles to chase every point (overfitting). The sweet spot is in the middle.
Learning Objectives
By the end of this lesson, you'll understand:
- The supervised learning problem formulation
- Training, validation, and test sets
- Loss functions and empirical risk minimization
- The bias-variance tradeoff
- Overfitting and underfitting
- Model capacity and generalization
1. The Supervised Learning Problem
Formal Definition
Given:
- A dataset (\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\})
- (\mathbf{x}_i \in \mathbb{R}^d): input features (d-dimensional vector)
- (y_i): output label (real number for regression, category for classification)
Goal: Learn a function (f: \mathbb{R}^d \rightarrow \mathbb{R}) (or (\mathbb{R}^k) for classification) that:
- Fits the training data well (low training error)
- Generalizes to new data (low test error)
Two Main Tasks
| Task | Output | Example |
|---|---|---|
| Regression | Continuous value | Predict house price ($350,000) |
| Classification | Discrete category | Predict email spam (yes/no) |
2. Training, Validation, and Test Sets
Why Split the Data?
The Fundamental Problem: We care about performance on future, unseen data, not memorizing the training examples.
Analogy: Preparing for an exam by memorizing the practice test questions isn't useful if the real exam has different questions. You need to understand the concepts to generalize.
The Three-Way Split
| Set | Purpose | Usage |
|---|---|---|
| Training | Learn model parameters | Optimize weights to minimize loss |
| Validation | Select model & hyperparameters | Compare different models, tune settings |
| Test | Final performance estimate | Only used once at the end |
Important Rules
- Never train on test data – it's your "sealed envelope" for final evaluation
- Don't tune on test data – use validation set for hyperparameter selection
- Test set performance is your true estimate of real-world performance
3. Loss Functions: Measuring Error
A loss function (\mathcal{L}(y, \hat{y})) quantifies how wrong our prediction (\hat{y}) is compared to the true value (y).
Common Loss Functions
For Regression: Mean Squared Error (MSE)
[ \mathcal{L}{\text{MSE}} = \frac{1}{n} \sum{i=1}^{n} (y_i - \hat{y}_i)^2 ]
Intuition: Penalize large errors heavily (squared term). Prediction off by 2 is 4× worse than off by 1.
For Classification: Cross-Entropy Loss
[ \mathcal{L}{\text{CE}} = -\frac{1}{n} \sum{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] ]
Intuition: Penalize confident wrong predictions heavily. If truth is 1 but you predict 0.01, loss is huge.
Empirical Risk Minimization
The training objective is to minimize the empirical risk (average loss on training data):
[ \mathbf{w}^* = \arg\min_{\mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(y_i, f(\mathbf{x}_i; \mathbf{w})) ]
This is what gradient descent and other optimization algorithms do!
4. The Bias-Variance Tradeoff
SEE🎯 Before the math — see it: open R2D3's Visual Introduction to ML, Part 2 in another tab. Scroll through once. The dart-board illustration you're about to read makes ten times more sense afterwards.
The Central Challenge of Machine Learning
The Dilemma: We want a model that:
- Fits the training data well (low bias)
- Generalizes to new data (low variance)
But these goals are often in tension!
Definitions
Bias: Error from overly simplistic assumptions
- High bias → underfitting → model too simple
- Can't capture underlying pattern
Variance: Error from sensitivity to training data noise
- High variance → overfitting → model too complex
- Memorizes noise, doesn't generalize
Now revisit the BiasVarianceExplorer at the top of the lesson with these definitions in mind — the complexity control is moving you along exactly this bias-variance axis.
The Decomposition
For any model, the expected test error can be decomposed:
[ \text{Expected Test Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} ]
- Bias²: How far off our model's average prediction is from the truth
- Variance: How much predictions vary for different training sets
- Irreducible Error: Noise in the data (can't be reduced)
5. Overfitting and Underfitting
Visual Intuition
You already saw this interactively in the BiasVarianceExplorer at the top: a degree-1 line underfits, a moderate degree fits well, and a very high degree overfits by snaking through every noisy point. The pattern in the error numbers is what to remember:
- Underfitting (too simple): high training error and high test error.
- Good fit (just right): low training error, good test error.
- Overfitting (too complex): very low training error, but high test error.
Signs of Overfitting
- Training error much lower than validation error (large gap)
- Model performs well on training data, poorly on new data
- Model is very complex relative to amount of training data
- Training error keeps decreasing but validation error increases
Signs of Underfitting
- Both training and validation errors are high
- Model is too simple to capture patterns
- Learning curves plateau at high error
Solutions
| Problem | Solutions |
|---|---|
| Underfitting | • Increase model complexity<br>• Add more features<br>• Reduce regularization<br>• Train longer |
| Overfitting | • Get more training data<br>• Reduce model complexity<br>• Add regularization<br>• Early stopping<br>• Dropout/data augmentation |
6. Model Capacity and Generalization
Model Capacity
Model capacity: The range of functions a model can represent.
- Low capacity: Linear models, shallow trees (risk: underfitting)
- High capacity: Deep neural networks, high-degree polynomials (risk: overfitting)
Key Principle: Match model capacity to:
- Problem complexity
- Amount of training data
The Golden Rule
[ \text{Model Capacity} \propto \sqrt{\text{Training Data Size}} ]
More data → Can use more complex models
7. Practical Guidelines
Checklist for Supervised Learning
-
Split your data properly
- 60-70% training, 15-20% validation, 15-20% test
- Use stratified sampling for classification
-
Choose appropriate loss function
- Regression: MSE, MAE, Huber
- Classification: Cross-entropy, hinge loss
-
Start simple, increase complexity
- Begin with linear models
- Add complexity only if needed
-
Monitor training and validation errors
- Gap widening? → Overfitting
- Both high? → Underfitting
-
Use validation set to tune
- Model selection
- Hyperparameters
- Early stopping
-
Test set: use only once!
- Final performance estimate
- Report this as your result
Key Takeaways
✓ Supervised Learning: Learn from labeled examples to predict on new data
✓ Data Splits: Training (learn), Validation (tune), Test (evaluate once)
✓ Loss Functions: Quantify prediction error
- MSE for regression
- Cross-entropy for classification
✓ Bias-Variance Tradeoff: Balance model complexity
- High bias → underfitting (too simple)
- High variance → overfitting (too complex)
✓ Generalization: True goal is performance on unseen data, not memorizing training data
✓ Model Capacity: Match complexity to data size and problem difficulty
Practice Problems
Problem 1: Data Splitting
You have 5000 labeled images. Split them into train/validation/test sets using scikit-learn.
Problem 2: Identify the Problem
Describe whether each scenario is underfitting or overfitting:
Problem 3: Calculate MSE
Implement MSE loss from scratch and compare with sklearn.
Next Steps
You now understand the framework for supervised learning. In the next lessons, we'll dive into specific algorithms:
- Lesson 3: Linear Regression – the simplest regression algorithm
- Lesson 4: Logistic Regression – binary classification
- Lesson 5: Regularization – preventing overfitting
Each algorithm is a different way to minimize loss and find the best function (f) for your data!
Further Reading
Interactive Visualizations
- A Visual Introduction to Machine Learning — Part 2: Bias-Variance (R2D3) — the best scroll-story ever made on overfitting, variance, and why cross-validation matters.
- MLU-Explain: The Bias-Variance Tradeoff — interactive dart-board and curve-fitting demo with adjustable complexity.
- MLU-Explain: Train, Test, and Validation Sets — why we split, what each split is for, and what goes wrong when you peek at the test set.
- Scott Fortmann-Roe: Understanding the Bias-Variance Tradeoff — the classic essay with target-shooting illustrations.
Video Courses
- StatQuest — Bias and Variance (Josh Starmer) — intuitive 7-minute breakdown.
- Google ML Crash Course — Generalization — short interactive lessons on overfitting, splits, and regularization.
Papers & Articles
- A Few Useful Things to Know About Machine Learning — Pedro Domingos, CACM 2012. Twelve evergreen lessons on generalization, bias-variance, and the curse of dimensionality.
- Rethinking Bias-Variance Trade-off for Generalization of Neural Networks — Yang et al., ICML 2020. Double-descent and how modern over-parameterized models reshape the classical picture.
Documentation & Books
- Book: Pattern Recognition and Machine Learning — Christopher Bishop (Chapter 1)
- Book: Understanding Machine Learning: From Theory to Algorithms — Shalev-Shwartz & Ben-David (free PDF)
- scikit-learn: Cross-validation — the canonical how-to.
Remember: The goal isn't to memorize training data – it's to learn patterns that generalize to new data. This is the essence of machine learning!