Introduction: Learning from Examples
Imagine you're teaching a child to identify poisonous mushrooms. You don't explain the underlying biology – instead, you show them examples: "This one with red spots is poisonous. This brown one is safe. This one with gills underneath is poisonous." The child learns a pattern by seeing many labeled examples.
This is supervised learning: learning from labeled data to make predictions on new, unseen examples. It's called "supervised" because we provide the "correct answers" (labels) during training – like a teacher supervising a student.
In this lesson, we'll formalize the supervised learning framework and explore the fundamental concepts that underlie every ML algorithm you'll learn.
Learning Objectives
By the end of this lesson, you'll understand:
- The supervised learning problem formulation
- Training, validation, and test sets
- Loss functions and empirical risk minimization
- The bias-variance tradeoff
- Overfitting and underfitting
- Model capacity and generalization
1. The Supervised Learning Problem
Formal Definition
Given:
- A dataset (\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\})
- (\mathbf{x}_i \in \mathbb{R}^d): input features (d-dimensional vector)
- (y_i): output label (real number for regression, category for classification)
Goal: Learn a function (f: \mathbb{R}^d \rightarrow \mathbb{R}) (or (\mathbb{R}^k) for classification) that:
- Fits the training data well (low training error)
- Generalizes to new data (low test error)
Two Main Tasks
| Task | Output | Example |
|---|---|---|
| Regression | Continuous value | Predict house price ($350,000) |
| Classification | Discrete category | Predict email spam (yes/no) |
2. Training, Validation, and Test Sets
Why Split the Data?
The Fundamental Problem: We care about performance on future, unseen data, not memorizing the training examples.
Analogy: Preparing for an exam by memorizing the practice test questions isn't useful if the real exam has different questions. You need to understand the concepts to generalize.
The Three-Way Split
| Set | Purpose | Usage |
|---|---|---|
| Training | Learn model parameters | Optimize weights to minimize loss |
| Validation | Select model & hyperparameters | Compare different models, tune settings |
| Test | Final performance estimate | Only used once at the end |
Important Rules
- Never train on test data – it's your "sealed envelope" for final evaluation
- Don't tune on test data – use validation set for hyperparameter selection
- Test set performance is your true estimate of real-world performance
3. Loss Functions: Measuring Error
A loss function (\mathcal{L}(y, \hat{y})) quantifies how wrong our prediction (\hat{y}) is compared to the true value (y).
Common Loss Functions
For Regression: Mean Squared Error (MSE)
[ \mathcal{L}{\text{MSE}} = \frac{1}{n} \sum{i=1}^{n} (y_i - \hat{y}_i)^2 ]
Intuition: Penalize large errors heavily (squared term). Prediction off by 2 is 4× worse than off by 1.
For Classification: Cross-Entropy Loss
[ \mathcal{L}{\text{CE}} = -\frac{1}{n} \sum{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] ]
Intuition: Penalize confident wrong predictions heavily. If truth is 1 but you predict 0.01, loss is huge.
Empirical Risk Minimization
The training objective is to minimize the empirical risk (average loss on training data):
[ \mathbf{w}^* = \arg\min_{\mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(y_i, f(\mathbf{x}_i; \mathbf{w})) ]
This is what gradient descent and other optimization algorithms do!
4. The Bias-Variance Tradeoff
The Central Challenge of Machine Learning
The Dilemma: We want a model that:
- Fits the training data well (low bias)
- Generalizes to new data (low variance)
But these goals are often in tension!
Definitions
Bias: Error from overly simplistic assumptions
- High bias → underfitting → model too simple
- Can't capture underlying pattern
Variance: Error from sensitivity to training data noise
- High variance → overfitting → model too complex
- Memorizes noise, doesn't generalize
5. Overfitting and Underfitting
Visual Intuition
Signs of Overfitting
- Training error much lower than validation error (large gap)
- Model performs well on training data, poorly on new data
- Model is very complex relative to amount of training data
- Training error keeps decreasing but validation error increases
Signs of Underfitting
- Both training and validation errors are high
- Model is too simple to capture patterns
- Learning curves plateau at high error
Solutions
| Problem | Solutions |
|---|---|
| Underfitting | • Increase model complexity<br>• Add more features<br>• Reduce regularization<br>• Train longer |
| Overfitting | • Get more training data<br>• Reduce model complexity<br>• Add regularization<br>• Early stopping<br>• Dropout/data augmentation |
6. Model Capacity and Generalization
Model Capacity
Model capacity: The range of functions a model can represent.
- Low capacity: Linear models, shallow trees (risk: underfitting)
- High capacity: Deep neural networks, high-degree polynomials (risk: overfitting)
Key Principle: Match model capacity to:
- Problem complexity
- Amount of training data
The Golden Rule
[ \text{Model Capacity} \propto \sqrt{\text{Training Data Size}} ]
More data → Can use more complex models
7. Practical Guidelines
Checklist for Supervised Learning
-
Split your data properly
- 60-70% training, 15-20% validation, 15-20% test
- Use stratified sampling for classification
-
Choose appropriate loss function
- Regression: MSE, MAE, Huber
- Classification: Cross-entropy, hinge loss
-
Start simple, increase complexity
- Begin with linear models
- Add complexity only if needed
-
Monitor training and validation errors
- Gap widening? → Overfitting
- Both high? → Underfitting
-
Use validation set to tune
- Model selection
- Hyperparameters
- Early stopping
-
Test set: use only once!
- Final performance estimate
- Report this as your result
Key Takeaways
✓ Supervised Learning: Learn from labeled examples to predict on new data
✓ Data Splits: Training (learn), Validation (tune), Test (evaluate once)
✓ Loss Functions: Quantify prediction error
- MSE for regression
- Cross-entropy for classification
✓ Bias-Variance Tradeoff: Balance model complexity
- High bias → underfitting (too simple)
- High variance → overfitting (too complex)
✓ Generalization: True goal is performance on unseen data, not memorizing training data
✓ Model Capacity: Match complexity to data size and problem difficulty
Practice Problems
Problem 1: Data Splitting
You have 5000 labeled images. Split them into train/validation/test sets using scikit-learn.
Problem 2: Identify the Problem
Describe whether each scenario is underfitting or overfitting:
Problem 3: Calculate MSE
Implement MSE loss from scratch and compare with sklearn.
Next Steps
You now understand the framework for supervised learning. In the next lessons, we'll dive into specific algorithms:
- Lesson 3: Linear Regression – the simplest regression algorithm
- Lesson 4: Logistic Regression – binary classification
- Lesson 5: Regularization – preventing overfitting
Each algorithm is a different way to minimize loss and find the best function (f) for your data!
Further Reading
- Bias-Variance: Understanding the Bias-Variance Tradeoff
- Overfitting: Overfitting in Machine Learning
- Cross-Validation: Cross-validation: evaluating estimator performance
- Book: Pattern Recognition and Machine Learning by Christopher Bishop (Chapter 1)
Remember: The goal isn't to memorize training data – it's to learn patterns that generalize to new data. This is the essence of machine learning!