The Supervised Learning Framework

Introduction: Learning from Examples

Imagine you're teaching a child to identify poisonous mushrooms. You don't explain the underlying biology – instead, you show them examples: "This one with red spots is poisonous. This brown one is safe. This one with gills underneath is poisonous." The child learns a pattern by seeing many labeled examples.

This is supervised learning: learning from labeled data to make predictions on new, unseen examples. It's called "supervised" because we provide the "correct answers" (labels) during training – like a teacher supervising a student.

In this lesson, we'll formalize the supervised learning framework and explore the fundamental concepts that underlie every ML algorithm you'll learn.

Learning Objectives

By the end of this lesson, you'll understand:

The supervised learning problem formulation
Training, validation, and test sets
Loss functions and empirical risk minimization
The bias-variance tradeoff
Overfitting and underfitting
Model capacity and generalization

1. The Supervised Learning Problem

Formal Definition

Given:

A dataset (\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\})
- (\mathbf{x}_i \in \mathbb{R}^d): input features (d-dimensional vector)
- (y_i): output label (real number for regression, category for classification)

Goal: Learn a function (f: \mathbb{R}^d \rightarrow \mathbb{R}) (or (\mathbb{R}^k) for classification) that:

Fits the training data well (low training error)
Generalizes to new data (low test error)

Two Main Tasks

Task	Output	Example
Regression	Continuous value	Predict house price ($350,000)
Classification	Discrete category	Predict email spam (yes/no)

Loading Python runtime...

2. Training, Validation, and Test Sets

Why Split the Data?

The Fundamental Problem: We care about performance on future, unseen data, not memorizing the training examples.

Analogy: Preparing for an exam by memorizing the practice test questions isn't useful if the real exam has different questions. You need to understand the concepts to generalize.

The Three-Way Split

Loading interactive component...

Set	Purpose	Usage
Training	Learn model parameters	Optimize weights to minimize loss
Validation	Select model & hyperparameters	Compare different models, tune settings
Test	Final performance estimate	Only used once at the end

Important Rules

Never train on test data – it's your "sealed envelope" for final evaluation
Don't tune on test data – use validation set for hyperparameter selection
Test set performance is your true estimate of real-world performance

Loading Python runtime...

3. Loss Functions: Measuring Error

A loss function (\mathcal{L}(y, \hat{y})) quantifies how wrong our prediction (\hat{y}) is compared to the true value (y).

Common Loss Functions

For Regression: Mean Squared Error (MSE)

[ \mathcal{L}{\text{MSE}} = \frac{1}{n} \sum{i=1}^{n} (y_i - \hat{y}_i)^2 ]

Intuition: Penalize large errors heavily (squared term). Prediction off by 2 is 4× worse than off by 1.

For Classification: Cross-Entropy Loss

[ \mathcal{L}{\text{CE}} = -\frac{1}{n} \sum{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] ]

Intuition: Penalize confident wrong predictions heavily. If truth is 1 but you predict 0.01, loss is huge.

Loading Python runtime...

Empirical Risk Minimization

The training objective is to minimize the empirical risk (average loss on training data):

[ \mathbf{w}^* = \arg\min_{\mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(y_i, f(\mathbf{x}_i; \mathbf{w})) ]

This is what gradient descent and other optimization algorithms do!

4. The Bias-Variance Tradeoff

The Central Challenge of Machine Learning

The Dilemma: We want a model that:

Fits the training data well (low bias)
Generalizes to new data (low variance)

But these goals are often in tension!

Definitions

Bias: Error from overly simplistic assumptions

High bias → underfitting → model too simple
Can't capture underlying pattern

Variance: Error from sensitivity to training data noise

High variance → overfitting → model too complex
Memorizes noise, doesn't generalize

Loading interactive component...

The Decomposition

For any model, the expected test error can be decomposed:

[ \text{Expected Test Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} ]

Bias²: How far off our model's average prediction is from the truth
Variance: How much predictions vary for different training sets
Irreducible Error: Noise in the data (can't be reduced)

Loading Python runtime...

5. Overfitting and Underfitting

Visual Intuition

Loading Python runtime...

Signs of Overfitting

Training error much lower than validation error (large gap)
Model performs well on training data, poorly on new data
Model is very complex relative to amount of training data
Training error keeps decreasing but validation error increases

Signs of Underfitting

Both training and validation errors are high
Model is too simple to capture patterns
Learning curves plateau at high error

Solutions

Problem	Solutions
Underfitting	• Increase model complexity<br>• Add more features<br>• Reduce regularization<br>• Train longer
Overfitting	• Get more training data<br>• Reduce model complexity<br>• Add regularization<br>• Early stopping<br>• Dropout/data augmentation

6. Model Capacity and Generalization

Model Capacity

Model capacity: The range of functions a model can represent.

Low capacity: Linear models, shallow trees (risk: underfitting)
High capacity: Deep neural networks, high-degree polynomials (risk: overfitting)

Key Principle: Match model capacity to:

Problem complexity
Amount of training data

Loading Python runtime...

The Golden Rule

[ \text{Model Capacity} \propto \sqrt{\text{Training Data Size}} ]

More data → Can use more complex models

7. Practical Guidelines

Checklist for Supervised Learning

Split your data properly
- 60-70% training, 15-20% validation, 15-20% test
- Use stratified sampling for classification
Choose appropriate loss function
- Regression: MSE, MAE, Huber
- Classification: Cross-entropy, hinge loss
Start simple, increase complexity
- Begin with linear models
- Add complexity only if needed
Monitor training and validation errors
- Gap widening? → Overfitting
- Both high? → Underfitting
Use validation set to tune
- Model selection
- Hyperparameters
- Early stopping
Test set: use only once!
- Final performance estimate
- Report this as your result

Key Takeaways

✓ Supervised Learning: Learn from labeled examples to predict on new data

✓ Data Splits: Training (learn), Validation (tune), Test (evaluate once)

✓ Loss Functions: Quantify prediction error

MSE for regression
Cross-entropy for classification

✓ Bias-Variance Tradeoff: Balance model complexity

High bias → underfitting (too simple)
High variance → overfitting (too complex)

✓ Generalization: True goal is performance on unseen data, not memorizing training data

✓ Model Capacity: Match complexity to data size and problem difficulty

Practice Problems

Problem 1: Data Splitting

You have 5000 labeled images. Split them into train/validation/test sets using scikit-learn.

Loading Python runtime...

Problem 2: Identify the Problem

Describe whether each scenario is underfitting or overfitting:

Loading Python runtime...

Problem 3: Calculate MSE

Implement MSE loss from scratch and compare with sklearn.

Loading Python runtime...

Next Steps

You now understand the framework for supervised learning. In the next lessons, we'll dive into specific algorithms:

Lesson 3: Linear Regression – the simplest regression algorithm
Lesson 4: Logistic Regression – binary classification
Lesson 5: Regularization – preventing overfitting

Each algorithm is a different way to minimize loss and find the best function (f) for your data!

Classical Machine Learning: Supervised Learning Foundations

The Supervised Learning Framework

Introduction: Learning from Examples

Learning Objectives

1. The Supervised Learning Problem

Formal Definition

Two Main Tasks

2. Training, Validation, and Test Sets

Why Split the Data?

The Three-Way Split

Important Rules

3. Loss Functions: Measuring Error

Common Loss Functions

For Regression: Mean Squared Error (MSE)

For Classification: Cross-Entropy Loss

Empirical Risk Minimization

4. The Bias-Variance Tradeoff

The Central Challenge of Machine Learning

Definitions

The Decomposition

5. Overfitting and Underfitting

Visual Intuition

Signs of Overfitting

Signs of Underfitting

Solutions

6. Model Capacity and Generalization

Model Capacity

The Golden Rule

7. Practical Guidelines

Checklist for Supervised Learning

Key Takeaways

Practice Problems

Problem 1: Data Splitting

Problem 2: Identify the Problem

Problem 3: Calculate MSE

Next Steps

Further Reading