The Supervised Learning Framework

Introduction: Learning from Examples

Imagine you're teaching a child to identify poisonous mushrooms. You don't explain the underlying biology – instead, you show them examples: "This one with red spots is poisonous. This brown one is safe. This one with gills underneath is poisonous." The child learns a pattern by seeing many labeled examples.

This is supervised learning: learning from labeled data to make predictions on new, unseen examples. It's called "supervised" because we provide the "correct answers" (labels) during training – like a teacher supervising a student.

In this lesson, we'll formalize the supervised learning framework and explore the fundamental concepts that underlie every ML algorithm you'll learn.

The single hardest idea here is the bias-variance tradeoff — the tension between a model that's too simple and one that's too complex. Play with it now, then we'll build up the theory behind what you're seeing:

FIG. 02Bias-Variance Tradeoff Explorer

INTERACTIVE

LOADING INSTRUMENT

Fig. 02Interactive visualization of bias-variance tradeoff

Try it: drag the model-complexity control from low to high and watch how the fitted curve changes — at low complexity it stays too flat to follow the data (underfitting), and at high complexity it wiggles to chase every point (overfitting). The sweet spot is in the middle.

Learning Objectives

By the end of this lesson, you'll understand:

The supervised learning problem formulation
Training, validation, and test sets
Loss functions and empirical risk minimization
The bias-variance tradeoff
Overfitting and underfitting
Model capacity and generalization

1. The Supervised Learning Problem

Formal Definition

Given:

A dataset (\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\})
- (\mathbf{x}_i \in \mathbb{R}^d): input features (d-dimensional vector)
- (y_i): output label (real number for regression, category for classification)

Goal: Learn a function (f: \mathbb{R}^d \rightarrow \mathbb{R}) (or (\mathbb{R}^k) for classification) that:

Fits the training data well (low training error)
Generalizes to new data (low test error)

Two Main Tasks

Task	Output	Example
Regression	Continuous value	Predict house price ($350,000)
Classification	Discrete category	Predict email spam (yes/no)

FIG. 04Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 04Interactive Python code execution environment

2. Training, Validation, and Test Sets

Why Split the Data?

The Fundamental Problem: We care about performance on future, unseen data, not memorizing the training examples.

Analogy: Preparing for an exam by memorizing the practice test questions isn't useful if the real exam has different questions. You need to understand the concepts to generalize.

The Three-Way Split

FIG. 06Flow Diagram

DIAGRAM

LOADING INSTRUMENT

Fig. 06Flow diagrams, timelines, and process visualizations

Set	Purpose	Usage
Training	Learn model parameters	Optimize weights to minimize loss
Validation	Select model & hyperparameters	Compare different models, tune settings
Test	Final performance estimate	Only used once at the end

Important Rules

Never train on test data – it's your "sealed envelope" for final evaluation
Don't tune on test data – use validation set for hyperparameter selection
Test set performance is your true estimate of real-world performance

FIG. 08Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 08Interactive Python code execution environment

3. Loss Functions: Measuring Error

A loss function (\mathcal{L}(y, \hat{y})) quantifies how wrong our prediction (\hat{y}) is compared to the true value (y).

Common Loss Functions

For Regression: Mean Squared Error (MSE)

[ \mathcal{L}{\text{MSE}} = \frac{1}{n} \sum{i=1}^{n} (y_i - \hat{y}_i)^2 ]

Intuition: Penalize large errors heavily (squared term). Prediction off by 2 is 4× worse than off by 1.

For Classification: Cross-Entropy Loss

[ \mathcal{L}{\text{CE}} = -\frac{1}{n} \sum{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] ]

Intuition: Penalize confident wrong predictions heavily. If truth is 1 but you predict 0.01, loss is huge.

FIG. 10Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 10Interactive Python code execution environment

Empirical Risk Minimization

The training objective is to minimize the empirical risk (average loss on training data):

[ \mathbf{w}^* = \arg\min_{\mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(y_i, f(\mathbf{x}_i; \mathbf{w})) ]

This is what gradient descent and other optimization algorithms do!

4. The Bias-Variance Tradeoff

SEE

🎯 Before the math — see it: open R2D3's Visual Introduction to ML, Part 2 in another tab. Scroll through once. The dart-board illustration you're about to read makes ten times more sense afterwards.

The Central Challenge of Machine Learning

The Dilemma: We want a model that:

Fits the training data well (low bias)
Generalizes to new data (low variance)

But these goals are often in tension!

Definitions

Bias: Error from overly simplistic assumptions

High bias → underfitting → model too simple
Can't capture underlying pattern

Variance: Error from sensitivity to training data noise

High variance → overfitting → model too complex
Memorizes noise, doesn't generalize

Now revisit the BiasVarianceExplorer at the top of the lesson with these definitions in mind — the complexity control is moving you along exactly this bias-variance axis.

The Decomposition

For any model, the expected test error can be decomposed:

[ \text{Expected Test Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} ]

Bias²: How far off our model's average prediction is from the truth
Variance: How much predictions vary for different training sets
Irreducible Error: Noise in the data (can't be reduced)

FIG. 12Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 12Interactive Python code execution environment

5. Overfitting and Underfitting

Visual Intuition

You already saw this interactively in the BiasVarianceExplorer at the top: a degree-1 line underfits, a moderate degree fits well, and a very high degree overfits by snaking through every noisy point. The pattern in the error numbers is what to remember:

Underfitting (too simple): high training error and high test error.
Good fit (just right): low training error, good test error.
Overfitting (too complex): very low training error, but high test error.

Signs of Overfitting

Training error much lower than validation error (large gap)
Model performs well on training data, poorly on new data
Model is very complex relative to amount of training data
Training error keeps decreasing but validation error increases

Signs of Underfitting

Both training and validation errors are high
Model is too simple to capture patterns
Learning curves plateau at high error

Solutions

Problem	Solutions
Underfitting	• Increase model complexity<br>• Add more features<br>• Reduce regularization<br>• Train longer
Overfitting	• Get more training data<br>• Reduce model complexity<br>• Add regularization<br>• Early stopping<br>• Dropout/data augmentation

6. Model Capacity and Generalization

Model Capacity

Model capacity: The range of functions a model can represent.

Low capacity: Linear models, shallow trees (risk: underfitting)
High capacity: Deep neural networks, high-degree polynomials (risk: overfitting)

Key Principle: Match model capacity to:

Problem complexity
Amount of training data

FIG. 14Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 14Interactive Python code execution environment

The Golden Rule

[ \text{Model Capacity} \propto \sqrt{\text{Training Data Size}} ]

More data → Can use more complex models

7. Practical Guidelines

Checklist for Supervised Learning

Split your data properly
- 60-70% training, 15-20% validation, 15-20% test
- Use stratified sampling for classification
Choose appropriate loss function
- Regression: MSE, MAE, Huber
- Classification: Cross-entropy, hinge loss
Start simple, increase complexity
- Begin with linear models
- Add complexity only if needed
Monitor training and validation errors
- Gap widening? → Overfitting
- Both high? → Underfitting
Use validation set to tune
- Model selection
- Hyperparameters
- Early stopping
Test set: use only once!
- Final performance estimate
- Report this as your result

Key Takeaways

✓ Supervised Learning: Learn from labeled examples to predict on new data

✓ Data Splits: Training (learn), Validation (tune), Test (evaluate once)

✓ Loss Functions: Quantify prediction error

MSE for regression
Cross-entropy for classification

✓ Bias-Variance Tradeoff: Balance model complexity

High bias → underfitting (too simple)
High variance → overfitting (too complex)

✓ Generalization: True goal is performance on unseen data, not memorizing training data

✓ Model Capacity: Match complexity to data size and problem difficulty

Practice Problems

Problem 1: Data Splitting

You have 5000 labeled images. Split them into train/validation/test sets using scikit-learn.

FIG. 16Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 16Interactive Python code execution environment

Problem 2: Identify the Problem

Describe whether each scenario is underfitting or overfitting:

FIG. 18Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 18Interactive Python code execution environment

Problem 3: Calculate MSE

Implement MSE loss from scratch and compare with sklearn.

FIG. 20Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 20Interactive Python code execution environment

Next Steps

You now understand the framework for supervised learning. In the next lessons, we'll dive into specific algorithms:

Lesson 3: Linear Regression – the simplest regression algorithm
Lesson 4: Logistic Regression – binary classification
Lesson 5: Regularization – preventing overfitting

Each algorithm is a different way to minimize loss and find the best function (f) for your data!

The Supervised Learning Framework

Introduction: Learning from Examples

Learning Objectives

1. The Supervised Learning Problem

Formal Definition

Two Main Tasks

2. Training, Validation, and Test Sets

Why Split the Data?

The Three-Way Split

Important Rules

3. Loss Functions: Measuring Error

Common Loss Functions

For Regression: Mean Squared Error (MSE)

For Classification: Cross-Entropy Loss

Empirical Risk Minimization

4. The Bias-Variance Tradeoff

The Central Challenge of Machine Learning

Definitions

The Decomposition

5. Overfitting and Underfitting

Visual Intuition

Signs of Overfitting

Signs of Underfitting

Solutions

6. Model Capacity and Generalization

Model Capacity

The Golden Rule

7. Practical Guidelines

Checklist for Supervised Learning

Key Takeaways

Practice Problems

Problem 1: Data Splitting

Problem 2: Identify the Problem

Problem 3: Calculate MSE

Next Steps

Further Reading

Interactive Visualizations

Video Courses

Papers & Articles

Documentation & Books