The Supervised Learning Framework

Introduction: Learning from Examples

Imagine you're teaching a child to identify poisonous mushrooms. You don't explain the underlying biology – instead, you show them examples: "This one with red spots is poisonous. This brown one is safe. This one with gills underneath is poisonous." The child learns a pattern by seeing many labeled examples.

This is supervised learning: learning from labeled data to make predictions on new, unseen examples. It's called "supervised" because we provide the "correct answers" (labels) during training – like a teacher supervising a student.

In this lesson, we'll formalize the supervised learning framework and explore the fundamental concepts that underlie every ML algorithm you'll learn.

Learning Objectives

By the end of this lesson, you'll understand:

  • The supervised learning problem formulation
  • Training, validation, and test sets
  • Loss functions and empirical risk minimization
  • The bias-variance tradeoff
  • Overfitting and underfitting
  • Model capacity and generalization

1. The Supervised Learning Problem

Formal Definition

Given:

  • A dataset (\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\})
    • (\mathbf{x}_i \in \mathbb{R}^d): input features (d-dimensional vector)
    • (y_i): output label (real number for regression, category for classification)

Goal: Learn a function (f: \mathbb{R}^d \rightarrow \mathbb{R}) (or (\mathbb{R}^k) for classification) that:

  1. Fits the training data well (low training error)
  2. Generalizes to new data (low test error)

Two Main Tasks

TaskOutputExample
RegressionContinuous valuePredict house price ($350,000)
ClassificationDiscrete categoryPredict email spam (yes/no)

Loading Python runtime...


2. Training, Validation, and Test Sets

Why Split the Data?

The Fundamental Problem: We care about performance on future, unseen data, not memorizing the training examples.

Analogy: Preparing for an exam by memorizing the practice test questions isn't useful if the real exam has different questions. You need to understand the concepts to generalize.

The Three-Way Split

Loading interactive component...
SetPurposeUsage
TrainingLearn model parametersOptimize weights to minimize loss
ValidationSelect model & hyperparametersCompare different models, tune settings
TestFinal performance estimateOnly used once at the end

Important Rules

  1. Never train on test data – it's your "sealed envelope" for final evaluation
  2. Don't tune on test data – use validation set for hyperparameter selection
  3. Test set performance is your true estimate of real-world performance

Loading Python runtime...


3. Loss Functions: Measuring Error

A loss function (\mathcal{L}(y, \hat{y})) quantifies how wrong our prediction (\hat{y}) is compared to the true value (y).

Common Loss Functions

For Regression: Mean Squared Error (MSE)

[ \mathcal{L}{\text{MSE}} = \frac{1}{n} \sum{i=1}^{n} (y_i - \hat{y}_i)^2 ]

Intuition: Penalize large errors heavily (squared term). Prediction off by 2 is 4× worse than off by 1.

For Classification: Cross-Entropy Loss

[ \mathcal{L}{\text{CE}} = -\frac{1}{n} \sum{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] ]

Intuition: Penalize confident wrong predictions heavily. If truth is 1 but you predict 0.01, loss is huge.

Loading Python runtime...

Empirical Risk Minimization

The training objective is to minimize the empirical risk (average loss on training data):

[ \mathbf{w}^* = \arg\min_{\mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(y_i, f(\mathbf{x}_i; \mathbf{w})) ]

This is what gradient descent and other optimization algorithms do!


4. The Bias-Variance Tradeoff

The Central Challenge of Machine Learning

The Dilemma: We want a model that:

  • Fits the training data well (low bias)
  • Generalizes to new data (low variance)

But these goals are often in tension!

Definitions

Bias: Error from overly simplistic assumptions

  • High bias → underfitting → model too simple
  • Can't capture underlying pattern

Variance: Error from sensitivity to training data noise

  • High variance → overfitting → model too complex
  • Memorizes noise, doesn't generalize
Loading interactive component...

The Decomposition

For any model, the expected test error can be decomposed:

[ \text{Expected Test Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} ]

  • Bias²: How far off our model's average prediction is from the truth
  • Variance: How much predictions vary for different training sets
  • Irreducible Error: Noise in the data (can't be reduced)

Loading Python runtime...


5. Overfitting and Underfitting

Visual Intuition

Loading Python runtime...

Signs of Overfitting

  1. Training error much lower than validation error (large gap)
  2. Model performs well on training data, poorly on new data
  3. Model is very complex relative to amount of training data
  4. Training error keeps decreasing but validation error increases

Signs of Underfitting

  1. Both training and validation errors are high
  2. Model is too simple to capture patterns
  3. Learning curves plateau at high error

Solutions

ProblemSolutions
Underfitting• Increase model complexity<br>• Add more features<br>• Reduce regularization<br>• Train longer
Overfitting• Get more training data<br>• Reduce model complexity<br>• Add regularization<br>• Early stopping<br>• Dropout/data augmentation

6. Model Capacity and Generalization

Model Capacity

Model capacity: The range of functions a model can represent.

  • Low capacity: Linear models, shallow trees (risk: underfitting)
  • High capacity: Deep neural networks, high-degree polynomials (risk: overfitting)

Key Principle: Match model capacity to:

  1. Problem complexity
  2. Amount of training data

Loading Python runtime...

The Golden Rule

[ \text{Model Capacity} \propto \sqrt{\text{Training Data Size}} ]

More data → Can use more complex models


7. Practical Guidelines

Checklist for Supervised Learning

  1. Split your data properly

    • 60-70% training, 15-20% validation, 15-20% test
    • Use stratified sampling for classification
  2. Choose appropriate loss function

    • Regression: MSE, MAE, Huber
    • Classification: Cross-entropy, hinge loss
  3. Start simple, increase complexity

    • Begin with linear models
    • Add complexity only if needed
  4. Monitor training and validation errors

    • Gap widening? → Overfitting
    • Both high? → Underfitting
  5. Use validation set to tune

    • Model selection
    • Hyperparameters
    • Early stopping
  6. Test set: use only once!

    • Final performance estimate
    • Report this as your result

Key Takeaways

Supervised Learning: Learn from labeled examples to predict on new data

Data Splits: Training (learn), Validation (tune), Test (evaluate once)

Loss Functions: Quantify prediction error

  • MSE for regression
  • Cross-entropy for classification

Bias-Variance Tradeoff: Balance model complexity

  • High bias → underfitting (too simple)
  • High variance → overfitting (too complex)

Generalization: True goal is performance on unseen data, not memorizing training data

Model Capacity: Match complexity to data size and problem difficulty


Practice Problems

Problem 1: Data Splitting

You have 5000 labeled images. Split them into train/validation/test sets using scikit-learn.

Loading Python runtime...

Problem 2: Identify the Problem

Describe whether each scenario is underfitting or overfitting:

Loading Python runtime...

Problem 3: Calculate MSE

Implement MSE loss from scratch and compare with sklearn.

Loading Python runtime...


Next Steps

You now understand the framework for supervised learning. In the next lessons, we'll dive into specific algorithms:

  • Lesson 3: Linear Regression – the simplest regression algorithm
  • Lesson 4: Logistic Regression – binary classification
  • Lesson 5: Regularization – preventing overfitting

Each algorithm is a different way to minimize loss and find the best function (f) for your data!


Further Reading


Remember: The goal isn't to memorize training data – it's to learn patterns that generalize to new data. This is the essence of machine learning!