Introduction: From Continuous to Categorical
Imagine you're building an email spam filter. Unlike predicting house prices (a number), you need to predict a category: spam or not spam. You can't use linear regression here โ it predicts continuous values like 127.3, but you need a probability between 0 and 1.
Logistic regression solves this by applying a special function (the sigmoid) that "squashes" any real number into the range [0, 1], converting it to a probability.
Key Insight: Despite its name, logistic regression is a classification algorithm! It's called "regression" for historical reasons, but it outputs probabilities that we convert to class predictions.
Learning Objectives
- Understand binary and multi-class classification
- Derive the logistic regression model from first principles
- Master the sigmoid function and decision boundaries
- Implement binary cross-entropy loss
- Train logistic regression with gradient descent
- Visualize decision boundaries interactively
- Extend to multi-class classification
1. From Regression to Classification
The Classification Problem
In classification, we predict discrete categories (classes):
| Type | Classes | Examples |
|---|---|---|
| Binary | 2 classes | Spam/Not Spam, Disease/Healthy, Cat/Dog |
| Multi-class | (k > 2) classes | Digit Recognition (0-9), Image Classification (cat/dog/bird) |
| Multi-label | Multiple per sample | Movie genres, Medical diagnoses |
This lesson focuses on binary classification (2 classes: 0 and 1).
2. The Sigmoid Function: From Scores to Probabilities
Why Not Linear Regression?
If we try linear regression for classification: [ \hat{y} = \mathbf{w}^T \mathbf{x} ]
Problems:
- โ Output can be any value (e.g., -10, 5, 127)
- โ We need probabilities in [0, 1]
- โ Hard to interpret (\hat{y} = 2.7) as a class
The Sigmoid (Logistic) Function
[ \sigma(z) = \frac{1}{1 + e^{-z}} ]
Properties:
- Maps any real number to [0, 1]
- (\sigma(0) = 0.5) (decision boundary)
- (\sigma(z) \to 1) as (z \to +\infty)
- (\sigma(z) \to 0) as (z \to -\infty)
- Smooth and differentiable everywhere
The Logistic Regression Model
[ \hat{y} = P(y=1 | \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^T \mathbf{x}}} ]
Interpretation:
- (\hat{y}) = probability that sample belongs to class 1
- (1 - \hat{y}) = probability of class 0
- Decision rule: predict class 1 if (\hat{y} > 0.5), else class 0
3. Decision Boundaries
Visualizing Classification
The decision boundary is where the model is uncertain: (P(y=1) = 0.5)
For logistic regression: (\mathbf{w}^T \mathbf{x} = 0)
๐งช Push the idea further: TensorFlow Playground starts exactly where logistic regression ends โ a single sigmoid neuron on 2-D data. Click "Run" on the linear dataset, then switch to the circle dataset to watch it fail. In the next lesson on decision trees (and later with kernels), you'll fix that failure two very different ways.
4. Binary Cross-Entropy Loss
Why Not MSE?
MSE is non-convex for logistic regression โ multiple local minima make optimization hard.
Instead, we use Binary Cross-Entropy (Log Loss):
[ J(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right] ]
Where (\hat{y}_i = \sigma(\mathbf{w}^T \mathbf{x}_i))
Intuition:
- If (y_i = 1): loss = (-\log(\hat{y}_i)) โ high loss if (\hat{y}_i) is small
- If (y_i = 0): loss = (-\log(1-\hat{y}_i)) โ high loss if (\hat{y}_i) is large
Gradient for Logistic Regression
The gradient of cross-entropy with respect to weights is:
[ \nabla_{\mathbf{w}} J = \frac{1}{n} \mathbf{X}^T (\hat{\mathbf{y}} - \mathbf{y}) ]
Amazing fact: Same form as linear regression! Just replace predictions with sigmoid outputs.
5. Training Logistic Regression
Interactive Model Training
6. Multi-Class Classification
One-vs-Rest (OvR)
For (k) classes, train (k) binary classifiers:
- Classifier 1: Class 1 vs. {2, 3, ..., k}
- Classifier 2: Class 2 vs. {1, 3, ..., k}
- ...
Prediction: Choose class with highest probability.
Softmax Regression (Multinomial Logistic)
Direct extension to multi-class:
[ P(y = k | \mathbf{x}) = \frac{e^{\mathbf{w}k^T \mathbf{x}}}{\sum{j=1}^{K} e^{\mathbf{w}_j^T \mathbf{x}}} ]
This is the softmax function โ generalizes sigmoid to (k) classes.
Key Takeaways
โ Logistic Regression: Classification algorithm using sigmoid function
โ Sigmoid Function: (\sigma(z) = \frac{1}{1+e^{-z}}) maps real numbers to [0, 1]
โ Model: (P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}))
โ Loss: Binary cross-entropy (convex, probabilistically motivated)
โ Training: Gradient descent with same update rule as linear regression
โ Decision Boundary: Linear in feature space (where (\mathbf{w}^T \mathbf{x} = 0))
โ Multi-Class: One-vs-Rest or Softmax regression
Practice Problems
Problem 1: Implement Sigmoid
Problem 2: Compute Cross-Entropy Loss
Problem 3: Decision Boundary Interpretation
Given (\mathbf{w} = [2, -1, 3]) (including bias), what is the decision boundary equation?
Next Steps
You've mastered binary classification with logistic regression! Next:
- Lesson 5: Regularization โ preventing overfitting with L1/L2 penalties
- Lesson 6: Decision Trees โ non-linear decision boundaries
Logistic regression is used everywhere: web click prediction, medical diagnosis, credit scoring, and more!
Further Reading
Interactive Visualizations
- MLU-Explain: Logistic Regression โ scroll-story with an in-browser model that retrains as you drag points across the boundary.
- MLU-Explain: Double Descent โ a classification-first tour of modern generalization theory; builds directly on this lesson's decision-boundary picture.
- TensorFlow Playground โ the quintessential decision-boundary sandbox. Set activation to "Sigmoid" + 0 hidden layers for pure logistic regression, then add features to see why we'll need non-linear models next.
- Seeing Theory โ Frequentist Inference โ background on likelihood, the engine behind cross-entropy.
Video Tutorials
- StatQuest โ Logistic Regression: Main Ideas and the maximum-likelihood follow-up (Josh Starmer) โ the clearest audio-visual explanation of the sigmoid and its loss.
- Google ML Crash Course โ Logistic Regression โ short interactive exercises on sigmoids, log-loss, and thresholding.
Papers & Articles
- A Comparison of Numerical Optimizers for Logistic Regression โ Tom Minka. Why L-BFGS and Newton methods converge faster than vanilla gradient descent for logistic loss.
- The Softmax Layer in Large Language Models โ a modern reminder that multi-class logistic regression is still the output layer of every transformer.
Documentation & Books
- Book: An Introduction to Statistical Learning (2e) โ James, Witten, Hastie, Tibshirani, Chapter 4 (free PDF).
- ML Cheatsheet โ Cross-Entropy Derivation โ step-by-step derivation.
- scikit-learn: Logistic Regression โ implementation reference with solver comparison.
Remember: Logistic regression is a linear model that learns linear decision boundaries. For non-linear problems, we'll need more powerful models (coming soon)!