Introduction: From Continuous to Categorical
Imagine you're building an email spam filter. Unlike predicting house prices (a number), you need to predict a category: spam or not spam. You can't use linear regression here – it predicts continuous values like 127.3, but you need a probability between 0 and 1.
Logistic regression solves this by applying a special function (the sigmoid) that "squashes" any real number into the range [0, 1], converting it to a probability.
Key Insight: Despite its name, logistic regression is a classification algorithm! It's called "regression" for historical reasons, but it outputs probabilities that we convert to class predictions.
Learning Objectives
- Understand binary and multi-class classification
- Derive the logistic regression model from first principles
- Master the sigmoid function and decision boundaries
- Implement binary cross-entropy loss
- Train logistic regression with gradient descent
- Visualize decision boundaries interactively
- Extend to multi-class classification
1. From Regression to Classification
The Classification Problem
In classification, we predict discrete categories (classes):
Type | Classes | Examples |
---|---|---|
Binary | 2 classes | Spam/Not Spam, Disease/Healthy, Cat/Dog |
Multi-class | (k > 2) classes | Digit Recognition (0-9), Image Classification (cat/dog/bird) |
Multi-label | Multiple per sample | Movie genres, Medical diagnoses |
This lesson focuses on binary classification (2 classes: 0 and 1).
Loading Python runtime...
2. The Sigmoid Function: From Scores to Probabilities
Why Not Linear Regression?
If we try linear regression for classification: [ \hat{y} = \mathbf{w}^T \mathbf{x} ]
Problems:
- ❌ Output can be any value (e.g., -10, 5, 127)
- ❌ We need probabilities in [0, 1]
- ❌ Hard to interpret (\hat{y} = 2.7) as a class
The Sigmoid (Logistic) Function
[ \sigma(z) = \frac{1}{1 + e^{-z}} ]
Properties:
- Maps any real number to [0, 1]
- (\sigma(0) = 0.5) (decision boundary)
- (\sigma(z) \to 1) as (z \to +\infty)
- (\sigma(z) \to 0) as (z \to -\infty)
- Smooth and differentiable everywhere
Loading Python runtime...
The Logistic Regression Model
[ \hat{y} = P(y=1 | \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^T \mathbf{x}}} ]
Interpretation:
- (\hat{y}) = probability that sample belongs to class 1
- (1 - \hat{y}) = probability of class 0
- Decision rule: predict class 1 if (\hat{y} > 0.5), else class 0
3. Decision Boundaries
Visualizing Classification
The decision boundary is where the model is uncertain: (P(y=1) = 0.5)
For logistic regression: (\mathbf{w}^T \mathbf{x} = 0)
Linear Decision Boundaries
Logistic regression creates linear decision boundaries in feature space:
Loading Python runtime...
4. Binary Cross-Entropy Loss
Why Not MSE?
MSE is non-convex for logistic regression – multiple local minima make optimization hard.
Instead, we use Binary Cross-Entropy (Log Loss):
[ J(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right] ]
Where (\hat{y}_i = \sigma(\mathbf{w}^T \mathbf{x}_i))
Intuition:
- If (y_i = 1): loss = (-\log(\hat{y}_i)) → high loss if (\hat{y}_i) is small
- If (y_i = 0): loss = (-\log(1-\hat{y}_i)) → high loss if (\hat{y}_i) is large
Loading Python runtime...
Gradient for Logistic Regression
The gradient of cross-entropy with respect to weights is:
[ \nabla_{\mathbf{w}} J = \frac{1}{n} \mathbf{X}^T (\hat{\mathbf{y}} - \mathbf{y}) ]
Amazing fact: Same form as linear regression! Just replace predictions with sigmoid outputs.
5. Training Logistic Regression
Interactive Model Training
Implementation from Scratch
Loading Python runtime...
6. Multi-Class Classification
One-vs-Rest (OvR)
For (k) classes, train (k) binary classifiers:
- Classifier 1: Class 1 vs. {2, 3, ..., k}
- Classifier 2: Class 2 vs. {1, 3, ..., k}
- ...
Prediction: Choose class with highest probability.
Softmax Regression (Multinomial Logistic)
Direct extension to multi-class:
[ P(y = k | \mathbf{x}) = \frac{e^{\mathbf{w}k^T \mathbf{x}}}{\sum{j=1}^{K} e^{\mathbf{w}_j^T \mathbf{x}}} ]
This is the softmax function – generalizes sigmoid to (k) classes.
Loading Python runtime...
Key Takeaways
✓ Logistic Regression: Classification algorithm using sigmoid function
✓ Sigmoid Function: (\sigma(z) = \frac{1}{1+e^{-z}}) maps real numbers to [0, 1]
✓ Model: (P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}))
✓ Loss: Binary cross-entropy (convex, probabilistically motivated)
✓ Training: Gradient descent with same update rule as linear regression
✓ Decision Boundary: Linear in feature space (where (\mathbf{w}^T \mathbf{x} = 0))
✓ Multi-Class: One-vs-Rest or Softmax regression
Practice Problems
Problem 1: Implement Sigmoid
Loading Python runtime...
Problem 2: Compute Cross-Entropy Loss
Loading Python runtime...
Problem 3: Decision Boundary Interpretation
Given (\mathbf{w} = [2, -1, 3]) (including bias), what is the decision boundary equation?
Loading Python runtime...
Next Steps
You've mastered binary classification with logistic regression! Next:
- Lesson 5: Regularization – preventing overfitting with L1/L2 penalties
- Lesson 6: Decision Trees – non-linear decision boundaries
Logistic regression is used everywhere: web click prediction, medical diagnosis, credit scoring, and more!
Further Reading
- Tutorial: Logistic Regression in scikit-learn
- Mathematics: Cross-Entropy Derivation
- Book: An Introduction to Statistical Learning (Chapter 4)
- Interactive: Distill.pub - Neural Networks
Remember: Logistic regression is a linear model that learns linear decision boundaries. For non-linear problems, we'll need more powerful models (coming soon)!