Logistic Regression and Classification

Introduction: From Continuous to Categorical

Imagine you're building an email spam filter. Unlike predicting house prices (a number), you need to predict a category: spam or not spam. You can't use linear regression here – it predicts continuous values like 127.3, but you need a probability between 0 and 1.

Logistic regression solves this by applying a special function (the sigmoid) that "squashes" any real number into the range [0, 1], converting it to a probability.

Key Insight: Despite its name, logistic regression is a classification algorithm! It's called "regression" for historical reasons, but it outputs probabilities that we convert to class predictions.

Learning Objectives

  • Understand binary and multi-class classification
  • Derive the logistic regression model from first principles
  • Master the sigmoid function and decision boundaries
  • Implement binary cross-entropy loss
  • Train logistic regression with gradient descent
  • Visualize decision boundaries interactively
  • Extend to multi-class classification

1. From Regression to Classification

The Classification Problem

In classification, we predict discrete categories (classes):

TypeClassesExamples
Binary2 classesSpam/Not Spam, Disease/Healthy, Cat/Dog
Multi-class(k > 2) classesDigit Recognition (0-9), Image Classification (cat/dog/bird)
Multi-labelMultiple per sampleMovie genres, Medical diagnoses

This lesson focuses on binary classification (2 classes: 0 and 1).

Loading tool...

2. The Sigmoid Function: From Scores to Probabilities

Why Not Linear Regression?

If we try linear regression for classification: [ \hat{y} = \mathbf{w}^T \mathbf{x} ]

Problems:

  • ❌ Output can be any value (e.g., -10, 5, 127)
  • ❌ We need probabilities in [0, 1]
  • ❌ Hard to interpret (\hat{y} = 2.7) as a class

The Sigmoid (Logistic) Function

[ \sigma(z) = \frac{1}{1 + e^{-z}} ]

Properties:

  • Maps any real number to [0, 1]
  • (\sigma(0) = 0.5) (decision boundary)
  • (\sigma(z) \to 1) as (z \to +\infty)
  • (\sigma(z) \to 0) as (z \to -\infty)
  • Smooth and differentiable everywhere
Loading tool...

The Logistic Regression Model

[ \hat{y} = P(y=1 | \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^T \mathbf{x}}} ]

Interpretation:

  • (\hat{y}) = probability that sample belongs to class 1
  • (1 - \hat{y}) = probability of class 0
  • Decision rule: predict class 1 if (\hat{y} > 0.5), else class 0

3. Decision Boundaries

Visualizing Classification

The decision boundary is where the model is uncertain: (P(y=1) = 0.5)

For logistic regression: (\mathbf{w}^T \mathbf{x} = 0)

🧪 Push the idea further: TensorFlow Playground starts exactly where logistic regression ends — a single sigmoid neuron on 2-D data. Click "Run" on the linear dataset, then switch to the circle dataset to watch it fail. In the next lesson on decision trees (and later with kernels), you'll fix that failure two very different ways.

Loading tool...

4. Binary Cross-Entropy Loss

Why Not MSE?

MSE is non-convex for logistic regression – multiple local minima make optimization hard.

Instead, we use Binary Cross-Entropy (Log Loss):

[ J(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right] ]

Where (\hat{y}_i = \sigma(\mathbf{w}^T \mathbf{x}_i))

Intuition:

  • If (y_i = 1): loss = (-\log(\hat{y}_i)) → high loss if (\hat{y}_i) is small
  • If (y_i = 0): loss = (-\log(1-\hat{y}_i)) → high loss if (\hat{y}_i) is large
Loading tool...

Gradient for Logistic Regression

The gradient of cross-entropy with respect to weights is:

[ \nabla_{\mathbf{w}} J = \frac{1}{n} \mathbf{X}^T (\hat{\mathbf{y}} - \mathbf{y}) ]

Amazing fact: Same form as linear regression! Just replace predictions with sigmoid outputs.


5. Training Logistic Regression

Interactive Model Training

Loading tool...

6. Multi-Class Classification

One-vs-Rest (OvR)

For (k) classes, train (k) binary classifiers:

  • Classifier 1: Class 1 vs. {2, 3, ..., k}
  • Classifier 2: Class 2 vs. {1, 3, ..., k}
  • ...

Prediction: Choose class with highest probability.

Softmax Regression (Multinomial Logistic)

Direct extension to multi-class:

[ P(y = k | \mathbf{x}) = \frac{e^{\mathbf{w}k^T \mathbf{x}}}{\sum{j=1}^{K} e^{\mathbf{w}_j^T \mathbf{x}}} ]

This is the softmax function – generalizes sigmoid to (k) classes.

Loading tool...

Key Takeaways

Logistic Regression: Classification algorithm using sigmoid function

Sigmoid Function: (\sigma(z) = \frac{1}{1+e^{-z}}) maps real numbers to [0, 1]

Model: (P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}))

Loss: Binary cross-entropy (convex, probabilistically motivated)

Training: Gradient descent with same update rule as linear regression

Decision Boundary: Linear in feature space (where (\mathbf{w}^T \mathbf{x} = 0))

Multi-Class: One-vs-Rest or Softmax regression


Practice Problems

Problem 1: Implement Sigmoid

Loading tool...

Problem 2: Compute Cross-Entropy Loss

Loading tool...

Problem 3: Decision Boundary Interpretation

Given (\mathbf{w} = [2, -1, 3]) (including bias), what is the decision boundary equation?

Loading tool...

Next Steps

You've mastered binary classification with logistic regression! Next:

  • Lesson 5: Regularization – preventing overfitting with L1/L2 penalties
  • Lesson 6: Decision Trees – non-linear decision boundaries

Logistic regression is used everywhere: web click prediction, medical diagnosis, credit scoring, and more!


Further Reading

Interactive Visualizations

  • MLU-Explain: Logistic Regression — scroll-story with an in-browser model that retrains as you drag points across the boundary.
  • MLU-Explain: Double Descent — a classification-first tour of modern generalization theory; builds directly on this lesson's decision-boundary picture.
  • TensorFlow Playground — the quintessential decision-boundary sandbox. Set activation to "Sigmoid" + 0 hidden layers for pure logistic regression, then add features to see why we'll need non-linear models next.
  • Seeing Theory — Frequentist Inference — background on likelihood, the engine behind cross-entropy.

Video Tutorials

Papers & Articles

Documentation & Books


Remember: Logistic regression is a linear model that learns linear decision boundaries. For non-linear problems, we'll need more powerful models (coming soon)!