Linear Regression: From Theory to Practice

Introduction: Finding the Best Line

Imagine you're a real estate agent trying to estimate house prices. You notice a pattern: bigger houses tend to cost more. If you plot square footage vs. price, the data roughly follows a straight line. Linear regression is about finding the best line that fits this relationship.

Key Insight: Linear regression assumes the relationship between inputs (features) and output is approximately linear. It's the simplest, most interpretable ML algorithm – and surprisingly powerful!

In this lesson, we'll explore linear regression from multiple angles:

  • Geometric intuition (fitting a line)
  • Algebraic approach (normal equation)
  • Optimization approach (gradient descent)
  • Probabilistic interpretation (maximum likelihood)

Learning Objectives

  • Understand the linear regression model
  • Derive the cost function (MSE)
  • Solve using the normal equation
  • Implement gradient descent from scratch
  • Interpret coefficients and make predictions
  • Recognize when linear regression is appropriate

1. The Linear Model

Simple Linear Regression (One Feature)

The simplest case: predict (y) from a single feature (x):

[ \hat{y} = w_0 + w_1 x ]

  • (w_0): intercept (bias) – value when (x = 0)
  • (w_1): slope (weight) – change in (y) for unit change in (x)
  • (\hat{y}): predicted value

Geometric View: This is the equation of a line!

Loading Python runtime...

Multiple Linear Regression (Multiple Features)

Real-world problems have many features. For a house: square footage, bedrooms, age, location, etc.

[ \hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_d x_d = \mathbf{w}^T \mathbf{x} ]

In vector notation (adding bias to (\mathbf{x})): [ \hat{y} = \mathbf{w}^T \mathbf{x} = \begin{bmatrix} w_0 & w_1 & w_2 & \cdots & w_d \end{bmatrix} \begin{bmatrix} 1 \ x_1 \ x_2 \ \vdots \ x_d \end{bmatrix} ]

Key: The model is linear in the parameters (w_i), not necessarily in the features!


2. The Cost Function: Mean Squared Error

Measuring Goodness of Fit

How do we find the "best" line? We need a cost function (loss function) that quantifies how well our line fits the data.

Mean Squared Error (MSE): [ J(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}i)^2 = \frac{1}{n} \sum{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}_i)^2 ]

Or in matrix form: [ J(\mathbf{w}) = \frac{1}{n} |\mathbf{y} - \mathbf{X}\mathbf{w}|^2 ]

Where:

  • (\mathbf{X}): (n \times (d+1)) matrix (each row is a sample, first column is all 1s for intercept)
  • (\mathbf{y}): (n \times 1) vector of targets
  • (\mathbf{w}): ((d+1) \times 1) vector of weights

Loading Python runtime...


3. Solution #1: The Normal Equation (Analytical)

Closed-Form Solution

For linear regression, we can solve for (\mathbf{w}^*) analytically using calculus!

Derivation: Set the gradient of (J(\mathbf{w})) to zero:

[ \nabla_{\mathbf{w}} J = \frac{2}{n} \mathbf{X}^T (\mathbf{X}\mathbf{w} - \mathbf{y}) = \mathbf{0} ]

Solving for (\mathbf{w}):

[ \mathbf{X}^T \mathbf{X} \mathbf{w} = \mathbf{X}^T \mathbf{y} ]

[ \boxed{\mathbf{w}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}} ]

This is the normal equation! It gives the exact optimal solution in one step.

Loading Python runtime...

Pros and Cons of Normal Equation

Pros:

  • ✅ Exact solution (no iterations)
  • ✅ No hyperparameters to tune
  • ✅ Guaranteed global optimum

Cons:

  • ❌ Requires matrix inversion: (O(d^3)) complexity
  • ❌ Doesn't scale to large (d) (millions of features)
  • ❌ Requires (\mathbf{X}^T \mathbf{X}) to be invertible
  • ❌ Doesn't work for non-linear models

When to use: Small to medium datasets with (d < 10,000) features.


4. Solution #2: Gradient Descent (Iterative)

The Optimization Approach

Instead of solving analytically, we can iteratively improve (\mathbf{w}) by moving in the direction that reduces the cost.

Algorithm:

  1. Start with random weights (\mathbf{w})
  2. Compute gradient: (\nabla_{\mathbf{w}} J = \frac{2}{n} \mathbf{X}^T (\mathbf{X}\mathbf{w} - \mathbf{y}))
  3. Update weights: (\mathbf{w} \leftarrow \mathbf{w} - \alpha \nabla_{\mathbf{w}} J)
  4. Repeat until convergence

Where (\alpha) is the learning rate (step size).

Loading interactive component...

Implementing Gradient Descent

Loading Python runtime...

Choosing the Learning Rate

The learning rate (\alpha) is critical:

(\alpha)EffectResult
Too smallSlow convergenceTakes forever to train
Just rightSmooth, fast convergence✅ Optimal
Too largeOscillations, divergenceNever converges!

Rule of thumb: Start with (\alpha = 0.01) and adjust based on the cost curve.


5. Probabilistic Interpretation (Maximum Likelihood)

The Statistical View

Linear regression can be derived from a probabilistic perspective:

Assumption: Targets are noisy observations of a linear function: [ y_i = \mathbf{w}^T \mathbf{x}_i + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2) ]

Where (\epsilon_i) is Gaussian noise.

This means: [ y_i | \mathbf{x}_i, \mathbf{w} \sim \mathcal{N}(\mathbf{w}^T \mathbf{x}_i, \sigma^2) ]

Maximum Likelihood Estimation (MLE): Find (\mathbf{w}) that maximizes the probability of observing the data.

Result: MLE for linear regression with Gaussian noise is equivalent to minimizing MSE!

Loading Python runtime...


6. Model Interpretation and Diagnostics

Interpreting Coefficients

Each weight (w_j) tells us: "Holding all other features constant, a 1-unit increase in (x_j) leads to a (w_j)-unit change in (y)."

Loading Python runtime...

Residual Plots

Always check residuals to validate assumptions:

Loading Python runtime...

Diagnostic Checklist:

  • ✅ Residuals centered at 0
  • ✅ No pattern in residual plot
  • ✅ Residuals roughly constant variance (homoscedastic)
  • ✅ Residuals approximately Gaussian

7. When to Use Linear Regression

Appropriate When:

  • ✅ Relationship is approximately linear
  • ✅ You need interpretability (coefficients have meaning)
  • ✅ You have sufficient data (more samples than features)
  • ✅ Features are relatively independent (low multicollinearity)

Not Appropriate When:

  • ❌ Relationship is clearly non-linear (use polynomial regression or other models)
  • ❌ Severe outliers (consider robust regression)
  • ❌ More features than samples (use regularization – next lesson!)
  • ❌ Complex interactions between features (consider tree-based methods)

Key Takeaways

Linear Model: (\hat{y} = \mathbf{w}^T \mathbf{x}) – prediction is weighted sum of features

Cost Function: MSE measures average squared error

Two Solution Methods:

  • Normal Equation: (\mathbf{w}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}) (exact, one-step)
  • Gradient Descent: Iterative optimization (scalable)

Probabilistic View: Minimizing MSE ≡ Maximum Likelihood with Gaussian noise

Interpretation: Coefficients show feature importance and direction of effect

Diagnostics: Check residual plots to validate linear assumption


Practice Problems

Problem 1: Implement Normal Equation

Loading Python runtime...

Problem 2: Gradient Descent with Momentum

Enhance gradient descent with momentum: (v_t = \beta v_{t-1} + \alpha \nabla J), then (\mathbf{w} \leftarrow \mathbf{w} - v_t)

Loading Python runtime...


Next Steps

You now understand linear regression deeply! Next lessons:

  • Lesson 4: Logistic Regression – extending to classification
  • Lesson 5: Regularization – handling overfitting and many features

Linear regression is the foundation for many algorithms. Master it, and more complex models will make sense!


Further Reading


Remember: Linear regression is simple but powerful. Often, the simplest model that works is the best model to use!