Linear Regression: From Theory to Practice

Introduction: Finding the Best Line

Imagine you're a real estate agent trying to estimate house prices. You notice a pattern: bigger houses tend to cost more. If you plot square footage vs. price, the data roughly follows a straight line. Linear regression is about finding the best line that fits this relationship.

Key Insight: Linear regression assumes the relationship between inputs (features) and output is approximately linear. It's the simplest, most interpretable ML algorithm – and surprisingly powerful!

In this lesson, we'll explore linear regression from multiple angles:

Geometric intuition (fitting a line)
Algebraic approach (normal equation)
Optimization approach (gradient descent)
Probabilistic interpretation (maximum likelihood)

See It Before You Derive It

Before any math, watch the algorithm actually find the best line. The Gradient Descent Animator runs the optimization step-by-step on a loss surface — the same process we'll formalize.

FIG. 02Gradient Descent Animator

INTERACTIVE

LOADING INSTRUMENT

Fig. 02Step-by-step visualization of gradient descent

Try it: Crank the learning rate up to its maximum and press run — watch the descent path overshoot and bounce across the valley instead of settling. Then drop it back to a small value and notice how the loss curve falls smoothly but slowly. That tension is the whole story of section 4.

Learning Objectives

Understand the linear regression model
Derive the cost function (MSE)
Solve using the normal equation
Implement gradient descent from scratch
Interpret coefficients and make predictions
Recognize when linear regression is appropriate

1. The Linear Model

Simple Linear Regression (One Feature)

The simplest case: predict (y) from a single feature (x):

[ \hat{y} = w_0 + w_1 x ]

(w_0): intercept (bias) – value when (x = 0)
(w_1): slope (weight) – change in (y) for unit change in (x)
(\hat{y}): predicted value

Geometric View: This is the equation of a line!

FIG. 04Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 04Interactive Python code execution environment

Multiple Linear Regression (Multiple Features)

Real-world problems have many features. For a house: square footage, bedrooms, age, location, etc.

[ \hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_d x_d = \mathbf{w}^T \mathbf{x} ]

In vector notation (adding bias to (\mathbf{x})): [ \hat{y} = \mathbf{w}^T \mathbf{x} = \begin{bmatrix} w_0 & w_1 & w_2 & \cdots & w_d \end{bmatrix} \begin{bmatrix} 1 \ x_1 \ x_2 \ \vdots \ x_d \end{bmatrix} ]

Key: The model is linear in the parameters (w_i), not necessarily in the features!

2. The Cost Function: Mean Squared Error

Measuring Goodness of Fit

How do we find the "best" line? We need a cost function (loss function) that quantifies how well our line fits the data.

Mean Squared Error (MSE): [ J(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}i)^2 = \frac{1}{n} \sum{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}_i)^2 ]

Or in matrix form: [ J(\mathbf{w}) = \frac{1}{n} |\mathbf{y} - \mathbf{X}\mathbf{w}|^2 ]

Where:

(\mathbf{X}): (n \times (d+1)) matrix (each row is a sample, first column is all 1s for intercept)
(\mathbf{y}): (n \times 1) vector of targets
(\mathbf{w}): ((d+1) \times 1) vector of weights

FIG. 06Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 06Interactive Python code execution environment

3. Solution #1: The Normal Equation (Analytical)

Closed-Form Solution

For linear regression, we can solve for (\mathbf{w}^*) analytically using calculus!

Derivation: Set the gradient of (J(\mathbf{w})) to zero:

[ \nabla_{\mathbf{w}} J = \frac{2}{n} \mathbf{X}^T (\mathbf{X}\mathbf{w} - \mathbf{y}) = \mathbf{0} ]

Solving for (\mathbf{w}):

[ \mathbf{X}^T \mathbf{X} \mathbf{w} = \mathbf{X}^T \mathbf{y} ]

[ \boxed{\mathbf{w}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}} ]

This is the normal equation! It gives the exact optimal solution in one step.

FIG. 08Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 08Interactive Python code execution environment

Pros and Cons of Normal Equation

Pros:

✅ Exact solution (no iterations)
✅ No hyperparameters to tune
✅ Guaranteed global optimum

Cons:

❌ Requires matrix inversion: (O(d^3)) complexity
❌ Doesn't scale to large (d) (millions of features)
❌ Requires (\mathbf{X}^T \mathbf{X}) to be invertible
❌ Doesn't work for non-linear models

When to use: Small to medium datasets with (d < 10,000) features.

4. Solution #2: Gradient Descent (Iterative)

The Optimization Approach

Instead of solving analytically, we can iteratively improve (\mathbf{w}) by moving in the direction that reduces the cost.

Algorithm:

Start with random weights (\mathbf{w})
Compute gradient: (\nabla_{\mathbf{w}} J = \frac{2}{n} \mathbf{X}^T (\mathbf{X}\mathbf{w} - \mathbf{y}))
Update weights: (\mathbf{w} \leftarrow \mathbf{w} - \alpha \nabla_{\mathbf{w}} J)
Repeat until convergence

Where (\alpha) is the learning rate (step size).

Visualizing Gradient Descent

Recall the Gradient Descent Animator from the top of the lesson — now that you have the update rule (\mathbf{w} \leftarrow \mathbf{w} - \alpha \nabla_{\mathbf{w}} J), scroll back up and rerun it with that equation in mind. Each frame is one application of that line.

SEE

🗺️ Companion 3-D view from outside the platform: ML Visualized's linear-regression trainer shows the fitted line and the parameter-space descent path side by side. Seeing both at once is the moment "the slope of the loss is the update rule" stops being a slogan.

Key Observations:

The algorithm starts far from the optimum and iteratively moves toward the minimum
Learning rate controls step size: too high → oscillation, too low → slow convergence
Different optimizers (Momentum, Adam) can converge faster on challenging surfaces
The loss decreases monotonically when learning rate is properly tuned

Training the Model Interactively

Now try training a linear regression model yourself! Adjust the learning rate and watch how it affects convergence.

FIG. 10ML Model Trainer

INTERACTIVE

LOADING INSTRUMENT

Fig. 10Train machine learning models interactively

Implementing Gradient Descent from Scratch

FIG. 12Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 12Interactive Python code execution environment

Choosing the Learning Rate

The learning rate (\alpha) is critical:

(\alpha)	Effect	Result
Too small	Slow convergence	Takes forever to train
Just right	Smooth, fast convergence	✅ Optimal
Too large	Oscillations, divergence	Never converges!

Rule of thumb: Start with (\alpha = 0.01) and adjust based on the cost curve.

5. Probabilistic Interpretation (Maximum Likelihood)

The Statistical View

Linear regression can be derived from a probabilistic perspective:

Assumption: Targets are noisy observations of a linear function: [ y_i = \mathbf{w}^T \mathbf{x}_i + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2) ]

Where (\epsilon_i) is Gaussian noise.

This means: [ y_i | \mathbf{x}_i, \mathbf{w} \sim \mathcal{N}(\mathbf{w}^T \mathbf{x}_i, \sigma^2) ]

Maximum Likelihood Estimation (MLE): Find (\mathbf{w}) that maximizes the probability of observing the data.

Result: MLE for linear regression with Gaussian noise is equivalent to minimizing MSE!

FIG. 14Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 14Interactive Python code execution environment

6. Model Interpretation and Diagnostics

Interpreting Coefficients

Each weight (w_j) tells us: "Holding all other features constant, a 1-unit increase in (x_j) leads to a (w_j)-unit change in (y)."

FIG. 16Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 16Interactive Python code execution environment

Residual Plots

Always check residuals to validate assumptions:

FIG. 18Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 18Interactive Python code execution environment

Diagnostic Checklist:

✅ Residuals centered at 0
✅ No pattern in residual plot
✅ Residuals roughly constant variance (homoscedastic)
✅ Residuals approximately Gaussian

7. When to Use Linear Regression

Appropriate When:

✅ Relationship is approximately linear
✅ You need interpretability (coefficients have meaning)
✅ You have sufficient data (more samples than features)
✅ Features are relatively independent (low multicollinearity)

Not Appropriate When:

❌ Relationship is clearly non-linear (use polynomial regression or other models)
❌ Severe outliers (consider robust regression)
❌ More features than samples (use regularization – next lesson!)
❌ Complex interactions between features (consider tree-based methods)

Key Takeaways

✓ Linear Model: (\hat{y} = \mathbf{w}^T \mathbf{x}) – prediction is weighted sum of features

✓ Cost Function: MSE measures average squared error

✓ Two Solution Methods:

Normal Equation: (\mathbf{w}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}) (exact, one-step)
Gradient Descent: Iterative optimization (scalable)

✓ Probabilistic View: Minimizing MSE ≡ Maximum Likelihood with Gaussian noise

✓ Interpretation: Coefficients show feature importance and direction of effect

✓ Diagnostics: Check residual plots to validate linear assumption

Practice Problems

Problem 1: Implement Normal Equation

FIG. 20Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 20Interactive Python code execution environment

Problem 2: Gradient Descent with Momentum

Enhance gradient descent with momentum: (v_t = \beta v_{t-1} + \alpha \nabla J), then (\mathbf{w} \leftarrow \mathbf{w} - v_t)

FIG. 22Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 22Interactive Python code execution environment

Next Steps

You now understand linear regression deeply! Next lessons:

Lesson 4: Logistic Regression – extending to classification
Lesson 5: Regularization – handling overfitting and many features

Linear regression is the foundation for many algorithms. Master it, and more complex models will make sense!

Linear Regression: From Theory to Practice

Introduction: Finding the Best Line

See It Before You Derive It

Learning Objectives

1. The Linear Model

Simple Linear Regression (One Feature)

Multiple Linear Regression (Multiple Features)

2. The Cost Function: Mean Squared Error

Measuring Goodness of Fit

3. Solution #1: The Normal Equation (Analytical)

Closed-Form Solution

Pros and Cons of Normal Equation

4. Solution #2: Gradient Descent (Iterative)

The Optimization Approach

Visualizing Gradient Descent

Training the Model Interactively

Implementing Gradient Descent from Scratch

Choosing the Learning Rate

5. Probabilistic Interpretation (Maximum Likelihood)

The Statistical View

6. Model Interpretation and Diagnostics

Interpreting Coefficients

Residual Plots

7. When to Use Linear Regression

Appropriate When:

Not Appropriate When:

Key Takeaways

Practice Problems

Problem 1: Implement Normal Equation

Problem 2: Gradient Descent with Momentum

Next Steps

Further Reading

Interactive Visualizations

Video Tutorials

Papers & Articles

Documentation & Books