Regularization: L1, L2, and Elastic Net

Introduction: Taming Overfitting

Imagine you're fitting a curve through noisy data. With enough polynomial terms, you can make a line pass through every single point perfectly. But this "perfect" fit is terrible – it follows the noise, not the pattern!

Regularization is like telling the model: "Yes, fit the data well, but keep it simple. Don't go crazy with wild coefficients just to capture every tiny wiggle."

Key Insight: Regularization adds a "simplicity penalty" to the loss function, discouraging complex models that overfit.

Learning Objectives

Understand overfitting and why regularization helps
Master Ridge regression (L2 regularization)
Learn Lasso regression (L1 regularization)
Combine both with Elastic Net
Choose the right regularization for your problem
Implement regularized models from scratch

1. The Overfitting Problem (Revisited)

When Models Get Too Confident

Loading interactive component...

The core issue: With many features or complex models, we can fit training data perfectly but fail on new data.

Loading Python runtime...

2. Ridge Regression (L2 Regularization)

Adding a Simplicity Penalty

Idea: Penalize large coefficients to keep the model simple.

Ridge cost function: [ J(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}i)^2 + \alpha \sum{j=1}^{d} w_j^2 ]

Or more compactly: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha |\mathbf{w}|_2^2 ]

Parameters:

(\alpha > 0): regularization strength
- (\alpha = 0): No regularization (standard linear regression)
- (\alpha \to \infty): Maximum regularization (all weights → 0)
(|\mathbf{w}|_2^2 = \sum w_j^2): L2 norm (Euclidean norm squared)

Closed-Form Solution

Ridge regression has an analytical solution:

[ \mathbf{w}^* = (\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} ]

Key difference from normal equation: (+ \alpha \mathbf{I}) ensures the matrix is always invertible!

Loading Python runtime...

Geometric Interpretation

Ridge regression constrains weights to lie in a sphere (L2 ball):

Loading Python runtime...

3. Lasso Regression (L1 Regularization)

Sparse Solutions

Lasso cost function: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha \sum_{j=1}^{d} |w_j| = \text{MSE}(\mathbf{w}) + \alpha |\mathbf{w}|_1 ]

Key difference: Uses absolute values ((|w_j|)) instead of squares!

Magic property: Lasso drives some coefficients exactly to zero → automatic feature selection!

Loading Python runtime...

Why Lasso Creates Sparsity

The L1 constraint forms a diamond (not a circle):

Loading Python runtime...

4. Elastic Net: Best of Both Worlds

Combines L1 and L2: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha \left[ \rho |\mathbf{w}|_1 + (1-\rho) |\mathbf{w}|_2^2 \right] ]

Parameters:

(\alpha): Overall regularization strength
(\rho \in [0, 1]): Mix between L1 and L2
- (\rho = 0): Pure Ridge
- (\rho = 1): Pure Lasso
- (\rho = 0.5): Equal mix

When to use: When you have correlated features and want both feature selection (L1) and grouping effect (L2).

Loading Python runtime...

5. Choosing Regularization

Decision Guide

Scenario	Best Choice	Reason
Many correlated features	Ridge	Keeps all, handles multicollinearity
Many irrelevant features	Lasso	Automatic feature selection
Mix of above	Elastic Net	Balance between Ridge and Lasso
(d > n) (more features than samples)	Ridge or Lasso	Standard regression fails
Need interpretability	Lasso	Sparse model, clear feature importance

Choosing (\alpha) (Regularization Strength)

Use cross-validation!

Loading Python runtime...

Key Takeaways

✓ Regularization: Adds penalty to loss function to prevent overfitting

✓ Ridge (L2):

Penalty: (\alpha \sum w_j^2)
Shrinks all coefficients toward zero
Never exactly zero → keeps all features
Good for correlated features

✓ Lasso (L1):

Penalty: (\alpha \sum |w_j|)
Drives coefficients to exactly zero
Automatic feature selection
Good for high-dimensional sparse problems

✓ Elastic Net: Combines L1 and L2 for best of both worlds

✓ Choosing α: Use cross-validation to find optimal regularization strength

Practice Problems

Problem 1: Implement Ridge from Scratch

Loading Python runtime...

Problem 2: Compare L1 vs L2 Regularization Paths

Loading Python runtime...

Next Steps

With regularization mastered, you can now:

Handle high-dimensional data
Prevent overfitting
Perform automatic feature selection

Next lessons explore non-linear models:

Lesson 6: Decision Trees – non-linear decision boundaries
Lesson 7: Random Forests – ensemble power

Classical Machine Learning: Supervised Learning Foundations

Regularization: L1, L2, and Elastic Net

Introduction: Taming Overfitting

Learning Objectives

1. The Overfitting Problem (Revisited)

When Models Get Too Confident

2. Ridge Regression (L2 Regularization)

Adding a Simplicity Penalty

Closed-Form Solution

Geometric Interpretation

3. Lasso Regression (L1 Regularization)

Sparse Solutions

Why Lasso Creates Sparsity

4. Elastic Net: Best of Both Worlds

5. Choosing Regularization

Decision Guide

Choosing (\alpha) (Regularization Strength)

Key Takeaways

Practice Problems

Problem 1: Implement Ridge from Scratch

Problem 2: Compare L1 vs L2 Regularization Paths

Next Steps

Further Reading