Regularization: L1, L2, and Elastic Net

Introduction: Taming Overfitting

Imagine you're fitting a curve through noisy data. With enough polynomial terms, you can make a line pass through every single point perfectly. But this "perfect" fit is terrible – it follows the noise, not the pattern!

Regularization is like telling the model: "Yes, fit the data well, but keep it simple. Don't go crazy with wild coefficients just to capture every tiny wiggle."

Key Insight: Regularization adds a "simplicity penalty" to the loss function, discouraging complex models that overfit.

Learning Objectives

Understand overfitting and why regularization helps
Master Ridge regression (L2 regularization)
Learn Lasso regression (L1 regularization)
Combine both with Elastic Net
Choose the right regularization for your problem
Implement regularized models from scratch

1. The Overfitting Problem (Revisited)

When Models Get Too Confident

FIG. 02Bias-Variance Tradeoff Explorer

INTERACTIVE

LOADING INSTRUMENT

Fig. 02Interactive visualization of bias-variance tradeoff

The core issue: With many features or complex models, we can fit training data perfectly but fail on new data.

FIG. 04Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 04Interactive Python code execution environment

2. Ridge Regression (L2 Regularization)

Adding a Simplicity Penalty

Idea: Penalize large coefficients to keep the model simple.

Ridge cost function: [ J(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}i)^2 + \alpha \sum{j=1}^{d} w_j^2 ]

Or more compactly: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha |\mathbf{w}|_2^2 ]

Parameters:

(\alpha > 0): regularization strength
- (\alpha = 0): No regularization (standard linear regression)
- (\alpha \to \infty): Maximum regularization (all weights → 0)
(|\mathbf{w}|_2^2 = \sum w_j^2): L2 norm (Euclidean norm squared)

Closed-Form Solution

Ridge regression has an analytical solution:

[ \mathbf{w}^* = (\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} ]

Key difference from normal equation: (+ \alpha \mathbf{I}) ensures the matrix is always invertible!

FIG. 06Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 06Interactive Python code execution environment

Geometric Interpretation

Ridge regression constrains weights to lie in a sphere (L2 ball):

FIG. 08Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 08Interactive Python code execution environment

3. Lasso Regression (L1 Regularization)

Sparse Solutions

Lasso cost function: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha \sum_{j=1}^{d} |w_j| = \text{MSE}(\mathbf{w}) + \alpha |\mathbf{w}|_1 ]

Key difference: Uses absolute values ((|w_j|)) instead of squares!

Magic property: Lasso drives some coefficients exactly to zero → automatic feature selection!

FIG. 10Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 10Interactive Python code execution environment

Why Lasso Creates Sparsity

The L1 constraint forms a diamond (not a circle):

FIG. 12Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 12Interactive Python code execution environment

4. Elastic Net: Best of Both Worlds

Combines L1 and L2: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha \left[ \rho |\mathbf{w}|_1 + (1-\rho) |\mathbf{w}|_2^2 \right] ]

Parameters:

(\alpha): Overall regularization strength
(\rho \in [0, 1]): Mix between L1 and L2
- (\rho = 0): Pure Ridge
- (\rho = 1): Pure Lasso
- (\rho = 0.5): Equal mix

When to use: When you have correlated features and want both feature selection (L1) and grouping effect (L2).

FIG. 14Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 14Interactive Python code execution environment

5. Choosing Regularization

Decision Guide

Scenario	Best Choice	Reason
Many correlated features	Ridge	Keeps all, handles multicollinearity
Many irrelevant features	Lasso	Automatic feature selection
Mix of above	Elastic Net	Balance between Ridge and Lasso
(d > n) (more features than samples)	Ridge or Lasso	Standard regression fails
Need interpretability	Lasso	Sparse model, clear feature importance

Choosing (\alpha) (Regularization Strength)

Use cross-validation!

FIG. 16Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 16Interactive Python code execution environment

Key Takeaways

✓ Regularization: Adds penalty to loss function to prevent overfitting

✓ Ridge (L2):

Penalty: (\alpha \sum w_j^2)
Shrinks all coefficients toward zero
Never exactly zero → keeps all features
Good for correlated features

✓ Lasso (L1):

Penalty: (\alpha \sum |w_j|)
Drives coefficients to exactly zero
Automatic feature selection
Good for high-dimensional sparse problems

✓ Elastic Net: Combines L1 and L2 for best of both worlds

✓ Choosing α: Use cross-validation to find optimal regularization strength

Practice Problems

Problem 1: Implement Ridge from Scratch

FIG. 18Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 18Interactive Python code execution environment

Problem 2: Compare L1 vs L2 Regularization Paths

FIG. 20Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 20Interactive Python code execution environment

Next Steps

With regularization mastered, you can now:

Handle high-dimensional data
Prevent overfitting
Perform automatic feature selection

Next lessons explore non-linear models:

Lesson 6: Decision Trees – non-linear decision boundaries
Lesson 7: Random Forests – ensemble power

Regularization: L1, L2, and Elastic Net

Introduction: Taming Overfitting

Learning Objectives

1. The Overfitting Problem (Revisited)

When Models Get Too Confident

2. Ridge Regression (L2 Regularization)

Adding a Simplicity Penalty

Closed-Form Solution

Geometric Interpretation

3. Lasso Regression (L1 Regularization)

Sparse Solutions

Why Lasso Creates Sparsity

4. Elastic Net: Best of Both Worlds

5. Choosing Regularization

Decision Guide

Choosing (\alpha) (Regularization Strength)

Key Takeaways

Practice Problems

Problem 1: Implement Ridge from Scratch

Problem 2: Compare L1 vs L2 Regularization Paths

Next Steps

Further Reading

Interactive Visualizations

Video Tutorials

Papers & Articles

Documentation & Books