Regularization: L1, L2, and Elastic Net

Introduction: Taming Overfitting

Imagine you're fitting a curve through noisy data. With enough polynomial terms, you can make a line pass through every single point perfectly. But this "perfect" fit is terrible – it follows the noise, not the pattern!

Regularization is like telling the model: "Yes, fit the data well, but keep it simple. Don't go crazy with wild coefficients just to capture every tiny wiggle."

Key Insight: Regularization adds a "simplicity penalty" to the loss function, discouraging complex models that overfit.

Learning Objectives

  • Understand overfitting and why regularization helps
  • Master Ridge regression (L2 regularization)
  • Learn Lasso regression (L1 regularization)
  • Combine both with Elastic Net
  • Choose the right regularization for your problem
  • Implement regularized models from scratch

1. The Overfitting Problem (Revisited)

When Models Get Too Confident

Loading interactive component...

The core issue: With many features or complex models, we can fit training data perfectly but fail on new data.

Loading Python runtime...


2. Ridge Regression (L2 Regularization)

Adding a Simplicity Penalty

Idea: Penalize large coefficients to keep the model simple.

Ridge cost function: [ J(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}i)^2 + \alpha \sum{j=1}^{d} w_j^2 ]

Or more compactly: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha |\mathbf{w}|_2^2 ]

Parameters:

  • (\alpha > 0): regularization strength
    • (\alpha = 0): No regularization (standard linear regression)
    • (\alpha \to \infty): Maximum regularization (all weights → 0)
  • (|\mathbf{w}|_2^2 = \sum w_j^2): L2 norm (Euclidean norm squared)

Closed-Form Solution

Ridge regression has an analytical solution:

[ \mathbf{w}^* = (\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} ]

Key difference from normal equation: (+ \alpha \mathbf{I}) ensures the matrix is always invertible!

Loading Python runtime...

Geometric Interpretation

Ridge regression constrains weights to lie in a sphere (L2 ball):

Loading Python runtime...


3. Lasso Regression (L1 Regularization)

Sparse Solutions

Lasso cost function: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha \sum_{j=1}^{d} |w_j| = \text{MSE}(\mathbf{w}) + \alpha |\mathbf{w}|_1 ]

Key difference: Uses absolute values ((|w_j|)) instead of squares!

Magic property: Lasso drives some coefficients exactly to zero → automatic feature selection!

Loading Python runtime...

Why Lasso Creates Sparsity

The L1 constraint forms a diamond (not a circle):

Loading Python runtime...


4. Elastic Net: Best of Both Worlds

Combines L1 and L2: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha \left[ \rho |\mathbf{w}|_1 + (1-\rho) |\mathbf{w}|_2^2 \right] ]

Parameters:

  • (\alpha): Overall regularization strength
  • (\rho \in [0, 1]): Mix between L1 and L2
    • (\rho = 0): Pure Ridge
    • (\rho = 1): Pure Lasso
    • (\rho = 0.5): Equal mix

When to use: When you have correlated features and want both feature selection (L1) and grouping effect (L2).

Loading Python runtime...


5. Choosing Regularization

Decision Guide

ScenarioBest ChoiceReason
Many correlated featuresRidgeKeeps all, handles multicollinearity
Many irrelevant featuresLassoAutomatic feature selection
Mix of aboveElastic NetBalance between Ridge and Lasso
(d > n) (more features than samples)Ridge or LassoStandard regression fails
Need interpretabilityLassoSparse model, clear feature importance

Choosing (\alpha) (Regularization Strength)

Use cross-validation!

Loading Python runtime...


Key Takeaways

Regularization: Adds penalty to loss function to prevent overfitting

Ridge (L2):

  • Penalty: (\alpha \sum w_j^2)
  • Shrinks all coefficients toward zero
  • Never exactly zero → keeps all features
  • Good for correlated features

Lasso (L1):

  • Penalty: (\alpha \sum |w_j|)
  • Drives coefficients to exactly zero
  • Automatic feature selection
  • Good for high-dimensional sparse problems

Elastic Net: Combines L1 and L2 for best of both worlds

Choosing α: Use cross-validation to find optimal regularization strength


Practice Problems

Problem 1: Implement Ridge from Scratch

Loading Python runtime...

Problem 2: Compare L1 vs L2 Regularization Paths

Loading Python runtime...


Next Steps

With regularization mastered, you can now:

  • Handle high-dimensional data
  • Prevent overfitting
  • Perform automatic feature selection

Next lessons explore non-linear models:

  • Lesson 6: Decision Trees – non-linear decision boundaries
  • Lesson 7: Random Forests – ensemble power

Further Reading


Remember: When in doubt, start with Ridge. If you need feature selection, try Lasso. For complex scenarios, Elastic Net is your friend!