CLASSICAL MACHINE LEARNING: SUPERVISED LEARNING FOUNDATIONS / L05REGULARIZATION: L1, L2, AND ELASTIC NET
课程 · 15 · 05 / 15
LESSON 05 · INTERMEDIATE · 60 MIN · ◆ 2 INSTRUMENTS

Regularization: L1, L2, and Elastic Net

Prevent overfitting with regularization: Ridge, Lasso, and Elastic Net. Understand the geometry and sparsity-inducing properties.

Introduction: Taming Overfitting

Imagine you're fitting a curve through noisy data. With enough polynomial terms, you can make a line pass through every single point perfectly. But this "perfect" fit is terrible – it follows the noise, not the pattern!

Regularization is like telling the model: "Yes, fit the data well, but keep it simple. Don't go crazy with wild coefficients just to capture every tiny wiggle."

Key Insight: Regularization adds a "simplicity penalty" to the loss function, discouraging complex models that overfit.

Learning Objectives

  • Understand overfitting and why regularization helps
  • Master Ridge regression (L2 regularization)
  • Learn Lasso regression (L1 regularization)
  • Combine both with Elastic Net
  • Choose the right regularization for your problem
  • Implement regularized models from scratch

1. The Overfitting Problem (Revisited)

When Models Get Too Confident

FIG. 02Bias-Variance Tradeoff Explorer
INTERACTIVE
LOADING INSTRUMENT
Fig. 02Interactive visualization of bias-variance tradeoff

The core issue: With many features or complex models, we can fit training data perfectly but fail on new data.

FIG. 04Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 04Interactive Python code execution environment

2. Ridge Regression (L2 Regularization)

Adding a Simplicity Penalty

Idea: Penalize large coefficients to keep the model simple.

Ridge cost function: [ J(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}i)^2 + \alpha \sum{j=1}^{d} w_j^2 ]

Or more compactly: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha |\mathbf{w}|_2^2 ]

Parameters:

  • (\alpha > 0): regularization strength
    • (\alpha = 0): No regularization (standard linear regression)
    • (\alpha \to \infty): Maximum regularization (all weights → 0)
  • (|\mathbf{w}|_2^2 = \sum w_j^2): L2 norm (Euclidean norm squared)

Closed-Form Solution

Ridge regression has an analytical solution:

[ \mathbf{w}^* = (\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} ]

Key difference from normal equation: (+ \alpha \mathbf{I}) ensures the matrix is always invertible!

FIG. 06Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 06Interactive Python code execution environment

Geometric Interpretation

Ridge regression constrains weights to lie in a sphere (L2 ball):

FIG. 08Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 08Interactive Python code execution environment

3. Lasso Regression (L1 Regularization)

Sparse Solutions

Lasso cost function: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha \sum_{j=1}^{d} |w_j| = \text{MSE}(\mathbf{w}) + \alpha |\mathbf{w}|_1 ]

Key difference: Uses absolute values ((|w_j|)) instead of squares!

Magic property: Lasso drives some coefficients exactly to zero → automatic feature selection!

FIG. 10Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 10Interactive Python code execution environment

Why Lasso Creates Sparsity

The L1 constraint forms a diamond (not a circle):

FIG. 12Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 12Interactive Python code execution environment

4. Elastic Net: Best of Both Worlds

Combines L1 and L2: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha \left[ \rho |\mathbf{w}|_1 + (1-\rho) |\mathbf{w}|_2^2 \right] ]

Parameters:

  • (\alpha): Overall regularization strength
  • (\rho \in [0, 1]): Mix between L1 and L2
    • (\rho = 0): Pure Ridge
    • (\rho = 1): Pure Lasso
    • (\rho = 0.5): Equal mix

When to use: When you have correlated features and want both feature selection (L1) and grouping effect (L2).

FIG. 14Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 14Interactive Python code execution environment

5. Choosing Regularization

Decision Guide

ScenarioBest ChoiceReason
Many correlated featuresRidgeKeeps all, handles multicollinearity
Many irrelevant featuresLassoAutomatic feature selection
Mix of aboveElastic NetBalance between Ridge and Lasso
(d > n) (more features than samples)Ridge or LassoStandard regression fails
Need interpretabilityLassoSparse model, clear feature importance

Choosing (\alpha) (Regularization Strength)

Use cross-validation!

FIG. 16Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 16Interactive Python code execution environment

Key Takeaways

Regularization: Adds penalty to loss function to prevent overfitting

Ridge (L2):

  • Penalty: (\alpha \sum w_j^2)
  • Shrinks all coefficients toward zero
  • Never exactly zero → keeps all features
  • Good for correlated features

Lasso (L1):

  • Penalty: (\alpha \sum |w_j|)
  • Drives coefficients to exactly zero
  • Automatic feature selection
  • Good for high-dimensional sparse problems

Elastic Net: Combines L1 and L2 for best of both worlds

Choosing α: Use cross-validation to find optimal regularization strength


Practice Problems

Problem 1: Implement Ridge from Scratch

FIG. 18Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 18Interactive Python code execution environment

Problem 2: Compare L1 vs L2 Regularization Paths

FIG. 20Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 20Interactive Python code execution environment

Next Steps

With regularization mastered, you can now:

  • Handle high-dimensional data
  • Prevent overfitting
  • Perform automatic feature selection

Next lessons explore non-linear models:

  • Lesson 6: Decision Trees – non-linear decision boundaries
  • Lesson 7: Random Forests – ensemble power

Further Reading

Interactive Visualizations

Video Tutorials

Papers & Articles

Documentation & Books

  • Book: The Elements of Statistical Learning — Chapter 3.4 (free PDF).
  • Book: Statistical Learning with Sparsity: The Lasso and Generalizations — Hastie, Tibshirani, Wainwright (free PDF).
  • scikit-learn: Regularized Linear Models — Ridge, Lasso, ElasticNet with cross-validated variants.

Remember: When in doubt, start with Ridge. If you need feature selection, try Lasso. For complex scenarios, Elastic Net is your friend!