УРОКИ · 15 · 05 / 15
Regularization: L1, L2, and Elastic Net
Prevent overfitting with regularization: Ridge, Lasso, and Elastic Net. Understand the geometry and sparsity-inducing properties.
Introduction: Taming Overfitting
Imagine you're fitting a curve through noisy data. With enough polynomial terms, you can make a line pass through every single point perfectly. But this "perfect" fit is terrible – it follows the noise, not the pattern!
Regularization is like telling the model: "Yes, fit the data well, but keep it simple. Don't go crazy with wild coefficients just to capture every tiny wiggle."
Key Insight: Regularization adds a "simplicity penalty" to the loss function, discouraging complex models that overfit.
Try it: Drag the complexity/degree slider up and watch the train and test curves split apart — train loss keeps falling while test loss bottoms out and then climbs. Push capacity past the interpolation point in the Overfitting Room to see the test loss spike and then descend a second time (double descent).
Learning Objectives
- Understand overfitting and why regularization helps
- Master Ridge regression (L2 regularization)
- Learn Lasso regression (L1 regularization)
- Combine both with Elastic Net
- Choose the right regularization for your problem
- Implement regularized models from scratch
1. The Overfitting Problem (Revisited)
When Models Get Too Confident
Use the Bias-Variance Explorer at the top of this lesson to revisit how the U-shaped test curve emerges as complexity grows.
See the Double Descent
The classical bias-variance picture says: as you add capacity, test loss first drops (good — less bias), then rises (bad — more variance). It's the U-shape we all teach.
But that picture is incomplete. The Overfitting Room lets you sweep model capacity past params == N (the interpolation threshold) and watch what happens next: at d == N the test loss spikes (classical), but then — if you push further — it descends again. This is double descent, and it's why modern over-parameterized networks generalize at all.
Pick the DOUBLE-DESCENT-SWEEP mode in the Overfitting Room above and scan the degree slider to watch train and test loss trace the double-descent curve.
The core issue: With many features or complex models, we can fit training data perfectly but fail on new data.
2. Ridge Regression (L2 Regularization)
Adding a Simplicity Penalty
Idea: Penalize large coefficients to keep the model simple.
Ridge cost function: [ J(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}i)^2 + \alpha \sum{j=1}^{d} w_j^2 ]
Or more compactly: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha |\mathbf{w}|_2^2 ]
Parameters:
- (\alpha > 0): regularization strength
- (\alpha = 0): No regularization (standard linear regression)
- (\alpha \to \infty): Maximum regularization (all weights → 0)
- (|\mathbf{w}|_2^2 = \sum w_j^2): L2 norm (Euclidean norm squared)
Closed-Form Solution
Ridge regression has an analytical solution:
[ \mathbf{w}^* = (\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} ]
Key difference from normal equation: (+ \alpha \mathbf{I}) ensures the matrix is always invertible!
Geometric Interpretation
Ridge regression constrains weights to lie in a sphere (L2 ball):
3. Lasso Regression (L1 Regularization)
Sparse Solutions
Lasso cost function: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha \sum_{j=1}^{d} |w_j| = \text{MSE}(\mathbf{w}) + \alpha |\mathbf{w}|_1 ]
Key difference: Uses absolute values ((|w_j|)) instead of squares!
Magic property: Lasso drives some coefficients exactly to zero → automatic feature selection!
Why Lasso Creates Sparsity
The L1 constraint forms a diamond (not a circle). Where Ridge's circular constraint (drawn earlier) lets the solution land anywhere on a smooth curve, Lasso's diamond has sharp corners on the axes — and the loss contours tend to first touch the budget region at one of those corners, forcing a coefficient to exactly zero:
4. Elastic Net: Best of Both Worlds
Combines L1 and L2: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha \left[ \rho |\mathbf{w}|_1 + (1-\rho) |\mathbf{w}|_2^2 \right] ]
Parameters:
- (\alpha): Overall regularization strength
- (\rho \in [0, 1]): Mix between L1 and L2
- (\rho = 0): Pure Ridge
- (\rho = 1): Pure Lasso
- (\rho = 0.5): Equal mix
When to use: When you have correlated features and want both feature selection (L1) and grouping effect (L2).
5. Choosing Regularization
Decision Guide
| Scenario | Best Choice | Reason |
|---|---|---|
| Many correlated features | Ridge | Keeps all, handles multicollinearity |
| Many irrelevant features | Lasso | Automatic feature selection |
| Mix of above | Elastic Net | Balance between Ridge and Lasso |
| (d > n) (more features than samples) | Ridge or Lasso | Standard regression fails |
| Need interpretability | Lasso | Sparse model, clear feature importance |
Choosing (\alpha) (Regularization Strength)
Use cross-validation!
Key Takeaways
✓ Regularization: Adds penalty to loss function to prevent overfitting
✓ Ridge (L2):
- Penalty: (\alpha \sum w_j^2)
- Shrinks all coefficients toward zero
- Never exactly zero → keeps all features
- Good for correlated features
✓ Lasso (L1):
- Penalty: (\alpha \sum |w_j|)
- Drives coefficients to exactly zero
- Automatic feature selection
- Good for high-dimensional sparse problems
✓ Elastic Net: Combines L1 and L2 for best of both worlds
✓ Choosing α: Use cross-validation to find optimal regularization strength
Practice Problems
Problem 1: Implement Ridge from Scratch
Problem 2: Compare L1 vs L2 Regularization Paths
Next Steps
With regularization mastered, you can now:
- Handle high-dimensional data
- Prevent overfitting
- Perform automatic feature selection
Next lessons explore non-linear models:
- Lesson 6: Decision Trees – non-linear decision boundaries
- Lesson 7: Random Forests – ensemble power
Further Reading
Interactive Visualizations
- Explained.AI — How Regularization Works (Terence Parr) — the geometric picture: rotating contours against L1 diamonds and L2 circles, with sliders for λ.
- Setosa — Ordinary Least Squares (with shrinkage) — add penalty terms and watch the fitted line shrink toward zero.
- MLU-Explain: Logistic Regression — includes an interactive regularization-strength slider showing coefficient paths live.
Video Tutorials
- StatQuest — Ridge (L2) Regression & Lasso (L1) Regression & Elastic Net (Josh Starmer) — the three canonical videos, watched in order.
- 3Blue1Brown — But what is a Neural Network's regularization? — the regularization intuitions generalize past linear models.
Papers & Articles
- Regression Shrinkage and Selection via the Lasso — Tibshirani, 1996. The original paper.
- Regularization and Variable Selection via the Elastic Net — Zou & Hastie, 2005.
- A Bayesian view of regularization — Ridge = Gaussian prior, Lasso = Laplace prior; why this matters in practice.
Documentation & Books
- Book: The Elements of Statistical Learning — Chapter 3.4 (free PDF).
- Book: Statistical Learning with Sparsity: The Lasso and Generalizations — Hastie, Tibshirani, Wainwright (free PDF).
- scikit-learn: Regularized Linear Models — Ridge, Lasso, ElasticNet with cross-validated variants.
Remember: When in doubt, start with Ridge. If you need feature selection, try Lasso. For complex scenarios, Elastic Net is your friend!