Introduction: Taming Overfitting
Imagine you're fitting a curve through noisy data. With enough polynomial terms, you can make a line pass through every single point perfectly. But this "perfect" fit is terrible – it follows the noise, not the pattern!
Regularization is like telling the model: "Yes, fit the data well, but keep it simple. Don't go crazy with wild coefficients just to capture every tiny wiggle."
Key Insight: Regularization adds a "simplicity penalty" to the loss function, discouraging complex models that overfit.
Learning Objectives
- Understand overfitting and why regularization helps
- Master Ridge regression (L2 regularization)
- Learn Lasso regression (L1 regularization)
- Combine both with Elastic Net
- Choose the right regularization for your problem
- Implement regularized models from scratch
1. The Overfitting Problem (Revisited)
When Models Get Too Confident
The core issue: With many features or complex models, we can fit training data perfectly but fail on new data.
Loading Python runtime...
2. Ridge Regression (L2 Regularization)
Adding a Simplicity Penalty
Idea: Penalize large coefficients to keep the model simple.
Ridge cost function: [ J(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}i)^2 + \alpha \sum{j=1}^{d} w_j^2 ]
Or more compactly: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha |\mathbf{w}|_2^2 ]
Parameters:
- (\alpha > 0): regularization strength
- (\alpha = 0): No regularization (standard linear regression)
- (\alpha \to \infty): Maximum regularization (all weights → 0)
- (|\mathbf{w}|_2^2 = \sum w_j^2): L2 norm (Euclidean norm squared)
Closed-Form Solution
Ridge regression has an analytical solution:
[ \mathbf{w}^* = (\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} ]
Key difference from normal equation: (+ \alpha \mathbf{I}) ensures the matrix is always invertible!
Loading Python runtime...
Geometric Interpretation
Ridge regression constrains weights to lie in a sphere (L2 ball):
Loading Python runtime...
3. Lasso Regression (L1 Regularization)
Sparse Solutions
Lasso cost function: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha \sum_{j=1}^{d} |w_j| = \text{MSE}(\mathbf{w}) + \alpha |\mathbf{w}|_1 ]
Key difference: Uses absolute values ((|w_j|)) instead of squares!
Magic property: Lasso drives some coefficients exactly to zero → automatic feature selection!
Loading Python runtime...
Why Lasso Creates Sparsity
The L1 constraint forms a diamond (not a circle):
Loading Python runtime...
4. Elastic Net: Best of Both Worlds
Combines L1 and L2: [ J(\mathbf{w}) = \text{MSE}(\mathbf{w}) + \alpha \left[ \rho |\mathbf{w}|_1 + (1-\rho) |\mathbf{w}|_2^2 \right] ]
Parameters:
- (\alpha): Overall regularization strength
- (\rho \in [0, 1]): Mix between L1 and L2
- (\rho = 0): Pure Ridge
- (\rho = 1): Pure Lasso
- (\rho = 0.5): Equal mix
When to use: When you have correlated features and want both feature selection (L1) and grouping effect (L2).
Loading Python runtime...
5. Choosing Regularization
Decision Guide
Scenario | Best Choice | Reason |
---|---|---|
Many correlated features | Ridge | Keeps all, handles multicollinearity |
Many irrelevant features | Lasso | Automatic feature selection |
Mix of above | Elastic Net | Balance between Ridge and Lasso |
(d > n) (more features than samples) | Ridge or Lasso | Standard regression fails |
Need interpretability | Lasso | Sparse model, clear feature importance |
Choosing (\alpha) (Regularization Strength)
Use cross-validation!
Loading Python runtime...
Key Takeaways
✓ Regularization: Adds penalty to loss function to prevent overfitting
✓ Ridge (L2):
- Penalty: (\alpha \sum w_j^2)
- Shrinks all coefficients toward zero
- Never exactly zero → keeps all features
- Good for correlated features
✓ Lasso (L1):
- Penalty: (\alpha \sum |w_j|)
- Drives coefficients to exactly zero
- Automatic feature selection
- Good for high-dimensional sparse problems
✓ Elastic Net: Combines L1 and L2 for best of both worlds
✓ Choosing α: Use cross-validation to find optimal regularization strength
Practice Problems
Problem 1: Implement Ridge from Scratch
Loading Python runtime...
Problem 2: Compare L1 vs L2 Regularization Paths
Loading Python runtime...
Next Steps
With regularization mastered, you can now:
- Handle high-dimensional data
- Prevent overfitting
- Perform automatic feature selection
Next lessons explore non-linear models:
- Lesson 6: Decision Trees – non-linear decision boundaries
- Lesson 7: Random Forests – ensemble power
Further Reading
- Tutorial: Regularized Linear Models in scikit-learn
- Theory: The Elements of Statistical Learning (Chapter 3.4)
- Practical: Ridge vs Lasso: When to Use Which?
- Advanced: Bayesian interpretation of regularization
Remember: When in doubt, start with Ridge. If you need feature selection, try Lasso. For complex scenarios, Elastic Net is your friend!