Introduction: Less Can Be More
Imagine cleaning your closet. You have 500 items but only wear 50 regularly. The other 450 just add clutter, make it hard to find things, and waste space!
Feature selection is the same: remove features that don't help (or hurt) your model. The result? Faster training, better generalization, and easier interpretation.
More features ≠ better model. Often, fewer good features beat many mediocre ones!
Key Insight: Irrelevant and redundant features add noise, increase overfitting risk, and slow down training. Feature selection finds the optimal subset.
Learning Objectives
- Understand why feature selection matters
- Master filter methods (correlation, mutual information)
- Apply wrapper methods (RFE, sequential selection)
- Use embedded methods (L1 regularization, tree-based)
- Handle multicollinearity
- Avoid selection bias in cross-validation
- Choose appropriate methods for different problems
1. Why Feature Selection Matters
The Curse of Dimensionality
Problem: As dimensions increase, data becomes sparse and models overfit
2. Filter Methods: Independent of Model
Based on Statistical Properties
Idea: Score features independently, keep top-k
Advantages: Fast, model-agnostic Disadvantages: Ignores feature interactions
Variance Threshold
Remove features with low variance (nearly constant)
Correlation with Target
Select features highly correlated with target
Mutual Information
Captures non-linear dependencies
3. Wrapper Methods: Model-Based Selection
Recursive Feature Elimination (RFE)
Algorithm:
- Train model on all features
- Remove least important feature
- Repeat until desired number of features
Forward/Backward Sequential Selection
Forward: Start with 0, add best feature iteratively Backward: Start with all, remove worst feature iteratively
4. Embedded Methods: Selection During Training
L1 Regularization (Lasso)
L1 penalty drives some coefficients to exactly zero → automatic feature selection!
Tree-Based Feature Importance
Decision trees and ensembles provide built-in feature importance
5. Handling Multicollinearity
Detecting and Removing Correlated Features
Problem: Highly correlated features are redundant
6. Avoiding Selection Bias
The Right Way: Selection Inside CV
Problem: If you select features on full data, then do CV, you overestimate performance!
Solution: Feature selection must be done inside each CV fold
Key Takeaways
✓ Why: Remove irrelevant/redundant features → faster, better generalization, interpretability
✓ Filter Methods: Fast, model-agnostic (variance, correlation, mutual information)
✓ Wrapper Methods: Model-based, considers feature combinations (RFE, sequential)
✓ Embedded Methods: Selection during training (L1 regularization, tree importance)
✓ Multicollinearity: Remove highly correlated features (threshold ~0.9)
✓ Avoid Bias: Always do selection INSIDE CV folds (use Pipeline)
✓ Trade-offs: Filter (fast, simple) vs Wrapper (slow, thorough) vs Embedded (integrated)
Practice Problems
Problem 1: Complete Feature Selection Pipeline
Problem 2: Compare Selection Methods
Next Steps
You've mastered feature selection! 🎉
Final lesson: End-to-End ML Project – putting everything together in a real-world workflow!
This will tie together everything you've learned in the course!
Further Reading
Interactive Visualizations
- scikit-learn — Feature Selection Gallery — live plots for VarianceThreshold, RFE, SelectKBest, and Lasso paths, all runnable.
- Setosa: Principal Component Analysis — the best interactive intro to PCA, a cousin of feature selection.
- SHAP — Feature Attribution Explorer — modern replacement for naive feature-importance; interactive notebooks showing why RF default importance can mislead.
- UMAP Explorer — Google PAIR's interactive explainer for non-linear dimensionality reduction, a neighbor of the PCA section here.
Video Tutorials
- StatQuest — PCA Main Ideas and PCA Step-By-Step (Josh Starmer) — pair them and PCA stops feeling mysterious.
- Data School — Feature Selection Strategies — hands-on practical walkthrough in scikit-learn.
Papers & Articles
- Feature Selection: A Data Perspective — Li et al., 2017. The authoritative survey.
- Beware Default Random Forest Importances — Parr et al. Why permutation importance (and SHAP) should replace Gini/MDI importance.
- A Unified Approach to Interpreting Model Predictions (SHAP) — Lundberg & Lee, NeurIPS 2017.
- Boruta: All-Relevant Feature Selection — Kursa & Rudnicki, 2010. A popular wrapper that aims to find all relevant features, not just the minimal set.
- The Curse(s) of Dimensionality — a modern re-examination.
Documentation & Books
- Book: Feature Engineering and Selection — Kuhn & Johnson (free online).
- Book: Interpretable Machine Learning — Christoph Molnar (free online) — the chapters on permutation importance and SHAP pair perfectly with this lesson.
- scikit-learn: Feature selection — the canonical API.
- Yellowbrick — Feature Analysis Visualizers — drop-in scikit-learn-compatible feature-analysis plots.
Remember: "It is not the strongest features that survive, nor the most intelligent, but the ones most responsive to change." – Adapted from Darwin