Introduction: The Wisdom of Crowds
Imagine you're trying to guess the number of jelly beans in a jar. Instead of relying on one person's guess, you ask 100 people and average their guesses. Surprisingly, this average is often more accurate than any individual guess – even experts!
This is the wisdom of crowds: combining many diverse opinions often beats individual expertise.
Random Forests apply this principle to machine learning: instead of one decision tree, we train many trees on slightly different data and average their predictions. This ensemble is more accurate and robust than any single tree!
Key Insight: Random Forests reduce overfitting and variance by combining predictions from multiple diverse trees trained through bootstrap aggregating (bagging).
Learning Objectives
- Understand ensemble learning principles
- Master bootstrap sampling and bagging
- Build Random Forests from scratch
- Tune forest hyperparameters
- Understand feature randomness and its benefits
- Compare Random Forests with single trees
1. Ensemble Learning: Combining Models
Why Ensembles?
A single decision tree is:
- ✅ Interpretable
- ✅ Fast to train
- ❌ High variance (unstable)
- ❌ Tends to overfit
Solution: Train multiple trees and combine them!
2. Bootstrap Aggregating (Bagging)
Bootstrap Sampling
Bootstrap: Sample (n) data points with replacement from dataset of size (n).
- Some samples appear multiple times
- Some samples don't appear at all (~37% left out)
- Each bootstrap sample is slightly different
Bagging Algorithm
Bootstrap Aggregating:
- Create (B) bootstrap samples from training data
- Train one model (tree) on each bootstrap sample
- Aggregate predictions:
- Classification: Majority vote
- Regression: Average
3. Random Forests: Adding Feature Randomness
The Extra Ingredient
Problem: Bagging helps, but trees can still be too similar (correlated).
Solution: When splitting each node, only consider random subset of features!
Random Forest = Bagging + Feature Randomness
At each split:
- Select (m) features at random (typically (m = \sqrt{d}) for classification)
- Find best split among these (m) features only
- This decorrelates trees → better diversity → better ensemble
4. Out-of-Bag (OOB) Error
Free Cross-Validation
Remember: ~37% of data is not in each bootstrap sample.
OOB samples: Samples not used to train a particular tree.
OOB Error: For each sample, average predictions from trees that didn't see it during training.
Benefit: Get validation error estimate without separate validation set!
5. Hyperparameter Tuning
Key Hyperparameters
| Parameter | What it Controls | Typical Values |
|---|---|---|
n_estimators | Number of trees | 100-1000 (more is better, diminishing returns) |
max_depth | Maximum tree depth | 10-30 (or None for full depth) |
max_features | Features per split | 'sqrt' (classification), 'log2', or integer |
min_samples_split | Min samples to split | 2-10 |
min_samples_leaf | Min samples in leaf | 1-5 |
6. Feature Importance
Random Forests automatically calculate feature importance!
Method: Sum the decrease in impurity (Gini/entropy) weighted by probability of reaching that node, averaged across all trees.
7. Advantages and Limitations
✅ Advantages
- High Accuracy: Often best off-the-shelf performance
- Robust: Handles outliers and noisy data well
- No Overfitting: More trees doesn't overfit (unlike single tree)
- Feature Importance: Automatic ranking
- Handles Mixed Data: Numerical and categorical features
- Parallel: Trees train independently (fast on multiple cores)
- OOB Error: Built-in validation
❌ Limitations
- Black Box: Less interpretable than single tree
- Memory: Stores many trees (can be large)
- Slow Prediction: Must query all trees
- Not for Extrapolation: Can't predict beyond training data range
- Bias: Biased toward features with many categories
Key Takeaways
✓ Random Forests: Ensemble of decision trees via bagging + feature randomness
✓ Bagging: Bootstrap sampling + aggregating (majority vote or average)
✓ Feature Randomness: Consider only subset of features at each split → decorrelates trees
✓ OOB Error: Free validation using samples not in bootstrap → no need for separate val set
✓ Hyperparameters: n_estimators (more is better), max_depth, max_features
✓ Feature Importance: Automatic ranking of feature relevance
✓ Strengths: High accuracy, robust, handles mixed data, parallelizable
✓ Limitations: Black box, can't extrapolate, slower prediction than single tree
Practice Problems
Problem 1: Implement Simple Bagging
Problem 2: Compare Bagging vs Random Forest
Next Steps
Random Forests are powerful, but there's an even better ensemble method: Boosting!
Next lesson:
- Lesson 8: Gradient Boosting – sequentially building trees that fix previous errors
Boosting methods like XGBoost and LightGBM dominate ML competitions!
Further Reading
Interactive Visualizations
- MLU-Explain: Random Forest — a scroll-story showing how bootstrapping and feature-randomness produce diverse trees, with live retraining.
- A Visual Introduction to Machine Learning, Part 2 (R2D3) — ties ensembles directly to the bias-variance picture you've already seen.
- dtreeviz forest gallery — render a whole random forest's trees side by side to see how they disagree.
Video Tutorials
- StatQuest — Random Forests Part 1 and Part 2 (Josh Starmer) — Building, using, and evaluating, including OOB.
- Google ML Crash Course — Random Forests — short interactive exercises.
Papers & Articles
- Random Forests — Leo Breiman, 2001. The original paper — surprisingly readable.
- Beware Default Random Forest Importances — Parr et al. Why Gini/MDI importance is biased, and what to do instead (permutation importance).
- Extremely Randomized Trees — Geurts, Ernst, Wehenkel, 2006. A close cousin (ExtraTrees) with even more variance reduction.
- Understanding Random Forests: From Theory to Practice — Louppe, 2014. PhD thesis, the deepest free treatment.
Documentation & Books
- Book: The Elements of Statistical Learning — Chapter 15.
- scikit-learn: Forests of Randomized Trees — RandomForest, ExtraTrees, and tuning advice.
Remember: Random Forests combine simplicity (trees) with power (ensembles). They're often the first model to try on tabular data!