Random Forests and Bagging

Introduction: The Wisdom of Crowds

Imagine you're trying to guess the number of jelly beans in a jar. Instead of relying on one person's guess, you ask 100 people and average their guesses. Surprisingly, this average is often more accurate than any individual guess – even experts!

This is the wisdom of crowds: combining many diverse opinions often beats individual expertise.

Random Forests apply this principle to machine learning: instead of one decision tree, we train many trees on slightly different data and average their predictions. This ensemble is more accurate and robust than any single tree!

Key Insight: Random Forests reduce overfitting and variance by combining predictions from multiple diverse trees trained through bootstrap aggregating (bagging).

Learning Objectives

Understand ensemble learning principles
Master bootstrap sampling and bagging
Build Random Forests from scratch
Tune forest hyperparameters
Understand feature randomness and its benefits
Compare Random Forests with single trees

1. Ensemble Learning: Combining Models

Why Ensembles?

A single decision tree is:

✅ Interpretable
✅ Fast to train
❌ High variance (unstable)
❌ Tends to overfit

Solution: Train multiple trees and combine them!

Loading Python runtime...

2. Bootstrap Aggregating (Bagging)

Bootstrap Sampling

Bootstrap: Sample (n) data points with replacement from dataset of size (n).

Some samples appear multiple times
Some samples don't appear at all (~37% left out)
Each bootstrap sample is slightly different

Loading Python runtime...

Bagging Algorithm

Bootstrap Aggregating:

Create (B) bootstrap samples from training data
Train one model (tree) on each bootstrap sample
Aggregate predictions:
- Classification: Majority vote
- Regression: Average

Loading interactive component...

Loading Python runtime...

3. Random Forests: Adding Feature Randomness

The Extra Ingredient

Problem: Bagging helps, but trees can still be too similar (correlated).

Solution: When splitting each node, only consider random subset of features!

Random Forest = Bagging + Feature Randomness

At each split:

Select (m) features at random (typically (m = \sqrt{d}) for classification)
Find best split among these (m) features only
This decorrelates trees → better diversity → better ensemble

Loading Python runtime...

4. Out-of-Bag (OOB) Error

Free Cross-Validation

Remember: ~37% of data is not in each bootstrap sample.

OOB samples: Samples not used to train a particular tree.

OOB Error: For each sample, average predictions from trees that didn't see it during training.

Benefit: Get validation error estimate without separate validation set!

Loading Python runtime...

5. Hyperparameter Tuning

Key Hyperparameters

Parameter	What it Controls	Typical Values
`n_estimators`	Number of trees	100-1000 (more is better, diminishing returns)
`max_depth`	Maximum tree depth	10-30 (or None for full depth)
`max_features`	Features per split	'sqrt' (classification), 'log2', or integer
`min_samples_split`	Min samples to split	2-10
`min_samples_leaf`	Min samples in leaf	1-5

Loading Python runtime...

6. Feature Importance

Random Forests automatically calculate feature importance!

Method: Sum the decrease in impurity (Gini/entropy) weighted by probability of reaching that node, averaged across all trees.

Loading Python runtime...

7. Advantages and Limitations

✅ Advantages

High Accuracy: Often best off-the-shelf performance
Robust: Handles outliers and noisy data well
No Overfitting: More trees doesn't overfit (unlike single tree)
Feature Importance: Automatic ranking
Handles Mixed Data: Numerical and categorical features
Parallel: Trees train independently (fast on multiple cores)
OOB Error: Built-in validation

❌ Limitations

Black Box: Less interpretable than single tree
Memory: Stores many trees (can be large)
Slow Prediction: Must query all trees
Not for Extrapolation: Can't predict beyond training data range
Bias: Biased toward features with many categories

Loading Python runtime...

Key Takeaways

✓ Random Forests: Ensemble of decision trees via bagging + feature randomness

✓ Bagging: Bootstrap sampling + aggregating (majority vote or average)

✓ Feature Randomness: Consider only subset of features at each split → decorrelates trees

✓ OOB Error: Free validation using samples not in bootstrap → no need for separate val set

✓ Hyperparameters: n_estimators (more is better), max_depth, max_features

✓ Feature Importance: Automatic ranking of feature relevance

✓ Strengths: High accuracy, robust, handles mixed data, parallelizable

✓ Limitations: Black box, can't extrapolate, slower prediction than single tree

Practice Problems

Problem 1: Implement Simple Bagging

Loading Python runtime...

Problem 2: Compare Bagging vs Random Forest

Loading Python runtime...

Next Steps

Random Forests are powerful, but there's an even better ensemble method: Boosting!

Next lesson:

Lesson 8: Gradient Boosting – sequentially building trees that fix previous errors

Boosting methods like XGBoost and LightGBM dominate ML competitions!

Classical Machine Learning: Supervised Learning Foundations

Random Forests and Bagging

Introduction: The Wisdom of Crowds

Learning Objectives

1. Ensemble Learning: Combining Models

Why Ensembles?

2. Bootstrap Aggregating (Bagging)

Bootstrap Sampling

Bagging Algorithm

3. Random Forests: Adding Feature Randomness

The Extra Ingredient

4. Out-of-Bag (OOB) Error

Free Cross-Validation

5. Hyperparameter Tuning

Key Hyperparameters

6. Feature Importance

7. Advantages and Limitations

✅ Advantages

❌ Limitations

Key Takeaways

Practice Problems

Problem 1: Implement Simple Bagging

Problem 2: Compare Bagging vs Random Forest

Next Steps

Further Reading