Random Forests and Bagging

Introduction: The Wisdom of Crowds

Imagine you're trying to guess the number of jelly beans in a jar. Instead of relying on one person's guess, you ask 100 people and average their guesses. Surprisingly, this average is often more accurate than any individual guess – even experts!

This is the wisdom of crowds: combining many diverse opinions often beats individual expertise.

Random Forests apply this principle to machine learning: instead of one decision tree, we train many trees on slightly different data and average their predictions. This ensemble is more accurate and robust than any single tree!

Key Insight: Random Forests reduce overfitting and variance by combining predictions from multiple diverse trees trained through bootstrap aggregating (bagging).

Learning Objectives

  • Understand ensemble learning principles
  • Master bootstrap sampling and bagging
  • Build Random Forests from scratch
  • Tune forest hyperparameters
  • Understand feature randomness and its benefits
  • Compare Random Forests with single trees

1. Ensemble Learning: Combining Models

Why Ensembles?

A single decision tree is:

  • ✅ Interpretable
  • ✅ Fast to train
  • ❌ High variance (unstable)
  • ❌ Tends to overfit

Solution: Train multiple trees and combine them!

Loading Python runtime...


2. Bootstrap Aggregating (Bagging)

Bootstrap Sampling

Bootstrap: Sample (n) data points with replacement from dataset of size (n).

  • Some samples appear multiple times
  • Some samples don't appear at all (~37% left out)
  • Each bootstrap sample is slightly different

Loading Python runtime...

Bagging Algorithm

Bootstrap Aggregating:

  1. Create (B) bootstrap samples from training data
  2. Train one model (tree) on each bootstrap sample
  3. Aggregate predictions:
    • Classification: Majority vote
    • Regression: Average
Loading interactive component...

Loading Python runtime...


3. Random Forests: Adding Feature Randomness

The Extra Ingredient

Problem: Bagging helps, but trees can still be too similar (correlated).

Solution: When splitting each node, only consider random subset of features!

Random Forest = Bagging + Feature Randomness

At each split:

  • Select (m) features at random (typically (m = \sqrt{d}) for classification)
  • Find best split among these (m) features only
  • This decorrelates trees → better diversity → better ensemble

Loading Python runtime...


4. Out-of-Bag (OOB) Error

Free Cross-Validation

Remember: ~37% of data is not in each bootstrap sample.

OOB samples: Samples not used to train a particular tree.

OOB Error: For each sample, average predictions from trees that didn't see it during training.

Benefit: Get validation error estimate without separate validation set!

Loading Python runtime...


5. Hyperparameter Tuning

Key Hyperparameters

ParameterWhat it ControlsTypical Values
n_estimatorsNumber of trees100-1000 (more is better, diminishing returns)
max_depthMaximum tree depth10-30 (or None for full depth)
max_featuresFeatures per split'sqrt' (classification), 'log2', or integer
min_samples_splitMin samples to split2-10
min_samples_leafMin samples in leaf1-5

Loading Python runtime...


6. Feature Importance

Random Forests automatically calculate feature importance!

Method: Sum the decrease in impurity (Gini/entropy) weighted by probability of reaching that node, averaged across all trees.

Loading Python runtime...


7. Advantages and Limitations

✅ Advantages

  1. High Accuracy: Often best off-the-shelf performance
  2. Robust: Handles outliers and noisy data well
  3. No Overfitting: More trees doesn't overfit (unlike single tree)
  4. Feature Importance: Automatic ranking
  5. Handles Mixed Data: Numerical and categorical features
  6. Parallel: Trees train independently (fast on multiple cores)
  7. OOB Error: Built-in validation

❌ Limitations

  1. Black Box: Less interpretable than single tree
  2. Memory: Stores many trees (can be large)
  3. Slow Prediction: Must query all trees
  4. Not for Extrapolation: Can't predict beyond training data range
  5. Bias: Biased toward features with many categories

Loading Python runtime...


Key Takeaways

Random Forests: Ensemble of decision trees via bagging + feature randomness

Bagging: Bootstrap sampling + aggregating (majority vote or average)

Feature Randomness: Consider only subset of features at each split → decorrelates trees

OOB Error: Free validation using samples not in bootstrap → no need for separate val set

Hyperparameters: n_estimators (more is better), max_depth, max_features

Feature Importance: Automatic ranking of feature relevance

Strengths: High accuracy, robust, handles mixed data, parallelizable

Limitations: Black box, can't extrapolate, slower prediction than single tree


Practice Problems

Problem 1: Implement Simple Bagging

Loading Python runtime...

Problem 2: Compare Bagging vs Random Forest

Loading Python runtime...


Next Steps

Random Forests are powerful, but there's an even better ensemble method: Boosting!

Next lesson:

  • Lesson 8: Gradient Boosting – sequentially building trees that fix previous errors

Boosting methods like XGBoost and LightGBM dominate ML competitions!


Further Reading


Remember: Random Forests combine simplicity (trees) with power (ensembles). They're often the first model to try on tabular data!