Random Forests and Bagging

Introduction: The Wisdom of Crowds

Imagine you're trying to guess the number of jelly beans in a jar. Instead of relying on one person's guess, you ask 100 people and average their guesses. Surprisingly, this average is often more accurate than any individual guess – even experts!

This is the wisdom of crowds: combining many diverse opinions often beats individual expertise.

Random Forests apply this principle to machine learning: instead of one decision tree, we train many trees on slightly different data and average their predictions. This ensemble is more accurate and robust than any single tree!

Key Insight: Random Forests reduce overfitting and variance by combining predictions from multiple diverse trees trained through bootstrap aggregating (bagging).

Learning Objectives

Understand ensemble learning principles
Master bootstrap sampling and bagging
Build Random Forests from scratch
Tune forest hyperparameters
Understand feature randomness and its benefits
Compare Random Forests with single trees

1. Ensemble Learning: Combining Models

Why Ensembles?

A single decision tree is:

✅ Interpretable
✅ Fast to train
❌ High variance (unstable)
❌ Tends to overfit

Solution: Train multiple trees and combine them!

FIG. 02Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 02Interactive Python code execution environment

2. Bootstrap Aggregating (Bagging)

Bootstrap Sampling

Bootstrap: Sample (n) data points with replacement from dataset of size (n).

Some samples appear multiple times
Some samples don't appear at all (~37% left out)
Each bootstrap sample is slightly different

FIG. 04Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 04Interactive Python code execution environment

Bagging Algorithm

Bootstrap Aggregating:

Create (B) bootstrap samples from training data
Train one model (tree) on each bootstrap sample
Aggregate predictions:
- Classification: Majority vote
- Regression: Average

FIG. 06Flow Diagram

DIAGRAM

LOADING INSTRUMENT

Fig. 06Flow diagrams, timelines, and process visualizations

FIG. 08Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 08Interactive Python code execution environment

3. Random Forests: Adding Feature Randomness

The Extra Ingredient

Problem: Bagging helps, but trees can still be too similar (correlated).

Solution: When splitting each node, only consider random subset of features!

Random Forest = Bagging + Feature Randomness

At each split:

Select (m) features at random (typically (m = \sqrt{d}) for classification)
Find best split among these (m) features only
This decorrelates trees → better diversity → better ensemble

FIG. 10Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 10Interactive Python code execution environment

4. Out-of-Bag (OOB) Error

Free Cross-Validation

Remember: ~37% of data is not in each bootstrap sample.

OOB samples: Samples not used to train a particular tree.

OOB Error: For each sample, average predictions from trees that didn't see it during training.

Benefit: Get validation error estimate without separate validation set!

FIG. 12Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 12Interactive Python code execution environment

5. Hyperparameter Tuning

Key Hyperparameters

Parameter	What it Controls	Typical Values
`n_estimators`	Number of trees	100-1000 (more is better, diminishing returns)
`max_depth`	Maximum tree depth	10-30 (or None for full depth)
`max_features`	Features per split	'sqrt' (classification), 'log2', or integer
`min_samples_split`	Min samples to split	2-10
`min_samples_leaf`	Min samples in leaf	1-5

FIG. 14Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 14Interactive Python code execution environment

6. Feature Importance

Random Forests automatically calculate feature importance!

Method: Sum the decrease in impurity (Gini/entropy) weighted by probability of reaching that node, averaged across all trees.

FIG. 16Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 16Interactive Python code execution environment

7. Advantages and Limitations

✅ Advantages

High Accuracy: Often best off-the-shelf performance
Robust: Handles outliers and noisy data well
No Overfitting: More trees doesn't overfit (unlike single tree)
Feature Importance: Automatic ranking
Handles Mixed Data: Numerical and categorical features
Parallel: Trees train independently (fast on multiple cores)
OOB Error: Built-in validation

❌ Limitations

Black Box: Less interpretable than single tree
Memory: Stores many trees (can be large)
Slow Prediction: Must query all trees
Not for Extrapolation: Can't predict beyond training data range
Bias: Biased toward features with many categories

FIG. 18Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 18Interactive Python code execution environment

Key Takeaways

✓ Random Forests: Ensemble of decision trees via bagging + feature randomness

✓ Bagging: Bootstrap sampling + aggregating (majority vote or average)

✓ Feature Randomness: Consider only subset of features at each split → decorrelates trees

✓ OOB Error: Free validation using samples not in bootstrap → no need for separate val set

✓ Hyperparameters: n_estimators (more is better), max_depth, max_features

✓ Feature Importance: Automatic ranking of feature relevance

✓ Strengths: High accuracy, robust, handles mixed data, parallelizable

✓ Limitations: Black box, can't extrapolate, slower prediction than single tree

Practice Problems

Problem 1: Implement Simple Bagging

FIG. 20Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 20Interactive Python code execution environment

Problem 2: Compare Bagging vs Random Forest

FIG. 22Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 22Interactive Python code execution environment

Next Steps

Random Forests are powerful, but there's an even better ensemble method: Boosting!

Next lesson:

Lesson 8: Gradient Boosting – sequentially building trees that fix previous errors

Boosting methods like XGBoost and LightGBM dominate ML competitions!

Random Forests and Bagging

Introduction: The Wisdom of Crowds

Learning Objectives

1. Ensemble Learning: Combining Models

Why Ensembles?

2. Bootstrap Aggregating (Bagging)

Bootstrap Sampling

Bagging Algorithm

3. Random Forests: Adding Feature Randomness

The Extra Ingredient

4. Out-of-Bag (OOB) Error

Free Cross-Validation

5. Hyperparameter Tuning

Key Hyperparameters

6. Feature Importance

7. Advantages and Limitations

✅ Advantages

❌ Limitations

Key Takeaways

Practice Problems

Problem 1: Implement Simple Bagging

Problem 2: Compare Bagging vs Random Forest

Next Steps

Further Reading

Interactive Visualizations

Video Tutorials

Papers & Articles

Documentation & Books