CLASSICAL MACHINE LEARNING: SUPERVISED LEARNING FOUNDATIONS / L07RANDOM FORESTS AND BAGGING
课程 · 15 · 07 / 15
LESSON 07 · INTERMEDIATE · 60 MIN · ◆ 2 INSTRUMENTS

Random Forests and Bagging

Ensemble power through bagging: bootstrap aggregating, random forests, feature randomness, and out-of-bag evaluation.

Introduction: The Wisdom of Crowds

Imagine you're trying to guess the number of jelly beans in a jar. Instead of relying on one person's guess, you ask 100 people and average their guesses. Surprisingly, this average is often more accurate than any individual guess – even experts!

This is the wisdom of crowds: combining many diverse opinions often beats individual expertise.

Random Forests apply this principle to machine learning: instead of one decision tree, we train many trees on slightly different data and average their predictions. This ensemble is more accurate and robust than any single tree!

Key Insight: Random Forests reduce overfitting and variance by combining predictions from multiple diverse trees trained through bootstrap aggregating (bagging).

Learning Objectives

  • Understand ensemble learning principles
  • Master bootstrap sampling and bagging
  • Build Random Forests from scratch
  • Tune forest hyperparameters
  • Understand feature randomness and its benefits
  • Compare Random Forests with single trees

1. Ensemble Learning: Combining Models

Why Ensembles?

A single decision tree is:

  • ✅ Interpretable
  • ✅ Fast to train
  • ❌ High variance (unstable)
  • ❌ Tends to overfit

Solution: Train multiple trees and combine them!

FIG. 02Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 02Interactive Python code execution environment

2. Bootstrap Aggregating (Bagging)

Bootstrap Sampling

Bootstrap: Sample (n) data points with replacement from dataset of size (n).

  • Some samples appear multiple times
  • Some samples don't appear at all (~37% left out)
  • Each bootstrap sample is slightly different
FIG. 04Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 04Interactive Python code execution environment

Bagging Algorithm

Bootstrap Aggregating:

  1. Create (B) bootstrap samples from training data
  2. Train one model (tree) on each bootstrap sample
  3. Aggregate predictions:
    • Classification: Majority vote
    • Regression: Average
FIG. 06Flow Diagram
INTERACTIVE
LOADING INSTRUMENT
Fig. 06Interactive flow diagrams, timelines, and process visualizations
FIG. 08Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 08Interactive Python code execution environment

3. Random Forests: Adding Feature Randomness

The Extra Ingredient

Problem: Bagging helps, but trees can still be too similar (correlated).

Solution: When splitting each node, only consider random subset of features!

Random Forest = Bagging + Feature Randomness

At each split:

  • Select (m) features at random (typically (m = \sqrt{d}) for classification)
  • Find best split among these (m) features only
  • This decorrelates trees → better diversity → better ensemble
FIG. 10Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 10Interactive Python code execution environment

4. Out-of-Bag (OOB) Error

Free Cross-Validation

Remember: ~37% of data is not in each bootstrap sample.

OOB samples: Samples not used to train a particular tree.

OOB Error: For each sample, average predictions from trees that didn't see it during training.

Benefit: Get validation error estimate without separate validation set!

FIG. 12Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 12Interactive Python code execution environment

5. Hyperparameter Tuning

Key Hyperparameters

ParameterWhat it ControlsTypical Values
n_estimatorsNumber of trees100-1000 (more is better, diminishing returns)
max_depthMaximum tree depth10-30 (or None for full depth)
max_featuresFeatures per split'sqrt' (classification), 'log2', or integer
min_samples_splitMin samples to split2-10
min_samples_leafMin samples in leaf1-5
FIG. 14Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 14Interactive Python code execution environment

6. Feature Importance

Random Forests automatically calculate feature importance!

Method: Sum the decrease in impurity (Gini/entropy) weighted by probability of reaching that node, averaged across all trees.

FIG. 16Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 16Interactive Python code execution environment

7. Advantages and Limitations

✅ Advantages

  1. High Accuracy: Often best off-the-shelf performance
  2. Robust: Handles outliers and noisy data well
  3. No Overfitting: More trees doesn't overfit (unlike single tree)
  4. Feature Importance: Automatic ranking
  5. Handles Mixed Data: Numerical and categorical features
  6. Parallel: Trees train independently (fast on multiple cores)
  7. OOB Error: Built-in validation

❌ Limitations

  1. Black Box: Less interpretable than single tree
  2. Memory: Stores many trees (can be large)
  3. Slow Prediction: Must query all trees
  4. Not for Extrapolation: Can't predict beyond training data range
  5. Bias: Biased toward features with many categories
FIG. 18Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 18Interactive Python code execution environment

Key Takeaways

Random Forests: Ensemble of decision trees via bagging + feature randomness

Bagging: Bootstrap sampling + aggregating (majority vote or average)

Feature Randomness: Consider only subset of features at each split → decorrelates trees

OOB Error: Free validation using samples not in bootstrap → no need for separate val set

Hyperparameters: n_estimators (more is better), max_depth, max_features

Feature Importance: Automatic ranking of feature relevance

Strengths: High accuracy, robust, handles mixed data, parallelizable

Limitations: Black box, can't extrapolate, slower prediction than single tree


Practice Problems

Problem 1: Implement Simple Bagging

FIG. 20Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 20Interactive Python code execution environment

Problem 2: Compare Bagging vs Random Forest

FIG. 22Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 22Interactive Python code execution environment

Next Steps

Random Forests are powerful, but there's an even better ensemble method: Boosting!

Next lesson:

  • Lesson 8: Gradient Boosting – sequentially building trees that fix previous errors

Boosting methods like XGBoost and LightGBM dominate ML competitions!


Further Reading

Interactive Visualizations

Video Tutorials

Papers & Articles

Documentation & Books


Remember: Random Forests combine simplicity (trees) with power (ensembles). They're often the first model to try on tabular data!