Gradient Boosting: From AdaBoost to XGBoost

Introduction: Learning from Mistakes

Imagine you're learning to shoot free throws in basketball. After your first attempt, you notice you're shooting too far left. On your second attempt, you adjust by aiming right. You miss again, but this time too high. You adjust again, aiming lower. With each attempt, you correct your previous mistakes.

Gradient Boosting works the same way: each new model focuses on correcting the errors of the previous models!

Unlike Random Forests (which train trees in parallel on random data), Gradient Boosting trains trees sequentially, where each tree learns to fix what the previous ones got wrong.

Key Insight: Boosting builds a strong model by sequentially combining many weak models, each focusing on the hardest examples.

Learning Objectives

  • Understand boosting vs bagging
  • Master the gradient boosting algorithm
  • Implement gradient boosting from scratch
  • Learn about AdaBoost (historical context)
  • Tune boosting hyperparameters
  • Use XGBoost and LightGBM in practice
  • Understand when to use boosting vs Random Forests

1. Boosting: The Core Idea

Ensemble Learning: Two Approaches

Loading interactive component...

Bagging (Random Forests):

  • Train models in parallel
  • Each model sees random data (bootstrap)
  • Average predictions
  • Reduces variance (overfitting)

Boosting:

  • Train models sequentially
  • Each model focuses on previous errors
  • Weighted sum of predictions
  • Reduces bias (underfitting)

Loading Python runtime...


2. AdaBoost: The Pioneer (1995)

Before gradient boosting, there was AdaBoost (Adaptive Boosting).

AdaBoost Algorithm

  1. Start with equal weights for all training examples
  2. Train a weak learner (shallow tree)
  3. Increase weights for misclassified examples
  4. Train next learner on reweighted data (focuses on hard examples)
  5. Repeat
  6. Final prediction: weighted vote of all learners

Key Idea: Each new model pays more attention to examples that previous models got wrong.

Loading Python runtime...


3. Gradient Boosting: The Modern Approach

From AdaBoost to Gradient Boosting

AdaBoost reweights samples. Gradient Boosting is more general:

Key Innovation: Frame boosting as gradient descent in function space!

Gradient Boosting Algorithm (Regression)

Goal: Learn function (F(x)) that predicts (y)

Start with initial prediction: [F_0(x) = \text{mean}(y)]

For each iteration (m = 1, 2, ..., M):

  1. Compute residuals (errors of current model): [r_i = y_i - F_{m-1}(x_i)]

  2. Train tree (h_m(x)) to predict residuals (r_i)

  3. Update model: [F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)] where (\nu) is the learning rate (typically 0.01-0.3)

Final model: [F(x) = F_0(x) + \nu \sum_{m=1}^{M} h_m(x)]

Loading interactive component...

Loading Python runtime...


4. Gradient Boosting for Classification

For classification, we use log-loss (cross-entropy) instead of MSE.

Algorithm (binary classification):

  1. Initialize with log-odds: (F_0(x) = \log\frac{p}{1-p}) where (p = \text{mean}(y))

  2. For each iteration:

    • Compute pseudo-residuals (gradient of log-loss): [r_i = y_i - \sigma(F_{m-1}(x_i))] where (\sigma) is sigmoid function

    • Train tree to predict (r_i)

    • Update: (F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x))

  3. Final prediction: (P(y=1|x) = \sigma(F(x)))

Loading Python runtime...


5. Hyperparameters and Tuning

Key Hyperparameters

ParameterWhat it ControlsTypical ValuesEffect
n_estimatorsNumber of trees100-1000More trees = better fit (diminishing returns)
learning_rateStep size ((\nu))0.01-0.3Lower = more trees needed, but better generalization
max_depthTree depth3-8Deeper = more complex trees
subsampleFraction of samples per tree0.5-1.0<1.0 = stochastic gradient boosting
min_samples_splitMin samples to split2-10Higher = simpler trees
max_featuresFeatures per split'sqrt', 'log2'Adds randomness like Random Forest

Key Tradeoff: n_estimators × learning_rate

  • High learning rate + few trees = fast but less accurate
  • Low learning rate + many trees = slow but more accurate

Loading Python runtime...


6. XGBoost and LightGBM: Modern Implementations

Scikit-learn's GradientBoosting is good for learning, but XGBoost and LightGBM are production-grade:

XGBoost (Extreme Gradient Boosting)

Improvements:

  • Regularization: L1/L2 penalties on leaf weights
  • Parallel processing: Fast tree construction
  • Handling missing values: Built-in
  • Tree pruning: Loss-guided (not depth-limited)
  • Built-in cross-validation

LightGBM (Light Gradient Boosting Machine)

Innovations:

  • Leaf-wise growth: Instead of level-wise (can be deeper, faster)
  • Histogram-based: Buckets features → faster
  • GOSS: Gradient-based One-Side Sampling (samples intelligently)
  • EFB: Exclusive Feature Bundling (reduces features)

Loading Python runtime...


7. Gradient Boosting vs Random Forest

When to Use Each?

AspectRandom ForestGradient Boosting
Training Speed✅ Fast (parallel)❌ Slower (sequential)
Prediction Speed❌ Slower✅ Fast
AccuracyGood✅ Often better
Overfitting Risk✅ Low❌ Higher (needs tuning)
Hyperparameter Sensitivity✅ Robust❌ Sensitive
InterpretabilityModerate❌ Lower
Default Performance✅ Works wellNeeds tuning

Rule of Thumb:

  • Start with Random Forest: Quick baseline, robust
  • Try Gradient Boosting: If you need every bit of accuracy
  • Use XGBoost/LightGBM: For production systems, competitions

Loading Python runtime...


Key Takeaways

Boosting: Sequential training where each model corrects previous errors

AdaBoost: Reweights samples; focuses on hard examples

Gradient Boosting: Frames boosting as gradient descent; trains on residuals

Algorithm: Start with mean, iteratively add trees that predict residuals

Hyperparameters: Balance n_estimators × learning_rate; tune max_depth

XGBoost/LightGBM: Production-grade implementations with extra features

vs Random Forest: GB often more accurate but needs tuning; RF more robust

Use Cases: GB dominates tabular data competitions (Kaggle), widely used in industry


Practice Problems

Problem 1: Implement Gradient Boosting from Scratch

Loading Python runtime...

Problem 2: Tune XGBoost

Loading Python runtime...


Next Steps

You now understand the two major ensemble methods: Random Forests and Gradient Boosting!

Next, we'll explore a completely different class of algorithms:

  • Lesson 9: Support Vector Machines (Linear Case) – maximum margin classifiers
  • Lesson 10: Kernel Methods – the "kernel trick" for non-linear decision boundaries

SVMs have beautiful geometric intuition and powerful theoretical guarantees!


Further Reading


Remember: Gradient Boosting is the "Swiss Army knife" of supervised learning for tabular data. Master it, and you'll have a tool that works for almost any prediction problem!