Gradient Boosting: From AdaBoost to XGBoost

Introduction: Learning from Mistakes

Imagine you're learning to shoot free throws in basketball. After your first attempt, you notice you're shooting too far left. On your second attempt, you adjust by aiming right. You miss again, but this time too high. You adjust again, aiming lower. With each attempt, you correct your previous mistakes.

Gradient Boosting works the same way: each new model focuses on correcting the errors of the previous models!

Unlike Random Forests (which train trees in parallel on random data), Gradient Boosting trains trees sequentially, where each tree learns to fix what the previous ones got wrong.

Key Insight: Boosting builds a strong model by sequentially combining many weak models, each focusing on the hardest examples.

Learning Objectives

Understand boosting vs bagging
Master the gradient boosting algorithm
Implement gradient boosting from scratch
Learn about AdaBoost (historical context)
Tune boosting hyperparameters
Use XGBoost and LightGBM in practice
Understand when to use boosting vs Random Forests

1. Boosting: The Core Idea

Ensemble Learning: Two Approaches

Loading interactive component...

Bagging (Random Forests):

Train models in parallel
Each model sees random data (bootstrap)
Average predictions
Reduces variance (overfitting)

Boosting:

Train models sequentially
Each model focuses on previous errors
Weighted sum of predictions
Reduces bias (underfitting)

Loading Python runtime...

2. AdaBoost: The Pioneer (1995)

Before gradient boosting, there was AdaBoost (Adaptive Boosting).

AdaBoost Algorithm

Start with equal weights for all training examples
Train a weak learner (shallow tree)
Increase weights for misclassified examples
Train next learner on reweighted data (focuses on hard examples)
Repeat
Final prediction: weighted vote of all learners

Key Idea: Each new model pays more attention to examples that previous models got wrong.

Loading Python runtime...

3. Gradient Boosting: The Modern Approach

From AdaBoost to Gradient Boosting

AdaBoost reweights samples. Gradient Boosting is more general:

Key Innovation: Frame boosting as gradient descent in function space!

Gradient Boosting Algorithm (Regression)

Goal: Learn function (F(x)) that predicts (y)

Start with initial prediction: [F_0(x) = \text{mean}(y)]

For each iteration (m = 1, 2, ..., M):

Compute residuals (errors of current model): [r_i = y_i - F_{m-1}(x_i)]
Train tree (h_m(x)) to predict residuals (r_i)
Update model: [F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)] where (\nu) is the learning rate (typically 0.01-0.3)

Final model: [F(x) = F_0(x) + \nu \sum_{m=1}^{M} h_m(x)]

Loading interactive component...

Loading Python runtime...

4. Gradient Boosting for Classification

For classification, we use log-loss (cross-entropy) instead of MSE.

Algorithm (binary classification):

Initialize with log-odds: (F_0(x) = \log\frac{p}{1-p}) where (p = \text{mean}(y))
For each iteration:
- Compute pseudo-residuals (gradient of log-loss): [r_i = y_i - \sigma(F_{m-1}(x_i))] where (\sigma) is sigmoid function
- Train tree to predict (r_i)
- Update: (F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x))
Final prediction: (P(y=1|x) = \sigma(F(x)))

Loading Python runtime...

5. Hyperparameters and Tuning

Key Hyperparameters

Parameter	What it Controls	Typical Values	Effect
`n_estimators`	Number of trees	100-1000	More trees = better fit (diminishing returns)
`learning_rate`	Step size ((\nu))	0.01-0.3	Lower = more trees needed, but better generalization
`max_depth`	Tree depth	3-8	Deeper = more complex trees
`subsample`	Fraction of samples per tree	0.5-1.0	<1.0 = stochastic gradient boosting
`min_samples_split`	Min samples to split	2-10	Higher = simpler trees
`max_features`	Features per split	'sqrt', 'log2'	Adds randomness like Random Forest

Key Tradeoff: n_estimators × learning_rate

High learning rate + few trees = fast but less accurate
Low learning rate + many trees = slow but more accurate

Loading Python runtime...

6. XGBoost and LightGBM: Modern Implementations

Scikit-learn's GradientBoosting is good for learning, but XGBoost and LightGBM are production-grade:

XGBoost (Extreme Gradient Boosting)

Improvements:

Regularization: L1/L2 penalties on leaf weights
Parallel processing: Fast tree construction
Handling missing values: Built-in
Tree pruning: Loss-guided (not depth-limited)
Built-in cross-validation

LightGBM (Light Gradient Boosting Machine)

Innovations:

Leaf-wise growth: Instead of level-wise (can be deeper, faster)
Histogram-based: Buckets features → faster
GOSS: Gradient-based One-Side Sampling (samples intelligently)
EFB: Exclusive Feature Bundling (reduces features)

Loading Python runtime...

7. Gradient Boosting vs Random Forest

When to Use Each?

Aspect	Random Forest	Gradient Boosting
Training Speed	✅ Fast (parallel)	❌ Slower (sequential)
Prediction Speed	❌ Slower	✅ Fast
Accuracy	Good	✅ Often better
Overfitting Risk	✅ Low	❌ Higher (needs tuning)
Hyperparameter Sensitivity	✅ Robust	❌ Sensitive
Interpretability	Moderate	❌ Lower
Default Performance	✅ Works well	Needs tuning

Rule of Thumb:

Start with Random Forest: Quick baseline, robust
Try Gradient Boosting: If you need every bit of accuracy
Use XGBoost/LightGBM: For production systems, competitions

Loading Python runtime...

Key Takeaways

✓ Boosting: Sequential training where each model corrects previous errors

✓ AdaBoost: Reweights samples; focuses on hard examples

✓ Gradient Boosting: Frames boosting as gradient descent; trains on residuals

✓ Algorithm: Start with mean, iteratively add trees that predict residuals

✓ Hyperparameters: Balance n_estimators × learning_rate; tune max_depth

✓ XGBoost/LightGBM: Production-grade implementations with extra features

✓ vs Random Forest: GB often more accurate but needs tuning; RF more robust

✓ Use Cases: GB dominates tabular data competitions (Kaggle), widely used in industry

Practice Problems

Problem 1: Implement Gradient Boosting from Scratch

Loading Python runtime...

Problem 2: Tune XGBoost

Loading Python runtime...

Next Steps

You now understand the two major ensemble methods: Random Forests and Gradient Boosting!

Next, we'll explore a completely different class of algorithms:

Lesson 9: Support Vector Machines (Linear Case) – maximum margin classifiers
Lesson 10: Kernel Methods – the "kernel trick" for non-linear decision boundaries

SVMs have beautiful geometric intuition and powerful theoretical guarantees!

Classical Machine Learning: Supervised Learning Foundations

Gradient Boosting: From AdaBoost to XGBoost

Introduction: Learning from Mistakes

Learning Objectives

1. Boosting: The Core Idea

Ensemble Learning: Two Approaches

2. AdaBoost: The Pioneer (1995)

AdaBoost Algorithm

3. Gradient Boosting: The Modern Approach

From AdaBoost to Gradient Boosting

Gradient Boosting Algorithm (Regression)

4. Gradient Boosting for Classification

5. Hyperparameters and Tuning

Key Hyperparameters

6. XGBoost and LightGBM: Modern Implementations

XGBoost (Extreme Gradient Boosting)

LightGBM (Light Gradient Boosting Machine)

7. Gradient Boosting vs Random Forest

When to Use Each?

Key Takeaways

Practice Problems

Problem 1: Implement Gradient Boosting from Scratch

Problem 2: Tune XGBoost

Next Steps

Further Reading