Introduction: Playing 20 Questions
Remember playing "20 Questions"? You ask yes/no questions to narrow down possibilities:
- "Is it bigger than a breadbox?" → Yes
- "Is it electronic?" → No
- "Can you eat it?" → Yes
- "Is it a fruit?" → Yes
- "Is it red?" → Yes
- "Is it an apple?" → YES!
Decision trees work exactly like this! They ask a series of questions about features to classify or predict outcomes. Unlike linear models that draw straight boundaries, trees create rectangular decision boundaries by partitioning feature space.
Key Insight: Decision trees are interpretable, handle non-linear relationships naturally, and don't require feature scaling. They're the foundation for powerful ensemble methods!
Learning Objectives
- Understand tree-based decision making
- Master entropy and information gain
- Build decision trees from scratch
- Prune trees to prevent overfitting
- Compare trees with linear models
- Visualize tree structures and decisions
1. From Linear to Non-Linear
🏡 The definitive interactive explainer: before we code a single split, scroll through A Visual Introduction to Machine Learning — Part 1 (R2D3). It teaches every idea in this lesson — splits, entropy, depth, overfitting — using a San-Francisco-vs-New-York housing classifier. Five minutes of scrolling and the rest feels obvious.
Why Trees?
Linear models create straight decision boundaries. But what if your data looks like this?
2. How Decision Trees Work
The Tree Structure
A decision tree consists of:
- Root node: Starting point (all data)
- Internal nodes: Decision points (if feature_x > threshold?)
- Leaf nodes: Final predictions (class label or value)
- Branches: Outcomes of decisions (yes/no)
Making Predictions
To classify a new sample:
- Start at root
- Follow branches based on feature values
- Arrive at leaf node
- Return leaf's prediction
3. Entropy and Information Gain
Measuring Impurity
To build a tree, we need to decide which feature to split on and where to split it. We use entropy to measure "messiness" or impurity.
Entropy for a node: [ H(S) = -\sum_{c=1}^{C} p_c \log_2(p_c) ]
Where (p_c) is the proportion of samples in class (c).
Properties:
- (H = 0): Pure node (all same class) – best!
- (H = 1): Maximum impurity (50/50 split for binary) – worst!
Information Gain
Information Gain = How much does splitting reduce entropy?
[ IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v) ]
Where:
- (S): Parent node samples
- (A): Feature to split on
- (S_v): Subset after split on value (v)
Goal: Choose split that maximizes information gain (reduces entropy most).
4. Building a Decision Tree (ID3 Algorithm)
Algorithm Steps
- Start with all data at root
- For each node:
- If pure (all same class) → make leaf
- If max depth reached → make leaf
- Else:
- Find best split (maximum information gain)
- Create child nodes
- Recursively repeat
- Stop when: all leaves are pure or max depth reached
5. Overfitting and Pruning
The Overfitting Problem
Deep trees can memorize training data perfectly but fail on new data!
Pruning Strategies
Pre-pruning (stop early):
- Limit
max_depth - Require
min_samples_split(minimum samples to split) - Require
min_samples_leaf(minimum samples in leaf) - Require minimum information gain
Post-pruning (build full tree, then trim):
- Remove branches that don't improve validation performance
- More complex but often better
6. Advantages and Disadvantages
✅ Advantages
- Interpretable: Easy to understand and visualize
- No preprocessing: Don't need to scale features
- Handle non-linear: Naturally capture non-linear relationships
- Mixed data: Can handle numerical and categorical features
- Feature importance: Automatically ranks features
- Fast prediction: (O(\log n)) depth traversal
❌ Disadvantages
- Overfitting: Can create overly complex trees
- Instability: Small changes in data → completely different tree
- Bias: Greedy algorithm may not find globally optimal tree
- Imbalanced: Biased toward dominant classes
- Smooth functions: Poor at learning smooth relationships
Key Takeaways
✓ Decision Trees: Make sequential binary decisions to classify/predict
✓ Entropy: Measures impurity/disorder in a node (0 = pure, 1 = maximum impurity)
✓ Information Gain: Reduction in entropy from a split (choose split that maximizes this)
✓ Building Trees: Recursive algorithm (ID3, C4.5, CART) that greedily selects best splits
✓ Overfitting: Deep trees memorize training data → use max_depth, min_samples, or pruning
✓ Interpretability: Trees are "white box" models – easy to understand decision logic
✓ Non-linear: Naturally handle non-linear relationships with rectangular boundaries
Practice Problems
Problem 1: Calculate Entropy
Problem 2: Find Best Split
Next Steps
Decision trees are powerful on their own, but their real strength comes from ensembles!
Next lessons:
- Lesson 7: Random Forests – combining many trees through bagging
- Lesson 8: Gradient Boosting – sequentially improving trees
These ensemble methods are among the most powerful ML techniques available!
Further Reading
Interactive Visualizations
- A Visual Introduction to Machine Learning — Part 1 (R2D3) — the landmark scroll-story that teaches decision-tree splits through a San Francisco vs. New York housing classifier. Start here.
- MLU-Explain: Decision Trees — retrains a tree in-browser as you tweak depth and min-samples, and shows the partitioned feature space live.
- dtreeviz Gallery (Terence Parr) — a library that renders trees with histograms at every split; the linked gallery makes the structure obvious at a glance.
Video Tutorials
- StatQuest — Decision and Classification Trees, Clearly Explained and Regression Trees (Josh Starmer) — the canonical pair.
- Google ML Crash Course — Decision Trees & Random Forests — free interactive course on the whole tree/forest family.
Papers & Articles
- Classification and Regression Trees — Breiman, Friedman, Olshen, Stone (1984). The CART book.
- Why Model-Agnostic Interpretability is the Future — Christoph Molnar's Interpretable Machine Learning chapter on trees.
- Oblique Decision Trees via Differentiable Formulations — modern twist on splits beyond axis-aligned cuts.
Documentation & Books
- Book: The Elements of Statistical Learning — Chapter 9.
- scikit-learn: Decision Trees Guide — with plotting and pruning APIs.
Remember: Trees are interpretable, handle non-linearity naturally, but can overfit. Balance tree depth carefully, or use ensembles for better performance!