Introduction: Playing 20 Questions
Remember playing "20 Questions"? You ask yes/no questions to narrow down possibilities:
- "Is it bigger than a breadbox?" → Yes
- "Is it electronic?" → No
- "Can you eat it?" → Yes
- "Is it a fruit?" → Yes
- "Is it red?" → Yes
- "Is it an apple?" → YES!
Decision trees work exactly like this! They ask a series of questions about features to classify or predict outcomes. Unlike linear models that draw straight boundaries, trees create rectangular decision boundaries by partitioning feature space.
Key Insight: Decision trees are interpretable, handle non-linear relationships naturally, and don't require feature scaling. They're the foundation for powerful ensemble methods!
Learning Objectives
- Understand tree-based decision making
- Master entropy and information gain
- Build decision trees from scratch
- Prune trees to prevent overfitting
- Compare trees with linear models
- Visualize tree structures and decisions
1. From Linear to Non-Linear
Why Trees?
Linear models create straight decision boundaries. But what if your data looks like this?
2. How Decision Trees Work
The Tree Structure
A decision tree consists of:
- Root node: Starting point (all data)
- Internal nodes: Decision points (if feature_x > threshold?)
- Leaf nodes: Final predictions (class label or value)
- Branches: Outcomes of decisions (yes/no)
Making Predictions
To classify a new sample:
- Start at root
- Follow branches based on feature values
- Arrive at leaf node
- Return leaf's prediction
3. Entropy and Information Gain
Measuring Impurity
To build a tree, we need to decide which feature to split on and where to split it. We use entropy to measure "messiness" or impurity.
Entropy for a node: [ H(S) = -\sum_{c=1}^{C} p_c \log_2(p_c) ]
Where (p_c) is the proportion of samples in class (c).
Properties:
- (H = 0): Pure node (all same class) – best!
- (H = 1): Maximum impurity (50/50 split for binary) – worst!
Information Gain
Information Gain = How much does splitting reduce entropy?
[ IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v) ]
Where:
- (S): Parent node samples
- (A): Feature to split on
- (S_v): Subset after split on value (v)
Goal: Choose split that maximizes information gain (reduces entropy most).
4. Building a Decision Tree (ID3 Algorithm)
Algorithm Steps
- Start with all data at root
- For each node:
- If pure (all same class) → make leaf
- If max depth reached → make leaf
- Else:
- Find best split (maximum information gain)
- Create child nodes
- Recursively repeat
- Stop when: all leaves are pure or max depth reached
5. Overfitting and Pruning
The Overfitting Problem
Deep trees can memorize training data perfectly but fail on new data!
Pruning Strategies
Pre-pruning (stop early):
- Limit
max_depth - Require
min_samples_split(minimum samples to split) - Require
min_samples_leaf(minimum samples in leaf) - Require minimum information gain
Post-pruning (build full tree, then trim):
- Remove branches that don't improve validation performance
- More complex but often better
6. Advantages and Disadvantages
✅ Advantages
- Interpretable: Easy to understand and visualize
- No preprocessing: Don't need to scale features
- Handle non-linear: Naturally capture non-linear relationships
- Mixed data: Can handle numerical and categorical features
- Feature importance: Automatically ranks features
- Fast prediction: (O(\log n)) depth traversal
❌ Disadvantages
- Overfitting: Can create overly complex trees
- Instability: Small changes in data → completely different tree
- Bias: Greedy algorithm may not find globally optimal tree
- Imbalanced: Biased toward dominant classes
- Smooth functions: Poor at learning smooth relationships
Key Takeaways
✓ Decision Trees: Make sequential binary decisions to classify/predict
✓ Entropy: Measures impurity/disorder in a node (0 = pure, 1 = maximum impurity)
✓ Information Gain: Reduction in entropy from a split (choose split that maximizes this)
✓ Building Trees: Recursive algorithm (ID3, C4.5, CART) that greedily selects best splits
✓ Overfitting: Deep trees memorize training data → use max_depth, min_samples, or pruning
✓ Interpretability: Trees are "white box" models – easy to understand decision logic
✓ Non-linear: Naturally handle non-linear relationships with rectangular boundaries
Practice Problems
Problem 1: Calculate Entropy
Problem 2: Find Best Split
Next Steps
Decision trees are powerful on their own, but their real strength comes from ensembles!
Next lessons:
- Lesson 7: Random Forests – combining many trees through bagging
- Lesson 8: Gradient Boosting – sequentially improving trees
These ensemble methods are among the most powerful ML techniques available!
Further Reading
- Book: The Elements of Statistical Learning (Chapter 9)
- Algorithm: CART Algorithm
- Visualization: Decision Tree Visualization Tool
- scikit-learn: Decision Trees Guide
Remember: Trees are interpretable, handle non-linearity naturally, but can overfit. Balance tree depth carefully, or use ensembles for better performance!