Cross-Validation and Model Selection

Introduction: Don't Trust a Single Split!

Imagine you're hiring an employee based on one interview. By chance, they have a great day and ace it. You hire them, but they turn out to be mediocre. One sample isn't enough to judge!

The same applies to model evaluation. If you evaluate on just one train/test split, you might get lucky (or unlucky) with the split. Your performance estimate will be unreliable.

Cross-validation solves this: evaluate on multiple different splits and average the results. This gives a much more robust estimate of true performance!

Key Insight: Cross-validation provides a reliable estimate of model performance by testing on multiple data partitions, reducing variance and catching overfitting.

Learning Objectives

Understand why single train/test splits are insufficient
Master k-fold cross-validation
Learn specialized CV strategies (stratified, time-series, grouped)
Implement nested cross-validation for hyperparameter tuning
Avoid data leakage pitfalls
Choose appropriate CV strategy for different problems
Understand computational tradeoffs

1. The Problem with Train/Test Split

Variance in Performance Estimates

A single train/test split can give misleading results depending on which samples end up in each set!

Loading Python runtime...

2. K-Fold Cross-Validation

The Standard Approach

Algorithm:

Split data into (k) equal-sized folds
For each fold (i = 1, ..., k):
- Train on all folds except (i)
- Test on fold (i)
Average the (k) performance scores

Common choices: (k=5) or (k=10)

Loading interactive component...

Loading Python runtime...

3. Stratified K-Fold: For Imbalanced Data

Maintaining Class Proportions

Problem: Random folds might have different class distributions

Solution: Stratified K-Fold ensures each fold has same class proportions as original data

Loading Python runtime...

4. Specialized Cross-Validation Strategies

Time Series Split: Respecting Temporal Order

Problem: Time series data has temporal dependencies

Solution: TimeSeriesSplit – always train on past, test on future

Loading Python runtime...

Group K-Fold: Keeping Groups Together

Problem: Data has groups that shouldn't be split (e.g., multiple samples from same patient)

Solution: GroupKFold – keeps all samples from same group in same fold

Loading Python runtime...

5. Nested Cross-Validation: For Hyperparameter Tuning

The Right Way to Tune and Evaluate

Problem: If you tune hyperparameters on your CV folds, then report CV scores, you're overfitting to CV!

Solution: Nested CV

Outer loop: Estimates true performance
Inner loop: Hyperparameter tuning

Loading interactive component...

Loading Python runtime...

6. Common Pitfalls and Best Practices

Data Leakage in Cross-Validation

Leakage: Information from test fold influences training

Common mistakes:

Fitting preprocessor on full data before CV
Feature selection on full data before CV
Using regular CV for time series or grouped data

Loading Python runtime...

Key Takeaways

✓ Single train/test split: High variance, unreliable performance estimates

✓ K-Fold CV: Averages over (k) different splits (typical: (k=5) or 10)

✓ Stratified K-Fold: Maintains class proportions (use for imbalanced data)

✓ Time Series Split: Always train on past, test on future (never shuffle!)

✓ Group K-Fold: Keeps grouped samples together (medical, user data)

✓ Nested CV: Outer loop for evaluation, inner loop for hyperparameter tuning

✓ Avoid Leakage: All preprocessing must happen inside CV folds (use Pipeline)

✓ Tradeoff: Larger (k) → more reliable estimates but slower computation

Practice Problems

Problem 1: Implement K-Fold from Scratch

Loading Python runtime...

Problem 2: Design CV for a Real Problem

Loading Python runtime...

Next Steps

You now understand how to properly validate models!

Next: We'll cover the art of making models better through:

Lesson 13: Feature Engineering – crafting powerful features
Lesson 14: Feature Selection – choosing the most informative features

These skills separate good ML practitioners from great ones!

Classical Machine Learning: Supervised Learning Foundations

Cross-Validation and Model Selection

Introduction: Don't Trust a Single Split!

Learning Objectives

1. The Problem with Train/Test Split

Variance in Performance Estimates

2. K-Fold Cross-Validation

The Standard Approach

3. Stratified K-Fold: For Imbalanced Data

Maintaining Class Proportions

4. Specialized Cross-Validation Strategies

Time Series Split: Respecting Temporal Order

Group K-Fold: Keeping Groups Together

5. Nested Cross-Validation: For Hyperparameter Tuning

The Right Way to Tune and Evaluate

6. Common Pitfalls and Best Practices

Data Leakage in Cross-Validation

Key Takeaways

Practice Problems

Problem 1: Implement K-Fold from Scratch

Problem 2: Design CV for a Real Problem

Next Steps

Further Reading