Cross-Validation and Model Selection

Introduction: Don't Trust a Single Split!

Imagine you're hiring an employee based on one interview. By chance, they have a great day and ace it. You hire them, but they turn out to be mediocre. One sample isn't enough to judge!

The same applies to model evaluation. If you evaluate on just one train/test split, you might get lucky (or unlucky) with the split. Your performance estimate will be unreliable.

Cross-validation solves this: evaluate on multiple different splits and average the results. This gives a much more robust estimate of true performance!

Key Insight: Cross-validation provides a reliable estimate of model performance by testing on multiple data partitions, reducing variance and catching overfitting.

Learning Objectives

  • Understand why single train/test splits are insufficient
  • Master k-fold cross-validation
  • Learn specialized CV strategies (stratified, time-series, grouped)
  • Implement nested cross-validation for hyperparameter tuning
  • Avoid data leakage pitfalls
  • Choose appropriate CV strategy for different problems
  • Understand computational tradeoffs

1. The Problem with Train/Test Split

Variance in Performance Estimates

A single train/test split can give misleading results depending on which samples end up in each set!

Loading Python runtime...


2. K-Fold Cross-Validation

The Standard Approach

Algorithm:

  1. Split data into (k) equal-sized folds
  2. For each fold (i = 1, ..., k):
    • Train on all folds except (i)
    • Test on fold (i)
  3. Average the (k) performance scores

Common choices: (k=5) or (k=10)

Loading interactive component...

Loading Python runtime...


3. Stratified K-Fold: For Imbalanced Data

Maintaining Class Proportions

Problem: Random folds might have different class distributions

Solution: Stratified K-Fold ensures each fold has same class proportions as original data

Loading Python runtime...


4. Specialized Cross-Validation Strategies

Time Series Split: Respecting Temporal Order

Problem: Time series data has temporal dependencies

Solution: TimeSeriesSplit – always train on past, test on future

Loading Python runtime...

Group K-Fold: Keeping Groups Together

Problem: Data has groups that shouldn't be split (e.g., multiple samples from same patient)

Solution: GroupKFold – keeps all samples from same group in same fold

Loading Python runtime...


5. Nested Cross-Validation: For Hyperparameter Tuning

The Right Way to Tune and Evaluate

Problem: If you tune hyperparameters on your CV folds, then report CV scores, you're overfitting to CV!

Solution: Nested CV

  • Outer loop: Estimates true performance
  • Inner loop: Hyperparameter tuning
Loading interactive component...

Loading Python runtime...


6. Common Pitfalls and Best Practices

Data Leakage in Cross-Validation

Leakage: Information from test fold influences training

Common mistakes:

  1. Fitting preprocessor on full data before CV
  2. Feature selection on full data before CV
  3. Using regular CV for time series or grouped data

Loading Python runtime...


Key Takeaways

Single train/test split: High variance, unreliable performance estimates

K-Fold CV: Averages over (k) different splits (typical: (k=5) or 10)

Stratified K-Fold: Maintains class proportions (use for imbalanced data)

Time Series Split: Always train on past, test on future (never shuffle!)

Group K-Fold: Keeps grouped samples together (medical, user data)

Nested CV: Outer loop for evaluation, inner loop for hyperparameter tuning

Avoid Leakage: All preprocessing must happen inside CV folds (use Pipeline)

Tradeoff: Larger (k) → more reliable estimates but slower computation


Practice Problems

Problem 1: Implement K-Fold from Scratch

Loading Python runtime...

Problem 2: Design CV for a Real Problem

Loading Python runtime...


Next Steps

You now understand how to properly validate models!

Next: We'll cover the art of making models better through:

  • Lesson 13: Feature Engineering – crafting powerful features
  • Lesson 14: Feature Selection – choosing the most informative features

These skills separate good ML practitioners from great ones!


Further Reading


Remember: Proper validation is the foundation of trustworthy machine learning. Don't skip it!