Cross-Validation and Model Selection

Introduction: Don't Trust a Single Split!

Imagine you're hiring an employee based on one interview. By chance, they have a great day and ace it. You hire them, but they turn out to be mediocre. One sample isn't enough to judge!

The same applies to model evaluation. If you evaluate on just one train/test split, you might get lucky (or unlucky) with the split. Your performance estimate will be unreliable.

Cross-validation solves this: evaluate on multiple different splits and average the results. This gives a much more robust estimate of true performance!

Key Insight: Cross-validation provides a reliable estimate of model performance by testing on multiple data partitions, reducing variance and catching overfitting.

Learning Objectives

Understand why single train/test splits are insufficient
Master k-fold cross-validation
Learn specialized CV strategies (stratified, time-series, grouped)
Implement nested cross-validation for hyperparameter tuning
Avoid data leakage pitfalls
Choose appropriate CV strategy for different problems
Understand computational tradeoffs

1. The Problem with Train/Test Split

Variance in Performance Estimates

A single train/test split can give misleading results depending on which samples end up in each set!

FIG. 02Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 02Interactive Python code execution environment

2. K-Fold Cross-Validation

The Standard Approach

Algorithm:

Split data into (k) equal-sized folds
For each fold (i = 1, ..., k):
- Train on all folds except (i)
- Test on fold (i)
Average the (k) performance scores

Common choices: (k=5) or (k=10)

FIG. 04Flow Diagram

DIAGRAM

LOADING INSTRUMENT

Fig. 04Flow diagrams, timelines, and process visualizations

FIG. 06Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 06Interactive Python code execution environment

3. Stratified K-Fold: For Imbalanced Data

Maintaining Class Proportions

Problem: Random folds might have different class distributions

Solution: Stratified K-Fold ensures each fold has same class proportions as original data

FIG. 08Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 08Interactive Python code execution environment

4. Specialized Cross-Validation Strategies

Time Series Split: Respecting Temporal Order

Problem: Time series data has temporal dependencies

Solution: TimeSeriesSplit – always train on past, test on future

FIG. 10Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 10Interactive Python code execution environment

Group K-Fold: Keeping Groups Together

Problem: Data has groups that shouldn't be split (e.g., multiple samples from same patient)

Solution: GroupKFold – keeps all samples from same group in same fold

FIG. 12Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 12Interactive Python code execution environment

5. Nested Cross-Validation: For Hyperparameter Tuning

The Right Way to Tune and Evaluate

Problem: If you tune hyperparameters on your CV folds, then report CV scores, you're overfitting to CV!

Solution: Nested CV

Outer loop: Estimates true performance
Inner loop: Hyperparameter tuning

FIG. 14Flow Diagram

DIAGRAM

LOADING INSTRUMENT

Fig. 14Flow diagrams, timelines, and process visualizations

FIG. 16Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 16Interactive Python code execution environment

6. Common Pitfalls and Best Practices

Data Leakage in Cross-Validation

Leakage: Information from test fold influences training

Common mistakes:

Fitting preprocessor on full data before CV
Feature selection on full data before CV
Using regular CV for time series or grouped data

FIG. 18Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 18Interactive Python code execution environment

Key Takeaways

✓ Single train/test split: High variance, unreliable performance estimates

✓ K-Fold CV: Averages over (k) different splits (typical: (k=5) or 10)

✓ Stratified K-Fold: Maintains class proportions (use for imbalanced data)

✓ Time Series Split: Always train on past, test on future (never shuffle!)

✓ Group K-Fold: Keeps grouped samples together (medical, user data)

✓ Nested CV: Outer loop for evaluation, inner loop for hyperparameter tuning

✓ Avoid Leakage: All preprocessing must happen inside CV folds (use Pipeline)

✓ Tradeoff: Larger (k) → more reliable estimates but slower computation

Practice Problems

Problem 1: Implement K-Fold from Scratch

FIG. 20Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 20Interactive Python code execution environment

Problem 2: Design CV for a Real Problem

FIG. 22Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 22Interactive Python code execution environment

Next Steps

You now understand how to properly validate models!

Next: We'll cover the art of making models better through:

Lesson 13: Feature Engineering – crafting powerful features
Lesson 14: Feature Selection – choosing the most informative features

These skills separate good ML practitioners from great ones!

Cross-Validation and Model Selection

Introduction: Don't Trust a Single Split!

Learning Objectives

1. The Problem with Train/Test Split

Variance in Performance Estimates

2. K-Fold Cross-Validation

The Standard Approach

3. Stratified K-Fold: For Imbalanced Data

Maintaining Class Proportions

4. Specialized Cross-Validation Strategies

Time Series Split: Respecting Temporal Order

Group K-Fold: Keeping Groups Together

5. Nested Cross-Validation: For Hyperparameter Tuning

The Right Way to Tune and Evaluate

6. Common Pitfalls and Best Practices

Data Leakage in Cross-Validation

Key Takeaways

Practice Problems

Problem 1: Implement K-Fold from Scratch

Problem 2: Design CV for a Real Problem

Next Steps

Further Reading

Interactive Visualizations

Video Tutorials

Papers & Articles

Documentation & Books