CLASSICAL MACHINE LEARNING: SUPERVISED LEARNING FOUNDATIONS / L12CROSS-VALIDATION AND MODEL SELECTION
课程 · 15 · 12 / 15
LESSON 12 · INTERMEDIATE · 60 MIN · ◆ 2 INSTRUMENTS

Cross-Validation and Model Selection

Robust model evaluation: k-fold CV, stratified sampling, time series CV, nested CV for hyperparameter tuning.

Introduction: Don't Trust a Single Split!

Imagine you're hiring an employee based on one interview. By chance, they have a great day and ace it. You hire them, but they turn out to be mediocre. One sample isn't enough to judge!

The same applies to model evaluation. If you evaluate on just one train/test split, you might get lucky (or unlucky) with the split. Your performance estimate will be unreliable.

Cross-validation solves this: evaluate on multiple different splits and average the results. This gives a much more robust estimate of true performance!

Key Insight: Cross-validation provides a reliable estimate of model performance by testing on multiple data partitions, reducing variance and catching overfitting.

Learning Objectives

  • Understand why single train/test splits are insufficient
  • Master k-fold cross-validation
  • Learn specialized CV strategies (stratified, time-series, grouped)
  • Implement nested cross-validation for hyperparameter tuning
  • Avoid data leakage pitfalls
  • Choose appropriate CV strategy for different problems
  • Understand computational tradeoffs

1. The Problem with Train/Test Split

Variance in Performance Estimates

A single train/test split can give misleading results depending on which samples end up in each set!

FIG. 02Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 02Interactive Python code execution environment

2. K-Fold Cross-Validation

The Standard Approach

Algorithm:

  1. Split data into (k) equal-sized folds
  2. For each fold (i = 1, ..., k):
    • Train on all folds except (i)
    • Test on fold (i)
  3. Average the (k) performance scores

Common choices: (k=5) or (k=10)

FIG. 04Flow Diagram
INTERACTIVE
LOADING INSTRUMENT
Fig. 04Interactive flow diagrams, timelines, and process visualizations
FIG. 06Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 06Interactive Python code execution environment

3. Stratified K-Fold: For Imbalanced Data

Maintaining Class Proportions

Problem: Random folds might have different class distributions

Solution: Stratified K-Fold ensures each fold has same class proportions as original data

FIG. 08Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 08Interactive Python code execution environment

4. Specialized Cross-Validation Strategies

Time Series Split: Respecting Temporal Order

Problem: Time series data has temporal dependencies

Solution: TimeSeriesSplit – always train on past, test on future

FIG. 10Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 10Interactive Python code execution environment

Group K-Fold: Keeping Groups Together

Problem: Data has groups that shouldn't be split (e.g., multiple samples from same patient)

Solution: GroupKFold – keeps all samples from same group in same fold

FIG. 12Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 12Interactive Python code execution environment

5. Nested Cross-Validation: For Hyperparameter Tuning

The Right Way to Tune and Evaluate

Problem: If you tune hyperparameters on your CV folds, then report CV scores, you're overfitting to CV!

Solution: Nested CV

  • Outer loop: Estimates true performance
  • Inner loop: Hyperparameter tuning
FIG. 14Flow Diagram
INTERACTIVE
LOADING INSTRUMENT
Fig. 14Interactive flow diagrams, timelines, and process visualizations
FIG. 16Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 16Interactive Python code execution environment

6. Common Pitfalls and Best Practices

Data Leakage in Cross-Validation

Leakage: Information from test fold influences training

Common mistakes:

  1. Fitting preprocessor on full data before CV
  2. Feature selection on full data before CV
  3. Using regular CV for time series or grouped data
FIG. 18Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 18Interactive Python code execution environment

Key Takeaways

Single train/test split: High variance, unreliable performance estimates

K-Fold CV: Averages over (k) different splits (typical: (k=5) or 10)

Stratified K-Fold: Maintains class proportions (use for imbalanced data)

Time Series Split: Always train on past, test on future (never shuffle!)

Group K-Fold: Keeps grouped samples together (medical, user data)

Nested CV: Outer loop for evaluation, inner loop for hyperparameter tuning

Avoid Leakage: All preprocessing must happen inside CV folds (use Pipeline)

Tradeoff: Larger (k) → more reliable estimates but slower computation


Practice Problems

Problem 1: Implement K-Fold from Scratch

FIG. 20Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 20Interactive Python code execution environment

Problem 2: Design CV for a Real Problem

FIG. 22Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 22Interactive Python code execution environment

Next Steps

You now understand how to properly validate models!

Next: We'll cover the art of making models better through:

  • Lesson 13: Feature Engineering – crafting powerful features
  • Lesson 14: Feature Selection – choosing the most informative features

These skills separate good ML practitioners from great ones!


Further Reading

Interactive Visualizations

Video Tutorials

Papers & Articles

Documentation & Books

  • scikit-learn: Cross-validation — full API including cross_val_predict, RepeatedKFold, and grouped variants.
  • Book: Feature Engineering and Selection — Kuhn & Johnson (Chapter on Resampling, free online).

Remember: Proper validation is the foundation of trustworthy machine learning. Don't skip it!