Evaluation Metrics: Beyond Accuracy

Introduction: Beyond Accuracy

Imagine you're building a model to detect rare cancers. Your model predicts "no cancer" for everyone and achieves 99% accuracy! Sounds great, right? Wrong!

If only 1% of people have cancer, your "stupid" model that always says "no cancer" gets 99% accuracy but misses every single cancer case. Lives are lost!

Accuracy is not everything. The right metric depends on your problem, your costs, and your goals.

Key Insight: Different problems require different metrics. Choosing the right evaluation metric is as important as choosing the right algorithm!

Learning Objectives

  • Understand why accuracy can be misleading
  • Master classification metrics: precision, recall, F1-score, ROC-AUC
  • Grasp confusion matrices and their interpretation
  • Learn regression metrics: MSE, RMSE, MAE, R²
  • Handle imbalanced datasets properly
  • Choose appropriate metrics for different business problems
  • Implement custom metrics when needed

1. The Problem with Accuracy

When 99% Accuracy Means Failure

Accuracy: (\frac{\text{Correct Predictions}}{\text{Total Predictions}})

Problem: Doesn't account for class imbalance or error costs!

Loading Python runtime...


2. The Confusion Matrix: Foundation of Classification Metrics

Understanding the Four Quadrants

For binary classification:

Predicted PositivePredicted Negative
Actually PositiveTrue Positive (TP)False Negative (FN)
Actually NegativeFalse Positive (FP)True Negative (TN)

Definitions:

  • TP: Correctly predicted positive
  • TN: Correctly predicted negative
  • FP: Incorrectly predicted positive (Type I error)
  • FN: Incorrectly predicted negative (Type II error)

Loading Python runtime...


3. Key Classification Metrics

Precision and Recall

Precision: Of all predicted positives, how many are correct? [ \text{Precision} = \frac{TP}{TP + FP} ]

Recall (Sensitivity): Of all actual positives, how many did we catch? [ \text{Recall} = \frac{TP}{TP + FN} ]

Tradeoff: High precision → fewer false alarms; High recall → catch more positives

Loading Python runtime...

F1-Score: Harmonic Mean

F1-Score: Harmonic mean of precision and recall [ F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} ]

Use: When you want to balance precision and recall

Variants: (F_\beta) score weights recall (\beta) times more than precision

Loading Python runtime...


4. ROC Curve and AUC

Receiver Operating Characteristic (ROC)

ROC Curve: Plots True Positive Rate (TPR) vs False Positive Rate (FPR) at various thresholds

  • TPR (Recall): (\frac{TP}{TP + FN})
  • FPR: (\frac{FP}{FP + TN})

AUC (Area Under Curve): Single-number summary

  • AUC = 1.0: Perfect classifier
  • AUC = 0.5: Random guessing
  • AUC < 0.5: Worse than random (flip predictions!)

Loading Python runtime...


5. Regression Metrics

Mean Squared Error (MSE) and Variants

MSE: (\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2)

RMSE: (\sqrt{MSE}) (same units as target)

MAE: (\frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|) (less sensitive to outliers)

R² Score: Proportion of variance explained (1.0 = perfect, 0.0 = baseline, <0 = worse than mean)

Loading Python runtime...


6. Choosing the Right Metric

Decision Guide

Problem TypeMetricWhen to Use
Binary ClassificationAccuracyBalanced classes, equal error costs
PrecisionMinimize false positives (spam detection)
RecallMinimize false negatives (cancer detection)
F1-ScoreBalance precision & recall
ROC-AUCRanking quality, balanced classes
PR-AUCImbalanced classes
Multi-classMacro/Micro F1Weighted or unweighted class performance
RegressionMSE/RMSEGeneral purpose, penalize large errors
MAERobust to outliers
Explained variance
MAPEPercentage errors matter

Loading Python runtime...


Key Takeaways

Accuracy is not enough: Misleading for imbalanced datasets and asymmetric costs

Confusion Matrix: Foundation for all classification metrics (TP, TN, FP, FN)

Precision: Of predicted positives, how many are correct? (Minimize FP)

Recall: Of actual positives, how many did we catch? (Minimize FN)

F1-Score: Harmonic mean of precision and recall (balance both)

ROC-AUC: Ranking quality across all thresholds (good for balanced data)

PR-AUC: Better than ROC-AUC for imbalanced data

Regression: MSE (penalizes large errors), MAE (robust to outliers), R² (variance explained)

Choose metric: Based on problem type, class balance, and business costs!


Practice Problems

Problem 1: Compute Metrics from Confusion Matrix

Loading Python runtime...

Problem 2: Choose the Right Metric

Loading Python runtime...


Next Steps

You now understand how to properly evaluate models!

Next lesson: Cross-Validation Strategies – how to get reliable performance estimates and avoid overfitting to validation data.

Proper evaluation + proper validation = trustworthy models!


Further Reading


Remember: The metric you optimize is the behavior you get. Choose wisely!