Evaluation Metrics: Beyond Accuracy

Introduction: Beyond Accuracy

Imagine you're building a model to detect rare cancers. Your model predicts "no cancer" for everyone and achieves 99% accuracy! Sounds great, right? Wrong!

If only 1% of people have cancer, your "stupid" model that always says "no cancer" gets 99% accuracy but misses every single cancer case. Lives are lost!

Accuracy is not everything. The right metric depends on your problem, your costs, and your goals.

Key Insight: Different problems require different metrics. Choosing the right evaluation metric is as important as choosing the right algorithm!

Learning Objectives

Understand why accuracy can be misleading
Master classification metrics: precision, recall, F1-score, ROC-AUC
Grasp confusion matrices and their interpretation
Learn regression metrics: MSE, RMSE, MAE, R²
Handle imbalanced datasets properly
Choose appropriate metrics for different business problems
Implement custom metrics when needed

1. The Problem with Accuracy

When 99% Accuracy Means Failure

Accuracy: (\frac{\text{Correct Predictions}}{\text{Total Predictions}})

Problem: Doesn't account for class imbalance or error costs!

Loading Python runtime...

2. The Confusion Matrix: Foundation of Classification Metrics

Understanding the Four Quadrants

For binary classification:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

Definitions:

TP: Correctly predicted positive
TN: Correctly predicted negative
FP: Incorrectly predicted positive (Type I error)
FN: Incorrectly predicted negative (Type II error)

Loading Python runtime...

3. Key Classification Metrics

Precision and Recall

Precision: Of all predicted positives, how many are correct? [ \text{Precision} = \frac{TP}{TP + FP} ]

Recall (Sensitivity): Of all actual positives, how many did we catch? [ \text{Recall} = \frac{TP}{TP + FN} ]

Tradeoff: High precision → fewer false alarms; High recall → catch more positives

Loading Python runtime...

F1-Score: Harmonic Mean

F1-Score: Harmonic mean of precision and recall [ F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} ]

Use: When you want to balance precision and recall

Variants: (F_\beta) score weights recall (\beta) times more than precision

Loading Python runtime...

4. ROC Curve and AUC

Receiver Operating Characteristic (ROC)

ROC Curve: Plots True Positive Rate (TPR) vs False Positive Rate (FPR) at various thresholds

TPR (Recall): (\frac{TP}{TP + FN})
FPR: (\frac{FP}{FP + TN})

AUC (Area Under Curve): Single-number summary

AUC = 1.0: Perfect classifier
AUC = 0.5: Random guessing
AUC < 0.5: Worse than random (flip predictions!)

Loading Python runtime...

5. Regression Metrics

Mean Squared Error (MSE) and Variants

MSE: (\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2)

RMSE: (\sqrt{MSE}) (same units as target)

MAE: (\frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|) (less sensitive to outliers)

R² Score: Proportion of variance explained (1.0 = perfect, 0.0 = baseline, <0 = worse than mean)

Loading Python runtime...

6. Choosing the Right Metric

Decision Guide

Problem Type	Metric	When to Use
Binary Classification	Accuracy	Balanced classes, equal error costs
	Precision	Minimize false positives (spam detection)
	Recall	Minimize false negatives (cancer detection)
	F1-Score	Balance precision & recall
	ROC-AUC	Ranking quality, balanced classes
	PR-AUC	Imbalanced classes
Multi-class	Macro/Micro F1	Weighted or unweighted class performance
Regression	MSE/RMSE	General purpose, penalize large errors
	MAE	Robust to outliers
	R²	Explained variance
	MAPE	Percentage errors matter

Loading Python runtime...

Key Takeaways

✓ Accuracy is not enough: Misleading for imbalanced datasets and asymmetric costs

✓ Confusion Matrix: Foundation for all classification metrics (TP, TN, FP, FN)

✓ Precision: Of predicted positives, how many are correct? (Minimize FP)

✓ Recall: Of actual positives, how many did we catch? (Minimize FN)

✓ F1-Score: Harmonic mean of precision and recall (balance both)

✓ ROC-AUC: Ranking quality across all thresholds (good for balanced data)

✓ PR-AUC: Better than ROC-AUC for imbalanced data

✓ Regression: MSE (penalizes large errors), MAE (robust to outliers), R² (variance explained)

✓ Choose metric: Based on problem type, class balance, and business costs!

Practice Problems

Problem 1: Compute Metrics from Confusion Matrix

Loading Python runtime...

Problem 2: Choose the Right Metric

Loading Python runtime...

Next Steps

You now understand how to properly evaluate models!

Next lesson: Cross-Validation Strategies – how to get reliable performance estimates and avoid overfitting to validation data.

Proper evaluation + proper validation = trustworthy models!

Classical Machine Learning: Supervised Learning Foundations

Evaluation Metrics: Beyond Accuracy

Introduction: Beyond Accuracy

Learning Objectives

1. The Problem with Accuracy

When 99% Accuracy Means Failure

2. The Confusion Matrix: Foundation of Classification Metrics

Understanding the Four Quadrants

3. Key Classification Metrics

Precision and Recall

F1-Score: Harmonic Mean

4. ROC Curve and AUC

Receiver Operating Characteristic (ROC)

5. Regression Metrics

Mean Squared Error (MSE) and Variants

6. Choosing the Right Metric

Decision Guide

Key Takeaways

Practice Problems

Problem 1: Compute Metrics from Confusion Matrix

Problem 2: Choose the Right Metric

Next Steps

Further Reading