Introduction: Beyond Accuracy
Imagine you're building a model to detect rare cancers. Your model predicts "no cancer" for everyone and achieves 99% accuracy! Sounds great, right? Wrong!
If only 1% of people have cancer, your "stupid" model that always says "no cancer" gets 99% accuracy but misses every single cancer case. Lives are lost!
Accuracy is not everything. The right metric depends on your problem, your costs, and your goals.
Key Insight: Different problems require different metrics. Choosing the right evaluation metric is as important as choosing the right algorithm!
Learning Objectives
- Understand why accuracy can be misleading
- Master classification metrics: precision, recall, F1-score, ROC-AUC
- Grasp confusion matrices and their interpretation
- Learn regression metrics: MSE, RMSE, MAE, R²
- Handle imbalanced datasets properly
- Choose appropriate metrics for different business problems
- Implement custom metrics when needed
1. The Problem with Accuracy
When 99% Accuracy Means Failure
Accuracy: (\frac{\text{Correct Predictions}}{\text{Total Predictions}})
Problem: Doesn't account for class imbalance or error costs!
Loading Python runtime...
2. The Confusion Matrix: Foundation of Classification Metrics
Understanding the Four Quadrants
For binary classification:
Predicted Positive | Predicted Negative | |
---|---|---|
Actually Positive | True Positive (TP) | False Negative (FN) |
Actually Negative | False Positive (FP) | True Negative (TN) |
Definitions:
- TP: Correctly predicted positive
- TN: Correctly predicted negative
- FP: Incorrectly predicted positive (Type I error)
- FN: Incorrectly predicted negative (Type II error)
Loading Python runtime...
3. Key Classification Metrics
Precision and Recall
Precision: Of all predicted positives, how many are correct? [ \text{Precision} = \frac{TP}{TP + FP} ]
Recall (Sensitivity): Of all actual positives, how many did we catch? [ \text{Recall} = \frac{TP}{TP + FN} ]
Tradeoff: High precision → fewer false alarms; High recall → catch more positives
Loading Python runtime...
F1-Score: Harmonic Mean
F1-Score: Harmonic mean of precision and recall [ F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} ]
Use: When you want to balance precision and recall
Variants: (F_\beta) score weights recall (\beta) times more than precision
Loading Python runtime...
4. ROC Curve and AUC
Receiver Operating Characteristic (ROC)
ROC Curve: Plots True Positive Rate (TPR) vs False Positive Rate (FPR) at various thresholds
- TPR (Recall): (\frac{TP}{TP + FN})
- FPR: (\frac{FP}{FP + TN})
AUC (Area Under Curve): Single-number summary
- AUC = 1.0: Perfect classifier
- AUC = 0.5: Random guessing
- AUC < 0.5: Worse than random (flip predictions!)
Loading Python runtime...
5. Regression Metrics
Mean Squared Error (MSE) and Variants
MSE: (\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2)
RMSE: (\sqrt{MSE}) (same units as target)
MAE: (\frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|) (less sensitive to outliers)
R² Score: Proportion of variance explained (1.0 = perfect, 0.0 = baseline, <0 = worse than mean)
Loading Python runtime...
6. Choosing the Right Metric
Decision Guide
Problem Type | Metric | When to Use |
---|---|---|
Binary Classification | Accuracy | Balanced classes, equal error costs |
Precision | Minimize false positives (spam detection) | |
Recall | Minimize false negatives (cancer detection) | |
F1-Score | Balance precision & recall | |
ROC-AUC | Ranking quality, balanced classes | |
PR-AUC | Imbalanced classes | |
Multi-class | Macro/Micro F1 | Weighted or unweighted class performance |
Regression | MSE/RMSE | General purpose, penalize large errors |
MAE | Robust to outliers | |
R² | Explained variance | |
MAPE | Percentage errors matter |
Loading Python runtime...
Key Takeaways
✓ Accuracy is not enough: Misleading for imbalanced datasets and asymmetric costs
✓ Confusion Matrix: Foundation for all classification metrics (TP, TN, FP, FN)
✓ Precision: Of predicted positives, how many are correct? (Minimize FP)
✓ Recall: Of actual positives, how many did we catch? (Minimize FN)
✓ F1-Score: Harmonic mean of precision and recall (balance both)
✓ ROC-AUC: Ranking quality across all thresholds (good for balanced data)
✓ PR-AUC: Better than ROC-AUC for imbalanced data
✓ Regression: MSE (penalizes large errors), MAE (robust to outliers), R² (variance explained)
✓ Choose metric: Based on problem type, class balance, and business costs!
Practice Problems
Problem 1: Compute Metrics from Confusion Matrix
Loading Python runtime...
Problem 2: Choose the Right Metric
Loading Python runtime...
Next Steps
You now understand how to properly evaluate models!
Next lesson: Cross-Validation Strategies – how to get reliable performance estimates and avoid overfitting to validation data.
Proper evaluation + proper validation = trustworthy models!
Further Reading
- Paper: The Relationship Between Precision-Recall and ROC Curves
- Guide: Classification Metrics
- Tutorial: Beyond Accuracy: Precision and Recall
Remember: The metric you optimize is the behavior you get. Choose wisely!