Introduction: Raw Data Is Rarely Perfect
Imagine you're a chef. You could serve customers raw ingredients – flour, eggs, sugar – or you could transform them into a delicious cake!
Feature engineering is the art of transforming raw data into features that help models learn better. It's often more important than choosing the "best" algorithm!
As Andrew Ng famously said: "Applied machine learning is basically feature engineering."
Key Insight: Good features can make a simple model outperform a complex one. Feature engineering is where domain knowledge meets data science creativity!
Learning Objectives
- Understand why feature engineering matters
- Master common transformations (scaling, encoding, binning)
- Create polynomial and interaction features
- Extract temporal and cyclical features
- Handle missing values effectively
- Apply domain-specific feature engineering
- Avoid common pitfalls and data leakage
1. Why Feature Engineering Matters
Features > Algorithms (Often!)
Example: Predicting house prices
- Bad features: Raw pixel values of house photo
- Good features: Square footage, number of bedrooms, neighborhood
Loading Python runtime...
2. Scaling and Normalization
Making Features Comparable
Problem: Features with different scales can dominate models
Solution: Scale all features to similar ranges
Methods:
- StandardScaler: (z = \frac{x - \mu}{\sigma}) (mean=0, std=1)
- MinMaxScaler: (x' = \frac{x - \min}{\max - \min}) (range [0, 1])
- RobustScaler: Uses median and IQR (robust to outliers)
Loading Python runtime...
3. Encoding Categorical Features
From Categories to Numbers
Problem: Most ML algorithms need numerical input
Solutions:
One-Hot Encoding
Convert each category to binary feature: (N) categories → (N) features
Loading Python runtime...
Label/Ordinal Encoding
Assign integer to each category (preserve order for ordinal data)
Loading Python runtime...
Target Encoding
Replace category with mean target value (powerful but risky!)
Loading Python runtime...
4. Polynomial and Interaction Features
Capturing Non-Linear Relationships
Polynomial Features: Add powers of features
[x, x^2, x^3, ...]
Interaction Features: Add products of features
[x_1, x_2, x_1 \cdot x_2, x_1^2, x_2^2, ...]
Loading Python runtime...
5. Temporal Features
Extracting Time-Based Patterns
From timestamps, extract:
- Year, month, day, hour, minute, weekday
- Season, quarter
- Time since event
- Cyclical encodings (sin/cos for periodic features)
Loading Python runtime...
6. Handling Missing Values
Strategies for Incomplete Data
Methods:
- Drop: Remove rows/columns with missing values
- Mean/Median Imputation: Fill with central tendency
- Mode Imputation: For categorical features
- Forward/Backward Fill: For time series
- Model-Based: Predict missing values
- Add Indicator: Flag whether value was missing
Loading Python runtime...
7. Domain-Specific Feature Engineering
Real-World Examples
Example 1: E-commerce
Loading Python runtime...
Key Takeaways
✓ Feature engineering often matters more than algorithm choice
✓ Scaling: StandardScaler (Gaussian), MinMaxScaler (bounded), RobustScaler (outliers)
✓ Encoding: One-hot (nominal), ordinal (ordered), target (powerful but risky)
✓ Polynomial/Interactions: Capture non-linear relationships
✓ Temporal: Extract components, create flags, use cyclical encoding (sin/cos)
✓ Missing values: Mean/median, model-based, add indicators
✓ Domain knowledge: Create ratios, rates, flags, compound features
✓ Always: Do feature engineering INSIDE CV folds (use Pipeline)
Practice Problems
Problem 1: Engineer Features for House Prices
Loading Python runtime...
Problem 2: Time-Based Feature Engineering
Loading Python runtime...
Next Steps
You've mastered feature engineering! Next: Feature Selection – choosing which features to keep and which to discard.
Not all features are useful. Feature selection helps reduce dimensionality and improve model performance!
Further Reading
- Book: Feature Engineering for Machine Learning by Alice Zheng & Amanda Casari
- Article: Feature Engineering Tips
- Tutorial: Feature Engineering in Python
Remember: "Applied machine learning is basically feature engineering!" – Andrew Ng