Introduction: Raw Data Is Rarely Perfect
Imagine you're a chef. You could serve customers raw ingredients – flour, eggs, sugar – or you could transform them into a delicious cake!
Feature engineering is the art of transforming raw data into features that help models learn better. It's often more important than choosing the "best" algorithm!
As Andrew Ng famously said: "Applied machine learning is basically feature engineering."
Key Insight: Good features can make a simple model outperform a complex one. Feature engineering is where domain knowledge meets data science creativity!
Learning Objectives
- Understand why feature engineering matters
- Master common transformations (scaling, encoding, binning)
- Create polynomial and interaction features
- Extract temporal and cyclical features
- Handle missing values effectively
- Apply domain-specific feature engineering
- Avoid common pitfalls and data leakage
1. Why Feature Engineering Matters
Features > Algorithms (Often!)
Example: Predicting house prices
- Bad features: Raw pixel values of house photo
- Good features: Square footage, number of bedrooms, neighborhood
Interactive Feature Engineering Workbench
Let's explore feature transformations interactively! This workbench lets you try scaling, polynomial features, encoding, and feature selection - all the key techniques for transforming raw data.
2. Scaling and Normalization
Making Features Comparable
Problem: Features with different scales can dominate models
Solution: Scale all features to similar ranges
Methods:
- StandardScaler: (z = \frac{x - \mu}{\sigma}) (mean=0, std=1)
- MinMaxScaler: (x' = \frac{x - \min}{\max - \min}) (range [0, 1])
- RobustScaler: Uses median and IQR (robust to outliers)
3. Encoding Categorical Features
From Categories to Numbers
Problem: Most ML algorithms need numerical input
Solutions:
One-Hot Encoding
Convert each category to binary feature: (N) categories → (N) features
Label/Ordinal Encoding
Assign integer to each category (preserve order for ordinal data)
Target Encoding
Replace category with mean target value (powerful but risky!)
4. Polynomial and Interaction Features
Capturing Non-Linear Relationships
Polynomial Features: Add powers of features
[x, x^2, x^3, ...]
Interaction Features: Add products of features
[x_1, x_2, x_1 \cdot x_2, x_1^2, x_2^2, ...]
5. Temporal Features
Extracting Time-Based Patterns
From timestamps, extract:
- Year, month, day, hour, minute, weekday
- Season, quarter
- Time since event
- Cyclical encodings (sin/cos for periodic features)
6. Handling Missing Values
Strategies for Incomplete Data
Methods:
- Drop: Remove rows/columns with missing values
- Mean/Median Imputation: Fill with central tendency
- Mode Imputation: For categorical features
- Forward/Backward Fill: For time series
- Model-Based: Predict missing values
- Add Indicator: Flag whether value was missing
7. Domain-Specific Feature Engineering
Real-World Examples
Example 1: E-commerce
Key Takeaways
✓ Feature engineering often matters more than algorithm choice
✓ Scaling: StandardScaler (Gaussian), MinMaxScaler (bounded), RobustScaler (outliers)
✓ Encoding: One-hot (nominal), ordinal (ordered), target (powerful but risky)
✓ Polynomial/Interactions: Capture non-linear relationships
✓ Temporal: Extract components, create flags, use cyclical encoding (sin/cos)
✓ Missing values: Mean/median, model-based, add indicators
✓ Domain knowledge: Create ratios, rates, flags, compound features
✓ Always: Do feature engineering INSIDE CV folds (use Pipeline)
Practice Problems
Problem 1: Engineer Features for House Prices
Problem 2: Time-Based Feature Engineering
Next Steps
You've mastered feature engineering! Next: Feature Selection – choosing which features to keep and which to discard.
Not all features are useful. Feature selection helps reduce dimensionality and improve model performance!
Further Reading
Interactive Visualizations
- Compare the Effect of Different Scalers on Data with Outliers — scikit-learn's live gallery comparing StandardScaler, RobustScaler, QuantileTransformer, and PowerTransformer on the same distribution.
- scikit-learn — Encoding Categorical Features — side-by-side visual of one-hot, ordinal, and target encoders.
- Kaggle — Feature Engineering Tutorial — interactive notebooks you can fork and run.
- Feature Engineering for Time Series — interactive charts of lag, rolling, and cyclical features.
Video Tutorials
- StatQuest — Feature Encoding (Categorical → Numeric) — short intuition for each common encoder.
- Google ML Crash Course — Feature Engineering — interactive lessons on numeric features, binning, and categorical encodings.
- Kaggle Notebook Club — Target Encoding Demystified — fixes the leakage trap this lesson warns about.
Papers & Articles
- A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers — Potdar, Pardawala, Pai, 2017.
- Beyond One-Hot: An Exploration of Categorical Variables — Moeyersoms & Martens.
- A Tutorial on Time-Series Feature Engineering — Barandas et al.
- AutoML: A Survey of the State-of-the-Art — He et al., 2019. What to automate vs. keep manual.
Documentation & Books
- Book: Feature Engineering for Machine Learning — Alice Zheng & Amanda Casari (O'Reilly).
- Book: Feature Engineering and Selection — Kuhn & Johnson (free online).
- scikit-learn: Preprocessing — the complete API reference.
- Feature-engine Library — a specialized library with 50+ production-tested transformers that plug into sklearn pipelines.
Remember: "Applied machine learning is basically feature engineering!" – Andrew Ng