Learning Objectives: After this lesson, you'll master essential data wrangling techniques—merging datasets, grouping operations, pivot tables, and handling missing data to prepare real-world data for analysis.
The Art of Data Wrangling
Real-world data is messy. It comes from multiple sources, has missing values, duplicates, and inconsistent formats. Data wrangling (also called data munging) is the process of transforming raw data into a clean, analysis-ready format.
Handling Missing Data
Missing data is one of the most common data quality issues. Pandas represents missing values as NaN (Not a Number) or None.
Detecting Missing Values
Handling Missing Values
Imputation Strategies
Merging and Joining DataFrames
Combining data from multiple sources is a fundamental data wrangling task. The diagram below shows how different join types work:
Types of Joins
Merging on Different Column Names
Concatenating DataFrames
Grouping and Aggregation
Grouping lets you split data into groups, apply functions, and combine results—essential for data analysis. Explore the data below to understand aggregation patterns:
Basic Groupby
Multiple Grouping Columns
Custom Aggregation Functions
Transform: Apply Group-Level Values Back to Rows
Pivot Tables
Pivot tables reshape data for cross-tabulation analysis—like Excel pivot tables.
String Operations
Pandas provides vectorized string operations for cleaning text data.
Text Pattern Matching
Date and Time Operations
Date Calculations
Practical Example: Complete Data Wrangling Pipeline
Key Takeaways
✅ Missing data can be detected with isnull() and handled with dropna() or fillna()
✅ Merging combines DataFrames with merge() (like SQL joins) and concat() (stacking)
✅ Groupby splits data into groups for aggregation—fundamental for data analysis
✅ Pivot tables reshape data for cross-tabulation analysis
✅ String operations via .str accessor clean and transform text data
✅ Date operations via .dt accessor extract components and perform calculations
Connections: Data Wrangling in Practice
🔗 Connection to SQL
| SQL | Pandas |
|---|---|
| JOIN | pd.merge() |
| UNION | pd.concat() |
| GROUP BY | df.groupby() |
| COALESCE | fillna() |
| CASE WHEN | np.where() or apply() |
🔗 Connection to Machine Learning
Data wrangling is 80% of a data scientist's job:
- Feature engineering: Creating new columns from existing ones
- Data cleaning: Handling missing values before training
- Data integration: Combining multiple data sources
- Normalization: Standardizing formats for model input
Practice Exercises
Exercise 1: Join and Aggregate
Exercise 2: Clean and Transform
Next Steps
In the next lesson, we'll explore Data Input/Output—loading data from various sources (CSV, JSON, Excel, APIs) and saving your analysis results.
Ready to work with real data files? Let's learn data I/O!