Data Wrangling with Pandas: Transform and Clean

Learning Objectives: After this lesson, you'll master essential data wrangling techniques—merging datasets, grouping operations, pivot tables, and handling missing data to prepare real-world data for analysis.

The Art of Data Wrangling

Real-world data is messy. It comes from multiple sources, has missing values, duplicates, and inconsistent formats. Data wrangling (also called data munging) is the process of transforming raw data into a clean, analysis-ready format.

Loading tool...

Handling Missing Data

Missing data is one of the most common data quality issues. Pandas represents missing values as NaN (Not a Number) or None.

Detecting Missing Values

Loading tool...

Handling Missing Values

Loading tool...

Imputation Strategies

Loading tool...

Merging and Joining DataFrames

Combining data from multiple sources is a fundamental data wrangling task. The diagram below shows how different join types work:

Loading tool...

Types of Joins

Loading tool...

Merging on Different Column Names

Loading tool...

Concatenating DataFrames

Loading tool...

Grouping and Aggregation

Grouping lets you split data into groups, apply functions, and combine results—essential for data analysis. Explore the data below to understand aggregation patterns:

Unknown component: DataFrameExplorer

Basic Groupby

Loading tool...

Multiple Grouping Columns

Loading tool...

Custom Aggregation Functions

Loading tool...

Transform: Apply Group-Level Values Back to Rows

Loading tool...

Pivot Tables

Pivot tables reshape data for cross-tabulation analysis—like Excel pivot tables.

Loading tool...

String Operations

Pandas provides vectorized string operations for cleaning text data.

Loading tool...

Text Pattern Matching

Loading tool...

Date and Time Operations

Loading tool...

Date Calculations

Loading tool...

Practical Example: Complete Data Wrangling Pipeline

Loading tool...

Key Takeaways

✅ Missing data can be detected with isnull() and handled with dropna() or fillna()

✅ Merging combines DataFrames with merge() (like SQL joins) and concat() (stacking)

✅ Groupby splits data into groups for aggregation—fundamental for data analysis

✅ Pivot tables reshape data for cross-tabulation analysis

✅ String operations via .str accessor clean and transform text data

✅ Date operations via .dt accessor extract components and perform calculations

Connections: Data Wrangling in Practice

🔗 Connection to SQL

SQL	Pandas
JOIN	`pd.merge()`
UNION	`pd.concat()`
GROUP BY	`df.groupby()`
COALESCE	`fillna()`
CASE WHEN	`np.where()` or `apply()`

🔗 Connection to Machine Learning

Data wrangling is 80% of a data scientist's job:

Feature engineering: Creating new columns from existing ones
Data cleaning: Handling missing values before training
Data integration: Combining multiple data sources
Normalization: Standardizing formats for model input

Practice Exercises

Exercise 1: Join and Aggregate

Loading tool...

Exercise 2: Clean and Transform

Loading tool...

Next Steps

In the next lesson, we'll explore Data Input/Output—loading data from various sources (CSV, JSON, Excel, APIs) and saving your analysis results.

Ready to work with real data files? Let's learn data I/O!

Python for Data Science: From Arrays to Analysis

Data Wrangling with Pandas: Transform and Clean

The Art of Data Wrangling

Handling Missing Data

Detecting Missing Values

Handling Missing Values

Imputation Strategies

Merging and Joining DataFrames

Types of Joins

Merging on Different Column Names

Concatenating DataFrames

Grouping and Aggregation

Basic Groupby

Multiple Grouping Columns

Custom Aggregation Functions

Transform: Apply Group-Level Values Back to Rows

Pivot Tables

String Operations

Text Pattern Matching

Date and Time Operations

Date Calculations

Practical Example: Complete Data Wrangling Pipeline

Key Takeaways

Connections: Data Wrangling in Practice

🔗 Connection to SQL

🔗 Connection to Machine Learning

Practice Exercises

Exercise 1: Join and Aggregate

Exercise 2: Clean and Transform

Next Steps