Data Wrangling with Pandas: Transform and Clean

Learning Objectives: After this lesson, you'll master essential data wrangling techniques—merging datasets, grouping operations, pivot tables, and handling missing data to prepare real-world data for analysis.

The Art of Data Wrangling

Real-world data is messy. It comes from multiple sources, has missing values, duplicates, and inconsistent formats. Data wrangling (also called data munging) is the process of transforming raw data into a clean, analysis-ready format.

Loading tool...

Handling Missing Data

Missing data is one of the most common data quality issues. Pandas represents missing values as NaN (Not a Number) or None.

Detecting Missing Values

Loading tool...

Handling Missing Values

Loading tool...

Imputation Strategies

Loading tool...

Merging and Joining DataFrames

Combining data from multiple sources is a fundamental data wrangling task. The diagram below shows how different join types work:

Loading tool...

Types of Joins

Loading tool...

Merging on Different Column Names

Loading tool...

Concatenating DataFrames

Loading tool...

Grouping and Aggregation

Grouping lets you split data into groups, apply functions, and combine results—essential for data analysis. Explore the data below to understand aggregation patterns:

Unknown component: DataFrameExplorer

Basic Groupby

Loading tool...

Multiple Grouping Columns

Loading tool...

Custom Aggregation Functions

Loading tool...

Transform: Apply Group-Level Values Back to Rows

Loading tool...

Pivot Tables

Pivot tables reshape data for cross-tabulation analysis—like Excel pivot tables.

Loading tool...

String Operations

Pandas provides vectorized string operations for cleaning text data.

Loading tool...

Text Pattern Matching

Loading tool...

Date and Time Operations

Loading tool...

Date Calculations

Loading tool...

Practical Example: Complete Data Wrangling Pipeline

Loading tool...

Key Takeaways

Missing data can be detected with isnull() and handled with dropna() or fillna()

Merging combines DataFrames with merge() (like SQL joins) and concat() (stacking)

Groupby splits data into groups for aggregation—fundamental for data analysis

Pivot tables reshape data for cross-tabulation analysis

String operations via .str accessor clean and transform text data

Date operations via .dt accessor extract components and perform calculations

Connections: Data Wrangling in Practice

🔗 Connection to SQL

SQLPandas
JOINpd.merge()
UNIONpd.concat()
GROUP BYdf.groupby()
COALESCEfillna()
CASE WHENnp.where() or apply()

🔗 Connection to Machine Learning

Data wrangling is 80% of a data scientist's job:

  • Feature engineering: Creating new columns from existing ones
  • Data cleaning: Handling missing values before training
  • Data integration: Combining multiple data sources
  • Normalization: Standardizing formats for model input

Practice Exercises

Exercise 1: Join and Aggregate

Loading tool...

Exercise 2: Clean and Transform

Loading tool...

Next Steps

In the next lesson, we'll explore Data Input/Output—loading data from various sources (CSV, JSON, Excel, APIs) and saving your analysis results.


Ready to work with real data files? Let's learn data I/O!