课程 · 10 · 04 / 10
Data Wrangling with Pandas: Transform and Clean
Learn essential data wrangling techniques: merging datasets, grouping operations, pivot tables, and handling missing data like a pro.
TIPLearning Objectives: After this lesson, you'll master essential data wrangling techniques—merging datasets, grouping operations, pivot tables, and handling missing data to prepare real-world data for analysis.
The Art of Data Wrangling
Real-world data is messy. It comes from multiple sources, has missing values, duplicates, and inconsistent formats. Data wrangling (also called data munging) is the process of transforming raw data into a clean, analysis-ready format.
Handling Missing Data
Missing data is one of the most common data quality issues. Pandas represents missing values as NaN (Not a Number) or None.
Detecting Missing Values
Handling Missing Values
Imputation Strategies
Merging and Joining DataFrames
Combining data from multiple sources is a fundamental data wrangling task. The diagram below shows how different join types work:
Types of Joins
Merging on Different Column Names
Concatenating DataFrames
Grouping and Aggregation
Grouping lets you split data into groups, apply functions, and combine results—essential for data analysis. Explore the data below to understand aggregation patterns:
Here's a visual representation of groupby results:
Basic Groupby
Multiple Grouping Columns
Custom Aggregation Functions
Transform: Apply Group-Level Values Back to Rows
Pivot Tables
Pivot tables reshape data for cross-tabulation analysis—like Excel pivot tables.
String Operations
Pandas provides vectorized string operations for cleaning text data.
Text Pattern Matching
Date and Time Operations
Date Calculations
Practical Example: Complete Data Wrangling Pipeline
Key Takeaways
✅ Missing data can be detected with isnull() and handled with dropna() or fillna()
✅ Merging combines DataFrames with merge() (like SQL joins) and concat() (stacking)
✅ Groupby splits data into groups for aggregation—fundamental for data analysis
✅ Pivot tables reshape data for cross-tabulation analysis
✅ String operations via .str accessor clean and transform text data
✅ Date operations via .dt accessor extract components and perform calculations
Connections: Data Wrangling in Practice
🔗 Connection to SQL
| SQL | Pandas |
|---|---|
| JOIN | pd.merge() |
| UNION | pd.concat() |
| GROUP BY | df.groupby() |
| COALESCE | fillna() |
| CASE WHEN | np.where() or apply() |
🔗 Connection to Machine Learning
Data wrangling is 80% of a data scientist's job:
- Feature engineering: Creating new columns from existing ones
- Data cleaning: Handling missing values before training
- Data integration: Combining multiple data sources
- Normalization: Standardizing formats for model input
Practice Exercises
Exercise 1: Join and Aggregate
Exercise 2: Clean and Transform
Next Steps
In the next lesson, we'll explore Data Input/Output—loading data from various sources (CSV, JSON, Excel, APIs) and saving your analysis results.
Ready to work with real data files? Let's learn data I/O!
Further Reading
Visualize It
- Pandas Tutor — paste a
groupby().agg()chain or amergeand step through it. Joins finally make sense. - Pandas Merge Methods (official diagram) — side-by-side
inner/outer/left/right.
Official Docs
- Pandas — Group By: split-apply-combine — the core mental model for aggregation.
- Pandas — Reshaping and Pivot Tables —
melt,pivot,stack,unstack,crosstab. - Pandas — Working with Missing Data —
NaN,NA,interpolate,dropna,fillna.
Tutorials
- Modern Pandas — Method Chaining — Tom Augspurger. The single biggest readability upgrade for pandas code.
- Real Python —
groupby()— the canonical groupby walkthrough. - Tidy Data (Hadley Wickham) — the conceptual foundation for
melt/pivotoperations.
Speed When pandas Isn't Fast Enough
- Polars Migration Guide — Polars' lazy expressions often beat pandas 5–50× on group-by/joins.
- DuckDB on Pandas — SQL is sometimes the right language for the wrangling step.
Books
- Book: Python for Data Analysis (3rd ed.) — Wes McKinney (free). Chapters 7–10 cover wrangling.
- Book: Effective Pandas — Matt Harrison. Method-chained idioms throughout.