PYTHON FOR DATA SCIENCE: FROM ARRAYS TO ANALYSIS / L04DATA WRANGLING WITH PANDAS: TRANSFORM AND CLEAN
课程 · 10 · 04 / 10
LESSON 04 · INTERMEDIATE · 75 MIN · ◆ 4 INSTRUMENTS

Data Wrangling with Pandas: Transform and Clean

Learn essential data wrangling techniques: merging datasets, grouping operations, pivot tables, and handling missing data like a pro.

TIP

Learning Objectives: After this lesson, you'll master essential data wrangling techniques—merging datasets, grouping operations, pivot tables, and handling missing data to prepare real-world data for analysis.

The Art of Data Wrangling

Real-world data is messy. It comes from multiple sources, has missing values, duplicates, and inconsistent formats. Data wrangling (also called data munging) is the process of transforming raw data into a clean, analysis-ready format.

FIG. 02Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 02Interactive Python code execution environment

Handling Missing Data

Missing data is one of the most common data quality issues. Pandas represents missing values as NaN (Not a Number) or None.

Detecting Missing Values

FIG. 04Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 04Interactive Python code execution environment

Handling Missing Values

FIG. 06Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 06Interactive Python code execution environment

Imputation Strategies

FIG. 08Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 08Interactive Python code execution environment

Merging and Joining DataFrames

Combining data from multiple sources is a fundamental data wrangling task. The diagram below shows how different join types work:

FIG. 10Flow Diagram
INTERACTIVE
LOADING INSTRUMENT
Fig. 10Interactive flow diagrams, timelines, and process visualizations

Types of Joins

FIG. 12Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 12Interactive Python code execution environment

Merging on Different Column Names

FIG. 14Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 14Interactive Python code execution environment

Concatenating DataFrames

FIG. 16Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 16Interactive Python code execution environment

Grouping and Aggregation

Grouping lets you split data into groups, apply functions, and combine results—essential for data analysis. Explore the data below to understand aggregation patterns:

UNKNOWN COMPONENT
DataFrameExplorer

Here's a visual representation of groupby results:

FIG. 20Graph Plotter
INTERACTIVE
LOADING INSTRUMENT
Fig. 20Interactive plotting tool for visualizing data and relationships

Basic Groupby

FIG. 22Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 22Interactive Python code execution environment

Multiple Grouping Columns

FIG. 24Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 24Interactive Python code execution environment

Custom Aggregation Functions

FIG. 26Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 26Interactive Python code execution environment

Transform: Apply Group-Level Values Back to Rows

FIG. 28Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 28Interactive Python code execution environment

Pivot Tables

Pivot tables reshape data for cross-tabulation analysis—like Excel pivot tables.

FIG. 30Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 30Interactive Python code execution environment

String Operations

Pandas provides vectorized string operations for cleaning text data.

FIG. 32Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 32Interactive Python code execution environment

Text Pattern Matching

FIG. 34Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 34Interactive Python code execution environment

Date and Time Operations

FIG. 36Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 36Interactive Python code execution environment

Date Calculations

FIG. 38Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 38Interactive Python code execution environment

Practical Example: Complete Data Wrangling Pipeline

FIG. 40Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 40Interactive Python code execution environment

Key Takeaways

Missing data can be detected with isnull() and handled with dropna() or fillna()

Merging combines DataFrames with merge() (like SQL joins) and concat() (stacking)

Groupby splits data into groups for aggregation—fundamental for data analysis

Pivot tables reshape data for cross-tabulation analysis

String operations via .str accessor clean and transform text data

Date operations via .dt accessor extract components and perform calculations

Connections: Data Wrangling in Practice

🔗 Connection to SQL

SQLPandas
JOINpd.merge()
UNIONpd.concat()
GROUP BYdf.groupby()
COALESCEfillna()
CASE WHENnp.where() or apply()

🔗 Connection to Machine Learning

Data wrangling is 80% of a data scientist's job:

  • Feature engineering: Creating new columns from existing ones
  • Data cleaning: Handling missing values before training
  • Data integration: Combining multiple data sources
  • Normalization: Standardizing formats for model input

Practice Exercises

Exercise 1: Join and Aggregate

FIG. 42Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 42Interactive Python code execution environment

Exercise 2: Clean and Transform

FIG. 44Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 44Interactive Python code execution environment

Next Steps

In the next lesson, we'll explore Data Input/Output—loading data from various sources (CSV, JSON, Excel, APIs) and saving your analysis results.


Ready to work with real data files? Let's learn data I/O!


Further Reading

Visualize It

Official Docs

Tutorials

Speed When pandas Isn't Fast Enough

Books

  • Book: Python for Data Analysis (3rd ed.) — Wes McKinney (free). Chapters 7–10 cover wrangling.
  • Book: Effective Pandas — Matt Harrison. Method-chained idioms throughout.