PYTHON FOR DATA SCIENCE: FROM ARRAYS TO ANALYSIS / L05DATA INPUT/OUTPUT: LOADING AND SAVING DATA
课程 · 10 · 05 / 10
LESSON 05 · INTERMEDIATE · 45 MIN · ◆ 1 INSTRUMENT

Data Input/Output: Loading and Saving Data

Master reading and writing data in various formats: CSV, JSON, Excel, and more. Learn to fetch data from web APIs and handle different file encodings.

TIP

Learning Objectives: After this lesson, you'll master reading and writing data in various formats—CSV, JSON, Excel, and more. You'll also learn to fetch data from web APIs and handle different file encodings.

The Data Lifecycle

Data analysis starts with loading data and ends with saving results. Pandas makes this seamless across formats.

FIG. 02Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 02Interactive Python code execution environment

Reading CSV Files

CSV (Comma-Separated Values) is the most common data format. Pandas handles it effortlessly.

Basic CSV Reading

FIG. 04Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 04Interactive Python code execution environment

Common CSV Options

FIG. 06Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 06Interactive Python code execution environment

Handling Large Files

FIG. 08Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 08Interactive Python code execution environment

Writing CSV Files

FIG. 10Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 10Interactive Python code execution environment

Working with JSON

JSON is common for web APIs and nested data structures.

Reading JSON

FIG. 12Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 12Interactive Python code execution environment

Writing JSON

FIG. 14Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 14Interactive Python code execution environment

Excel Files

Pandas can read and write Excel files (requires openpyxl for .xlsx).

FIG. 16Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 16Interactive Python code execution environment

Working with SQL Databases

Pandas integrates seamlessly with SQL databases.

FIG. 18Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 18Interactive Python code execution environment

Fetching Data from Web APIs

APIs return data (usually JSON) that can be converted to DataFrames.

FIG. 20Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 20Interactive Python code execution environment

Pagination and Multiple API Calls

FIG. 22Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 22Interactive Python code execution environment

Handling File Encodings

Different files use different character encodings. Understanding this prevents errors.

FIG. 24Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 24Interactive Python code execution environment

Binary Formats: Parquet and Pickle

For performance, consider binary formats.

FIG. 26Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 26Interactive Python code execution environment

Practical Example: Data Pipeline

FIG. 28Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 28Interactive Python code execution environment

Key Takeaways

CSV is universal—use pd.read_csv() and df.to_csv() with appropriate options

JSON handles nested data—use pd.json_normalize() for complex structures

Excel requires openpyxl—supports multiple sheets with ExcelWriter

SQL integration is seamless—use pd.read_sql() and df.to_sql()

APIs return JSON—convert to DataFrames after parsing

Encodings matter—specify encoding for international data

Binary formats (Parquet, Pickle) are faster for large data

Connections: Data I/O in Practice

🔗 Connection to Data Engineering

TaskPandas Function
ETL Pipelineread_* → transform → to_*
Data Lakeread_parquet() / to_parquet()
Data Warehouseread_sql() / to_sql()
API Integrationrequests + read_json()

🔗 Connection to Machine Learning

Loading data is the first step in any ML workflow:

# Typical ML data loading train = pd.read_csv('train.csv') X = train.drop('target', axis=1) y = train['target']

Practice Exercises

Exercise 1: Multi-Format Pipeline

FIG. 30Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 30Interactive Python code execution environment

Next Steps

Now that you can load and save data, you're ready for Data Visualization with Matplotlib—turning your data into compelling visual stories.


Ready to visualize your data? Let's create beautiful charts!


Further Reading

Official Docs

Tutorials

Choose the Right Format

  • CSV: human-readable, universal — but slow, type-ambiguous, no compression by default.
  • Parquet: columnar, typed, compressed. Use for everything that isn't shared with non-tech humans.
  • JSON Lines: one JSON object per line. Ideal for log/streaming data.
  • Feather / Arrow IPC: fastest pandas-to-pandas roundtrip.

APIs & Web Scraping

  • httpx — modern, sync + async, drop-in requests replacement.
  • Beautiful Soup — HTML parsing. Pair with requests/httpx for scraping.
  • Playwright Python — for JavaScript-heavy sites that need a real browser.

Books

  • Book: Python for Data Analysis (3rd ed.) — Wes McKinney. Chapter 6 ("Data Loading, Storage, and File Formats").