ADVANCED ML: UNSUPERVISED LEARNING & PRODUCTION / L11MODEL OPTIMIZATION: QUANTIZATION, PRUNING, DISTILLATION
课程 · 12 · 11 / 12
LESSON 11 · ADVANCED · 60 MIN · ◆ 1 INSTRUMENT

Model Optimization: Quantization, Pruning, Distillation

Optimize models for production: quantization for smaller size, pruning for speed, and knowledge distillation for deployment.

Introduction: Making Models Faster and Smaller

Your neural network achieves 95% accuracy... but it's 500MB, takes 2 seconds per prediction, and requires a GPU. Production systems need fast, small, and efficient models.

Model optimization reduces size and latency while maintaining accuracy.

Key Techniques:

  • Quantization (reduce precision)
  • Pruning (remove weights)
  • Knowledge distillation (teacher-student)
  • Compilation (optimize computation graph)

Learning Objectives

  • Understand model optimization trade-offs
  • Apply quantization techniques
  • Implement model pruning
  • Use knowledge distillation
  • Optimize inference with compilation
  • Choose appropriate optimization for use case

1. Why Optimize?

The Optimization Trade-off

FIG. 02Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 02Interactive Python code execution environment

2. Quantization

Idea: Use lower precision (INT8 instead of FP32) for weights and activations.

FP32 (32-bit float): Standard training precision INT8 (8-bit integer): 4x smaller, 2-4x faster inference

FIG. 04Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 04Interactive Python code execution environment

3. Model Pruning

Idea: Remove unimportant weights (set to zero) to reduce model size.

Types:

  • Unstructured pruning: Remove individual weights
  • Structured pruning: Remove entire neurons/channels
FIG. 06Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 06Interactive Python code execution environment

4. Knowledge Distillation

Idea: Train a small "student" model to mimic a large "teacher" model.

Process:

  1. Train large, accurate teacher model
  2. Use teacher's outputs as "soft targets"
  3. Train smaller student to match teacher
FIG. 08Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 08Interactive Python code execution environment

5. Model Compilation

Idea: Optimize the computation graph for specific hardware.

Tools:

  • TensorRT (NVIDIA GPUs)
  • ONNX Runtime
  • TensorFlow Lite (mobile/edge)
  • Apache TVM
FIG. 10Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 10Interactive Python code execution environment

6. Optimization Decision Guide

FIG. 12Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 12Interactive Python code execution environment

Key Takeaways

Quantization reduces precision (FP32 → INT8) for 4x smaller models

Pruning removes unimportant weights for sparse models

Knowledge distillation trains small students to mimic large teachers

Compilation optimizes computation graph for hardware

Trade-offs: Speed/size gains vs. accuracy loss

Combine techniques for maximum optimization


What's Next?

Final lesson: Production Best Practices – security, reliability, scalability, and real-world deployment strategies!


Further Reading

Hands-On Tutorials & Tools

Visualizations

  • Netron — drag-and-drop model file → interactive graph viewer for ONNX, TF, PyTorch, Keras. Indispensable for inspecting compiled / pruned models.

Papers & Articles

Documentation & Books

  • ONNX Runtime — cross-platform optimized inference.
  • TensorRT — NVIDIA's GPU compiler.
  • Apache TVM — open-source compiler stack across CPUs, GPUs, accelerators.
  • Book: Efficient Deep Learning Book — Menghani (free draft). Modern, comprehensive treatment.