Model Optimization: Quantization, Pruning, Distillation

Introduction: Making Models Faster and Smaller

Your neural network achieves 95% accuracy... but it's 500MB, takes 2 seconds per prediction, and requires a GPU. Production systems need fast, small, and efficient models.

Model optimization reduces size and latency while maintaining accuracy.

Key Techniques:

Quantization (reduce precision)
Pruning (remove weights)
Knowledge distillation (teacher-student)
Compilation (optimize computation graph)

Learning Objectives

Understand model optimization trade-offs
Apply quantization techniques
Implement model pruning
Use knowledge distillation
Optimize inference with compilation
Choose appropriate optimization for use case

1. Why Optimize?

The Optimization Trade-off

2. Quantization

Idea: Use lower precision (INT8 instead of FP32) for weights and activations.

FP32 (32-bit float): Standard training precision INT8 (8-bit integer): 4x smaller, 2-4x faster inference

3. Model Pruning

Idea: Remove unimportant weights (set to zero) to reduce model size.

Types:

Unstructured pruning: Remove individual weights
Structured pruning: Remove entire neurons/channels

4. Knowledge Distillation

Idea: Train a small "student" model to mimic a large "teacher" model.

Process:

Train large, accurate teacher model
Use teacher's outputs as "soft targets"
Train smaller student to match teacher

5. Model Compilation

Idea: Optimize the computation graph for specific hardware.

Tools:

TensorRT (NVIDIA GPUs)
ONNX Runtime
TensorFlow Lite (mobile/edge)
Apache TVM

6. Optimization Decision Guide

Key Takeaways

✅ Quantization reduces precision (FP32 → INT8) for 4x smaller models

✅ Pruning removes unimportant weights for sparse models

✅ Knowledge distillation trains small students to mimic large teachers

✅ Compilation optimizes computation graph for hardware

✅ Trade-offs: Speed/size gains vs. accuracy loss

✅ Combine techniques for maximum optimization

What's Next?

Final lesson: Production Best Practices – security, reliability, scalability, and real-world deployment strategies!