Introduction: Making Models Faster and Smaller
Your neural network achieves 95% accuracy... but it's 500MB, takes 2 seconds per prediction, and requires a GPU. Production systems need fast, small, and efficient models.
Model optimization reduces size and latency while maintaining accuracy.
Key Techniques:
- Quantization (reduce precision)
- Pruning (remove weights)
- Knowledge distillation (teacher-student)
- Compilation (optimize computation graph)
Learning Objectives
- Understand model optimization trade-offs
- Apply quantization techniques
- Implement model pruning
- Use knowledge distillation
- Optimize inference with compilation
- Choose appropriate optimization for use case
1. Why Optimize?
The Optimization Trade-off
2. Quantization
Idea: Use lower precision (INT8 instead of FP32) for weights and activations.
FP32 (32-bit float): Standard training precision INT8 (8-bit integer): 4x smaller, 2-4x faster inference
3. Model Pruning
Idea: Remove unimportant weights (set to zero) to reduce model size.
Types:
- Unstructured pruning: Remove individual weights
- Structured pruning: Remove entire neurons/channels
4. Knowledge Distillation
Idea: Train a small "student" model to mimic a large "teacher" model.
Process:
- Train large, accurate teacher model
- Use teacher's outputs as "soft targets"
- Train smaller student to match teacher
5. Model Compilation
Idea: Optimize the computation graph for specific hardware.
Tools:
- TensorRT (NVIDIA GPUs)
- ONNX Runtime
- TensorFlow Lite (mobile/edge)
- Apache TVM
6. Optimization Decision Guide
Key Takeaways
✅ Quantization reduces precision (FP32 → INT8) for 4x smaller models
✅ Pruning removes unimportant weights for sparse models
✅ Knowledge distillation trains small students to mimic large teachers
✅ Compilation optimizes computation graph for hardware
✅ Trade-offs: Speed/size gains vs. accuracy loss
✅ Combine techniques for maximum optimization
What's Next?
Final lesson: Production Best Practices – security, reliability, scalability, and real-world deployment strategies!
Further Reading
Hands-On Tutorials & Tools
- PyTorch — Quantization Tutorials — start with dynamic INT8, then static, then QAT.
- bitsandbytes — 8-bit and 4-bit quantization that drops into Hugging Face Transformers in one line.
- Hugging Face — Optimum — exports + accelerates HF models for ONNX, TensorRT, OpenVINO, AWS Inferentia.
- Neural Magic — DeepSparse — CPU inference engine that exploits unstructured sparsity.
Visualizations
- Netron — drag-and-drop model file → interactive graph viewer for ONNX, TF, PyTorch, Keras. Indispensable for inspecting compiled / pruned models.
Papers & Articles
- A Survey of Quantization Methods for Efficient Neural Network Inference — Gholami et al., 2021. The reference.
- Distilling the Knowledge in a Neural Network — Hinton, Vinyals, Dean, 2015. The original distillation paper.
- The State of Sparsity in Deep Neural Networks — Gale, Elsen, Hooker, 2019. Magnitude pruning is a strong baseline.
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Frantar et al., 2022.
- QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al., NeurIPS 2023. 4-bit + LoRA.
- SmoothQuant & AWQ — modern activation-aware quantization for LLMs.
Documentation & Books
- ONNX Runtime — cross-platform optimized inference.
- TensorRT — NVIDIA's GPU compiler.
- Apache TVM — open-source compiler stack across CPUs, GPUs, accelerators.
- Book: Efficient Deep Learning Book — Menghani (free draft). Modern, comprehensive treatment.