课程 · 12 · 09 / 12
Model Deployment: API Design & Serving Infrastructure
Deploy ML models to production. Design REST APIs, containerize models with Docker, and build scalable serving infrastructure.
Introduction: The Deployment Gap
You've built an amazing machine learning model. It achieves 95% accuracy on your test set. You're excited to share it with the world... and then reality hits.
The deployment gap is real:
- Your model works great in Jupyter, but fails in production
- It takes 5 seconds to predict (users want < 100ms)
- It crashes under load
- You can't monitor or debug it
- Rolling back changes is a nightmare
Key Insight: A model that's not deployed is worthless. Production deployment is where the real engineering challenges begin!
Learning Objectives
- Understand production requirements (latency, throughput, reliability)
- Learn deployment patterns (batch vs. real-time, edge vs. cloud)
- Master model serving with REST APIs
- Implement monitoring and logging
- Handle versioning and rollback
- Scale to handle production traffic
- Optimize for different deployment environments
1. Production Requirements
The 3 Pillars of Production ML
1. Performance
- Latency: Time to return a prediction (target: < 100ms for real-time)
- Throughput: Requests handled per second (RPS)
- Resource usage: CPU, RAM, GPU
2. Reliability
- Availability: System uptime (target: 99.9% = 43 minutes downtime/month)
- Error handling: Graceful degradation
- Monitoring: Detect issues before users do
3. Maintainability
- Versioning: Track model versions
- Rollback: Quickly revert to previous version
- A/B testing: Compare models in production
- Continuous deployment: Automated updates
Interactive Exploration
Try it: Switch the environment (CPU vs GPU) and drag the request-rate control up — watch the latency and throughput readouts shift in real time, and notice when the system tips into degradation.
Try this:
- Start with CPU deployment – observe latency and throughput
- Switch to GPU – see the performance boost
- Increase request rate – watch for degradation
- Compare different deployment options
2. Deployment Patterns
Real-Time vs. Batch Prediction
Real-Time (Online)
- Predictions on-demand as requests arrive
- Low latency required (< 100ms)
- Examples: Fraud detection, recommendation systems
Batch (Offline)
- Predictions computed in bulk, stored, and served
- Can tolerate higher latency
- Examples: Daily email recommendations, monthly churn predictions
Cloud vs. Edge Deployment
Cloud Deployment
- Models run on cloud servers (AWS, GCP, Azure)
- Easy to scale and update
- Requires internet connection
- Higher latency due to network
Edge Deployment
- Models run on user devices (phones, IoT)
- Ultra-low latency
- Works offline
- Limited compute resources
3. Model Serving with REST API
Flask API Example
A Flask service is a standalone process that listens on a port, so it can't run inside this in-browser sandbox. Read it as the structure you'd save to app.py and run with a real Python process:
from flask import Flask, request, jsonify import numpy as np import pickle app = Flask(__name__) # Load model once at startup (in production) # model = pickle.load(open('model.pkl', 'rb')) # Mock model for demonstration class MockModel: def predict(self, X): return np.random.rand(len(X), 1) model = MockModel() @app.route('/health', methods=['GET']) def health(): """Health check endpoint""" return jsonify({'status': 'healthy', 'version': '1.0.0'}) @app.route('/predict', methods=['POST']) def predict(): """Prediction endpoint""" try: data = request.get_json() features = np.array(data['features']) if features.ndim != 2: return jsonify({'error': 'Features must be 2D array'}), 400 predictions = model.predict(features) return jsonify({ 'predictions': predictions.tolist(), 'model_version': '1.0.0', 'latency_ms': 45 # Would measure actual time }) except Exception as e: return jsonify({'error': str(e)}), 500 if __name__ == '__main__': app.run(host='0.0.0.0', port=5000)
The service exposes two endpoints:
| Method | Path | Purpose |
|---|---|---|
GET | /health | Check if the service is running |
POST | /predict | Make predictions |
A prediction request and its response look like this:
POST /predict {"features": [[0.5, 0.3, 0.8], [0.2, 0.7, 0.4]]} 200 OK {"predictions": [[0.85], [0.42]], "model_version": "1.0.0", "latency_ms": 45}
FastAPI (Modern Alternative)
4. Docker Containerization
Why Docker?
Problem: "It works on my machine" 🤷♂️
Solution: Package everything (code, dependencies, environment) into a container!
Dockerfile Example
A Dockerfile is a build recipe, not Python — it's read by the Docker engine, so there's nothing to run here. Save the following as Dockerfile for an ML serving image:
# Start with Python base image FROM python:3.9-slim # Set working directory WORKDIR /app # Copy requirements and install COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy model and application code COPY model.pkl . COPY app.py . # Expose port EXPOSE 5000 # Set environment variables ENV MODEL_PATH=/app/model.pkl ENV PORT=5000 # Health check HEALTHCHECK \ CMD curl -f http://localhost:5000/health || exit 1 # Run the application CMD ["python", "app.py"]
Pin your dependencies in requirements.txt so builds are reproducible:
flask==2.3.0 scikit-learn==1.3.0 numpy==1.24.0 gunicorn==21.2.0
Then build and run the container:
docker build -t ml-model-api . docker run -p 5000:5000 ml-model-api
Benefits:
- Consistent environment across dev/staging/prod
- Easy deployment to any cloud platform
- Isolation from other services
- Simple scaling with orchestration tools
5. Monitoring and Logging
Key Metrics to Monitor
System Metrics:
- CPU/GPU utilization
- Memory usage
- Request latency (p50, p95, p99)
- Throughput (requests/second)
- Error rate
Model Metrics:
- Prediction distribution
- Input feature drift
- Output drift (are predictions changing over time?)
- Model confidence/uncertainty
Logging Best Practices
6. Model Versioning and A/B Testing
Model Versioning Strategy
A/B Testing
7. Performance Optimization
Optimization Strategies
1. Model Optimization
- Quantization (reduce precision: FP32 → FP16 or INT8)
- Pruning (remove unnecessary weights)
- Distillation (train smaller model to mimic larger one)
2. Serving Optimization
- Batch predictions together
- Use model compilation (TensorRT, ONNX Runtime)
- Cache frequently requested predictions
- Load balancing across multiple instances
3. Infrastructure Optimization
- Auto-scaling based on load
- Use appropriate hardware (CPU vs GPU vs TPU)
- Geographic distribution (CDN for models)
Benchmarking Example
Key Takeaways
✅ Production ML is fundamentally different from experimentation
✅ Performance matters: Optimize for latency, throughput, and resource usage
✅ Reliability: Monitor, log, and handle errors gracefully
✅ Deployment patterns: Choose real-time vs batch, cloud vs edge based on requirements
✅ Containerization: Use Docker for consistent, reproducible deployments
✅ Versioning: Track models, enable rollback, and A/B test new versions
✅ Monitoring: Measure system AND model metrics continuously
What's Next?
Next lesson: MLOps Fundamentals – automating the ML lifecycle with CI/CD, orchestration, and production best practices!
Further Reading
Hands-On Tutorials
- FastAPI — Tutorial — the canonical modern Python web framework for serving models. Auto-generated OpenAPI docs are a superpower.
- Kubernetes the Hard Way — Kelsey Hightower. If your model serving will run on K8s, do this once.
- Cog — packages models as Docker containers without writing a Dockerfile. Replicate's standard.
Production-Ready Serving
- BentoML — model serving framework with batching, observability, and multi-model orchestration built in.
- NVIDIA Triton Inference Server — high-throughput GPU serving across PyTorch / TF / ONNX / TensorRT.
- KServe — Kubernetes-native, autoscaling model serving.
- vLLM & TensorRT-LLM — for LLM-specific serving (paged attention, continuous batching).
Papers & Articles
- The ML Test Score: A Rubric for ML Production Readiness — Breck et al., Google 2017. 28 specific tests every production ML system should have.
- Towards ML Engineering: A Brief History of TensorFlow Extended (TFX) — Katsiapis et al., Google 2020.
- Continuous Delivery for Machine Learning — Sato, Wider, Windheuser (Martin Fowler's blog). Essential read.
Documentation & Books
- Book: Designing Machine Learning Systems — Chip Huyen (Chapters 7–9 on deployment + monitoring).
- Book: Building Machine Learning Powered Applications — Emmanuel Ameisen.
- Awesome MLOps — curated index of ~300 tools and papers.