ADVANCED ML: UNSUPERVISED LEARNING & PRODUCTION / L09MODEL DEPLOYMENT: API DESIGN & SERVING INFRASTRUCTURE
课程 · 12 · 09 / 12
LESSON 09 · ADVANCED · 60 MIN · ◆ 2 INSTRUMENTS

Model Deployment: API Design & Serving Infrastructure

Deploy ML models to production. Design REST APIs, containerize models with Docker, and build scalable serving infrastructure.

Introduction: The Deployment Gap

You've built an amazing machine learning model. It achieves 95% accuracy on your test set. You're excited to share it with the world... and then reality hits.

The deployment gap is real:

  • Your model works great in Jupyter, but fails in production
  • It takes 5 seconds to predict (users want < 100ms)
  • It crashes under load
  • You can't monitor or debug it
  • Rolling back changes is a nightmare

Key Insight: A model that's not deployed is worthless. Production deployment is where the real engineering challenges begin!

Learning Objectives

  • Understand production requirements (latency, throughput, reliability)
  • Learn deployment patterns (batch vs. real-time, edge vs. cloud)
  • Master model serving with REST APIs
  • Implement monitoring and logging
  • Handle versioning and rollback
  • Scale to handle production traffic
  • Optimize for different deployment environments

1. Production Requirements

The 3 Pillars of Production ML

1. Performance

  • Latency: Time to return a prediction (target: < 100ms for real-time)
  • Throughput: Requests handled per second (RPS)
  • Resource usage: CPU, RAM, GPU

2. Reliability

  • Availability: System uptime (target: 99.9% = 43 minutes downtime/month)
  • Error handling: Graceful degradation
  • Monitoring: Detect issues before users do

3. Maintainability

  • Versioning: Track model versions
  • Rollback: Quickly revert to previous version
  • A/B testing: Compare models in production
  • Continuous deployment: Automated updates

Interactive Exploration

FIG. 02Deployment Simulator
INTERACTIVE
LOADING INSTRUMENT
Fig. 02Simulate ML model deployment with metrics

Try it: Switch the environment (CPU vs GPU) and drag the request-rate control up — watch the latency and throughput readouts shift in real time, and notice when the system tips into degradation.

Try this:

  1. Start with CPU deployment – observe latency and throughput
  2. Switch to GPU – see the performance boost
  3. Increase request rate – watch for degradation
  4. Compare different deployment options

2. Deployment Patterns

Real-Time vs. Batch Prediction

Real-Time (Online)

  • Predictions on-demand as requests arrive
  • Low latency required (< 100ms)
  • Examples: Fraud detection, recommendation systems

Batch (Offline)

  • Predictions computed in bulk, stored, and served
  • Can tolerate higher latency
  • Examples: Daily email recommendations, monthly churn predictions

Cloud vs. Edge Deployment

Cloud Deployment

  • Models run on cloud servers (AWS, GCP, Azure)
  • Easy to scale and update
  • Requires internet connection
  • Higher latency due to network

Edge Deployment

  • Models run on user devices (phones, IoT)
  • Ultra-low latency
  • Works offline
  • Limited compute resources

3. Model Serving with REST API

Flask API Example

A Flask service is a standalone process that listens on a port, so it can't run inside this in-browser sandbox. Read it as the structure you'd save to app.py and run with a real Python process:

from flask import Flask, request, jsonify import numpy as np import pickle app = Flask(__name__) # Load model once at startup (in production) # model = pickle.load(open('model.pkl', 'rb')) # Mock model for demonstration class MockModel: def predict(self, X): return np.random.rand(len(X), 1) model = MockModel() @app.route('/health', methods=['GET']) def health(): """Health check endpoint""" return jsonify({'status': 'healthy', 'version': '1.0.0'}) @app.route('/predict', methods=['POST']) def predict(): """Prediction endpoint""" try: data = request.get_json() features = np.array(data['features']) if features.ndim != 2: return jsonify({'error': 'Features must be 2D array'}), 400 predictions = model.predict(features) return jsonify({ 'predictions': predictions.tolist(), 'model_version': '1.0.0', 'latency_ms': 45 # Would measure actual time }) except Exception as e: return jsonify({'error': str(e)}), 500 if __name__ == '__main__': app.run(host='0.0.0.0', port=5000)

The service exposes two endpoints:

MethodPathPurpose
GET/healthCheck if the service is running
POST/predictMake predictions

A prediction request and its response look like this:

POST /predict {"features": [[0.5, 0.3, 0.8], [0.2, 0.7, 0.4]]} 200 OK {"predictions": [[0.85], [0.42]], "model_version": "1.0.0", "latency_ms": 45}

FastAPI (Modern Alternative)

FIG. 04Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 04Interactive Python code execution environment

4. Docker Containerization

Why Docker?

Problem: "It works on my machine" 🤷‍♂️

Solution: Package everything (code, dependencies, environment) into a container!

Dockerfile Example

A Dockerfile is a build recipe, not Python — it's read by the Docker engine, so there's nothing to run here. Save the following as Dockerfile for an ML serving image:

# Start with Python base image FROM python:3.9-slim # Set working directory WORKDIR /app # Copy requirements and install COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy model and application code COPY model.pkl . COPY app.py . # Expose port EXPOSE 5000 # Set environment variables ENV MODEL_PATH=/app/model.pkl ENV PORT=5000 # Health check HEALTHCHECK --interval=30s --timeout=3s \ CMD curl -f http://localhost:5000/health || exit 1 # Run the application CMD ["python", "app.py"]

Pin your dependencies in requirements.txt so builds are reproducible:

flask==2.3.0 scikit-learn==1.3.0 numpy==1.24.0 gunicorn==21.2.0

Then build and run the container:

docker build -t ml-model-api . docker run -p 5000:5000 ml-model-api

Benefits:

  • Consistent environment across dev/staging/prod
  • Easy deployment to any cloud platform
  • Isolation from other services
  • Simple scaling with orchestration tools

5. Monitoring and Logging

Key Metrics to Monitor

System Metrics:

  • CPU/GPU utilization
  • Memory usage
  • Request latency (p50, p95, p99)
  • Throughput (requests/second)
  • Error rate

Model Metrics:

  • Prediction distribution
  • Input feature drift
  • Output drift (are predictions changing over time?)
  • Model confidence/uncertainty

Logging Best Practices

FIG. 06Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 06Interactive Python code execution environment

6. Model Versioning and A/B Testing

Model Versioning Strategy

FIG. 08Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 08Interactive Python code execution environment

A/B Testing

FIG. 10Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 10Interactive Python code execution environment

7. Performance Optimization

Optimization Strategies

1. Model Optimization

  • Quantization (reduce precision: FP32 → FP16 or INT8)
  • Pruning (remove unnecessary weights)
  • Distillation (train smaller model to mimic larger one)

2. Serving Optimization

  • Batch predictions together
  • Use model compilation (TensorRT, ONNX Runtime)
  • Cache frequently requested predictions
  • Load balancing across multiple instances

3. Infrastructure Optimization

  • Auto-scaling based on load
  • Use appropriate hardware (CPU vs GPU vs TPU)
  • Geographic distribution (CDN for models)

Benchmarking Example

FIG. 12Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 12Interactive Python code execution environment

Key Takeaways

Production ML is fundamentally different from experimentation

Performance matters: Optimize for latency, throughput, and resource usage

Reliability: Monitor, log, and handle errors gracefully

Deployment patterns: Choose real-time vs batch, cloud vs edge based on requirements

Containerization: Use Docker for consistent, reproducible deployments

Versioning: Track models, enable rollback, and A/B test new versions

Monitoring: Measure system AND model metrics continuously


What's Next?

Next lesson: MLOps Fundamentals – automating the ML lifecycle with CI/CD, orchestration, and production best practices!


Further Reading

Hands-On Tutorials

  • FastAPI — Tutorial — the canonical modern Python web framework for serving models. Auto-generated OpenAPI docs are a superpower.
  • Kubernetes the Hard Way — Kelsey Hightower. If your model serving will run on K8s, do this once.
  • Cog — packages models as Docker containers without writing a Dockerfile. Replicate's standard.

Production-Ready Serving

  • BentoML — model serving framework with batching, observability, and multi-model orchestration built in.
  • NVIDIA Triton Inference Server — high-throughput GPU serving across PyTorch / TF / ONNX / TensorRT.
  • KServe — Kubernetes-native, autoscaling model serving.
  • vLLM & TensorRT-LLM — for LLM-specific serving (paged attention, continuous batching).

Papers & Articles

Documentation & Books

  • Book: Designing Machine Learning Systems — Chip Huyen (Chapters 7–9 on deployment + monitoring).
  • Book: Building Machine Learning Powered Applications — Emmanuel Ameisen.
  • Awesome MLOps — curated index of ~300 tools and papers.
相关概念
deploymentapi-designdockerserving