ADVANCED ML: UNSUPERVISED LEARNING & PRODUCTION / L09MODEL DEPLOYMENT: API DESIGN & SERVING INFRASTRUCTURE
课程 · 12 · 09 / 12
LESSON 09 · ADVANCED · 60 MIN · ◆ 2 INSTRUMENTS

Model Deployment: API Design & Serving Infrastructure

Deploy ML models to production. Design REST APIs, containerize models with Docker, and build scalable serving infrastructure.

Introduction: The Deployment Gap

You've built an amazing machine learning model. It achieves 95% accuracy on your test set. You're excited to share it with the world... and then reality hits.

The deployment gap is real:

  • Your model works great in Jupyter, but fails in production
  • It takes 5 seconds to predict (users want < 100ms)
  • It crashes under load
  • You can't monitor or debug it
  • Rolling back changes is a nightmare

Key Insight: A model that's not deployed is worthless. Production deployment is where the real engineering challenges begin!

Learning Objectives

  • Understand production requirements (latency, throughput, reliability)
  • Learn deployment patterns (batch vs. real-time, edge vs. cloud)
  • Master model serving with REST APIs
  • Implement monitoring and logging
  • Handle versioning and rollback
  • Scale to handle production traffic
  • Optimize for different deployment environments

1. Production Requirements

The 3 Pillars of Production ML

1. Performance

  • Latency: Time to return a prediction (target: < 100ms for real-time)
  • Throughput: Requests handled per second (RPS)
  • Resource usage: CPU, RAM, GPU

2. Reliability

  • Availability: System uptime (target: 99.9% = 43 minutes downtime/month)
  • Error handling: Graceful degradation
  • Monitoring: Detect issues before users do

3. Maintainability

  • Versioning: Track model versions
  • Rollback: Quickly revert to previous version
  • A/B testing: Compare models in production
  • Continuous deployment: Automated updates

Interactive Exploration

FIG. 02Deployment Simulator
INTERACTIVE
LOADING INSTRUMENT
Fig. 02Simulate ML model deployment with metrics

Try this:

  1. Start with CPU deployment – observe latency and throughput
  2. Switch to GPU – see the performance boost
  3. Increase request rate – watch for degradation
  4. Compare different deployment options

2. Deployment Patterns

Real-Time vs. Batch Prediction

Real-Time (Online)

  • Predictions on-demand as requests arrive
  • Low latency required (< 100ms)
  • Examples: Fraud detection, recommendation systems

Batch (Offline)

  • Predictions computed in bulk, stored, and served
  • Can tolerate higher latency
  • Examples: Daily email recommendations, monthly churn predictions

Cloud vs. Edge Deployment

Cloud Deployment

  • Models run on cloud servers (AWS, GCP, Azure)
  • Easy to scale and update
  • Requires internet connection
  • Higher latency due to network

Edge Deployment

  • Models run on user devices (phones, IoT)
  • Ultra-low latency
  • Works offline
  • Limited compute resources

3. Model Serving with REST API

Flask API Example

FIG. 04Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 04Interactive Python code execution environment

FastAPI (Modern Alternative)

FIG. 06Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 06Interactive Python code execution environment

4. Docker Containerization

Why Docker?

Problem: "It works on my machine" 🤷‍♂️

Solution: Package everything (code, dependencies, environment) into a container!

Dockerfile Example

FIG. 08Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 08Interactive Python code execution environment

5. Monitoring and Logging

Key Metrics to Monitor

System Metrics:

  • CPU/GPU utilization
  • Memory usage
  • Request latency (p50, p95, p99)
  • Throughput (requests/second)
  • Error rate

Model Metrics:

  • Prediction distribution
  • Input feature drift
  • Output drift (are predictions changing over time?)
  • Model confidence/uncertainty

Logging Best Practices

FIG. 10Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 10Interactive Python code execution environment

6. Model Versioning and A/B Testing

Model Versioning Strategy

FIG. 12Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 12Interactive Python code execution environment

A/B Testing

FIG. 14Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 14Interactive Python code execution environment

7. Performance Optimization

Optimization Strategies

1. Model Optimization

  • Quantization (reduce precision: FP32 → FP16 or INT8)
  • Pruning (remove unnecessary weights)
  • Distillation (train smaller model to mimic larger one)

2. Serving Optimization

  • Batch predictions together
  • Use model compilation (TensorRT, ONNX Runtime)
  • Cache frequently requested predictions
  • Load balancing across multiple instances

3. Infrastructure Optimization

  • Auto-scaling based on load
  • Use appropriate hardware (CPU vs GPU vs TPU)
  • Geographic distribution (CDN for models)

Benchmarking Example

FIG. 16Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 16Interactive Python code execution environment

Key Takeaways

Production ML is fundamentally different from experimentation

Performance matters: Optimize for latency, throughput, and resource usage

Reliability: Monitor, log, and handle errors gracefully

Deployment patterns: Choose real-time vs batch, cloud vs edge based on requirements

Containerization: Use Docker for consistent, reproducible deployments

Versioning: Track models, enable rollback, and A/B test new versions

Monitoring: Measure system AND model metrics continuously


What's Next?

Next lesson: MLOps Fundamentals – automating the ML lifecycle with CI/CD, orchestration, and production best practices!


Further Reading

Hands-On Tutorials

  • FastAPI — Tutorial — the canonical modern Python web framework for serving models. Auto-generated OpenAPI docs are a superpower.
  • Kubernetes the Hard Way — Kelsey Hightower. If your model serving will run on K8s, do this once.
  • Cog — packages models as Docker containers without writing a Dockerfile. Replicate's standard.

Production-Ready Serving

  • BentoML — model serving framework with batching, observability, and multi-model orchestration built in.
  • NVIDIA Triton Inference Server — high-throughput GPU serving across PyTorch / TF / ONNX / TensorRT.
  • KServe — Kubernetes-native, autoscaling model serving.
  • vLLM & TensorRT-LLM — for LLM-specific serving (paged attention, continuous batching).

Papers & Articles

Documentation & Books

  • Book: Designing Machine Learning Systems — Chip Huyen (Chapters 7–9 on deployment + monitoring).
  • Book: Building Machine Learning Powered Applications — Emmanuel Ameisen.
  • Awesome MLOps — curated index of ~300 tools and papers.