Model Deployment: API Design & Serving Infrastructure

Introduction: The Deployment Gap

You've built an amazing machine learning model. It achieves 95% accuracy on your test set. You're excited to share it with the world... and then reality hits.

The deployment gap is real:

  • Your model works great in Jupyter, but fails in production
  • It takes 5 seconds to predict (users want < 100ms)
  • It crashes under load
  • You can't monitor or debug it
  • Rolling back changes is a nightmare

Key Insight: A model that's not deployed is worthless. Production deployment is where the real engineering challenges begin!

Learning Objectives

  • Understand production requirements (latency, throughput, reliability)
  • Learn deployment patterns (batch vs. real-time, edge vs. cloud)
  • Master model serving with REST APIs
  • Implement monitoring and logging
  • Handle versioning and rollback
  • Scale to handle production traffic
  • Optimize for different deployment environments

1. Production Requirements

The 3 Pillars of Production ML

1. Performance

  • Latency: Time to return a prediction (target: < 100ms for real-time)
  • Throughput: Requests handled per second (RPS)
  • Resource usage: CPU, RAM, GPU

2. Reliability

  • Availability: System uptime (target: 99.9% = 43 minutes downtime/month)
  • Error handling: Graceful degradation
  • Monitoring: Detect issues before users do

3. Maintainability

  • Versioning: Track model versions
  • Rollback: Quickly revert to previous version
  • A/B testing: Compare models in production
  • Continuous deployment: Automated updates

Interactive Exploration

Loading tool...

FastAPI (Modern Alternative)

Loading tool...

4. Docker Containerization

Why Docker?

Problem: "It works on my machine" 🤷‍♂️

Solution: Package everything (code, dependencies, environment) into a container!

Dockerfile Example

Loading tool...

5. Monitoring and Logging

Key Metrics to Monitor

System Metrics:

  • CPU/GPU utilization
  • Memory usage
  • Request latency (p50, p95, p99)
  • Throughput (requests/second)
  • Error rate

Model Metrics:

  • Prediction distribution
  • Input feature drift
  • Output drift (are predictions changing over time?)
  • Model confidence/uncertainty

Logging Best Practices

Loading tool...

6. Model Versioning and A/B Testing

Model Versioning Strategy

Loading tool...

A/B Testing

Loading tool...

7. Performance Optimization

Optimization Strategies

1. Model Optimization

  • Quantization (reduce precision: FP32 → FP16 or INT8)
  • Pruning (remove unnecessary weights)
  • Distillation (train smaller model to mimic larger one)

2. Serving Optimization

  • Batch predictions together
  • Use model compilation (TensorRT, ONNX Runtime)
  • Cache frequently requested predictions
  • Load balancing across multiple instances

3. Infrastructure Optimization

  • Auto-scaling based on load
  • Use appropriate hardware (CPU vs GPU vs TPU)
  • Geographic distribution (CDN for models)

Benchmarking Example

Loading tool...

Key Takeaways

Production ML is fundamentally different from experimentation

Performance matters: Optimize for latency, throughput, and resource usage

Reliability: Monitor, log, and handle errors gracefully

Deployment patterns: Choose real-time vs batch, cloud vs edge based on requirements

Containerization: Use Docker for consistent, reproducible deployments

Versioning: Track models, enable rollback, and A/B test new versions

Monitoring: Measure system AND model metrics continuously


What's Next?

Next lesson: MLOps Fundamentals – automating the ML lifecycle with CI/CD, orchestration, and production best practices!


Further Reading

Hands-On Tutorials

  • FastAPI — Tutorial — the canonical modern Python web framework for serving models. Auto-generated OpenAPI docs are a superpower.
  • Kubernetes the Hard Way — Kelsey Hightower. If your model serving will run on K8s, do this once.
  • Cog — packages models as Docker containers without writing a Dockerfile. Replicate's standard.

Production-Ready Serving

  • BentoML — model serving framework with batching, observability, and multi-model orchestration built in.
  • NVIDIA Triton Inference Server — high-throughput GPU serving across PyTorch / TF / ONNX / TensorRT.
  • KServe — Kubernetes-native, autoscaling model serving.
  • vLLM & TensorRT-LLM — for LLM-specific serving (paged attention, continuous batching).

Papers & Articles

Documentation & Books

  • Book: Designing Machine Learning Systems — Chip Huyen (Chapters 7–9 on deployment + monitoring).
  • Book: Building Machine Learning Powered Applications — Emmanuel Ameisen.
  • Awesome MLOps — curated index of ~300 tools and papers.