Model Deployment: API Design & Serving Infrastructure

Introduction: The Deployment Gap

You've built an amazing machine learning model. It achieves 95% accuracy on your test set. You're excited to share it with the world... and then reality hits.

The deployment gap is real:

  • Your model works great in Jupyter, but fails in production
  • It takes 5 seconds to predict (users want < 100ms)
  • It crashes under load
  • You can't monitor or debug it
  • Rolling back changes is a nightmare

Key Insight: A model that's not deployed is worthless. Production deployment is where the real engineering challenges begin!

Learning Objectives

  • Understand production requirements (latency, throughput, reliability)
  • Learn deployment patterns (batch vs. real-time, edge vs. cloud)
  • Master model serving with REST APIs
  • Implement monitoring and logging
  • Handle versioning and rollback
  • Scale to handle production traffic
  • Optimize for different deployment environments

1. Production Requirements

The 3 Pillars of Production ML

1. Performance

  • Latency: Time to return a prediction (target: < 100ms for real-time)
  • Throughput: Requests handled per second (RPS)
  • Resource usage: CPU, RAM, GPU

2. Reliability

  • Availability: System uptime (target: 99.9% = 43 minutes downtime/month)
  • Error handling: Graceful degradation
  • Monitoring: Detect issues before users do

3. Maintainability

  • Versioning: Track model versions
  • Rollback: Quickly revert to previous version
  • A/B testing: Compare models in production
  • Continuous deployment: Automated updates

Interactive Exploration

Loading interactive component...

Try this:

  1. Start with CPU deployment – observe latency and throughput
  2. Switch to GPU – see the performance boost
  3. Increase request rate – watch for degradation
  4. Compare different deployment options

2. Deployment Patterns

Real-Time vs. Batch Prediction

Real-Time (Online)

  • Predictions on-demand as requests arrive
  • Low latency required (< 100ms)
  • Examples: Fraud detection, recommendation systems

Batch (Offline)

  • Predictions computed in bulk, stored, and served
  • Can tolerate higher latency
  • Examples: Daily email recommendations, monthly churn predictions

Cloud vs. Edge Deployment

Cloud Deployment

  • Models run on cloud servers (AWS, GCP, Azure)
  • Easy to scale and update
  • Requires internet connection
  • Higher latency due to network

Edge Deployment

  • Models run on user devices (phones, IoT)
  • Ultra-low latency
  • Works offline
  • Limited compute resources

3. Model Serving with REST API

Flask API Example

Loading Python runtime...

FastAPI (Modern Alternative)

Loading Python runtime...


4. Docker Containerization

Why Docker?

Problem: "It works on my machine" 🤷‍♂️

Solution: Package everything (code, dependencies, environment) into a container!

Dockerfile Example

Loading Python runtime...


5. Monitoring and Logging

Key Metrics to Monitor

System Metrics:

  • CPU/GPU utilization
  • Memory usage
  • Request latency (p50, p95, p99)
  • Throughput (requests/second)
  • Error rate

Model Metrics:

  • Prediction distribution
  • Input feature drift
  • Output drift (are predictions changing over time?)
  • Model confidence/uncertainty

Logging Best Practices

Loading Python runtime...


6. Model Versioning and A/B Testing

Model Versioning Strategy

Loading Python runtime...

A/B Testing

Loading Python runtime...


7. Performance Optimization

Optimization Strategies

1. Model Optimization

  • Quantization (reduce precision: FP32 → FP16 or INT8)
  • Pruning (remove unnecessary weights)
  • Distillation (train smaller model to mimic larger one)

2. Serving Optimization

  • Batch predictions together
  • Use model compilation (TensorRT, ONNX Runtime)
  • Cache frequently requested predictions
  • Load balancing across multiple instances

3. Infrastructure Optimization

  • Auto-scaling based on load
  • Use appropriate hardware (CPU vs GPU vs TPU)
  • Geographic distribution (CDN for models)

Benchmarking Example

Loading Python runtime...


Key Takeaways

Production ML is fundamentally different from experimentation

Performance matters: Optimize for latency, throughput, and resource usage

Reliability: Monitor, log, and handle errors gracefully

Deployment patterns: Choose real-time vs batch, cloud vs edge based on requirements

Containerization: Use Docker for consistent, reproducible deployments

Versioning: Track models, enable rollback, and A/B test new versions

Monitoring: Measure system AND model metrics continuously


What's Next?

Next lesson: MLOps Fundamentals – automating the ML lifecycle with CI/CD, orchestration, and production best practices!