Model Deployment: API Design & Serving Infrastructure

Introduction: The Deployment Gap

You've built an amazing machine learning model. It achieves 95% accuracy on your test set. You're excited to share it with the world... and then reality hits.

The deployment gap is real:

Your model works great in Jupyter, but fails in production
It takes 5 seconds to predict (users want < 100ms)
It crashes under load
You can't monitor or debug it
Rolling back changes is a nightmare

Key Insight: A model that's not deployed is worthless. Production deployment is where the real engineering challenges begin!

Learning Objectives

Understand production requirements (latency, throughput, reliability)
Learn deployment patterns (batch vs. real-time, edge vs. cloud)
Master model serving with REST APIs
Implement monitoring and logging
Handle versioning and rollback
Scale to handle production traffic
Optimize for different deployment environments

1. Production Requirements

The 3 Pillars of Production ML

1. Performance

Latency: Time to return a prediction (target: < 100ms for real-time)
Throughput: Requests handled per second (RPS)
Resource usage: CPU, RAM, GPU

2. Reliability

Availability: System uptime (target: 99.9% = 43 minutes downtime/month)
Error handling: Graceful degradation
Monitoring: Detect issues before users do

3. Maintainability

Versioning: Track model versions
Rollback: Quickly revert to previous version
A/B testing: Compare models in production
Continuous deployment: Automated updates

Interactive Exploration

FIG. 02Deployment Simulator

INTERACTIVE

LOADING INSTRUMENT

Fig. 02Simulate ML model deployment with metrics

Try it: Switch the environment (CPU vs GPU) and drag the request-rate control up — watch the latency and throughput readouts shift in real time, and notice when the system tips into degradation.

Try this:

Start with CPU deployment – observe latency and throughput
Switch to GPU – see the performance boost
Increase request rate – watch for degradation
Compare different deployment options

2. Deployment Patterns

Real-Time vs. Batch Prediction

Real-Time (Online)

Predictions on-demand as requests arrive
Low latency required (< 100ms)
Examples: Fraud detection, recommendation systems

Batch (Offline)

Predictions computed in bulk, stored, and served
Can tolerate higher latency
Examples: Daily email recommendations, monthly churn predictions

Cloud vs. Edge Deployment

Cloud Deployment

Models run on cloud servers (AWS, GCP, Azure)
Easy to scale and update
Requires internet connection
Higher latency due to network

Edge Deployment

Models run on user devices (phones, IoT)
Ultra-low latency
Works offline
Limited compute resources

3. Model Serving with REST API

Flask API Example

A Flask service is a standalone process that listens on a port, so it can't run inside this in-browser sandbox. Read it as the structure you'd save to app.py and run with a real Python process:

from flask import Flask, request, jsonify
import numpy as np
import pickle

app = Flask(__name__)

# Load model once at startup (in production)
# model = pickle.load(open('model.pkl', 'rb'))

# Mock model for demonstration
class MockModel:
    def predict(self, X):
        return np.random.rand(len(X), 1)

model = MockModel()

@app.route('/health', methods=['GET'])
def health():
    """Health check endpoint"""
    return jsonify({'status': 'healthy', 'version': '1.0.0'})

@app.route('/predict', methods=['POST'])
def predict():
    """Prediction endpoint"""
    try:
        data = request.get_json()
        features = np.array(data['features'])

        if features.ndim != 2:
            return jsonify({'error': 'Features must be 2D array'}), 400

        predictions = model.predict(features)

        return jsonify({
            'predictions': predictions.tolist(),
            'model_version': '1.0.0',
            'latency_ms': 45  # Would measure actual time
        })
    except Exception as e:
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

The service exposes two endpoints:

Method	Path	Purpose
`GET`	`/health`	Check if the service is running
`POST`	`/predict`	Make predictions

A prediction request and its response look like this:

POST /predict
{"features": [[0.5, 0.3, 0.8], [0.2, 0.7, 0.4]]}

200 OK
{"predictions": [[0.85], [0.42]], "model_version": "1.0.0", "latency_ms": 45}

FastAPI (Modern Alternative)

FIG. 04Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 04Interactive Python code execution environment

4. Docker Containerization

Why Docker?

Problem: "It works on my machine" 🤷‍♂️

Solution: Package everything (code, dependencies, environment) into a container!

Dockerfile Example

A Dockerfile is a build recipe, not Python — it's read by the Docker engine, so there's nothing to run here. Save the following as Dockerfile for an ML serving image:

# Start with Python base image
FROM python:3.9-slim

# Set working directory
WORKDIR /app

# Copy requirements and install
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and application code
COPY model.pkl .
COPY app.py .

# Expose port
EXPOSE 5000

# Set environment variables
ENV MODEL_PATH=/app/model.pkl
ENV PORT=5000

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost:5000/health || exit 1

# Run the application
CMD ["python", "app.py"]

Pin your dependencies in requirements.txt so builds are reproducible:

flask==2.3.0
scikit-learn==1.3.0
numpy==1.24.0
gunicorn==21.2.0

Then build and run the container:

docker build -t ml-model-api .
docker run -p 5000:5000 ml-model-api

Benefits:

Consistent environment across dev/staging/prod
Easy deployment to any cloud platform
Isolation from other services
Simple scaling with orchestration tools

5. Monitoring and Logging

Key Metrics to Monitor

System Metrics:

CPU/GPU utilization
Memory usage
Request latency (p50, p95, p99)
Throughput (requests/second)
Error rate

Model Metrics:

Prediction distribution
Input feature drift
Output drift (are predictions changing over time?)
Model confidence/uncertainty

Logging Best Practices

FIG. 06Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 06Interactive Python code execution environment

6. Model Versioning and A/B Testing

Model Versioning Strategy

FIG. 08Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 08Interactive Python code execution environment

A/B Testing

FIG. 10Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 10Interactive Python code execution environment

7. Performance Optimization

Optimization Strategies

1. Model Optimization

Quantization (reduce precision: FP32 → FP16 or INT8)
Pruning (remove unnecessary weights)
Distillation (train smaller model to mimic larger one)

2. Serving Optimization

Batch predictions together
Use model compilation (TensorRT, ONNX Runtime)
Cache frequently requested predictions
Load balancing across multiple instances

3. Infrastructure Optimization

Auto-scaling based on load
Use appropriate hardware (CPU vs GPU vs TPU)
Geographic distribution (CDN for models)

Benchmarking Example

FIG. 12Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 12Interactive Python code execution environment

Key Takeaways

✅ Production ML is fundamentally different from experimentation

✅ Performance matters: Optimize for latency, throughput, and resource usage

✅ Reliability: Monitor, log, and handle errors gracefully

✅ Deployment patterns: Choose real-time vs batch, cloud vs edge based on requirements

✅ Containerization: Use Docker for consistent, reproducible deployments

✅ Versioning: Track models, enable rollback, and A/B test new versions

✅ Monitoring: Measure system AND model metrics continuously

What's Next?

Next lesson: MLOps Fundamentals – automating the ML lifecycle with CI/CD, orchestration, and production best practices!

Model Deployment: API Design & Serving Infrastructure

Introduction: The Deployment Gap

Learning Objectives

1. Production Requirements

The 3 Pillars of Production ML

Interactive Exploration

2. Deployment Patterns

Real-Time vs. Batch Prediction

Cloud vs. Edge Deployment

3. Model Serving with REST API

Flask API Example

FastAPI (Modern Alternative)

4. Docker Containerization

Why Docker?

Dockerfile Example

5. Monitoring and Logging

Key Metrics to Monitor

Logging Best Practices

6. Model Versioning and A/B Testing

Model Versioning Strategy

A/B Testing

7. Performance Optimization

Optimization Strategies

Benchmarking Example

Key Takeaways

What's Next?

Further Reading

Hands-On Tutorials

Production-Ready Serving

Papers & Articles

Documentation & Books