Production Deployment and Operations

Overview

After developing, training, and fine-tuning language models, the next crucial step is deploying them to production environments where they can provide value to users. However, deploying LLMs presents unique challenges due to their size, complexity, and resource requirements. This lesson covers strategies for successfully deploying LLMs in production, including infrastructure considerations, monitoring approaches, A/B testing methodologies, and version management techniques.

We'll explore how to transition from a successful model in the research environment to a reliable, scalable, and cost-effective system in production. You'll learn about the architectural patterns, operational practices, and technical solutions that enable effective LLM deployments across different scales and use cases.

Learning Objectives

After completing this lesson, you will be able to:

Design scalable and cost-effective infrastructure for LLM deployment
Implement comprehensive monitoring and observability for production LLMs
Set up A/B testing and experimentation frameworks for continuous improvement
Develop strategies for versioning and managing model lifecycles
Apply best practices for security, compliance, and responsible AI
Troubleshoot common issues in production LLM systems
Choose appropriate deployment architectures based on requirements and constraints

From Research to Production: The Deployment Gap

The Deployment Challenge

Transitioning from a successful model in research to a reliable production system involves bridging what's often called the "deployment gap" – the difference between what works in a controlled research environment and what's needed for reliable production systems.

Analogy: From Prototype to Manufacturing

Think of the transition from research to production as similar to moving from a prototype car to mass manufacturing:

Research Phase (Prototype): Building a single working model with a focus on performance and proof of concept. Engineers can constantly tinker and adjust, and performance is the main concern.
Production Phase (Manufacturing): Creating a reliable, reproducible process that delivers consistent quality at scale. Considerations include cost efficiency, reliability, maintainability, and user safety.

Just as automotive manufacturers must solve supply chain, quality control, and maintenance issues that weren't priorities during prototyping, ML teams must address deployment challenges that weren't relevant during model development.

Deployment Challenges for LLMs

Aspect	Research Environment	Production Environment
Primary Focus	Model accuracy and capabilities	Reliability, cost, and user experience
Hardware	High-end GPUs/TPUs with flexibility	Cost-optimized, often heterogeneous
Latency	Not a primary concern	Critical for user experience
Scale	Limited test users	Potentially millions of users
Monitoring	Manual evaluation	Automated, comprehensive systems
Updates	Frequent and experimental	Carefully tested and controlled
Cost	Less constrained (within budget)	Key business constraint
Safety	Basic safeguards	Robust safety systems

Challenge 1: Model Size and Computational Requirements

Modern LLMs present unique deployment challenges due to their sheer size:

Memory Footprint: Models like GPT-4 have hundreds of billions of parameters requiring significant GPU memory
Computational Demands: Inference requires substantial computing power for acceptable latency
Cost Considerations: Running large models 24/7 at scale can incur substantial cloud costs

Challenge 2: Latency and Throughput Requirements

User-facing applications have strict performance requirements:

Inference Latency: Users expect responses within seconds, not minutes
Throughput: Production systems must handle many concurrent requests
Cost-Performance Balance: Finding the optimal tradeoff between performance and operational costs

Challenge 3: Scalability and Reliability

Production systems need to handle variable load while maintaining reliability:

Elastic Scaling: Efficiently scaling up and down with demand
High Availability: Ensuring system resilience despite hardware or software failures
Resource Management: Efficiently allocating computing resources across services

Deployment Infrastructure for LLMs

Choosing the Right Infrastructure

The choice of infrastructure depends on factors like model size, latency requirements, budget constraints, and expected load. The deployment requirements flow from model characteristics and user requirements to infrastructure selection, which branches into cloud options, on-premises options, and hybrid options.

Infrastructure Options

1. Cloud-based Deployment

Advantages:

Scalability and flexibility
Access to specialized hardware (latest GPUs/TPUs)
Managed services for many deployment components
Lower upfront costs

Considerations:

Long-term costs can be high for constant workloads
Limited control over hardware specifics
Potential data security and compliance concerns
Vendor lock-in risks

2. On-Premises Deployment

Advantages:

Complete control over infrastructure
Can be more cost-effective for stable, high-volume workloads
Data remains within your physical control
No dependency on external internet connectivity

Considerations:

High upfront capital expenditure
Requires specialized DevOps expertise
Hardware becomes outdated
Scaling requires physical hardware procurement

3. Hybrid Approaches

Advantages:

Balance between control and convenience
Flexibility to optimize for cost vs. performance
Can address specific compliance requirements
Resilience through diversity

Considerations:

More complex architecture and management
Requires expertise in multiple environments
Potential synchronization challenges
More complex security model

Cloud Provider Comparison

Provider	Key Offerings	Advantages	Considerations
AWS	SageMaker, EC2 G5/P4 instances, Inferentia	Deep integration with AWS services, global reach	Premium pricing, complex pricing model
Google Cloud	Vertex AI, TPUs, Cloud GPUs	TPU access, specialized for ML workloads	TPU learning curve, fewer deployment options
Azure	Azure OpenAI Service, ML Service, NC-series VMs	Strong enterprise integration, OpenAI partnership	Limited hardware options compared to competitors
Specialized providers (Lambda, CoreWeave)	GPU-optimized infrastructure	Optimized for ML workloads, potentially lower costs	Smaller ecosystem, fewer integrated services

Containerization and Orchestration

Modern LLM deployments often leverage containerization for consistency and orchestration for management:

Docker containers provide a consistent environment across development and production
Kubernetes offers orchestration capabilities to manage scaling and resource allocation
Helm charts help standardize deployments

Code Example: Basic Kubernetes Deployment for Model Serving

# model-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-service
  labels:
    app: llm-inference
spec:
  replicas: 3  # Start with 3 pods
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: model-server
        image: your-registry/llm-model:v1.0.0
        resources:
          limits:
            nvidia.com/gpu: 1  # Each pod requests 1 GPU
            memory: "16Gi"
            cpu: "8"
          requests:
            nvidia.com/gpu: 1
            memory: "12Gi"
            cpu: "4"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        env:
        - name: MODEL_PATH
          value: "/models/llama-7b-chat-q4"
        - name: MAX_CONCURRENT_REQUESTS
          value: "16"
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llm-inference-service
spec:
  selector:
    app: llm-inference
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference-service
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Deployment Architecture Patterns

Model-as-a-Service Architecture

In this pattern, the LLM is deployed as a standalone service with a REST or gRPC API:

FIG. 02Flow Diagram

DIAGRAM

LOADING INSTRUMENT

Fig. 02Flow diagrams, timelines, and process visualizations

Monitoring and Observability

The Importance of LLM Monitoring

Monitoring is particularly crucial for LLMs due to several factors:

Resource Intensity: Detecting inefficiencies or problems that could lead to high costs
Performance Drift: Detecting when model behavior changes over time
Reliability Concerns: Ensuring consistent service despite complex systems
Safety and Compliance: Monitoring for problematic outputs or usage patterns

Analogy: Monitoring as a Dashboard

Think of monitoring and observability as the dashboard in a complex vehicle:

Gauges (metrics) show you the current state of key systems
Warning lights (alerts) notify you when something needs attention
Diagnostic port (logging) lets you dig deeper when problems arise
Black box (tracing) records everything for post-incident analysis

Just as a pilot needs both basic flight instruments and advanced diagnostics, LLM systems need multiple layers of monitoring.

LLM-Specific Monitoring Considerations

Metrics to Monitor

Category	Metrics	Purpose
System Performance	GPU/CPU utilization, Memory usage, I/O wait times	Identify resource bottlenecks and capacity planning
Operational Metrics	Request latency, Throughput, Error rates, Queue length	Ensure system meets performance requirements
Model Metrics	Token throughput, Perplexity, Generation length, Attention patterns	Track model efficiency and behavior
Business Metrics	Cost per request, User engagement, Conversion rates	Evaluate business impact and ROI
Safety Metrics	Content policy violations, User reports, Safety filter activations	Monitor for problematic or harmful outputs

Implementing a Monitoring Stack

Interactive Visualization: Explore a training/inference monitoring dashboard:

TIP

▶ Try this first. Open the TrainingExplorer dashboard below and watch how the live metrics move together — notice which signals spike or drift before others, and ask yourself which one you'd wire an alert to first. Come back to the theory once you've seen what "healthy" versus "degrading" actually looks like on the gauges.

FIG. 04Model Training & Parallelism Explorer

INTERACTIVE

LOADING INSTRUMENT

Fig. 04Comprehensive tool for exploring training strategies

A Comprehensive Monitoring Architecture

A comprehensive monitoring architecture for LLM services:

FIG. 06Flow Diagram

DIAGRAM

LOADING INSTRUMENT

Fig. 06Flow diagrams, timelines, and process visualizations

Implementing Metrics Collection

Here's a Python example using Prometheus with FastAPI for serving an LLM:

from fastapi import FastAPI, Request
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time
import os
from prometheus_client import Counter, Histogram, Gauge, generate_latest

app = FastAPI()

# Load model
model_name = os.environ.get("MODEL_NAME", "mistralai/Mistral-7B-Instruct-v0.2")
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define Prometheus metrics
REQUEST_COUNT = Counter('llm_request_count', 'Total number of requests')
REQUEST_LATENCY = Histogram('llm_request_latency_seconds', 'Request latency in seconds')
MODEL_TEMPERATURE = Gauge('llm_temperature', 'Temperature parameter for generation')
TOKEN_THROUGHPUT = Histogram('llm_token_throughput', 'Tokens generated per second')
GPU_MEMORY_USED = Gauge('llm_gpu_memory_used_bytes', 'GPU memory used by the model')
ACTIVE_REQUESTS = Gauge('llm_active_requests', 'Number of active inference requests')
TOKEN_COUNT = Histogram('llm_token_count', 'Number of tokens in generation')

# Setup middleware to track active requests
@app.middleware("http")
async def track_requests(request: Request, call_next):
    ACTIVE_REQUESTS.inc()
    try:
        response = await call_next(request)
        return response
    finally:
        ACTIVE_REQUESTS.dec()

@app.post("/generate")
async def generate_text(request: dict):
    REQUEST_COUNT.inc()
    start_time = time.time()
    
    # Extract parameters
    prompt = request["prompt"]
    max_length = request.get("max_length", 512)
    temperature = request.get("temperature", 0.7)
    top_p = request.get("top_p", 0.9)
    
    # Update metrics
    MODEL_TEMPERATURE.set(temperature)
    
    # Check GPU memory usage
    if torch.cuda.is_available():
        memory_allocated = torch.cuda.memory_allocated(0)
        GPU_MEMORY_USED.set(memory_allocated)
    
    # Track token generation
    generation_start = time.time()
    
    # Generate text
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    input_token_count = len(inputs.input_ids[0])
    
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_length=max_length,
            temperature=temperature,
            top_p=top_p,
            do_sample=temperature > 0,
        )
    
    generation_time = time.time() - generation_start
    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Calculate token throughput (output tokens / generation time)
    output_token_count = len(outputs[0]) - input_token_count
    token_throughput = output_token_count / generation_time if generation_time > 0 else 0
    
    # Update metrics
    TOKEN_THROUGHPUT.observe(token_throughput)
    TOKEN_COUNT.observe(output_token_count)
    
    total_time = time.time() - start_time
    REQUEST_LATENCY.observe(total_time)
    
    return {
        text: output_text,
        "generation_time": generation_time,
        "total_time": total_time,
        "input_tokens": input_token_count,
        "output_tokens": output_token_count,
        "token_throughput": token_throughput
    }

@app.get("/metrics")
async def metrics():
    return generate_latest()

@app.get("/health")
async def health_check():
    return {"status": "ok"}

A/B Testing and Experimentation

Why A/B Testing is Critical for LLMs

A/B testing and controlled experimentation are essential for safe, effective improvements to production LLM systems:

Validating Model Improvements: Ensuring new models actually improve real-world performance
Parameter Optimization: Testing different inference parameters (temperature, top-p, etc.)
User Experience Testing: Understanding how model changes affect user satisfaction
Safety Evaluation: Assessing whether model changes introduce new risks or reduce existing ones

Analogy: Scientific Experimentation

Think of A/B testing as running scientific experiments:

You have a control group (existing model/configuration)
You have a treatment group (new model/configuration)
You need a hypothesis (what improvement you expect)
You need metrics (to measure success)
You run both systems simultaneously to compare results

Just as good science requires controlled conditions and sufficient sample sizes, good A/B testing requires careful experimental design.

Setting Up an A/B Testing Framework

Key Components of an LLM Experimentation System

ПРЕМИУМ-УРОК

Продолжите урок с Premium

Это конец бесплатного превью. Premium открывает урок целиком, все продвинутые треки и исходники всех инструментов.

◆Все премиум-уроки открыты
◆Платите сколько хотите — от $1 до $100
◆6 месяцев полного доступа

Открыть с Premium →Уже есть Premium? Войти

Overview

Learning Objectives

After completing this lesson, you will be able to:

Design scalable and cost-effective infrastructure for LLM deployment
Implement comprehensive monitoring and observability for production LLMs
Set up A/B testing and experimentation frameworks for continuous improvement
Develop strategies for versioning and managing model lifecycles
Apply best practices for security, compliance, and responsible AI
Troubleshoot common issues in production LLM systems
Choose appropriate deployment architectures based on requirements and constraints

From Research to Production: The Deployment Gap

The Deployment Challenge

Analogy: From Prototype to Manufacturing

Think of the transition from research to production as similar to moving from a prototype car to mass manufacturing:

Research Phase (Prototype): Building a single working model with a focus on performance and proof of concept. Engineers can constantly tinker and adjust, and performance is the main concern.
Production Phase (Manufacturing): Creating a reliable, reproducible process that delivers consistent quality at scale. Considerations include cost efficiency, reliability, maintainability, and user safety.

Deployment Challenges for LLMs

Aspect	Research Environment	Production Environment
Primary Focus	Model accuracy and capabilities	Reliability, cost, and user experience
Hardware	High-end GPUs/TPUs with flexibility	Cost-optimized, often heterogeneous
Latency	Not a primary concern	Critical for user experience
Scale	Limited test users	Potentially millions of users
Monitoring	Manual evaluation	Automated, comprehensive systems
Updates	Frequent and experimental	Carefully tested and controlled
Cost	Less constrained (within budget)	Key business constraint
Safety	Basic safeguards	Robust safety systems

Challenge 1: Model Size and Computational Requirements

Modern LLMs present unique deployment challenges due to their sheer size:

Memory Footprint: Models like GPT-4 have hundreds of billions of parameters requiring significant GPU memory
Computational Demands: Inference requires substantial computing power for acceptable latency
Cost Considerations: Running large models 24/7 at scale can incur substantial cloud costs

Challenge 2: Latency and Throughput Requirements

User-facing applications have strict performance requirements:

Inference Latency: Users expect responses within seconds, not minutes
Throughput: Production systems must handle many concurrent requests
Cost-Performance Balance: Finding the optimal tradeoff between performance and operational costs

Challenge 3: Scalability and Reliability

Production systems need to handle variable load while maintaining reliability:

Elastic Scaling: Efficiently scaling up and down with demand
High Availability: Ensuring system resilience despite hardware or software failures
Resource Management: Efficiently allocating computing resources across services

Deployment Infrastructure for LLMs

Choosing the Right Infrastructure

Infrastructure Options

1. Cloud-based Deployment

Advantages:

Scalability and flexibility
Access to specialized hardware (latest GPUs/TPUs)
Managed services for many deployment components
Lower upfront costs

Considerations:

Long-term costs can be high for constant workloads
Limited control over hardware specifics
Potential data security and compliance concerns
Vendor lock-in risks

2. On-Premises Deployment

Advantages:

Complete control over infrastructure
Can be more cost-effective for stable, high-volume workloads
Data remains within your physical control
No dependency on external internet connectivity

Considerations:

High upfront capital expenditure
Requires specialized DevOps expertise
Hardware becomes outdated
Scaling requires physical hardware procurement

3. Hybrid Approaches

Advantages:

Balance between control and convenience
Flexibility to optimize for cost vs. performance
Can address specific compliance requirements
Resilience through diversity

Considerations:

More complex architecture and management
Requires expertise in multiple environments
Potential synchronization challenges
More complex security model

Cloud Provider Comparison

Provider	Key Offerings	Advantages	Considerations
AWS	SageMaker, EC2 G5/P4 instances, Inferentia	Deep integration with AWS services, global reach	Premium pricing, complex pricing model
Google Cloud	Vertex AI, TPUs, Cloud GPUs	TPU access, specialized for ML workloads	TPU learning curve, fewer deployment options
Azure	Azure OpenAI Service, ML Service, NC-series VMs	Strong enterprise integration, OpenAI partnership	Limited hardware options compared to competitors
Specialized providers (Lambda, CoreWeave)	GPU-optimized infrastructure	Optimized for ML workloads, potentially lower costs	Smaller ecosystem, fewer integrated services

Containerization and Orchestration

Modern LLM deployments often leverage containerization for consistency and orchestration for management:

Docker containers provide a consistent environment across development and production
Kubernetes offers orchestration capabilities to manage scaling and resource allocation
Helm charts help standardize deployments

Code Example: Basic Kubernetes Deployment for Model Serving

# model-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-service
  labels:
    app: llm-inference
spec:
  replicas: 3  # Start with 3 pods
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: model-server
        image: your-registry/llm-model:v1.0.0
        resources:
          limits:
            nvidia.com/gpu: 1  # Each pod requests 1 GPU
            memory: "16Gi"
            cpu: "8"
          requests:
            nvidia.com/gpu: 1
            memory: "12Gi"
            cpu: "4"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        env:
        - name: MODEL_PATH
          value: "/models/llama-7b-chat-q4"
        - name: MAX_CONCURRENT_REQUESTS
          value: "16"
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llm-inference-service
spec:
  selector:
    app: llm-inference
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference-service
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Deployment Architecture Patterns

Model-as-a-Service Architecture

In this pattern, the LLM is deployed as a standalone service with a REST or gRPC API:

FIG. 02Flow Diagram

DIAGRAM

LOADING INSTRUMENT

Fig. 02Flow diagrams, timelines, and process visualizations

Monitoring and Observability

The Importance of LLM Monitoring

Monitoring is particularly crucial for LLMs due to several factors:

Resource Intensity: Detecting inefficiencies or problems that could lead to high costs
Performance Drift: Detecting when model behavior changes over time
Reliability Concerns: Ensuring consistent service despite complex systems
Safety and Compliance: Monitoring for problematic outputs or usage patterns

Analogy: Monitoring as a Dashboard

Think of monitoring and observability as the dashboard in a complex vehicle:

Gauges (metrics) show you the current state of key systems
Warning lights (alerts) notify you when something needs attention
Diagnostic port (logging) lets you dig deeper when problems arise
Black box (tracing) records everything for post-incident analysis

Just as a pilot needs both basic flight instruments and advanced diagnostics, LLM systems need multiple layers of monitoring.

LLM-Specific Monitoring Considerations

Metrics to Monitor

Category	Metrics	Purpose
System Performance	GPU/CPU utilization, Memory usage, I/O wait times	Identify resource bottlenecks and capacity planning
Operational Metrics	Request latency, Throughput, Error rates, Queue length	Ensure system meets performance requirements
Model Metrics	Token throughput, Perplexity, Generation length, Attention patterns	Track model efficiency and behavior
Business Metrics	Cost per request, User engagement, Conversion rates	Evaluate business impact and ROI
Safety Metrics	Content policy violations, User reports, Safety filter activations	Monitor for problematic or harmful outputs

Implementing a Monitoring Stack

Interactive Visualization: Explore a training/inference monitoring dashboard:

TIP

▶ Try this first. Open the TrainingExplorer dashboard below and watch how the live metrics move together — notice which signals spike or drift before others, and ask yourself which one you'd wire an alert to first. Come back to the theory once you've seen what "healthy" versus "degrading" actually looks like on the gauges.

FIG. 04Model Training & Parallelism Explorer

INTERACTIVE

LOADING INSTRUMENT

Fig. 04Comprehensive tool for exploring training strategies

A Comprehensive Monitoring Architecture

A comprehensive monitoring architecture for LLM services:

FIG. 06Flow Diagram

DIAGRAM

LOADING INSTRUMENT

Fig. 06Flow diagrams, timelines, and process visualizations

Implementing Metrics Collection

Here's a Python example using Prometheus with FastAPI for serving an LLM:

from fastapi import FastAPI, Request
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time
import os
from prometheus_client import Counter, Histogram, Gauge, generate_latest

app = FastAPI()

# Load model
model_name = os.environ.get("MODEL_NAME", "mistralai/Mistral-7B-Instruct-v0.2")
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define Prometheus metrics
REQUEST_COUNT = Counter('llm_request_count', 'Total number of requests')
REQUEST_LATENCY = Histogram('llm_request_latency_seconds', 'Request latency in seconds')
MODEL_TEMPERATURE = Gauge('llm_temperature', 'Temperature parameter for generation')
TOKEN_THROUGHPUT = Histogram('llm_token_throughput', 'Tokens generated per second')
GPU_MEMORY_USED = Gauge('llm_gpu_memory_used_bytes', 'GPU memory used by the model')
ACTIVE_REQUESTS = Gauge('llm_active_requests', 'Number of active inference requests')
TOKEN_COUNT = Histogram('llm_token_count', 'Number of tokens in generation')

# Setup middleware to track active requests
@app.middleware("http")
async def track_requests(request: Request, call_next):
    ACTIVE_REQUESTS.inc()
    try:
        response = await call_next(request)
        return response
    finally:
        ACTIVE_REQUESTS.dec()

@app.post("/generate")
async def generate_text(request: dict):
    REQUEST_COUNT.inc()
    start_time = time.time()
    
    # Extract parameters
    prompt = request["prompt"]
    max_length = request.get("max_length", 512)
    temperature = request.get("temperature", 0.7)
    top_p = request.get("top_p", 0.9)
    
    # Update metrics
    MODEL_TEMPERATURE.set(temperature)
    
    # Check GPU memory usage
    if torch.cuda.is_available():
        memory_allocated = torch.cuda.memory_allocated(0)
        GPU_MEMORY_USED.set(memory_allocated)
    
    # Track token generation
    generation_start = time.time()
    
    # Generate text
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    input_token_count = len(inputs.input_ids[0])
    
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_length=max_length,
            temperature=temperature,
            top_p=top_p,
            do_sample=temperature > 0,
        )
    
    generation_time = time.time() - generation_start
    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Calculate token throughput (output tokens / generation time)
    output_token_count = len(outputs[0]) - input_token_count
    token_throughput = output_token_count / generation_time if generation_time > 0 else 0
    
    # Update metrics
    TOKEN_THROUGHPUT.observe(token_throughput)
    TOKEN_COUNT.observe(output_token_count)
    
    total_time = time.time() - start_time
    REQUEST_LATENCY.observe(total_time)
    
    return {
        text: output_text,
        "generation_time": generation_time,
        "total_time": total_time,
        "input_tokens": input_token_count,
        "output_tokens": output_token_count,
        "token_throughput": token_throughput
    }

@app.get("/metrics")
async def metrics():
    return generate_latest()

@app.get("/health")
async def health_check():
    return {"status": "ok"}

A/B Testing and Experimentation

Why A/B Testing is Critical for LLMs

A/B testing and controlled experimentation are essential for safe, effective improvements to production LLM systems:

Validating Model Improvements: Ensuring new models actually improve real-world performance
Parameter Optimization: Testing different inference parameters (temperature, top-p, etc.)
User Experience Testing: Understanding how model changes affect user satisfaction
Safety Evaluation: Assessing whether model changes introduce new risks or reduce existing ones

Analogy: Scientific Experimentation

Think of A/B testing as running scientific experiments:

You have a control group (existing model/configuration)
You have a treatment group (new model/configuration)
You need a hypothesis (what improvement you expect)
You need metrics (to measure success)
You run both systems simultaneously to compare results

Just as good science requires controlled conditions and sufficient sample sizes, good A/B testing requires careful experimental design.

Setting Up an A/B Testing Framework

Key Components of an LLM Experimentation System

ПРЕМИУМ-УРОК

Продолжите урок с Premium

◆Все премиум-уроки открыты
◆Платите сколько хотите — от $1 до $100
◆6 месяцев полного доступа

Открыть с Premium →Уже есть Premium? Войти