AI AGENTS: BUILDING AUTONOMOUS INTELLIGENT SYSTEMS / L12PERFORMANCE OPTIMIZATION: MODEL AND INFRASTRUCTURE OPTIMIZATION
УРОКИ · 14 · 12 / 14
LESSON 12 · ADVANCED · 90 MIN · ◆ 2 INSTRUMENTS

Performance Optimization: Model and Infrastructure Optimization

Master model quantization, compression, and infrastructure-level optimizations for faster and more cost-effective agent systems.

Learning Objectives

By the end of this lesson, you will be able to:

  • Implement model quantization and compression techniques
  • Design hardware-accelerated inference pipelines
  • Optimize costs through intelligent model selection and routing
  • Build efficient multi-model serving architectures
  • Deploy optimized models across different hardware configurations

Introduction

While the previous lesson focused on application-level optimizations, this lesson dives into model and infrastructure-level performance improvements. We'll explore how to make models smaller, faster, and more cost-effective while maintaining quality.

Model Optimization Techniques

Model Optimization Strategy Overview

FIG. 02Flow Diagram
DIAGRAM
LOADING INSTRUMENT
Fig. 02Flow diagrams, timelines, and process visualizations

Optimization Techniques Comparison

TechniqueSize ReductionSpeed ImprovementQuality ImpactImplementation Complexity
Quantization (INT8)75%2-4xMinimalLow
Pruning (50%)50%1.5-2xMinimalMedium
Knowledge Distillation60-80%3-5xLowHigh
Weight Compression30-50%1.2-1.5xNoneMedium
Operator Fusion0%1.5-2xNoneLow
Mixed Precision50%1.5-2xMinimalLow
FIG. 04AI Agents Explorer
INTERACTIVE
LOADING INSTRUMENT
Fig. 04Watch AI agents move, think, and communicate in real-time

Infrastructure Optimization

Hardware-Accelerated Inference Pipeline

FIG. 06Flow Diagram
DIAGRAM
LOADING INSTRUMENT
Fig. 06Flow diagrams, timelines, and process visualizations

Hardware Performance Comparison

HardwareThroughputLatencyCost/HourPower UsageBest Use Case
CPU (High-end)LowMedium$0.10LowSmall models, edge inference
GPU (A100)HighLow$3.00HighLarge models, training
TPU v4Very HighMedium$2.40MediumBatch processing, training
Edge TPUMediumVery Low$0.05Very LowMobile, IoT devices
AWS InferentiaHighLow$0.80LowProduction inference
Custom ASICVery HighVery Low$1.50LowSpecialized workloads

Cost Optimization Strategies

Multi-Model Serving Architecture

FIG. 08Flow Diagram
DIAGRAM
LOADING INSTRUMENT
Fig. 08Flow diagrams, timelines, and process visualizations

Interactive Cost Optimization Demo

TIP

▶ Try this first. Open the AgentExplorer below and explore how requests flow through resource pooling and monitoring before any auto-scaling kicks in — watch how a fixed pool of model workers absorbs load, and notice the point where demand outpaces capacity. Come back to the cost-versus-quality theory once you've seen the pressure build.

FIG. 10AI Agents Explorer
INTERACTIVE
LOADING INSTRUMENT
Fig. 10Watch AI agents move, think, and communicate in real-time

Cost vs Quality Trade-offs

Model SizeInference CostQuality ScoreLatency (ms)Cost per 1M requests
Small (1B params)$0.000185%50$100
Medium (7B params)$0.00192%150$1000
Large (70B params)$0.0196%500$10000
External API$0.00294%200$2000
Hybrid (Smart Routing)$0.001593%120$1500

Model Serving Optimization

Dynamic Model Loading and Scaling

FIG. 12Flow Diagram
DIAGRAM
LOADING INSTRUMENT
Fig. 12Flow diagrams, timelines, and process visualizations

Serving Strategy Performance

StrategyCold Start TimeResource EfficiencyCost EfficiencyComplexityScalability
Always Warm0sLowLowLowLimited
Auto-scaling30-60sMediumMediumMediumGood
Predictive Scaling5-15sHighHighHighVery Good
Serverless1-10sVery HighVery HighLowExcellent
Hybrid Approach2-20sHighHighHighExcellent

Advanced Infrastructure Patterns

Edge Computing and Model Distribution

FIG. 14Flow Diagram
DIAGRAM
LOADING INSTRUMENT
Fig. 14Flow diagrams, timelines, and process visualizations

Connections to Previous Concepts

Building on Performance Fundamentals

Model optimization extends our application-level performance strategies:

From Performance Efficiency:

  • Caching: Enhanced with model artifact caching
  • Resource Pooling: Extended to GPU and specialized hardware pools
  • Monitoring: Augmented with model-specific metrics

Integration with Production Systems:

  • Infrastructure: Optimized hardware selection and deployment
  • Scaling: Model-aware autoscaling strategies
  • Cost Management: Multi-dimensional optimization (compute, storage, network)
FIG. 16AI Agents Explorer
INTERACTIVE
LOADING INSTRUMENT
Fig. 16Watch AI agents move, think, and communicate in real-time

End-to-End Optimization Pipeline

FIG. 18Flow Diagram
DIAGRAM
LOADING INSTRUMENT
Fig. 18Flow diagrams, timelines, and process visualizations

Worked Example: Optimizing a RAG Support Agent

Start with one concrete workload: a customer-support RAG agent serving 40 req/s. Each request retrieves 5 chunks, stuffs ~2,400 tokens of context into a 70B model, and emits ~180 tokens. Baseline today: p95 latency 1,900 ms, cost $0.0120/req. The team targets sub-1s p95 and a 50%+ cost cut without dropping answer quality below a 92% rubric score. We apply four changes in order and measure after each.

Step 1 — INT8 quantize the 70B. Weight-only INT8 shrinks the model 4x, fits it on fewer GPUs, and cuts decode latency. Rubric score holds at 95% (quantization impact is minimal here). Latency drops to 1,250 ms; cost to $0.0090/req.

Step 2 — Cache retrieval + responses. Support questions cluster ("reset password", "refund status"). An exact-match response cache on normalized queries plus an embedding cache for retrieval yields a 35% hit rate. Hits return in ~40 ms at near-zero marginal cost, pulling the blended numbers down sharply.

Step 3 — Continuous batching (vLLM). Replace one-request-at-a-time decoding with PagedAttention continuous batching. Same hardware now serves more concurrent decodes, so under load p95 falls again with no quality change.

Step 4 — Route cheap-model-first. Send every query to a quantized 7B first; only escalate to the 70B when a confidence gate (self-rated <0.7 or retrieval score below threshold) fires. About 60% of tickets resolve on the 7B at $0.001/req.

Stepp95 latencyCost/reqRubric score
Baseline (70B)1,900 ms$0.012095%
+ INT8 quantize1,250 ms$0.009095%
+ Caching (35% hit)850 ms$0.005995%
+ Continuous batching620 ms$0.005995%
+ Cheap-first routing540 ms$0.003693%

Decision: the routing step trades 2 points of rubric quality for the last big cost drop — acceptable because the confidence gate still escalates hard tickets. Final result: p95 540 ms (3.5x faster), $0.0036/req (70% cheaper), comfortably inside both targets. The order matters: quantize and cache before routing, so the escalation path is already cheap when it fires.

Key Takeaways

  1. Model Optimization: Quantization and pruning can significantly reduce model size and latency
  2. Hardware Matters: Proper hardware selection and optimization are crucial for performance
  3. Cost Awareness: Intelligent model routing can dramatically reduce operational costs
  4. Continuous Monitoring: Regular benchmarking helps identify optimization opportunities
  5. Multi-Model Strategy: Different models for different tasks based on requirements
  6. Infrastructure as Code: Automated deployment and scaling for consistent performance

Next Steps

In the next lesson, we'll cover Ethics and Safety in AI agent systems, addressing:

  • Responsible AI practices and bias mitigation
  • Safety measures and fail-safes
  • Privacy and data protection
  • Ethical decision-making frameworks

Practice Exercises

  1. Implement Model Quantization: Quantize a model using both static and dynamic methods
  2. Build a Model Router: Create an intelligent routing system for multiple models
  3. Cost Optimization Dashboard: Build a real-time cost monitoring and optimization system
  4. Hardware Benchmarking: Compare model performance across different hardware configurations
  5. Multi-Model Serving: Implement a production-ready serving system with load balancing

Further Reading

Papers & Articles

Frameworks & Libraries

  • vLLM — high-throughput serving with PagedAttention and continuous batching
  • LiteLLM — unified proxy for model routing, fallbacks, and cost tracking across providers
СВЯЗАННЫЕ ПОНЯТИЯ
ai-agentsoptimizationquantizationcompressioninfrastructurecost-optimization