УРОКИ · 14 · 12 / 14
Performance Optimization: Model and Infrastructure Optimization
Master model quantization, compression, and infrastructure-level optimizations for faster and more cost-effective agent systems.
Learning Objectives
By the end of this lesson, you will be able to:
- Implement model quantization and compression techniques
- Design hardware-accelerated inference pipelines
- Optimize costs through intelligent model selection and routing
- Build efficient multi-model serving architectures
- Deploy optimized models across different hardware configurations
Introduction
While the previous lesson focused on application-level optimizations, this lesson dives into model and infrastructure-level performance improvements. We'll explore how to make models smaller, faster, and more cost-effective while maintaining quality.
Model Optimization Techniques
Model Optimization Strategy Overview
Optimization Techniques Comparison
| Technique | Size Reduction | Speed Improvement | Quality Impact | Implementation Complexity |
|---|---|---|---|---|
| Quantization (INT8) | 75% | 2-4x | Minimal | Low |
| Pruning (50%) | 50% | 1.5-2x | Minimal | Medium |
| Knowledge Distillation | 60-80% | 3-5x | Low | High |
| Weight Compression | 30-50% | 1.2-1.5x | None | Medium |
| Operator Fusion | 0% | 1.5-2x | None | Low |
| Mixed Precision | 50% | 1.5-2x | Minimal | Low |
Infrastructure Optimization
Hardware-Accelerated Inference Pipeline
Hardware Performance Comparison
| Hardware | Throughput | Latency | Cost/Hour | Power Usage | Best Use Case |
|---|---|---|---|---|---|
| CPU (High-end) | Low | Medium | $0.10 | Low | Small models, edge inference |
| GPU (A100) | High | Low | $3.00 | High | Large models, training |
| TPU v4 | Very High | Medium | $2.40 | Medium | Batch processing, training |
| Edge TPU | Medium | Very Low | $0.05 | Very Low | Mobile, IoT devices |
| AWS Inferentia | High | Low | $0.80 | Low | Production inference |
| Custom ASIC | Very High | Very Low | $1.50 | Low | Specialized workloads |
Cost Optimization Strategies
Multi-Model Serving Architecture
Interactive Cost Optimization Demo
TIP▶ Try this first. Open the AgentExplorer below and explore how requests flow through resource pooling and monitoring before any auto-scaling kicks in — watch how a fixed pool of model workers absorbs load, and notice the point where demand outpaces capacity. Come back to the cost-versus-quality theory once you've seen the pressure build.
Cost vs Quality Trade-offs
| Model Size | Inference Cost | Quality Score | Latency (ms) | Cost per 1M requests |
|---|---|---|---|---|
| Small (1B params) | $0.0001 | 85% | 50 | $100 |
| Medium (7B params) | $0.001 | 92% | 150 | $1000 |
| Large (70B params) | $0.01 | 96% | 500 | $10000 |
| External API | $0.002 | 94% | 200 | $2000 |
| Hybrid (Smart Routing) | $0.0015 | 93% | 120 | $1500 |
Model Serving Optimization
Dynamic Model Loading and Scaling
Serving Strategy Performance
| Strategy | Cold Start Time | Resource Efficiency | Cost Efficiency | Complexity | Scalability |
|---|---|---|---|---|---|
| Always Warm | 0s | Low | Low | Low | Limited |
| Auto-scaling | 30-60s | Medium | Medium | Medium | Good |
| Predictive Scaling | 5-15s | High | High | High | Very Good |
| Serverless | 1-10s | Very High | Very High | Low | Excellent |
| Hybrid Approach | 2-20s | High | High | High | Excellent |
Advanced Infrastructure Patterns
Edge Computing and Model Distribution
Connections to Previous Concepts
Building on Performance Fundamentals
Model optimization extends our application-level performance strategies:
From Performance Efficiency:
- Caching: Enhanced with model artifact caching
- Resource Pooling: Extended to GPU and specialized hardware pools
- Monitoring: Augmented with model-specific metrics
Integration with Production Systems:
- Infrastructure: Optimized hardware selection and deployment
- Scaling: Model-aware autoscaling strategies
- Cost Management: Multi-dimensional optimization (compute, storage, network)
End-to-End Optimization Pipeline
Worked Example: Optimizing a RAG Support Agent
Start with one concrete workload: a customer-support RAG agent serving 40 req/s. Each request retrieves 5 chunks, stuffs ~2,400 tokens of context into a 70B model, and emits ~180 tokens. Baseline today: p95 latency 1,900 ms, cost $0.0120/req. The team targets sub-1s p95 and a 50%+ cost cut without dropping answer quality below a 92% rubric score. We apply four changes in order and measure after each.
Step 1 — INT8 quantize the 70B. Weight-only INT8 shrinks the model 4x, fits it on fewer GPUs, and cuts decode latency. Rubric score holds at 95% (quantization impact is minimal here). Latency drops to 1,250 ms; cost to $0.0090/req.
Step 2 — Cache retrieval + responses. Support questions cluster ("reset password", "refund status"). An exact-match response cache on normalized queries plus an embedding cache for retrieval yields a 35% hit rate. Hits return in ~40 ms at near-zero marginal cost, pulling the blended numbers down sharply.
Step 3 — Continuous batching (vLLM). Replace one-request-at-a-time decoding with PagedAttention continuous batching. Same hardware now serves more concurrent decodes, so under load p95 falls again with no quality change.
Step 4 — Route cheap-model-first. Send every query to a quantized 7B first; only escalate to the 70B when a confidence gate (self-rated <0.7 or retrieval score below threshold) fires. About 60% of tickets resolve on the 7B at $0.001/req.
| Step | p95 latency | Cost/req | Rubric score |
|---|---|---|---|
| Baseline (70B) | 1,900 ms | $0.0120 | 95% |
| + INT8 quantize | 1,250 ms | $0.0090 | 95% |
| + Caching (35% hit) | 850 ms | $0.0059 | 95% |
| + Continuous batching | 620 ms | $0.0059 | 95% |
| + Cheap-first routing | 540 ms | $0.0036 | 93% |
Decision: the routing step trades 2 points of rubric quality for the last big cost drop — acceptable because the confidence gate still escalates hard tickets. Final result: p95 540 ms (3.5x faster), $0.0036/req (70% cheaper), comfortably inside both targets. The order matters: quantize and cache before routing, so the escalation path is already cheap when it fires.
Key Takeaways
- Model Optimization: Quantization and pruning can significantly reduce model size and latency
- Hardware Matters: Proper hardware selection and optimization are crucial for performance
- Cost Awareness: Intelligent model routing can dramatically reduce operational costs
- Continuous Monitoring: Regular benchmarking helps identify optimization opportunities
- Multi-Model Strategy: Different models for different tasks based on requirements
- Infrastructure as Code: Automated deployment and scaling for consistent performance
Next Steps
In the next lesson, we'll cover Ethics and Safety in AI agent systems, addressing:
- Responsible AI practices and bias mitigation
- Safety measures and fail-safes
- Privacy and data protection
- Ethical decision-making frameworks
Practice Exercises
- Implement Model Quantization: Quantize a model using both static and dynamic methods
- Build a Model Router: Create an intelligent routing system for multiple models
- Cost Optimization Dashboard: Build a real-time cost monitoring and optimization system
- Hardware Benchmarking: Compare model performance across different hardware configurations
- Multi-Model Serving: Implement a production-ready serving system with load balancing
Further Reading
Papers & Articles
- Speculative Decoding — using small draft models to accelerate large model inference without quality loss
- Distilling Step-by-Step — training smaller models from LLM reasoning traces for agent tasks
- RouteLLM: Learning to Route LLMs — intelligent model routing to balance cost and quality