Model Serving Infrastructure — Compute Reference

1. At a glance

Model serving is the runtime infrastructure that hosts trained ML models and exposes them for inference at scale. In the 2024-26 window LLM serving dominates compute and cost across the industry — a single H100 hour costs $3-8, a typical production deployment pins tens to thousands of them, and inference (not training) is now the majority of accelerator demand at OpenAI, Anthropic, Meta, and Google.

The goals of a serving system are five-way:

  • Low latency — Time-to-First-Token (TTFT) and Inter-Token-Latency (ITL) within SLO at the P99 tail.
  • High throughput — tokens-per-second per dollar, maximized via batching and quantization.
  • Cost efficiency — right-sized hardware, autoscaling, spot capacity, prompt caching, model cascading.
  • Reliability — graceful degradation, retries with budget, circuit breakers, multi-region failover.
  • Observability — metrics + logs + traces deep enough to diagnose token-level pathologies.

Companion notes — [[Compute/inference-optimization]] covers the algorithmic side (quantization, speculative decoding, KV cache tricks, kernel-level work); this note covers the infrastructure and serving-framework side. [[Compute/transformer-architecture]] covers the model internals these systems exploit. [[Compute/kubernetes-deep]] covers the orchestration substrate most serving fleets run on.

The 2024-26 ecosystem is bifurcated. Generic model serving (recommenders, vision, classical ML) still uses TensorFlow Serving, TorchServe, Triton, KServe, and Seldon. LLM serving has its own stack — vLLM, TGI, TensorRT-LLM, SGLang, LMDeploy, DeepSpeed-MII — because the workload (autoregressive decoding, multi-tenant prefix sharing, GB-scale KV cache, dynamic batch evolution) breaks the assumptions of older frameworks.

A serving system is not just a single binary. It is a stack: hardware (GPU + interconnect + memory), kernels (CUDA, Triton-lang, ROCm, MLX), runtime (vLLM / TGI / TRT-LLM), gateway (Envoy / NGINX / Cloudflare / LiteLLM), orchestration (K8s + KEDA + Karpenter), storage (S3 + Lustre + Safetensors), and observability (Prometheus + OTel + LangFuse). Operating it well requires understanding the seams between layers — a problem at any seam shows up as P99 latency or unexplained cost on the bill.

2. Serving architecture patterns

How the model lives relative to the application is the first architectural choice.

Embedded / in-process — model loaded into the application process via PyTorch, ONNX Runtime, llama.cpp, or Core ML. Simple, no network hop, the lowest latency for small models, but no horizontal scaling, no GPU sharing across instances, no model versioning, and one bad model crash kills the app. Common for small classification heads, recommender features, and on-device inference.

Sidecar — model runs in a separate container co-located with the application pod (same node, loopback or Unix socket). Decouples language runtime from model runtime, allows GPU pinning per pod, but does not share GPU across applications. Used when one app owns one model exclusively.

Centralized service — dedicated serving fleet behind a gateway. Many apps and tenants share the same model replicas. The dominant pattern at scale. Adds a network hop (sub-millisecond inside a cluster, 5-50ms across regions) but unlocks batching, multi-tenancy, central observability, and operational ownership by an ML platform team.

Serverless — Modal, Replicate, RunPod, AWS Lambda (CPU + small models only), Vercel AI SDK, Cloudflare Workers AI. Scale-to-zero, pay-per-second, no idle GPU cost. Cold start is the killer — loading a 70B model from object storage to GPU HBM is 60-300s without tricks. Good for spiky workloads and prototyping; less suitable for steady high-QPS LLM serving.

Edge / on-device — model ships with the client. Core ML on iOS, TFLite / LiteRT on Android, MediaPipe LLM Inference for cross-platform mobile, WebGPU + transformers.js / MLC LLM in the browser, Microsoft Phi Silica on Windows on ARM, Apple Intelligence on iPhone 15 Pro+ and M-series Macs. Zero per-request cost, full privacy, but limited model size (1-8B parameters typical), heterogeneous hardware, and shipping model updates requires app store dance.

Hybrid request routing — many modern stacks combine three or four of these patterns. A typical large-scale architecture looks like: edge model for instant short responses (Apple Intelligence, on-device Phi Silica), centralized service for primary chat (vLLM cluster on K8s), serverless burst for spikes (Modal / RunPod), and a sidecar embedding service for retrieval (BGE-M3 or NV-Embed-v2 colocated with the application). The routing layer (LiteLLM, Cloudflare AI Gateway, Vercel AI Gateway) picks among them based on latency budget, content sensitivity, and cost target.

API surface conventions — OpenAI Chat Completions and the newer Responses API have become de facto standards. Anthropic Messages API and Google’s Vertex AI generateContent are similar in shape. vLLM, TGI, TRT-LLM, SGLang, Together, Fireworks, Anyscale, and most open-source servers all expose an OpenAI-compatible endpoint specifically because client tooling assumes it. Building against this contract gives you free portability across providers.

Streaming vs non-streaming endpoints — most chat workloads stream (SSE) for perceived latency, most embedding/classification workloads return all at once. Embeddings APIs (OpenAI embeddings, Cohere embed, Voyage AI, NV-Embed) batch many texts per request to amortize HTTP overhead — typical batch is 100-1000 texts per call.

The right pattern is workload-dependent. Most production LLM systems combine a centralized fleet (steady traffic) with serverless burst (peak spikes) and on-device for latency-critical features.

3. General serving frameworks

The non-LLM serving stack matured 2016-2022 and remains the workhorse for vision, recommenders, fraud, ranking, and classical ML.

TensorFlow Serving (Google, open-sourced 2016) — the original. gRPC + REST APIs, model versioning with hot-reload, server-side request batching, A/B testing via labels, monitoring exporters. Native to SavedModel format. Still the canonical TF deployment path; integrates with Vertex AI Prediction.

TorchServe (Meta + AWS, 2020) — PyTorch-native serving. Custom handler classes (preprocess / inference / postprocess), workflow DAGs for multi-model pipelines, dynamic batching, KServe integration. AWS deprecated SageMaker’s older PyTorch container in favor of TorchServe-based ones. Loses ground to vLLM and Triton for LLM workloads.

NVIDIA Triton Inference Server — open-source, multi-framework. Backends for ONNX Runtime, TensorRT, PyTorch (LibTorch), TensorFlow, OpenVINO, Python, DALI, FIL (forest models), and now vllm and tensorrtllm_backend. Request batching, concurrent model execution, model ensembles (DAG of models), Python business-logic scripting, model warmup, dynamic shape support. Production standard inside NVIDIA’s customer base — Meta, Microsoft, NVIDIA NIM containers, AWS SageMaker Triton endpoints. Triton 24.x adds first-class vLLM and TRT-LLM backends, blurring the line between generic and LLM serving.

KServe (CNCF, formerly KFServing) — Kubernetes-native serving via the InferenceService Custom Resource Definition. Sits on top of either Knative (scale-to-zero, autoscaling on RPS or concurrency) or RAW deployment mode (plain Deployment + HPA, no Knative dependency). Transformers, predictors, explainers, and storage initializers compose into a graph. Standard for K8s-first organizations.

Seldon Core — alternative K8s-native framework. SeldonDeployment CRD with prepackaged servers (SKLearn, XGBoost, MLflow, Triton, TensorFlow) and Python language wrappers. Pivoted to Seldon Core v2 (separate from v1) with a more declarative approach.

BentoML — Python-native packaging and serving. Define a Service class, bentoml serve, get an HTTP server. BentoCloud is the hosted offering. Yatai is the K8s operator. Strong DX for Python ML teams; less common at FAANG-scale.

Ray Serve (Anyscale) — Ray-based, designed for multi-model orchestration. Deployments are decorated Python classes, composed into graphs, scaled per-deployment. Strong fit when the model serving lives alongside Ray training, Ray Data, or Ray RLHF in the same cluster. Anyscale Endpoints exposes hosted Ray Serve.

MLServer (Seldon, open) — Python protocol-server implementing the V2 Inference Protocol (KServe Open Inference Protocol). Used as the runtime backend inside KServe and Seldon Core.

MLflow Models + mlflow.deployments — flavors-based packaging (pyfunc, sklearn, pytorch, transformers) and a deploy plugin API for Databricks, SageMaker, Azure ML targets. Often the model registry feeding into other serving frameworks.

Cog (Replicate’s container packaging) — cog.yaml + predict.py builds a container with a standard inference API. Powers Replicate’s hosted predictions. Lightweight, opinionated, good for one-off model demos.

NVIDIA NIM (NVIDIA Inference Microservices) — pre-packaged OCI containers per popular model with a tuned TRT-LLM or vLLM backend, OpenAI-compatible API, NVIDIA AI Enterprise support contract. Increasingly the path of least resistance for enterprise deployments that want NVIDIA’s stack without the engine-build burden. Each NIM is a self-contained, versioned container with built-in observability hooks.

AWS SageMaker, Azure ML, Vertex AI managed endpoints — hyperscaler-managed serving with autoscaling, monitoring, and IAM integration baked in. The trade-off is reduced flexibility (your framework versions and tuning options are constrained) for radically simpler operations. SageMaker LMI (Large Model Inference) containers ship with vLLM and TRT-LLM pre-installed.

Most teams pick one general framework (Triton or KServe) and one LLM framework (vLLM or TGI) and run them side by side.

4. LLM-specific serving (2024-26)

LLM serving is its own discipline because autoregressive decoding requires per-token compute, the KV cache grows unboundedly, prompts vary 100x in length, and prefix sharing across requests is a first-order optimization.

vLLM (Kwon et al., UC Berkeley, SOSP 2023) — the open-source default. Invented PagedAttention (treating KV cache as paged virtual memory) and continuous batching (re-batching at every decode step instead of per-request). Supports tensor parallel, pipeline parallel, expert parallel, multi-LoRA serving, FP8 / AWQ / GPTQ / SqueezeLLM quantization, speculative decoding, prefix caching. Native architectures cover Llama, Mistral, Mixtral, Qwen, DeepSeek, Yi, Phi, Gemma, Command-R, Granite, Falcon, plus most vision-language models (LLaVA, InternVL, Qwen-VL). OpenAI-compatible HTTP API. The de facto choice for self-hosted LLM serving in 2026.

TGI (Text Generation Inference) (Hugging Face, 2022+) — production-grade serving with strong HuggingFace integration. gRPC internal protocol with HTTP / OpenAI-compatible frontend. Continuous batching, tensor parallel, quantization (bitsandbytes, GPTQ, AWQ, EETQ, FP8), speculative decoding, guidance-based structured output, vision-language support. License changed to a more restrictive HF license in 2023 then back to Apache 2.0 in late 2024 — historically a source of confusion. Powers Hugging Face Inference Endpoints.

TensorRT-LLM (NVIDIA) — the fastest option on NVIDIA hardware. Precompiles a model into a TensorRT engine for a specific GPU, batch size, and sequence length range. INT4 (W4A16 / W4A8), INT8 (W8A8 / SmoothQuant), FP8 native; speculative decoding with draft heads, Medusa, EAGLE; in-flight batching equivalent to continuous batching; multi-LoRA; tensor + pipeline + expert parallel via NCCL. The build step is the friction — engine recompile per GPU SKU, per max batch, per max sequence length — but the runtime wins by 1.5-3x over vLLM on identical hardware for well-tuned configs. Drives NVIDIA NIM and ships as a Triton backend.

SGLang (Zheng et al., 2023, ongoing through 2025) — RadixAttention prefix cache (LRU radix tree over prefix tokens, automatic cross-request reuse), structured generation via grammar-constrained decoding (xgrammar integration), JSON schema enforcement, multi-modal support (Gemma 3, Qwen-VL, Llama 3.2 Vision, InternVL), strong DSL for chained / agentic prompts. Competitive with vLLM on throughput; faster on prefix-heavy workloads (system prompts, few-shot, agent loops).

LMDeploy (OpenMMLab / Shanghai AI Lab) — vLLM-like with strong W4A16 (AWQ) and W8A8 (SmoothQuant) quantization, particularly good for InternLM and Qwen models. Two engines — TurboMind (C++ / CUDA, fastest) and PyTorch (more flexible).

DeepSpeed-MII + DeepSpeed-FastGen (Microsoft) — built on DeepSpeed Inference. Dynamic SplitFuse continuous batching, model-parallel options. Less momentum than vLLM but still used inside Microsoft and for ZeRO-Inference compatibility.

MAX serve (Modular, 2024+) — Mojo-based runtime with Triton-compatible APIs. Promises CPU and GPU portability via the MAX Engine. Early adopters; ecosystem still small relative to vLLM / TRT-LLM.

Hugging Face Inference Endpoints + Inference Endpoints Hardware Hub — managed, dedicated endpoints on AWS / Azure / GCP, automatic TGI selection per model. The hardware hub (2024+) exposes Cerebras, Groq, SambaNova, and Hyperbolic Labs as backends behind the same API.

Local + edge LLM stack — llama.cpp (the foundational C++ inference engine, GGUF quantization formats Q2_K through Q8_0 and the newer IQ-series), Ollama (llama.cpp wrapper with a Modelfile and HTTP API), LM Studio (desktop GUI), Cortex (Jan’s runtime), mistral.rs (Rust llama.cpp alternative), Apple MLX-LM (Apple Silicon unified-memory inference), Microsoft Foundry Local + Phi Silica (Windows on ARM with NPU), Cog and Replicate’s local CLI. These power the desktop, mobile, and laptop tier of inference.

Choosing among the LLM frameworks — practical decision guide. Pick vLLM if you want the broadest model support, fast moving open-source community, OpenAI-compatible API, and don’t want to compile engines. Pick TGI if you’re deep in the Hugging Face ecosystem and want managed Inference Endpoints to mirror your self-hosted setup. Pick TensorRT-LLM if you need maximum performance on NVIDIA hardware and can absorb the engine-build complexity (most large-scale production deployments converge here). Pick SGLang if your workload is prefix-heavy (agents, structured generation, RAG) or you need grammar-constrained decoding. Pick LMDeploy if you primarily serve InternLM / Qwen and want strong quantization defaults. Pick llama.cpp / Ollama if the deployment is on CPU, Apple Silicon, or a laptop.

Hot-swapping frameworks — many production stacks abstract behind Triton or an OpenAI-compatible gateway so the underlying engine can be swapped without client changes. Useful when benchmarking vLLM vs TRT-LLM vs TGI on the same hardware, or when migrating from one to the other gradually. The cost of this abstraction is small (a few ms of routing overhead) and the optionality is large.

5. Key serving features

These primitives differentiate a modern LLM server from a generic one.

Continuous batching (also called in-flight batching or iteration-level scheduling, introduced by Yu et al., Orca, OSDI 2022) — instead of forming a batch and decoding it to completion before forming the next, the scheduler re-evaluates the batch at every decode step. Finished sequences drop out, new requests join, prefill happens interleaved with decode. Throughput improves 2-10x over static batching on bursty workloads.

PagedAttention (Kwon et al., 2023) — KV cache stored as fixed-size blocks (16 tokens typical) addressed via a per-sequence page table, analogous to OS virtual memory. Eliminates internal fragmentation, enables copy-on-write for parallel sampling (best-of-N, beam search) and prefix sharing.

Prefix / prompt caching — when many requests share the same leading tokens (system prompt, few-shot examples, document context for RAG), the KV cache for that prefix is computed once and shared. Anthropic Prompt Caching (2024) and OpenAI prompt caching (2024) charge 10-50% of normal input price for cached prefixes. Google Gemini context caching uses an explicit cache object. Self-hosted: vLLM --enable-prefix-caching, SGLang RadixAttention (automatic), TRT-LLM kv_cache_reuse.

Speculative decoding — a small draft model proposes K tokens, the large target model verifies them in a single forward pass. Acceptance rate 50-80% typical, end-to-end speedup 1.5-3x. Variants: standard speculative, Medusa (multiple decoding heads), EAGLE (autoregressive feature prediction), Lookahead Decoding (Jacobi iteration), self-speculative. Covered in detail in [[Compute/inference-optimization]].

Disaggregated prefill + decode — prefill is compute-bound (high FLOPS, low memory bandwidth pressure per token); decode is memory-bandwidth-bound (one token at a time, full weight matrix touched). DistServe (Zhong et al., 2024) and Splitwise (Patel et al., Microsoft, ISCA 2024) showed that separating these onto different GPU pools — typically H100 for prefill, A100 or L40S for decode — improves token / $ and tail latency. Colocated mode (prefill and decode interleaved on same GPU) is simpler and sufficient under low QPS; disaggregated is optimal but requires KV cache transfer between pools (NVLink or RDMA). Chunked prefill (vLLM, 2024) is a middle ground — split a long prefill into chunks that interleave with decode steps, smoothing tail latency without a separate pool.

Chunked prefill — vLLM’s default scheduler from 0.5+ slices a long prompt into chunks (e.g. 512 tokens each) and processes each chunk alongside ongoing decode iterations. The benefit: a single 50K-token prompt no longer monopolizes the GPU for hundreds of milliseconds while other users’ decode tokens stall. Trade-off: the prompt’s own TTFT increases slightly, in exchange for everyone else’s ITL staying smooth. SGLang implements analogous chunked-prefill scheduling.

Tensor + pipeline + expert parallel — when a model exceeds single-GPU memory or single-GPU throughput is insufficient. Tensor parallel splits each weight matrix across devices (Megatron-LM style, all-reduce per layer), pipeline parallel splits layers across devices (bubble-time trade-off, GPipe / 1F1B schedules), expert parallel splits MoE experts across devices. vLLM, TGI, and TRT-LLM all support TP (most common), PP (for very large models that need multi-node), and EP (for MoE like Mixtral 8x22B, DeepSeek V3, Qwen MoE).

Multi-LoRA serving — fine-tuned LoRA adapters (4-64 MB each) layered on a shared base model at inference time. vLLM --enable-lora, S-LoRA (Sheng et al., 2023), Punica (Chen et al., 2024), TRT-LLM Multi-LoRA. Lets a single base model serve N customer-specific or task-specific adapters concurrently — the economics that make per-customer fine-tuning viable.

Structured generation / JSON mode / tool calling — constrained decoding that guarantees output matches a schema. Outlines (Willard, 2023), Guidance (Microsoft), LMFE, xgrammar (SGLang’s engine), JSONFormer. Implemented by masking the logits to zero out tokens that would violate the grammar. Required for production tool-use, code generation, and structured extraction.

Multimodal serving — vision and audio tokenization, cross-modal attention, separate image / audio encoders feeding the LLM. vLLM, SGLang, TGI, and TRT-LLM all support major VLMs (Llama 3.2 Vision, Qwen2.5-VL, Pixtral, InternVL3, Phi-4-multimodal, GPT-4o-like architectures via open weights).

KV cache offload + tiered storage — when GPU HBM fills up but you don’t want to evict, offload less-recently-used KV blocks to CPU RAM, NVMe, or remote tiered storage. LMCache (CMU + UChicago, 2024) and vLLM’s experimental disaggregated KV-cache backend implement this. Reloading from CPU RAM costs 1-5ms per 1K tokens; from NVMe 10-30ms; from remote RDMA 5-15ms. The economics make sense when prefix-cache reuse is high and the alternative is recomputing prefill from scratch.

Token streaming protocols — Server-Sent Events (SSE, OpenAI standard via text/event-stream), HTTP/2 chunked, gRPC server streaming, WebSocket (less common). SSE wins on simplicity and works through most proxies; WebSocket is needed for bidirectional or when SSE buffering on intermediaries is unacceptable. HTTP/2 multiplexing is critical — without it, browsers max out at 6 concurrent connections and streaming chat suffers.

Long-context strategies — for >100K-token contexts, the KV cache dominates memory and the attention compute becomes a bottleneck. Strategies: ring attention (split sequence across GPUs), sliding-window attention, attention sinks (StreamingLLM, Xiao et al. 2024), context compression (LLMLingua, Activation Beacon), retrieval-augmented compression (RAG over the long context itself), Mamba / SSM architectures for some workloads.

Reasoning model serving — o1, o3, R1, and Claude extended-thinking variants emit long internal reasoning chains before the user-visible answer. Serving them requires longer per-request compute (often 10-100x more tokens than a standard chat reply), stricter timeout budgets, and care with caching (the reasoning chain itself is often invisible to the user but still counts in tokens). Throughput per GPU drops significantly; cost per task rises but accuracy improves.

Agent-loop serving — agents make many sequential LLM calls in a single user-facing turn (planner tool call observation next step). Optimizations: keep the same conversation pinned to one replica via sticky routing to maximize prefix cache hits, batch tool-execution side calls, use a smaller / cheaper model for routine sub-steps (model cascade), aggressively cap max-steps to bound runaway loops.

6. Throughput vs latency optimization

The two goals trade off, and the right knob depends on workload.

Throughput-oriented workloads — batch inference, embeddings, document summarization, classification, agent backends with relaxed user-facing SLOs. Maximize tokens / sec / $. Push batch size large, quantize aggressively (W4A16, FP8), pack multiple short sequences per batch, accept higher per-request latency.

Latency-oriented workloads — chat, code completion, voice agents, anything user-facing and synchronous. Minimize TTFT (Time-to-First-Token, mostly prefill-bound, sensitive to prefix cache hit) and ITL (Inter-Token-Latency, mostly decode-bound, sensitive to batch size, GPU memory bandwidth, and quantization quality). Cap batch size, use prefix caching, use speculative decoding, target faster GPUs (H100 SXM, H200, MI300X) for memory bandwidth.

Typical chat SLOs — P50 TTFT under 200ms, P99 TTFT under 500ms, P50 ITL under 30ms (33 tokens/sec), P99 ITL under 50ms (20 tokens/sec). Coding assistants tolerate higher TTFT (500-1000ms) but want fast ITL. Voice agents need sub-300ms end-to-end including ASR + LLM + TTS, so LLM TTFT under 150ms is critical.

Online vs offline — offline batch APIs accept 24-hour SLA and trade latency for cost. Anthropic Message Batches API and OpenAI Batch API both offer 50% discount; Google Vertex Batch Prediction and AWS Bedrock Batch similar. Self-hosted equivalent: run a separate dedicated low-priority queue on spot GPUs overnight.

Headroom planning — pick a target P99 latency, identify the batch size at which the framework hits that latency, then set max-concurrency below that to leave headroom for traffic spikes. A common mistake is sizing for steady-state P50 and watching P99 collapse when a thundering herd arrives. 30-50% headroom on the per-replica concurrency limit, with horizontal scaling making up the difference, gives smoother tails than a tight per-replica limit.

Queueing theory implications — when arrival rate approaches service rate, queue depth and wait time grow toward infinity (M/M/1 queue intuition). LLM serving has variable service time (output length varies 10x across requests), making the tail even worse. Practical rule: target 60-70% utilization on average, scale before exceeding 80%.

7. Multi-tenancy + isolation

Sharing model replicas across tenants is mandatory for cost efficiency but requires care.

Per-tenant quotas — rate-limit by requests-per-minute and tokens-per-minute. Stripe’s Limit-Service pattern (sliding window over Redis), Envoy rate-limit filter with descriptors, or in-application Token Bucket. Quotas should be enforced at the gateway, not the inference server, to reject early.

Fair scheduling between tenants — within a single replica, multiple tenants’ requests interleave through continuous batching. Weighted-fair-queueing across tenants prevents a single noisy tenant from monopolizing the batch. vLLM’s priority and SGLang’s tenant scheduling extensions provide hooks.

GPU isolation — MIG (Multi-Instance GPU) on A100/H100/H200 partitions a single GPU into up to seven isolated slices with dedicated SMs, L2 cache, and HBM. Useful when tenants need hard isolation guarantees. vGPU (NVIDIA AI Enterprise) virtualizes a GPU into time-sliced shares. Full-GPU-dedicated remains the default for performance-critical paths.

Container resource limitscgroups memory + CPU limits, K8s resources.limits for nvidia.com/gpu (whole-GPU only without MIG), eBPF-based GPU sharing (HAMi project, NVIDIA k8s-device-plugin with time-slicing). The CPU side of the inference pod also matters — token streaming, tokenization, and HTTP handling can saturate cores if undersized.

GPU device plugin pitfalls — the NVIDIA Kubernetes device plugin exposes whole GPUs by default; sharing requires either MIG (clean isolation but rigid slice sizes) or time-slicing (no isolation, requests can starve each other). HAMi and the NVIDIA Multi-Process Service (MPS) offer middle grounds. Choose carefully — the wrong sharing mode silently destroys SLOs.

Cross-tenant audit trail — for SOC 2, ISO 27001, HIPAA, EU AI Act, and similar regimes, every inference request must be auditable: who, what model, what version, what prompt (or its hash), what response (or its hash), at what time, from what IP, under what API key. Write to an immutable append-only store (AWS S3 Object Lock, GCS retention policy, Azure Immutable Blob) and retain per your retention policy. Anthropic and OpenAI both publish detailed compliance documentation for their managed services along these lines.

Tenant-aware caching — prefix cache hits should not leak tenant A’s prompt content to tenant B. Either partition the cache per tenant, hash with a tenant salt, or accept that the cache key is content-addressed and rely on the fact that two tenants having the identical prefix is benign.

Noisy-neighbor mitigation — one tenant submitting 100K-token prompts at high QPS can blow out KV cache for everyone else on that replica. Defenses: per-tenant KV cache budgets (vLLM --max-num-batched-tokens interacts with this), dedicated long-context replicas separate from chat replicas, admission control that rejects when KV cache exceeds a high-watermark.

Tenant-priority tiers — gold / silver / bronze QoS classes routed to different replica pools with different SLO targets. Premium customers land on dedicated H100 pools with aggressive replica count; trial users share L4 pools with scale-to-zero. K8s topology-aware routing + Istio destination subsets implement this cleanly.

8. Routing + load balancing

Getting requests to the right replica is its own discipline.

K8s Service + Ingress / Gateway API — Service does intra-cluster L4 load balancing (kube-proxy iptables/IPVS), Ingress and Gateway API expose externally. Gateway API (the successor to Ingress) has first-class support for HTTP/2, weighted backends, and traffic splits.

Envoy / Istio / Linkerd — L7 routing, traffic split for canaries, retries with budget, outlier detection, mTLS, distributed tracing headers. Istio’s VirtualService and DestinationRule are the standard primitives. Linkerd is lighter and faster but has fewer features.

Sticky sessions for prefix-cache hit — without stickiness, a follow-up turn in a chat may land on a different replica with cold KV cache, wasting prefix-cache potential. Route by conversation ID, user ID, or hash of the system prompt to the same replica. Envoy ring_hash LB policy, NGINX hash, or Istio consistent-hash subsets achieve this.

Latency-aware routing + circuit breaking — least-outstanding-requests (LOR) is a better policy than round-robin for variable-cost LLM workloads. Circuit breakers (per-backend, per-route) prevent cascade failures when a replica goes slow.

NGINX + HAProxy — workhorse HTTP/2 reverse proxies. HAProxy excels at low-latency, high-connection-count workloads; NGINX is more general. Both can do consistent hashing and least-connections.

Skypilot + Anyscale + Modal — multi-cloud burst orchestration. Skypilot (Berkeley) finds cheapest spot GPU across AWS/GCP/Azure/Lambda/RunPod and runs the job there; Anyscale and Modal are commercial equivalents with managed infrastructure.

Cloudflare AI Gateway + LangSmith Proxy + LiteLLM Proxy + Helicone — LLM-specific gateways. Caching, retry-with-fallback to alternate providers, cost tracking, prompt versioning, rate limiting, evaluation hooks. LiteLLM Proxy is open-source and self-hostable.

Cross-region failover — for multi-region serving, route by latency (geographic proximity) but maintain failover to another region on errors or capacity exhaustion. AWS Global Accelerator, Cloudflare Load Balancing, GCP Global External HTTP Load Balancer, Azure Front Door. Critical: pre-warm capacity in failover regions or accept cold-start tax during incidents. Anthropic, OpenAI, and Google all run multi-region inference fleets with active-active failover.

Latency-budgeted routing — given a per-request latency budget (e.g. 3 seconds total), the gateway can pick a model and route based on expected service time. Fast cheap models first; on timeout, retry with a more capable one within the remaining budget. LangChain’s RunnableWithFallbacks, LiteLLM’s fallback chain, and custom routers implement this.

Semantic caching — beyond exact-prefix caching, semantic caches (GPTCache, Redis VectorSet semantic cache, Cloudflare AI Gateway semantic cache) embed the prompt and look for near-neighbors in a vector index of past prompts. Hit returns the cached response without running the model. Works well for FAQ-style traffic with high prompt similarity; risky for time-sensitive or personalized content (stale cache hits look like bugs).

9. Autoscaling

LLM workloads are spikier and heavier than typical web workloads, and scaling has to account for slow cold starts.

HPA / VPA — Horizontal Pod Autoscaler scales pod count on CPU + GPU util or custom metrics. VPA tunes resource requests / limits per pod. For LLM serving the right signals are queue depth, KV cache utilization, and tokens-per-second per replica — not raw CPU. Export these from vLLM’s /metrics (Prometheus format) and feed into Prometheus Adapter for HPA.

KEDA (Kubernetes Event-Driven Autoscaling) — scaler library that lets HPA react to external signals: SQS queue depth, Kafka consumer lag, Pulsar backlog, Prometheus PromQL queries, Redis lengths, NATS subjects. Standard for async / queue-driven LLM workloads.

Cluster Autoscaler / Karpenter (AWS) / GKE Cluster Autoscaler — node-level. Adds GPU nodes when pending pods can’t schedule, drains and terminates underutilized ones. Karpenter (AWS) is faster and handles spot and on-demand mixes more elegantly than the upstream Cluster Autoscaler.

Cold-start mitigation — preload model weights into a warm pool of nodes, cache container images locally on each node (NodeImageCache, ECR pull-through cache, Harbor proxy cache), use lazy-loading container images (Stargz, SOCI), keep N warm replicas always-on with a “minimum scale” of 1+. For very-cold paths: use nydus or Hydrostor for streaming weights from S3 directly into GPU HBM.

Scale-to-zero — Knative autoscaling.knative.dev/min-scale: "0", KEDA minReplicaCount: 0, Modal default behavior, AWS Lambda. Cold start for LLMs is 5-60s depending on model size, image, and weight loading strategy — acceptable for batch and dev, often not for prod chat.

Predictive autoscaling — beyond reactive HPA, use traffic forecasts (time-of-day, day-of-week, marketing-launch calendars) to pre-warm capacity. Cloudflare and many SaaS providers use simple ARIMA or Prophet models to predict 5-minute-ahead load and scale ahead of the curve. Cuts cold-start tail risk significantly.

Knative autoscaler (KPA) vs HPA — Knative’s Pod Autoscaler (KPA) reacts on per-second concurrency or RPS with sub-second granularity, designed for spike-handling. HPA polls every 15-60s and is too slow for true scale-to-zero workloads. For steady production fleets HPA suffices; for serverless or burst workloads KPA or KEDA is the better fit.

Scaling signal selection — for LLM serving, useful HPA custom metrics include vllm:num_requests_running, vllm:num_requests_waiting, vllm:gpu_cache_usage_perc, vllm:time_to_first_token_seconds P95, and vllm:e2e_request_latency_seconds P95. Avoid scaling purely on GPU memory util — KV cache pre-allocation makes it look like memory is fully used at startup.

10. Storage + model loading

Model weights are massive and how you move them around matters.

Model weight sizes — Llama 3.1 8B is ~16 GB at BF16, 4 GB at INT4. Llama 3.3 70B is ~140 GB at BF16, ~40 GB at INT4. Llama 3.1 405B is ~810 GB at BF16, ~225 GB at INT4. Mixtral 8x22B is ~280 GB BF16. DeepSeek-V3 671B is ~1.3 TB BF16, ~400 GB at FP8.

OCI image bloat — embedding weights in container images creates 50-500 GB images that are slow to pull. The pattern is to ship a thin image (10s of MB) and pull weights from object store at pod-start via an init container or in-process fetch. Tools: huggingface_hub snapshot download with hf_transfer (Rust-based parallel downloader, 4-8x faster than default), nydusify for lazy-pull images, AWS S3 + s5cmd, GCS gsutil -m, MinIO + Alluxio for shared cache.

NFS / shared filesystem — EFS, FSx for Lustre, Filestore, Azure Files Premium. Acts as a warm cache across nodes — first puller of a model warms the cache, subsequent pullers read at network speed. Trade-off: NFS contention and IOPS limits under fanout. Lustre / WekaFS scale better but cost more.

Model warm-up — first N tokens of decode JIT-compile attention kernels (Triton kernel autotuning, CUDA graph capture, FlashAttention version selection). Run a warm-up request at startup with representative shapes so production traffic doesn’t pay the JIT tax. vLLM does this automatically with --enforce-eager off and CUDA graph capture for common batch sizes.

Weight caching tiers — L1: GPU HBM (the live copy); L2: host CPU RAM (instant reload after eviction); L3: local NVMe (10-30 GB/sec sequential read on modern NVMe); L4: regional shared filesystem (Lustre, WekaFS, FSx); L5: object store (S3, GCS, R2). Engineer the deployment so that the working set sits in the highest tier that fits, and cold paths fall through deterministically.

Container image layering — keep weights out of the image. A 5-10 GB image (CUDA runtime + framework + model code) pulls in seconds; a 200 GB image pulls in 10+ minutes. Use init containers, sidecar fetchers, or in-process downloads on first call to populate weights from object store. AWS SageMaker, Vertex AI, and Hugging Face Inference Endpoints all follow this pattern internally.

Disk format — Safetensors is the standard (memory-mapped, no pickle, fast load), GGUF for llama.cpp ecosystem, ONNX for cross-runtime, Engine files for TensorRT-LLM (pre-compiled, GPU- and batch-specific). FP8 weights ship in their own variants of Safetensors with quantization metadata; MXFP4 / NVFP4 (Blackwell-era 4-bit microscaling formats) are 2025+.

Lazy weight loading — RunPod’s Serverless GPU and AWS SageMaker Inference Optimization both stream weights from S3 to GPU HBM in parallel chunks during model load, overlapping I/O with the small amount of CPU-side initialization. The technique cuts cold start for a 70B model from 90s+ to 15-30s on a fast cross-VPC path.

Multi-model on one replica — when serving many small models (per-customer fine-tunes, specialized rerankers, classifier heads), keep all of them resident on one GPU using shared base weights + LoRA adapters or weight swapping. vLLM multi-LoRA, Triton model-control mode, BentoML runners, and Ray Serve replicas all support patterns here.

11. Observability + monitoring

LLM serving needs LLM-aware telemetry, not just CPU and RAM.

Metrics — collect at minimum:

  • Latency percentiles — P50 / P95 / P99 for TTFT, ITL, and end-to-end.
  • Token throughput — prefill tokens/sec and decode tokens/sec, separately.
  • GPU utilization — SM util, memory util, HBM bandwidth util (use DCGM exporter).
  • KV cache utilization and prefix-cache hit rate.
  • Queue depth and time-in-queue.
  • Batch size distribution per iteration (continuous batching makes this dynamic).
  • Error rate by class — OOM, timeout, malformed-request, model-unavailable.

Logs — per-request structured logs with sanitization. Capture request ID, tenant ID, model ID, prompt length, output length, finish reason, latency breakdown — but redact prompt and completion content (or store separately with stricter access).

Traces — OpenTelemetry distributed traces span gateway → router → inference server → GPU kernel level. Tempo / Jaeger / Honeycomb on the backend. Critical for diagnosing where a slow P99 actually spent its time.

LLM-specific observability — token usage and dollar-cost tracking per tenant, prefix-cache hit rate, prompt and output length histograms, tool-call frequency, structured-output validation failure rate.

Tooling stack — Prometheus + Grafana + Loki + Tempo as the generic OSS stack (see [[Compute/observability-stack]]). LLM-specific managed: LangFuse, Helicone, LangSmith (LangChain), Phoenix (Arize), Braintrust, Weights & Biases Weave, Honeycomb’s LLM features, Datadog LLM Observability. Self-hostable: LangFuse OSS, Phoenix OSS, Helicone OSS.

GPU-specific telemetry — NVIDIA DCGM (Data Center GPU Manager) exposes SM occupancy, HBM bandwidth, NVLink throughput, PCIe traffic, ECC errors, power, temperature, and clock-throttling reasons. nvidia-smi dmon, nvidia-smi pmon, and the DCGM Exporter for Prometheus are essential. For AMD: ROCm SMI + AMD GPU Operator metrics. For Apple Silicon: powermetrics and Metal Performance Shaders profiler.

Saturation signals — the four Golden Signals (latency, traffic, errors, saturation) for LLM serving translate to: latency = TTFT + ITL, traffic = tokens/sec in and out, errors = HTTP 5xx + token-budget exceeded + safety filter rejects, saturation = KV cache util + queue depth + GPU SM occupancy. Alert on saturation before latency degrades — early warning is more useful than after-the-fact pages.

Sampling strategies for high-cardinality data — token-level traces explode storage cost. Sample aggressively: 1% baseline trace sampling, 100% on errors, head-based sampling by tenant for compliance traffic, tail-based sampling (OTel collector) to retain slow tails. Honeycomb and Tempo both support tail-based sampling natively.

Cost dashboards — surface per-tenant, per-model, per-feature spend hourly. Without this the bill comes as a surprise. Useful axes: tokens/$ trend, cache-hit-rate trend, requests-per-second per replica, GPU hours per request class. FinOps for AI is now its own discipline; Vantage, CloudHealth, Anodot AI Cost, and homegrown dashboards on Snowflake / BigQuery are common.

12. Eval + canary + rollout

Shipping a model change is much riskier than shipping a code change because behavior is statistical.

Shadow traffic — duplicate prod requests to a candidate replica, log responses, do not return them to the user. Diff against prod outputs (exact match for deterministic, semantic similarity or LLM-as-judge for generative). Catches regressions before any user impact.

A/B and canary — route a small percentage (1-5%) of real traffic to the candidate. Hold for hours or days, watch metrics (latency, completion rate, regenerate rate, thumbs-down, downstream conversion). Ramp up over a curve. Istio / Linkerd / Cloudflare / AWS App Mesh provide traffic-splitting primitives.

Offline eval — golden set of N representative prompts + reference outputs. Ragas for RAG-specific metrics. LLM-as-judge with a strong model (Claude Opus, GPT-4o, Gemini Pro) scoring along rubrics. Manual review for high-stakes axes (safety, accuracy on regulated content). MTEB / BEIR for embeddings, HumanEval / MBPP / LiveCodeBench / SWE-bench for code, MMLU / GPQA / ARC for general reasoning, MT-Bench / Arena-Hard / Chatbot Arena for chat quality.

Online eval — implicit signals: thumbs-up / thumbs-down, completion rate (did user accept the answer?), regenerate rate, follow-up question rate (a high rate may indicate poor first answer), session length, task completion.

Drift detection — input distribution drift (changing prompt patterns, new languages, new lengths) and output distribution drift (refusal rate change, length change, hallucination rate change). Compare rolling histograms to a reference window.

Progressive delivery primitives — feature flags scoped to models (LaunchDarkly, Statsig, Unleash, OpenFeature), weighted traffic split via Istio / Linkerd / Cloudflare, regional rollouts (start in low-traffic region, ramp by region), and automated rollback on SLO breach (Argo Rollouts, Flagger). Treat the model version as a code dependency: pin it, test it, roll it with the same discipline as any binary.

Replay-based regression testing — capture a corpus of production prompts with their responses on the current model. When deploying a new model version, replay the corpus and diff. Use embedding similarity, BLEU-like surface metrics, and LLM-as-judge for a multi-dimensional score. Flag regressions that exceed thresholds.

Tiered evaluation gates — promote a model through stages: smoke tests (does it answer at all), golden-set scoring (does it score within X% of incumbent), shadow traffic (does it diverge from incumbent on real traffic), canary (does live A/B show no SLO regression), full rollout. Each gate has a defined pass/fail criterion and rollback trigger.

Outage playbooks — when a primary model provider degrades (latency spike, error rate, capacity exhaustion), automated fallback to a secondary provider or self-hosted backup keeps the product up. Define and rehearse the playbook quarterly: what threshold triggers fallback, who decides, how long the fallback persists, how the system recovers. Multi-cloud LLM gateways (LiteLLM, Portkey, OpenRouter) make this mechanically straightforward.

Eval frameworks — Inspect (UK AI Safety Institute), OpenAI Evals, Anthropic’s open-sourced bench harnesses, EleutherAI lm-eval-harness, HELM (Stanford CRFM), MTEB / MMTEB (embeddings), BEIR (retrieval), AgentBench, SWE-bench, LiveCodeBench. Pick a coherent set that covers your workload and run them on every candidate model. Cache results so you don’t re-run unchanged evals.

13. Cost optimization

LLM serving cost dominates many companies’ AI budgets. The high-leverage levers:

Right-size hardware — H100 SXM5 is the default for large models needing memory bandwidth, but L40S, L4, A10G, and even Mac M-series (via MLX) are dramatically cheaper for smaller models. H200 (141 GB HBM3e) and B100 / B200 (Blackwell, 2025+) extend the high end. AMD MI300X (192 GB HBM3) is a viable alternative for 70B+ models needing memory.

Quantize aggressively while watching quality on representative eval. W4A16 (AWQ, GPTQ) is nearly lossless for most modern models, W8A8 (SmoothQuant) is broadly safe, FP8 (E4M3 / E5M2 on Hopper+, native MX-FP8 on Blackwell) is the new default for both train and inference, INT4 KV cache halves memory for long contexts.

Speculative decoding with a small draft model (3-8B drafting for a 70B target) gives 1.5-3x speedup with no quality loss.

Prefix caching is enormous for system-prompt-heavy workloads. Anthropic Prompt Caching (90% input discount on cache hits), OpenAI prompt caching (50% discount), Google Gemini context caching (75% discount), self-hosted vLLM / SGLang automatic.

Batch API for offline workloads — Anthropic Message Batches and OpenAI Batch API offer 50% discount for 24h SLA. Move any non-interactive workload (overnight embeddings, scheduled extractions, training data labeling) there.

Model cascade — try a cheap small model first (Haiku, Gemini Flash, Llama 3.1 8B); if confidence is low or task is complex, escalate to a large model (Opus, Gemini Pro, GPT-4o, Llama 3.3 70B). Confidence signal: output log-probability, dedicated router, or an explicit “is this hard” classifier.

Distillation + MoE pruning + small-model fine-tune — for high-volume domain workloads, distill a large model down to a small one specialized for that workload. The economics flip dramatically at >10M tokens/day.

Spot / preemptible GPUs — Lambda Labs Reserved + Spot, RunPod Spot, Hyperstack Reserved, AWS EC2 Capacity Blocks + Spot, GCP Spot VMs, Azure Spot. 50-80% discount over on-demand. Pair with checkpointing and short job duration.

Self-hosted vs API crossover — depends on model size, hardware price, utilization, and operational cost. Rough rule of thumb: a single H100 at full utilization serves ~2000 Llama 3 70B tokens/sec, ~$0.50/hour amortized, beating most API pricing above 10-50M tokens/day. Below that, hosted APIs win on operational overhead alone.

Idle cost is the silent killer — the real economics depend on utilization, not peak throughput. A reserved H100 cluster sitting at 20% utilization costs 5x what its tokens-served-per-dollar number would suggest. Squeeze utilization with multi-tenant scheduling, cross-region failover (one cluster handles two timezones), batch backfill during low-traffic windows, and aggressive scale-down policies.

Token accounting hygiene — track input tokens, output tokens, cached input tokens, batched tokens, and rejected tokens separately per tenant per model. Map each category to its cost line. Without this, attribution to product features or customers is impossible. OpenTelemetry’s GenAI semantic conventions (2024) define a standardized set of attributes — adopt them rather than inventing your own.

14. Security + safety

TLS + mTLS — between gateway and inference server, between inference server and any sidecar. Istio / Linkerd / SPIRE / Consul auto-rotate mTLS certs.

API key + per-tenant auth — see [[Compute/auth-authz]]. Per-key rate limits, per-key scopes (which models, which features), key rotation.

Input filtering and output filtering — Lakera Guard, NeMo Guardrails (NVIDIA), Llama Guard 3 / Prompt Guard (Meta), Microsoft Azure AI Content Safety, OpenAI Moderation, Google Vertex Safety Filters. Run input through a fast classifier before the model, output through a moderation pass before returning.

Prompt injection detection — heuristic (instruction-pattern matching) + LLM-based classifier + canary-token techniques. The threat is high in agentic systems where the LLM acts on retrieved or user-supplied content. PromptGuard-2, Lakera, and Microsoft’s Prompt Shields are 2024-26 defenses.

PII redaction before logging — Presidio (Microsoft), Amazon Comprehend PII, Google DLP, OpenAI Moderation. Pipe prompts and completions through redaction before they hit log stores or eval datasets.

Audit log + provenance — who called what, with what prompt, against what model version. Immutable append-only store (e.g. S3 with object lock). Required for SOC 2, HIPAA-like compliance, and AI Act conformity.

Model leakage prevention — bound error messages so they don’t echo prompts back, don’t expose internal model files via debug routes, scrub weights from container images that ship to customers.

Confidential inference — for regulated workloads (healthcare, finance, defense) the entire inference path runs inside a TEE (Trusted Execution Environment). NVIDIA H100 + H200 Confidential Computing (CC) mode encrypts GPU memory and the CPU-GPU bus with attestation; Intel TDX and AMD SEV-SNP secure the CPU side. Apple Private Cloud Compute pioneers a transparent-build attested cloud inference architecture, publishing the production image hash and allowing third-party auditors to verify what binary served a given request.

Data residency + region pinning — for GDPR, HIPAA, China cybersecurity law, and EU AI Act compliance, pin specific tenants or specific data categories to specific regions. Most hosted providers (Anthropic, OpenAI, Google, AWS Bedrock, Azure OpenAI) offer EU-only or US-only or sovereign-cloud variants. Self-hosted: K8s topology constraints + region-aware routing.

Abuse detection — high-frequency identical-prompt traffic is often abuse (scraping, training-data extraction attempts). Adversarial-prompt-extraction defenses, output filtering against known canary leaks, rate-limit-by-prompt-similarity (locality-sensitive hashing).

15. Edge + on-device serving

The “smaller models, closer to user” tier of inference is growing fast.

Mobile:

  • iOS — Core ML (Apple, 2017), Core ML Tools converts from PyTorch / TF / ONNX. Apple Intelligence (iPhone 15 Pro+, 2024) runs a 3B foundation model on-device + a server cloud “Private Cloud Compute” fallback. Apple MLX-LM and MLX-Swift for direct M-series / A-series inference.
  • Android — TFLite, renamed LiteRT in 2024, with GPU + NNAPI + Hexagon delegates. MediaPipe LLM Inference (Google, 2024) wraps LiteRT for Gemma, Phi, Llama, Falcon on Android and iOS, exposing a high-level chat API.

Web — transformers.js (Hugging Face, ONNX Runtime Web under the hood), MLC LLM (CMU + Apache TVM, compiles models to WebGPU), WebLLM (Chen et al., CMU, 2023), ONNX Runtime Web, TensorFlow.js. WebGPU is the unlock — H1 2024 saw it ship in Chrome and Edge, late 2024 in Safari, 2025 in Firefox stable, enabling sub-second-per-token Llama 3.2 1B in the browser.

Apple MLX — Apple’s array framework, unified-memory inference on M-series Macs and A-series iPhones. MLX-LM exposes a Python and Swift interface, supports quantization, LoRA fine-tuning, and weight sharing. The Mac Studio M2 Ultra (192 GB unified memory) and M3 Max / M4 Max (128 GB) are surprisingly viable for 70B inference.

Browser-side small LLMs — Phi-3-mini, Gemma 2B / 3 4B, Llama 3.2 1B and 3B all run on consumer phones and laptops. Microsoft Phi Silica (Copilot+ PCs with NPUs, Windows on ARM) ships at OS level; Foundry Local (2024) is the equivalent of Ollama for Windows + NPU.

Desktop / dev — Ollama, LM Studio, GPT4All, Jan (Cortex-based), Cog local, Hugging Face Hub pipeline(), Apple MLX-LM CLI, Microsoft Foundry Local CLI.

Hybrid edge + cloud patterns — Apple’s Private Cloud Compute (2024) and Microsoft’s Copilot+ + Azure handoff (2024-25) show the dominant production pattern: a small on-device model handles short turns, simple intent, and offline mode; a larger cloud model handles complex reasoning, long context, or tool use. The handoff is explicit (UI signals which one is in use) and end-to-end encrypted with attestation (Apple’s PCC uses Secure Enclave attestation + transparent build logs).

Quantization for edge — on-device inference lives or dies by quantization. Common formats: Q4_K_M (llama.cpp 4-bit with K-quants), Q8_0 (8-bit), MLX 4-bit, Core ML 4-bit palettization, MediaPipe int8 with per-channel scales. The quality gap between Q4 and BF16 has shrunk significantly through 2024-25 as quantization-aware training and improved formats (AWQ, GPTQ, SmoothQuant, SpinQuant, QuaRot) became standard.

NPU offload (Copilot+ PCs, Apple Neural Engine, Qualcomm Hexagon) — dedicated neural accelerators on consumer devices now offer 40-50 TOPS at single-digit watts. The challenge is software fragmentation: Microsoft Foundry Local / DirectML / ONNX Runtime for Windows NPUs, Core ML for Apple, Qualcomm AI Engine SDK for Snapdragon, Samsung Exynos NPU stack. MediaPipe LLM Inference abstracts some of this. Real-world: small models (1-4B) run latency-competitive with cloud frontier models for short prompts.

16. Hosted inference providers (2026)

Closed-weight frontier:

  • OpenAI (GPT-4.5 / o3 family, GPT-4o / 4o-mini, embeddings).
  • Anthropic (Claude Opus 4.7 / Sonnet 4.7 / Haiku 4.x — the family this assistant is part of).
  • Google DeepMind (Gemini 2.5 / 3 Pro / Flash, Vertex AI + AI Studio).
  • Mistral La Plateforme (Mistral Large / Medium / Small / Codestral / Pixtral).
  • xAI (Grok 3 / 4).
  • Cohere (Command R+, Command R7B, embed-multilingual-v4, Aya Expanse).
  • Reka (Reka Core / Flash / Edge).

Open-weight serving (model catalog + cheap inference) — Together AI, Fireworks AI, Anyscale Endpoints, Replicate, Modal, RunPod, DeepInfra, OctoAI (acquired by NVIDIA, 2024), Lepton AI (acquired by NVIDIA, 2025), Hyperbolic, Hyperstack. Specialized speed: Cerebras Inference (record speed via wafer-scale CS-3 chips, 2000+ tokens/sec for Llama 3.3 70B), Groq LPU (deterministic SRAM-based inference, 500+ tokens/sec), SambaNova SN40L (Reconfigurable Dataflow Unit).

Sovereign / regional — Mistral (France), Aleph Alpha (Germany / Schwarz Digits), Apertus / Swiss AI Initiative (EPFL + ETH Zurich), Silo AI (Finland, AMD), G42 / Inception / Falcon (UAE), Baichuan / Zhipu / DeepSeek (China), Sarvam (India), Lelapa AI (Africa).

Specialized inference accelerators (non-NVIDIA) — beyond the standard GPU vendors, several silicon startups now offer commercial inference:

  • Cerebras CS-3 wafer-scale (44 GB SRAM on-chip, all weights resident, no HBM round-trip) — sustains 2000+ tokens/sec on Llama 3.3 70B.
  • Groq LPU (Tensor Streaming Processor, deterministic compile-time scheduling, fully on-chip SRAM) — 500+ tokens/sec, deterministic batch-1 latency.
  • SambaNova SN40L (Reconfigurable Dataflow Unit with three-tier memory: SRAM + HBM3 + DDR) — strong for very large MoE models that don’t fit on a single GPU.
  • Tenstorrent Wormhole / Blackhole (RISC-V + Tensix cores, open-source software stack) — emerging open alternative.
  • Etched Sohu (transformer-specific ASIC, 2024-25) — claims order-of-magnitude better tokens/sec/$ for transformer inference only.
  • AWS Trainium / Inferentia (Trn2 + Inf2) — Anthropic Project Rainier ships on Trainium2 at multi-hundred-thousand-chip scale.
  • Google TPU v5p / Trillium (v6e) / v7 — first-party inference on Vertex AI, dominant inside Google products.

Pricing models — token-based (per-million input/output tokens, the OpenAI / Anthropic / Google norm), throughput-based (dedicated capacity, $/hour per GPU or per replica), spot/on-demand GPU rental (Lambda, RunPod, etc.), batch-discounted (50% off for 24h SLA). The break-even between these depends on workload shape and is worth modeling explicitly before committing.

Provisioned throughput vs on-demand — Anthropic, OpenAI, AWS Bedrock, and Vertex all offer “provisioned throughput” or “dedicated capacity” tiers that reserve compute at a fixed monthly cost in exchange for guaranteed availability and often lower per-token cost at high volume. The break-even is workload-shape-dependent: steady, high-volume traffic favors provisioned; spiky, low-volume favors on-demand. Modeling the curve and re-evaluating quarterly is a high-leverage finance exercise.

Commit + buyback patterns — many hyperscalers and dedicated providers offer reserved-capacity discounts (1-year, 3-year) and surplus capacity buyback / resale. Combining a base commit (sized at P50 demand) with on-demand burst (for P95) and spot (for batch backfill) yields blended pricing 30-60% below pure on-demand.

Open-weight model catalog (May 2026) — the landscape of models you can self-host: Llama 3.3 70B / 3.4 (Meta), Mistral Large / Medium / Small / Mixtral 8x22B / Pixtral / Codestral, Qwen 2.5 / 3 series (Alibaba), DeepSeek-V3 / R1 (DeepSeek AI, MoE 671B with 37B active), Gemma 2 / 3 (Google open variant), Phi 3 / 4 (Microsoft), Command-R / Command-R+ / Aya (Cohere), Granite (IBM), Yi 1.5 / 34B (01.AI), InternLM 3 (Shanghai AI Lab), Falcon 3 (TII Abu Dhabi), Apertus (Swiss AI Initiative, public weights with provenance). Most ship Apache 2.0, MIT, or Llama-style community license; check terms before commercial use.

Inference cost projections — Epoch AI, SemiAnalysis, and ARK research consistently project a 4-10x annual cost-per-intelligence reduction through 2027, driven by algorithmic improvements (better attention kernels, MoE, distillation, speculative decoding), hardware improvements (HBM3e HBM4, Blackwell Rubin, MI300X MI400), and software improvements (kernel autotuning, custom kernels, paged everything). Plan capacity assuming this trend continues, while hedging against the (real) risk that it stalls.

Where this is going — the dominant 2026-28 themes are: disaggregated everything (prefill + decode + KV cache + scheduler as separate scalable pools), reasoning-aware serving (mid-generation cost amplification handled gracefully), tighter edge + cloud integration (Apple PCC, Phi Silica + Azure, Pixel + Vertex), open-source serving frameworks reaching feature parity with proprietary ones, and standardization on OpenTelemetry GenAI conventions for cross-vendor observability.

Datacenter MaaS (Model-as-a-Service via hyperscalers) — AWS Bedrock + SageMaker Endpoints + EC2 P5/P5en/P6 + Capacity Blocks; Azure OpenAI Service + Azure AI Foundry + Azure ML Online Endpoints + Azure AI Inference; GCP Vertex AI + Model Garden + GKE with TPU Trillium; Oracle GenAI; IBM watsonx.ai; Alibaba PAI; Tencent Cloud TI; Huawei Pangu.

17. Performance optimization workflow

A practical sequence when tuning a deployment:

  1. Profile the baseline — measure TTFT, ITL, throughput, GPU util, KV cache util, batch size distribution on representative traffic. Don’t optimize what you haven’t measured. Tools: vLLM /metrics, NVIDIA Nsight Systems, PyTorch Profiler, DCGM exporter.

  2. Apply quantization — start with W4A16 (AWQ or GPTQ), evaluate quality on your eval set, escalate to W4A8 / FP8 if quality holds. KV cache quantization (INT8 KV, FP8 KV) for long contexts.

  3. Enable continuous batching — vLLM and TGI default to it; if you’re on an older framework, switching is the single biggest throughput win.

  4. Increase batch size until you hit the latency budget — push max-num-seqs upward, watch P99 ITL, stop when SLO breaks.

  5. Add prefix caching if your workload has recurring prefixes — system prompts, RAG context windows, agent state. vLLM --enable-prefix-caching, SGLang RadixAttention.

  6. Speculative decoding if a draft model is available — pair Llama 3.2 1B with Llama 3.1 70B, or use Medusa / EAGLE heads on the target.

  7. Tensor-parallel across GPUs only after single-GPU is maxed — TP introduces NVLink / NVSwitch traffic and synchronization; don’t add it prematurely.

  8. Disaggregate prefill + decode at very large scale — DistServe / Splitwise architectures pay off above several thousand QPS or with very different prefill / decode profiles. Most deployments don’t need this.

  9. Move to a faster GPU only as a last lever — H100 → H200 → B100 / B200 → MI300X each step has real cost. Re-measure after each software optimization before paying for hardware.

  10. Revisit the workflow quarterly — vLLM, TRT-LLM, and SGLang ship breaking improvements every 4-8 weeks. A tuning run from six months ago may be 30-50% slower than what current versions yield on the same hardware. Pin versions, run regression benchmarks before upgrading, then capture new optimal config.

Benchmark methodology — use representative production traffic, not synthetic. Capture a week of real prompts (sanitized), replay through a load generator (vllm-bench, genai-perf, locust, k6 with custom payload generator, nv-gpu-bench). Measure at the target QPS, not just peak throughput. Report TTFT P50/P95/P99, ITL P50/P95/P99, throughput in tokens/sec, and tokens per dollar. Include cold-start measurements and steady-state separately.

18. Common pitfalls

  • Batch size too small — GPU sits underutilized, throughput plummets, tokens/$ is awful. Increase --max-num-seqs and check GPU util reaches 80%+ at steady state.

  • Memory allocator fragmentation — the original problem PagedAttention solved. If using a non-paged framework, KV cache memory waste can be 40%+. Use vLLM, TGI, TRT-LLM, or SGLang.

  • Cold start without warm-up — first request after deploy or after a scale-up event triggers kernel JIT compilation, CUDA graph capture, weight prefetch. Run a synthetic warm-up request at pod start.

  • Logging full prompts and completions — PII leak risk, storage cost explosion, slow log pipelines. Redact, sample, or store separately with stricter access.

  • Not measuring TTFT separately from total latency — total latency depends heavily on output length, which the model controls. TTFT is the user’s perceived first-byte; ITL is the streaming smoothness; max-tokens is a cap. Track all three.

  • Same machine for prefill and decode at high QPS — head-of-line blocking, prefill bursts cause decode stalls, tail latency degrades. Disaggregate or use chunked prefill (vLLM’s chunked-prefill scheduler, default since 0.5+).

  • Forgetting to bound max-tokens — a runaway model can generate 16K tokens and bill you for it. Set sensible defaults at the gateway, enforce per-tenant limits.

  • Cache pollution from one-off long prompts — a single 100K-token request can evict hot system-prompt prefixes from the prefix cache, hurting many subsequent requests. Use cache priorities, separate pools for long-context vs chat, or evict-on-finish heuristics.

  • Ignoring tail latency in autoscaling signals — average GPU util can look fine while P99 latency is failing. Scale on P99 latency or queue depth, not just util.

  • Mixing model versions on the same replica without strict label-based routing — a request labeled for v2 lands on a v1 replica because of routing-spec misconfiguration. Validate at the gateway, log the served version in every response.

  • Sampling parameter drift — default temperature, top_p, top_k, repetition_penalty, min_p, and frequency_penalty differ across frameworks. The same model on vLLM vs TGI vs TRT-LLM produces different outputs at “default” settings. Pin sampling parameters explicitly in client SDKs and treat them as part of the model contract.

  • Tokenizer mismatch between training and serving — subtle bugs from tokenizer version drift (special-token additions, normalization changes) produce silent quality regressions. Pin tokenizer version in the model card and assert at startup that the loaded tokenizer hash matches expected.

  • Ignoring streaming back-pressure — clients that disconnect mid-stream cause server-side memory pressure if the runtime doesn’t detect the disconnect promptly. Wire up HTTP/2 RST_STREAM and gRPC cancellation to cancel in-progress generation, freeing KV cache slots.

  • No graceful drain on pod termination — K8s sends SIGTERM with a grace period, but vLLM by default may abort in-flight requests. Implement a preStop hook that stops accepting new requests, drains in-flight ones, then exits. Tail-latency users mid-stream during a deploy notice if you skip this.

  • Forgetting NUMA pinning on multi-socket nodes — CPU pre/post-processing, tokenization, and network I/O can suffer 2x latency hits when threads bounce across NUMA nodes. Pin pods or threads to the socket that owns the GPU’s PCIe root.

  • Not testing with realistic prompt-length distributions — synthetic benchmarks often use uniform 512-token prompts. Production traffic is heavy-tailed (a few 50K-token RAG prompts mixed with many 50-token chat turns). Latency and throughput profiles differ dramatically between the two.

  • Letting the prefix cache get evicted on every deploy — a rolling restart wipes the in-process prefix cache. For prefix-heavy workloads, consider externalizing the prefix cache (LMCache, Redis-backed KV cache, vLLM disaggregated KV cache) or doing canary deploys that warm one replica at a time.

19. Cross-references

  • [[Compute/inference-optimization]] — algorithmic side: quantization, speculative decoding, KV cache tricks, kernel optimization.
  • [[Compute/transformer-architecture]] — what the runtime is serving.
  • [[Compute/kubernetes-deep]] — the orchestration substrate.
  • [[Compute/containers-service-mesh]] — Istio, Linkerd, Envoy patterns.
  • [[Compute/observability-stack]] — Prometheus / Grafana / Loki / Tempo / OTel.
  • [[Compute/auth-authz]] — API keys, OAuth, mTLS, per-tenant authz.
  • [[Compute/cuda-triton-gpu-programming]] — kernel-level work that backs these frameworks.
  • [[Compute/rag-embeddings-vector-search]] — RAG workloads exercise prefix caching heavily.
  • [[Compute/fine-tuning-rlhf]] — produces the adapters served via multi-LoRA.
  • [[Engineering/Tier3/semiconductor-materials]] — HBM and 3D-stack memory that bounds decode throughput.
  • [[Engineering/Tier3/semiconductor-packages]] — CoWoS, HBM3e, chiplet packaging that drives H100/H200/B100/MI300X capacities.
  • [[Compute/distributed-systems-fundamentals]] — consistency, sharding, replication patterns that apply to distributed serving fleets.
  • [[Compute/networking-foundations]] — RDMA, NVLink, NVSwitch, InfiniBand for multi-GPU and multi-node serving.
  • [[Compute/http2-http3-quic]] — protocols underpinning token streaming and gateway behavior.

20. Citations

  • Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. (vLLM)
  • Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., & Chun, B.-G. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI 2022. (continuous batching origin)
  • Patel, P., Choukse, E., Zhang, C., Shah, A., Goiri, Í., Maleki, S., & Bianchini, R. (2024). Splitwise: Efficient Generative LLM Inference Using Phase Splitting. ISCA 2024. (Microsoft prefill/decode disaggregation)
  • Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., & Zhang, H. (2024). DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. OSDI 2024.
  • Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., & Sheng, Y. (2023+, ongoing). SGLang: Efficient Execution of Structured Language Model Programs. NeurIPS 2024 + project updates.
  • Sheng, Y., Cao, S., Li, D., Hooper, C., Lee, N., Yang, S., Chou, C., Zhu, B., Zheng, L., Keutzer, K., Gonzalez, J. E., & Stoica, I. (2023). S-LoRA: Serving Thousands of Concurrent LoRA Adapters. arXiv 2311.03285.
  • Chen, L., Ye, Z., Wu, Y., Zhuo, D., Ceze, L., & Krishnamurthy, A. (2024). Punica: Multi-Tenant LoRA Serving. MLSys 2024.
  • NVIDIA TensorRT-LLM Developer Guide and Best Practices (2024-26 releases).
  • NVIDIA Triton Inference Server Documentation (24.x releases).
  • vLLM Project Documentation (v0.6+, v0.7+ ongoing).
  • Hugging Face Text Generation Inference repository and docs.
  • TorchServe Documentation (PyTorch + AWS).
  • TensorFlow Serving Documentation.
  • KServe Documentation (CNCF).
  • Cerebras Inference Launch and Llama 3.3 70B benchmark announcements (2024-25).
  • Groq LPU Architecture Paper and Inference Benchmark Reports.
  • Anthropic Prompt Caching and Message Batches API announcements (2024).
  • OpenAI Prompt Caching and Batch API documentation (2024).
  • Google Vertex AI Context Caching documentation (2024).
  • Apple MLX and Apple Intelligence on-device foundation model technical disclosures (2024-25).
  • Microsoft Phi Silica, Foundry Local, and Copilot+ PC NPU documentation (2024-26).
  • Meta Llama Guard 3 + Prompt Guard 2 model cards (2024-26).
  • MediaPipe LLM Inference API documentation (Google, 2024-26).
  • Xiao, G., Tian, Y., Chen, B., Han, S., & Lewis, M. (2024). Efficient Streaming Language Models with Attention Sinks. ICLR 2024. (StreamingLLM)
  • Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., & Shrivastava, A. (2024). Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time. NeurIPS 2024.
  • Liu, H., Yan, W., & Abbeel, P. (2023). Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv 2310.01889.
  • Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., & Han, S. (2024). AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. MLSys 2024.
  • Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023.
  • Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. (2023). SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. ICML 2023.
  • Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., & Hensman, J. (2024). QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. NeurIPS 2024.
  • Liu, Z., Kong, C., Liu, Y., & Sun, M. (2024). LMCache: Redis for the LLM Era. NSDI 2025 (CMU + UChicago).
  • OpenTelemetry GenAI semantic conventions (CNCF, 2024-26).
  • Apple Private Cloud Compute Security Guide and verifiable transparency log (2024).
  • Jiang, H., Wu, Q., Lin, C.-Y., Yang, Y., & Qiu, L. (2023). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. EMNLP 2023.
  • Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., & Dean, J. (2022). Efficiently Scaling Transformer Inference. MLSys 2023. (Google JetStream / TPU serving)
  • DeepSeek-AI. (2024-25). DeepSeek-V3 + DeepSeek-R1 Technical Reports. (MoE training and inference at scale, MLA attention, FP8 serving)
  • Anthropic Engineering blog posts on Project Rainier, batch inference, and prompt caching architecture (2024-25).
  • Microsoft Azure AI Foundry + Foundry Local launch documentation (2024-26).
  • Hugging Face Hardware Hub launch and integration partner documentation (2024-26).
  • Epoch AI inference cost reports (2024-26 quarterly).
  • SemiAnalysis quarterly AI hardware + serving cost analyses (2024-26).