NVIDIA Dynamo — distributed LLM inference orchestration with disaggregated prefill-decode, KV-aware routing, NIXL transfers, and SLO-based autoscaling. Use when deploying multi-GPU/multi-node inference with vLLM, TRT-LLM, or SGLang backends.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: nvidia-dynamo description: "NVIDIA Dynamo — distributed LLM inference orchestration with disaggregated prefill-decode, KV-aware routing, NIXL transfers, and SLO-based autoscaling. Use when deploying multi-GPU/multi-node inference with vLLM, TRT-LLM, or SGLang backends."
NVIDIA Dynamo
Dynamo is NVIDIA's open-source distributed inference serving framework for generative AI. It sits above inference engines (vLLM, TensorRT-LLM, SGLang) as an orchestration layer that adds disaggregated serving, intelligent routing, KV cache management, and dynamic GPU scheduling — capabilities individual engines lack.
Not the PyTorch torch.compile Dynamo compiler. This is the inference serving framework at github.com/ai-dynamo/dynamo.
When to Use Dynamo
- Multi-node LLM inference where models exceed single-GPU memory
- Disaggregated prefill-decode deployments for throughput optimization
- Workloads requiring KV cache reuse across requests (multi-turn, agentic)
- Production Kubernetes deployments needing SLO-based autoscaling
- Any backend: vLLM, TensorRT-LLM, or SGLang
Architecture Overview
Dynamo consists of five core components sitting between the API layer and inference workers:
Frontend (Rust)
OpenAI-compatible HTTP server handling tokenization, prompt templating, and request dispatch. High-performance Rust implementation with OpenAPI 3 spec at /openapi.json.
Smart Router
Routes requests to optimal workers by computing overlap scores between incoming request tokens and KV cache blocks already resident on each GPU across the cluster. Unlike round-robin or load-based routing, the router factors in:
- Cache hit rate — how much KV cache can be reused on each worker
- Workload balance — current queue depth and GPU utilization
- GPU capacity — available memory and compute headroom
The router maintains a Radix Tree of hashed KV cache locations across all GPUs, enabling O(prefix-length) lookups. It also implements intelligent eviction policies to balance retention vs. lookup latency.
GAIE Integration (v0.9.0+): The router accepts Gateway API Inference Extension (GAIE) routing hints — K8s-native inference routing metadata (model identity, session affinity, priority) can flow through standard Gateway API headers into Dynamo's routing decisions.
Planner (GPU Resource Scheduler)
The Planner continuously monitors GPU capacity metrics and makes dynamic decisions:
- Aggregated vs. disaggregated — whether a request should use combined prefill+decode on one GPU or split across specialized pools
- Worker scaling — how many GPUs to assign to prefill vs. decode phases
- SLO enforcement — adjusts allocation to meet TTFT and ITL targets
The SLO-based Planner (v0.4+, updated v0.9.0) uses a forward-looking approach:
- Pre-deployment profiling to understand behavior under different configs
- Traffic prediction using Kalman filter with mooncake-style warmup (replaces earlier ARIMA/Prophet approach)
- Output block tracking with fractional decay — the router accounts for expected request duration when scoring workers, not just current cache state
- Minimum worker calculation to meet SLA under predicted demand
- Continuous reassessment and proactive scaling before bottlenecks occur
Distributed KV Cache Manager (KVBM)
Manages KV cache across memory hierarchies — GPU HBM, CPU DRAM, local SSD, networked object storage. Uses tiered caching policies: frequently accessed blocks stay in GPU memory, less-accessed blocks migrate to cheaper tiers. Supports cluster-level cache management spanning multiple nodes.
Event Plane (v0.9.0+)
Transport-agnostic pub/sub layer built on ZMQ with MessagePack serialization. Together with the Discovery Plane and Request Plane, it forms Dynamo's fully decoupled communication architecture. The Event Plane replaces NATS for system event distribution — KV router queries, worker lifecycle events, and scaling signals all flow through it without external message brokers.
Disaggregated Prefill-Decode Serving
The core innovation. Traditional inference co-locates prefill (compute-bound, processes all input tokens) and decode (memory-bandwidth-bound, generates tokens one at a time) on the same GPU. This wastes resources because:
- Prefill benefits from low tensor parallelism (less communication overhead)
- Decode benefits from high tensor parallelism (more memory bandwidth)
- Long prefills block decode, causing latency spikes for other users
Disaggregated serving separates these phases onto different GPU pools, enabling:
- Independent parallelism strategies per phase
- Independent scaling (more prefill GPUs during summarization spikes, more decode GPUs during chat workloads)
- Better SLO control — TTFT and ITL managed independently
- Up to 30x throughput improvement on GB200 NVL72 with DeepSeek-R1, 2x+ on Hopper with Llama 70B
When NOT to Disaggregate
- Short input sequences where prefill is cheap
- Low traffic where GPU utilization is already low
- Single-GPU deployments where overhead exceeds benefit
- The Planner can dynamically choose aggregated mode per-request when disaggregation isn't beneficial
Encoder Disaggregation (v0.9.0+)
For multimodal models, encoding (processing images/video/audio) can be separated onto dedicated GPUs, creating a full Encode/Prefill/Decode (E/P/D) split:
- vLLM: Uses the Embedding Cache connector — the encoder runs on a separate GPU and caches computed embeddings for prefill workers to consume
- TRT-LLM: Uses a standalone encoder process
- SGLang: Supports single-GPU multimodal serving without requiring a full E/PD split
This prevents heavy vision encoding from starving the prefill/decode pipeline and allows independent scaling of encoder capacity for image-heavy workloads.
NIXL — NVIDIA Inference Transfer Library
NIXL handles all data movement in disaggregated deployments — primarily KV cache transfer from prefill to decode workers. Key properties:
- Hardware-agnostic API — same semantics across NVLink, InfiniBand, RoCE, Ethernet
- Non-blocking, non-contiguous transfers — async by design, handles scattered KV blocks
- Heterogeneous memory support — GPU HBM ↔ CPU DRAM ↔ SSD ↔ networked storage (S3, block, file)
- Backend auto-selection — Dynamo's policy engine picks optimal transport (GPUDirect RDMA, UCX, GPUDirect Storage)
- Storage partner integrations — Dell PowerScale, WEKA for KV cache persistence
Backend Support Matrix
| Feature | SGLang | TensorRT-LLM | vLLM |
|---|---|---|---|
| Best for | High throughput | Max performance | Broadest features |
| Disaggregated Serving | ✅ | ✅ | ✅ |
| KV-Aware Routing | ✅ | ✅ | ✅ |
| SLA-Based Planner | ✅ | ✅ | ✅ |
| KVBM | 🚧 | ✅ | ✅ |
| Multimodal | ✅ | ✅ | ✅ |
| Tool Calling | ✅ | ✅ | ✅ |
Diffusion Language Model Support (v0.9.0+)
Dynamo supports LLaDA2.0 and other diffusion-based language models that generate tokens via iterative denoising rather than autoregressive decoding. The Planner and Router handle diffusion LLM scheduling — the decode phase runs fixed-step denoising iterations instead of variable-length token generation, which changes scaling characteristics (predictable step count per request).
Deployment
Local Development (Single Machine)
Dynamo supports a zero-dependency local mode using --discovery-backend file that avoids etcd/NATS:
# Terminal 1: Frontend
python3 -m dynamo.frontend --http-port 8000 --discovery-backend file
# Terminal 2: Worker (choose backend)
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \
--kv-events-config '{"enable_kv_cache_events": false}'
Container Images (Recommended)
Pre-built containers on NGC with all dependencies:
nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0
nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0
Kubernetes (Production)
Dynamo has a Kubernetes-native deployment model using CRDs:
- Install Dynamo Platform — operator, CRDs, RBAC
- Create namespace and secrets — HuggingFace token for model downloads
- Apply DynamoGraphDeployment — declarative YAML defining model, backend, topology
# Example: kubectl apply -f agg.yaml -n dynamo-system
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
Service discovery uses K8s-native EndpointSlices — no etcd or NATS needed in K8s deployments.
Pre-built recipes exist for common configurations:
- Llama-3-70B on 4x H100 (vLLM, aggregated)
- DeepSeek-R1 on 8x H200 (SGLang, disaggregated)
- Qwen3-32B-FP8 on 8x GPU (TRT-LLM, aggregated)
Cloud-specific guides: AWS EKS, Google GKE, Azure AKS.
Tolerations and Node Affinity (v0.9.0+)
DynamoGraphDeployment supports tolerations and nodeAffinity fields directly on graph node specs — required for scheduling onto tainted GPU node pools in EKS/GKE:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: llama-70b-disagg
namespace: dynamo-system
spec:
graph:
prefillWorkers:
replicas: 2
resources:
requests:
nvidia.com/gpu: "4"
limits:
nvidia.com/gpu: "4"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-H100-80GB-HBM3
decodeWorkers:
replicas: 4
resources:
requests:
nvidia.com/gpu: "4"
limits:
nvidia.com/gpu: "4"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Rolling Restarts
DynamoGraphDeployment supports rolling restarts — updating configuration (model parameters, scaling settings, environment variables) triggers a rolling update of affected pods without tearing down the entire inference graph. Unchanged components remain running, minimizing downtime during day-2 config changes.
Service Discovery
As of v0.9.0, NATS and etcd are fully optional in all deployment modes. The Event Plane (ZMQ + MessagePack) replaces NATS, and K8s-native or file-based discovery replaces etcd.
| Deployment | etcd | NATS | Notes |
|---|---|---|---|
| Local dev | ❌ | ❌ | --discovery-backend file |
| Kubernetes | ❌ | ❌ | K8s-native CRDs + EndpointSlices |
| Multi-node (bare metal) | ❌ | ❌ | Event Plane (ZMQ) for pub/sub, file or custom discovery |
AIConfigurator Integration
AIConfigurator recommends optimal PD disaggregation configuration and model parallelism strategy for a given model, GPU budget, and SLO targets. It:
- Uses pre-measured performance data across model layers (attention, FFN, communication, memory)
- Models different scheduling techniques (static batching, inflight batching, disaggregated serving)
- Suggests PD configs maximizing throughput/GPU while meeting SLOs
- Generates backend configurations deployable directly in Dynamo
Available as CLI and web interface. Initial support for TensorRT-LLM on Hopper.
See the aiconfigurator skill for detailed usage.
Autoscaling
For HPA manifests, KEDA ScaledObjects, scale-to-zero patterns, and Prometheus recording rules, see references/autoscaling.md.
Three scaling approaches:
| Approach | When to Use |
|---|---|
| Dynamo Planner | Disaggregated P/D deployments — SLO-aware, scales prefill and decode pools independently |
| HPA + Prometheus Adapter | Monolithic vLLM workers — scale on dynamo_queued_requests or TTFT percentiles |
| KEDA | Event-driven or scale-to-zero for batch inference — no Prometheus Adapter needed |
Key metrics: dynamo_queued_requests (queue depth), dynamo_ttft_seconds (latency histogram), vllm:gpu_cache_usage_perc (KV cache utilization).
Observability
Dynamo components emit Prometheus metrics across control, data, and event planes:
- Requests/second and duration
- TTFT and ITL (average and percentiles)
- Input/output sequence lengths
- GPU utilization and power
- Custom metric API for user-defined metrics
Pre-built Grafana dashboards are included in the repo (deploy/metrics/).
Distributed Tracing (v0.9.0+)
Dynamo traces span the full request path — from frontend through router, prefill, KV transfer (including TCP transport), and decode. Traces export via OTLP:
# Enable OTLP trace export
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
export OTEL_SERVICE_NAME=dynamo-frontend
export OTEL_TRACES_EXPORTER=otlp
Each component (frontend, router, workers) propagates trace context automatically. Configure an OpenTelemetry Collector to receive traces and forward to Jaeger, Tempo, or your preferred backend.
Fault Tolerance (v0.4+)
- Inflight request re-routing — when a worker goes offline, in-progress requests forward to healthy workers preserving intermediate computation (no reprocessing from scratch)
- Early failure detection — the Smart Router detects worker failures directly, bypassing etcd's slower health check propagation
- Reduces detection-to-recovery window from seconds to sub-second
Dynamo vs Ray Serve
Key architectural difference: Dynamo is a purpose-built LLM inference orchestrator with built-in disaggregated serving, KV-aware routing, and NIXL data transfer. Ray Serve is a general-purpose model serving framework built on the Ray distributed computing runtime.
| Dimension | NVIDIA Dynamo | Ray Serve |
|---|---|---|
| Primary focus | LLM inference at datacenter scale | General ML model serving |
| Disaggregated P/D | Native, first-class | Via Ray Serve LLM (newer addition) |
| KV cache routing | Built-in Radix Tree router | Not built-in |
| Data transfer | NIXL (RDMA, NVLink, GPUDirect) | Ray object store (plasma) |
| Backends | vLLM, TRT-LLM, SGLang | vLLM, SGLang (via integration) |
| Autoscaling | SLO-based Planner + K8s HPA | Replica-based autoscaling |
| Multi-model | Designed for single-model scaling | Native multi-model composition |
| Non-LLM models | Not designed for this | Full support (CV, audio, etc.) |
| Language | Rust core + Python bindings | Pure Python |
| K8s integration | Native CRDs (DynamoGraphDeployment) | RayService CRD via KubeRay |
| Maturity | Released March 2025 (GTC) | Production since ~2020 |
When to Choose Dynamo
- Pure LLM inference workload needing maximum throughput
- Disaggregated serving with optimized KV cache transfer
- NVIDIA GPU fleet with NVLink/InfiniBand interconnects
- Multi-node model parallelism (TP, PP, EP across nodes)
- SLO-driven autoscaling of prefill/decode pools
When to Choose Ray Serve
- Mixed model types (LLM + vision + audio + custom)
- Complex model composition pipelines with business logic
- Existing Ray ecosystem (Ray Train, Ray Data)
- Need for fractional GPU allocation
- Team preference for Python-centric development
- Heterogeneous hardware environments
When to Choose Gateway API Inference (llm-d)
- Kubernetes-native approach using standard Gateway API
- Want the CNCF ecosystem (Envoy, Istio compatibility)
- Multi-tenant environments with standard K8s RBAC
- Lighter-weight orchestration without full Dynamo stack
- Note: llm-d uses NIXL from Dynamo for KV cache transfer
Relationship to Other NVIDIA Components
- Triton Inference Server — Dynamo's predecessor for general inference; Dynamo is LLM-specialized
- TensorRT-LLM — inference engine (backend); Dynamo orchestrates it
- NIM — NVIDIA's inference microservice packaging; Dynamo will be included with NIM via AI Enterprise
- AIConfigurator — companion tool for optimal Dynamo configuration
- NIXL — Dynamo's data transfer library, also used independently by llm-d
Key Configuration
Environment Variables
DYN_LOG— logging level (same syntax asRUST_LOG), e.g.,export DYN_LOG=debug
Frontend Flags
--http-port— API server port (default 8000)--discovery-backend file|etcd|k8s— service discovery mode
Worker Flags (vary by backend)
--model/--model-path— HuggingFace model identifier--kv-events-config— KV cache event publishing (vLLM requires explicit disable for local dev)- Backend-specific flags pass through to the underlying engine
Troubleshooting
- "Cannot connect to ModelExpress server" (TRT-LLM) — expected warning, safe to ignore
- KV events errors in local dev — add
--kv-events-config '{"enable_kv_cache_events": false}'for vLLM; SGLang/TRT-LLM disable by default - System verification — run
python3 deploy/sanity_check.pybefore starting Dynamo - Worker detection delay — with Event Plane (v0.9.0+), worker registration is near-instant via ZMQ; older etcd-based setups may take seconds
- NATS/etcd errors after upgrade — v0.9.0 no longer requires NATS or etcd; remove these dependencies and their configuration. KV router queries now run over the native Dynamo endpoint
- Multimodal encoder bottleneck — if image/video processing slows prefill, enable encoder disaggregation (E/P/D split) to offload encoding to dedicated GPUs
- HttpError truncation — v0.9.0 truncates long HTTP error messages to 8192 chars to prevent ValueError on oversized error responses
- Performance tuning — use AIConfigurator to find optimal TP/PP/EP/DP settings before deploying
References
troubleshooting.md— Common errors and fixeskubernetes-deployment.md— Full Dynamo stack on K8s: Helm install, ETCD/NATS clusters, DynamoJob CRD, worker GPU scheduling, event plane configurationautoscaling.md— Planner SLO config, HPA manifests with custom metrics, KEDA ScaledObjects, scale-to-zero for batch, Prometheus metric reference
Cross-References
- nixl — KV cache transfer layer powering Dynamo's disaggregated serving
- vllm — Supported backend engine; Dynamo orchestrates distributed vLLM workers
- tensorrt-llm — Supported backend engine; Dynamo orchestrates TRT-LLM workers
- sglang — Supported backend engine; Dynamo orchestrates SGLang workers
- ray-serve — Alternative serving orchestration; Dynamo is GPU-inference-specialized
- llm-d — K8s-native alternative; llm-d uses Gateway API vs Dynamo's built-in routing
- gpu-operator — GPU driver and device plugin prerequisites
- nccl — NCCL for multi-GPU communication in Dynamo workers
- keda — Autoscale Dynamo worker pools
- prometheus-grafana — Monitor Dynamo serving metrics
- aiconfigurator — Automated performance tuning for Dynamo configurations
More by tylertitsworth
View alluv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.
TensorRT-LLM — engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup.
NVIDIA Triton Inference Server — model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM.
KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).
