nvidia-dynamo

@tylertitsworth/nvidia-dynamo

0 forks

Updated 3/31/2026

NVIDIA Dynamo — distributed LLM inference orchestration with disaggregated prefill-decode, KV-aware routing, NIXL transfers, and SLO-based autoscaling. Use when deploying multi-GPU/multi-node inference with vLLM, TRT-LLM, or SGLang backends.

Installation

$npx agent-skills-cli install @tylertitsworth/nvidia-dynamo

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositorytylertitsworth/skills

Pathnvidia-dynamo/SKILL.md

Branchmain

Scoped Name@tylertitsworth/nvidia-dynamo

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: nvidia-dynamo description: "NVIDIA Dynamo — distributed LLM inference orchestration with disaggregated prefill-decode, KV-aware routing, NIXL transfers, and SLO-based autoscaling. Use when deploying multi-GPU/multi-node inference with vLLM, TRT-LLM, or SGLang backends."

NVIDIA Dynamo

Dynamo is NVIDIA's open-source distributed inference serving framework for generative AI. It sits above inference engines (vLLM, TensorRT-LLM, SGLang) as an orchestration layer that adds disaggregated serving, intelligent routing, KV cache management, and dynamic GPU scheduling — capabilities individual engines lack.

Not the PyTorch torch.compile Dynamo compiler. This is the inference serving framework at github.com/ai-dynamo/dynamo.

When to Use Dynamo

Multi-node LLM inference where models exceed single-GPU memory
Disaggregated prefill-decode deployments for throughput optimization
Workloads requiring KV cache reuse across requests (multi-turn, agentic)
Production Kubernetes deployments needing SLO-based autoscaling
Any backend: vLLM, TensorRT-LLM, or SGLang

Architecture Overview

Dynamo consists of five core components sitting between the API layer and inference workers:

Frontend (Rust)

OpenAI-compatible HTTP server handling tokenization, prompt templating, and request dispatch. High-performance Rust implementation with OpenAPI 3 spec at /openapi.json.

Smart Router

Routes requests to optimal workers by computing overlap scores between incoming request tokens and KV cache blocks already resident on each GPU across the cluster. Unlike round-robin or load-based routing, the router factors in:

Cache hit rate — how much KV cache can be reused on each worker
Workload balance — current queue depth and GPU utilization
GPU capacity — available memory and compute headroom

The router maintains a Radix Tree of hashed KV cache locations across all GPUs, enabling O(prefix-length) lookups. It also implements intelligent eviction policies to balance retention vs. lookup latency.

GAIE Integration (v0.9.0+): The router accepts Gateway API Inference Extension (GAIE) routing hints — K8s-native inference routing metadata (model identity, session affinity, priority) can flow through standard Gateway API headers into Dynamo's routing decisions.

Planner (GPU Resource Scheduler)

The Planner continuously monitors GPU capacity metrics and makes dynamic decisions:

Aggregated vs. disaggregated — whether a request should use combined prefill+decode on one GPU or split across specialized pools
Worker scaling — how many GPUs to assign to prefill vs. decode phases
SLO enforcement — adjusts allocation to meet TTFT and ITL targets

The SLO-based Planner (v0.4+, updated v0.9.0) uses a forward-looking approach:

Pre-deployment profiling to understand behavior under different configs
Traffic prediction using Kalman filter with mooncake-style warmup (replaces earlier ARIMA/Prophet approach)
Output block tracking with fractional decay — the router accounts for expected request duration when scoring workers, not just current cache state
Minimum worker calculation to meet SLA under predicted demand
Continuous reassessment and proactive scaling before bottlenecks occur

Distributed KV Cache Manager (KVBM)

Manages KV cache across memory hierarchies — GPU HBM, CPU DRAM, local SSD, networked object storage. Uses tiered caching policies: frequently accessed blocks stay in GPU memory, less-accessed blocks migrate to cheaper tiers. Supports cluster-level cache management spanning multiple nodes.

Event Plane (v0.9.0+)

Transport-agnostic pub/sub layer built on ZMQ with MessagePack serialization. Together with the Discovery Plane and Request Plane, it forms Dynamo's fully decoupled communication architecture. The Event Plane replaces NATS for system event distribution — KV router queries, worker lifecycle events, and scaling signals all flow through it without external message brokers.

Disaggregated Prefill-Decode Serving

The core innovation. Traditional inference co-locates prefill (compute-bound, processes all input tokens) and decode (memory-bandwidth-bound, generates tokens one at a time) on the same GPU. This wastes resources because:

Prefill benefits from low tensor parallelism (less communication overhead)
Decode benefits from high tensor parallelism (more memory bandwidth)
Long prefills block decode, causing latency spikes for other users

Disaggregated serving separates these phases onto different GPU pools, enabling:

Independent parallelism strategies per phase
Independent scaling (more prefill GPUs during summarization spikes, more decode GPUs during chat workloads)
Better SLO control — TTFT and ITL managed independently
Up to 30x throughput improvement on GB200 NVL72 with DeepSeek-R1, 2x+ on Hopper with Llama 70B

When NOT to Disaggregate

Short input sequences where prefill is cheap
Low traffic where GPU utilization is already low
Single-GPU deployments where overhead exceeds benefit
The Planner can dynamically choose aggregated mode per-request when disaggregation isn't beneficial

Encoder Disaggregation (v0.9.0+)

For multimodal models, encoding (processing images/video/audio) can be separated onto dedicated GPUs, creating a full Encode/Prefill/Decode (E/P/D) split:

vLLM: Uses the Embedding Cache connector — the encoder runs on a separate GPU and caches computed embeddings for prefill workers to consume
TRT-LLM: Uses a standalone encoder process
SGLang: Supports single-GPU multimodal serving without requiring a full E/PD split

This prevents heavy vision encoding from starving the prefill/decode pipeline and allows independent scaling of encoder capacity for image-heavy workloads.

NIXL — NVIDIA Inference Transfer Library

NIXL handles all data movement in disaggregated deployments — primarily KV cache transfer from prefill to decode workers. Key properties:

Hardware-agnostic API — same semantics across NVLink, InfiniBand, RoCE, Ethernet
Non-blocking, non-contiguous transfers — async by design, handles scattered KV blocks
Heterogeneous memory support — GPU HBM ↔ CPU DRAM ↔ SSD ↔ networked storage (S3, block, file)
Backend auto-selection — Dynamo's policy engine picks optimal transport (GPUDirect RDMA, UCX, GPUDirect Storage)
Storage partner integrations — Dell PowerScale, WEKA for KV cache persistence

Backend Support Matrix

Feature	SGLang	TensorRT-LLM	vLLM
Best for	High throughput	Max performance	Broadest features
Disaggregated Serving	✅	✅	✅
KV-Aware Routing	✅	✅	✅
SLA-Based Planner	✅	✅	✅
KVBM	🚧	✅	✅
Multimodal	✅	✅	✅
Tool Calling	✅	✅	✅

Diffusion Language Model Support (v0.9.0+)

Dynamo supports LLaDA2.0 and other diffusion-based language models that generate tokens via iterative denoising rather than autoregressive decoding. The Planner and Router handle diffusion LLM scheduling — the decode phase runs fixed-step denoising iterations instead of variable-length token generation, which changes scaling characteristics (predictable step count per request).

Deployment

Local Development (Single Machine)

Dynamo supports a zero-dependency local mode using --discovery-backend file that avoids etcd/NATS:

# Terminal 1: Frontend
python3 -m dynamo.frontend --http-port 8000 --discovery-backend file

# Terminal 2: Worker (choose backend)
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \
  --kv-events-config '{"enable_kv_cache_events": false}'

Container Images (Recommended)

Pre-built containers on NGC with all dependencies:

nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0
nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0

Kubernetes (Production)

Dynamo has a Kubernetes-native deployment model using CRDs:

Install Dynamo Platform — operator, CRDs, RBAC
Create namespace and secrets — HuggingFace token for model downloads
Apply DynamoGraphDeployment — declarative YAML defining model, backend, topology

# Example: kubectl apply -f agg.yaml -n dynamo-system
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment

Service discovery uses K8s-native EndpointSlices — no etcd or NATS needed in K8s deployments.

Pre-built recipes exist for common configurations:

Llama-3-70B on 4x H100 (vLLM, aggregated)
DeepSeek-R1 on 8x H200 (SGLang, disaggregated)
Qwen3-32B-FP8 on 8x GPU (TRT-LLM, aggregated)

Cloud-specific guides: AWS EKS, Google GKE, Azure AKS.

Tolerations and Node Affinity (v0.9.0+)

DynamoGraphDeployment supports tolerations and nodeAffinity fields directly on graph node specs — required for scheduling onto tainted GPU node pools in EKS/GKE:

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: llama-70b-disagg
  namespace: dynamo-system
spec:
  graph:
    prefillWorkers:
      replicas: 2
      resources:
        requests:
          nvidia.com/gpu: "4"
        limits:
          nvidia.com/gpu: "4"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: nvidia.com/gpu.product
              operator: In
              values:
              - NVIDIA-H100-80GB-HBM3
    decodeWorkers:
      replicas: 4
      resources:
        requests:
          nvidia.com/gpu: "4"
        limits:
          nvidia.com/gpu: "4"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Rolling Restarts

DynamoGraphDeployment supports rolling restarts — updating configuration (model parameters, scaling settings, environment variables) triggers a rolling update of affected pods without tearing down the entire inference graph. Unchanged components remain running, minimizing downtime during day-2 config changes.

Service Discovery

As of v0.9.0, NATS and etcd are fully optional in all deployment modes. The Event Plane (ZMQ + MessagePack) replaces NATS, and K8s-native or file-based discovery replaces etcd.

Deployment	etcd	NATS	Notes
Local dev	❌	❌	`--discovery-backend file`
Kubernetes	❌	❌	K8s-native CRDs + EndpointSlices
Multi-node (bare metal)	❌	❌	Event Plane (ZMQ) for pub/sub, file or custom discovery

AIConfigurator Integration

AIConfigurator recommends optimal PD disaggregation configuration and model parallelism strategy for a given model, GPU budget, and SLO targets. It:

Uses pre-measured performance data across model layers (attention, FFN, communication, memory)
Models different scheduling techniques (static batching, inflight batching, disaggregated serving)
Suggests PD configs maximizing throughput/GPU while meeting SLOs
Generates backend configurations deployable directly in Dynamo

Available as CLI and web interface. Initial support for TensorRT-LLM on Hopper.

See the aiconfigurator skill for detailed usage.

Autoscaling

For HPA manifests, KEDA ScaledObjects, scale-to-zero patterns, and Prometheus recording rules, see references/autoscaling.md.

Three scaling approaches:

Approach	When to Use
Dynamo Planner	Disaggregated P/D deployments — SLO-aware, scales prefill and decode pools independently
HPA + Prometheus Adapter	Monolithic vLLM workers — scale on `dynamo_queued_requests` or TTFT percentiles
KEDA	Event-driven or scale-to-zero for batch inference — no Prometheus Adapter needed

Key metrics: dynamo_queued_requests (queue depth), dynamo_ttft_seconds (latency histogram), vllm:gpu_cache_usage_perc (KV cache utilization).

Observability

Dynamo components emit Prometheus metrics across control, data, and event planes:

Requests/second and duration
TTFT and ITL (average and percentiles)
Input/output sequence lengths
GPU utilization and power
Custom metric API for user-defined metrics

Pre-built Grafana dashboards are included in the repo (deploy/metrics/).

Distributed Tracing (v0.9.0+)

Dynamo traces span the full request path — from frontend through router, prefill, KV transfer (including TCP transport), and decode. Traces export via OTLP:

# Enable OTLP trace export
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
export OTEL_SERVICE_NAME=dynamo-frontend
export OTEL_TRACES_EXPORTER=otlp

Each component (frontend, router, workers) propagates trace context automatically. Configure an OpenTelemetry Collector to receive traces and forward to Jaeger, Tempo, or your preferred backend.

Fault Tolerance (v0.4+)

Inflight request re-routing — when a worker goes offline, in-progress requests forward to healthy workers preserving intermediate computation (no reprocessing from scratch)
Early failure detection — the Smart Router detects worker failures directly, bypassing etcd's slower health check propagation
Reduces detection-to-recovery window from seconds to sub-second

Dynamo vs Ray Serve

Key architectural difference: Dynamo is a purpose-built LLM inference orchestrator with built-in disaggregated serving, KV-aware routing, and NIXL data transfer. Ray Serve is a general-purpose model serving framework built on the Ray distributed computing runtime.

Dimension	NVIDIA Dynamo	Ray Serve
Primary focus	LLM inference at datacenter scale	General ML model serving
Disaggregated P/D	Native, first-class	Via Ray Serve LLM (newer addition)
KV cache routing	Built-in Radix Tree router	Not built-in
Data transfer	NIXL (RDMA, NVLink, GPUDirect)	Ray object store (plasma)
Backends	vLLM, TRT-LLM, SGLang	vLLM, SGLang (via integration)
Autoscaling	SLO-based Planner + K8s HPA	Replica-based autoscaling
Multi-model	Designed for single-model scaling	Native multi-model composition
Non-LLM models	Not designed for this	Full support (CV, audio, etc.)
Language	Rust core + Python bindings	Pure Python
K8s integration	Native CRDs (DynamoGraphDeployment)	RayService CRD via KubeRay
Maturity	Released March 2025 (GTC)	Production since ~2020

When to Choose Dynamo

Pure LLM inference workload needing maximum throughput
Disaggregated serving with optimized KV cache transfer
NVIDIA GPU fleet with NVLink/InfiniBand interconnects
Multi-node model parallelism (TP, PP, EP across nodes)
SLO-driven autoscaling of prefill/decode pools

When to Choose Ray Serve

Mixed model types (LLM + vision + audio + custom)
Complex model composition pipelines with business logic
Existing Ray ecosystem (Ray Train, Ray Data)
Need for fractional GPU allocation
Team preference for Python-centric development
Heterogeneous hardware environments

When to Choose Gateway API Inference (llm-d)

Kubernetes-native approach using standard Gateway API
Want the CNCF ecosystem (Envoy, Istio compatibility)
Multi-tenant environments with standard K8s RBAC
Lighter-weight orchestration without full Dynamo stack
Note: llm-d uses NIXL from Dynamo for KV cache transfer

Relationship to Other NVIDIA Components

Triton Inference Server — Dynamo's predecessor for general inference; Dynamo is LLM-specialized
TensorRT-LLM — inference engine (backend); Dynamo orchestrates it
NIM — NVIDIA's inference microservice packaging; Dynamo will be included with NIM via AI Enterprise
AIConfigurator — companion tool for optimal Dynamo configuration
NIXL — Dynamo's data transfer library, also used independently by llm-d

Key Configuration

Environment Variables

DYN_LOG — logging level (same syntax as RUST_LOG), e.g., export DYN_LOG=debug

Frontend Flags

--http-port — API server port (default 8000)
--discovery-backend file|etcd|k8s — service discovery mode

Worker Flags (vary by backend)

--model / --model-path — HuggingFace model identifier
--kv-events-config — KV cache event publishing (vLLM requires explicit disable for local dev)
Backend-specific flags pass through to the underlying engine

Troubleshooting

"Cannot connect to ModelExpress server" (TRT-LLM) — expected warning, safe to ignore
KV events errors in local dev — add --kv-events-config '{"enable_kv_cache_events": false}' for vLLM; SGLang/TRT-LLM disable by default
System verification — run python3 deploy/sanity_check.py before starting Dynamo
Worker detection delay — with Event Plane (v0.9.0+), worker registration is near-instant via ZMQ; older etcd-based setups may take seconds
NATS/etcd errors after upgrade — v0.9.0 no longer requires NATS or etcd; remove these dependencies and their configuration. KV router queries now run over the native Dynamo endpoint
Multimodal encoder bottleneck — if image/video processing slows prefill, enable encoder disaggregation (E/P/D split) to offload encoding to dedicated GPUs
HttpError truncation — v0.9.0 truncates long HTTP error messages to 8192 chars to prevent ValueError on oversized error responses
Performance tuning — use AIConfigurator to find optimal TP/PP/EP/DP settings before deploying

References

troubleshooting.md — Common errors and fixes
kubernetes-deployment.md — Full Dynamo stack on K8s: Helm install, ETCD/NATS clusters, DynamoJob CRD, worker GPU scheduling, event plane configuration
autoscaling.md — Planner SLO config, HPA manifests with custom metrics, KEDA ScaledObjects, scale-to-zero for batch, Prometheus metric reference

Cross-References

nixl — KV cache transfer layer powering Dynamo's disaggregated serving
vllm — Supported backend engine; Dynamo orchestrates distributed vLLM workers
tensorrt-llm — Supported backend engine; Dynamo orchestrates TRT-LLM workers
sglang — Supported backend engine; Dynamo orchestrates SGLang workers
ray-serve — Alternative serving orchestration; Dynamo is GPU-inference-specialized
llm-d — K8s-native alternative; llm-d uses Gateway API vs Dynamo's built-in routing
gpu-operator — GPU driver and device plugin prerequisites
nccl — NCCL for multi-GPU communication in Dynamo workers
keda — Autoscale Dynamo worker pools
prometheus-grafana — Monitor Dynamo serving metrics
aiconfigurator — Automated performance tuning for Dynamo configurations

More by tylertitsworth

View all

uv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.

tensorrt-llm

TensorRT-LLM — engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup.

triton-inference-server

NVIDIA Triton Inference Server — model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM.

kuberay

KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).