NVIDIA GPU Operator on Kubernetes — ClusterPolicy, DRA (Dynamic Resource Allocation), time-slicing, MIG, DCGM metrics, driver management, GPUDirect RDMA/GDS, CDI, and GPU scheduling. Use when installing, configuring, or troubleshooting GPU infrastructure on K8s.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: gpu-operator description: "NVIDIA GPU Operator on Kubernetes — ClusterPolicy, DRA (Dynamic Resource Allocation), time-slicing, MIG, DCGM metrics, driver management, GPUDirect RDMA/GDS, CDI, and GPU scheduling. Use when installing, configuring, or troubleshooting GPU infrastructure on K8s."
NVIDIA GPU Operator
Automates management of NVIDIA software components on Kubernetes — drivers, device plugin, container toolkit, DCGM, and GPU feature discovery. Version: v25.10.x
Docs: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/ GitHub: https://github.com/NVIDIA/gpu-operator
ClusterPolicy Configuration
The GPU Operator is configured through a ClusterPolicy CRD. Key Helm values map directly to ClusterPolicy spec fields.
Essential Helm Values
helm install gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
--version=v25.10.1 \
--set driver.enabled=true \
--set driver.kernelModuleType=auto \
--set dcgmExporter.enabled=true \
--set cdi.enabled=true \
--set mig.strategy=single
| Parameter | Default | Purpose |
|---|---|---|
driver.enabled | true | Deploy driver containers; set false if drivers pre-installed |
driver.kernelModuleType | auto | auto, open, or proprietary — auto picks based on GPU + driver branch |
driver.rdma.enabled | false | Build/load nvidia-peermem for GPUDirect RDMA |
driver.rdma.useHostMofed | false | Set true if MLNX_OFED pre-installed on host |
cdi.enabled | true | Use Container Device Interface for GPU injection (v25.10+) |
dcgmExporter.enabled | true | Deploy DCGM Exporter for Prometheus GPU metrics |
devicePlugin.config | {} | ConfigMap name for time-slicing or MPS config |
mig.strategy | none | none, single, or mixed |
nfd.enabled | true | Deploy Node Feature Discovery; disable if already running |
Pre-installed Drivers
When using pre-installed drivers (e.g., AWS EKS optimized AMIs, or host-managed installs):
helm install gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
--set driver.enabled=false
Patching ClusterPolicy
Modify configuration post-install:
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
--type='json' \
-p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"mixed"}]'
GPU Time-Slicing
Oversubscribe GPUs by advertising multiple replicas per physical GPU. No memory or fault isolation (unlike MIG).
Configuration
Create a ConfigMap in the GPU Operator namespace:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4
| Field | Purpose |
|---|---|
replicas | Number of virtual GPUs per physical GPU |
renameByDefault | If true, advertise as nvidia.com/gpu.shared instead of nvidia.com/gpu |
failRequestsGreaterThanOne | If true, reject pods requesting >1 GPU (prevents accidental full-GPU requests) |
migStrategy | Set to none for time-slicing only; mixed to combine with MIG |
Enable Time-Slicing
Reference the ConfigMap in the ClusterPolicy:
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
--type merge \
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'
Apply to specific nodes via label:
kubectl label node <node> nvidia.com/device-plugin.config=any
Node Labels After Configuration
nvidia.com/gpu.replicas = 4
nvidia.com/gpu.product = A100-SXM4-80GB-SHARED
Use the -SHARED suffix in node selectors to schedule onto time-sliced GPUs specifically.
Best Practices
- Dev/test: 4-8 replicas — acceptable for notebooks, small experiments
- Training: Avoid time-slicing — no memory isolation means OOM from neighbors
- Inference: 2-4 replicas for small models with predictable memory usage
- Set
failRequestsGreaterThanOne: trueto prevent accidental exclusive access requests
Multi-Instance GPU (MIG)
Hardware-level GPU partitioning with memory and fault isolation. Supported on A100, A30, H100, H200.
MIG Strategies
| Strategy | Use Case | Behavior |
|---|---|---|
single | All GPUs get same MIG profile | One resource type advertised (e.g., nvidia.com/gpu) |
mixed | Different profiles per GPU | Multiple resource types (e.g., nvidia.com/mig-1g.10gb) |
Configure MIG Profiles
# Set strategy
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
--type='json' \
-p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"single"}]'
# Label node with desired profile
kubectl label nodes <node> nvidia.com/mig.config=all-1g.10gb --overwrite
Common MIG Profiles (H100 80GB)
| Profile | Instances | GPU Memory Each | Compute (SMs) |
|---|---|---|---|
all-1g.10gb | 7 | 10GB | 1/7 |
all-2g.20gb | 3 | 20GB | 2/7 |
all-3g.40gb | 2 | 40GB | 3/7 |
all-7g.80gb | 1 | 80GB | 7/7 |
Mixed MIG Example
For heterogeneous profiles on a single GPU:
kubectl label nodes <node> \
nvidia.com/mig.config=1g.10gb-2,3g.40gb-1 --overwrite
MIG + Time-Slicing
Combine MIG partitions with time-slicing for maximum sharing:
sharing:
timeSlicing:
resources:
- name: nvidia.com/mig-1g.10gb
replicas: 2
MIG Best Practices
- Reconfiguring MIG requires stopping all GPU workloads on the node — cordon first
- Use
singlestrategy when all workloads need the same partition size - Use
mixedwhen running diverse workloads (e.g., inference + training on same node) - Monitor
nvidia.com/mig.config.statelabel — should besuccessafter configuration
DCGM Monitoring
DCGM Exporter provides Prometheus metrics for GPU health and utilization.
Key Metrics
| Metric | Description |
|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU compute utilization % |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory bandwidth utilization % |
DCGM_FI_DEV_FB_USED | Framebuffer memory used (MB) |
DCGM_FI_DEV_FB_FREE | Framebuffer memory free (MB) |
DCGM_FI_DEV_GPU_TEMP | GPU temperature (°C) |
DCGM_FI_DEV_POWER_USAGE | Power draw (W) |
DCGM_FI_DEV_SM_CLOCK | SM clock frequency (MHz) |
DCGM_FI_DEV_XID_ERRORS | XID error count (hardware errors) |
DCGM_FI_DEV_PCIE_REPLAY_COUNTER | PCIe replay errors (connectivity issues) |
Custom Metrics ConfigMap
Override default metrics collection:
apiVersion: v1
kind: ConfigMap
metadata:
name: dcgm-metrics
namespace: gpu-operator
data:
default-counters.csv: |
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization
DCGM_FI_DEV_FB_USED, gauge, Framebuffer used
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer free
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature
DCGM_FI_DEV_POWER_USAGE, gauge, Power usage
DCGM_FI_DEV_XID_ERRORS, gauge, XID errors
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Graphics engine active ratio
DCGM_FI_PROF_DRAM_ACTIVE, gauge, DRAM active ratio
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Tensor core active ratio
Reference in Helm:
helm install gpu-operator nvidia/gpu-operator \
--set dcgmExporter.config.name=dcgm-metrics
HPC Job Mapping (v25.10.1+)
DCGM Exporter can associate GPU metrics with HPC jobs scheduled by a workload manager (e.g., Slurm). Enable via ClusterPolicy:
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
--type merge \
-p '{"spec":{"dcgmExporter":{"hpcJobMapping":{"enabled":true,"directory":"/var/lib/dcgm-exporter/job-mapping"}}}}'
| Field | Default | Purpose |
|---|---|---|
dcgmExporter.hpcJobMapping.enabled | false | Enable HPC job-to-GPU metric association |
dcgmExporter.hpcJobMapping.directory | /var/lib/dcgm-exporter/job-mapping | Directory where workload manager writes job mapping files |
Prometheus ServiceMonitor
DCGM Exporter creates a Service that Prometheus can scrape. For kube-prometheus-stack:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
endpoints:
- port: gpu-metrics
interval: 15s
Useful PromQL Queries
# Average GPU utilization per node
avg by (node) (DCGM_FI_DEV_GPU_UTIL)
# Memory pressure — nodes over 90% GPU memory
DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.9
# XID errors (hardware issues)
rate(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
# Tensor core utilization (training efficiency)
avg by (pod) (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE)
GPU Feature Discovery Labels
GFD automatically labels nodes. Useful for scheduling:
| Label | Example | Use For |
|---|---|---|
nvidia.com/gpu.product | NVIDIA-H100-80GB-HBM3 | Target specific GPU models |
nvidia.com/gpu.memory | 81920 | Schedule by GPU memory (MB) |
nvidia.com/gpu.count | 8 | Node GPU count |
nvidia.com/gpu.replicas | 4 | Time-slicing replica count |
nvidia.com/mig.capable | true | MIG-capable GPUs |
nvidia.com/cuda.driver.major | 570 | Driver version filtering |
nvidia.com/gpu.compute.major | 9 | Compute capability (9.0 = H100) |
Node Selector Example
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
nvidia.com/gpu.sharing-strategy: none # exclusive access
GPUDirect RDMA
Enables direct data exchange between GPUs and network devices (ConnectX NICs, BlueField DPUs) via PCIe. Two kernel-level approaches:
| Approach | Requirements | Helm Config |
|---|---|---|
| DMA-BUF (recommended) | Open kernel modules, Linux 5.12+, CUDA 11.7+ | Default — no extra flags needed |
| Legacy nvidia-peermem | Any kernel module type, MLNX_OFED required | --set driver.rdma.enabled=true |
# DMA-BUF with host-installed MOFED
helm install gpu-operator nvidia/gpu-operator \
--set driver.rdma.useHostMofed=true
# Legacy nvidia-peermem
helm install gpu-operator nvidia/gpu-operator \
--set driver.rdma.enabled=true
Use with NVIDIA Network Operator for full RDMA lifecycle management. For GDS and GDRCopy configuration, see references/advanced-features.md.
Container Toolkit and CDI
The NVIDIA Container Toolkit ensures containers can access GPUs. CDI (Container Device Interface) is the modern injection method (default in v25.10+).
| Helm Parameter | Default | Purpose |
|---|---|---|
toolkit.enabled | true | Deploy Container Toolkit DaemonSet |
cdi.enabled | true | Use CDI for device injection |
cdi.default | false | Make CDI the default runtime mode |
Validator
The GPU Operator Validator runs post-install checks to confirm all components are functional:
kubectl get pods -n gpu-operator -l app=nvidia-operator-validator
kubectl logs -n gpu-operator -l app=nvidia-operator-validator
| Helm Parameter | Default | Purpose |
|---|---|---|
validator.driver.env | [] | Env vars for driver validation (e.g., DISABLE_DEV_CHAR_SYMLINK_CREATION) |
Diagnostics
Run the bundled diagnostic script for a comprehensive GPU Operator health check:
bash scripts/gpu-operator-diag.sh [namespace]
Checks: operator pod health, ClusterPolicy state, GPU node detection, driver/device-plugin/DCGM/GFD pod status, GPU workload inventory, scheduling failures, and time-slicing configuration.
Troubleshooting
See references/troubleshooting.md for:
- Driver pod CrashLoopBackOff diagnosis
- Device plugin not advertising GPUs
- MIG configuration stuck in pending state
- DCGM Exporter missing metrics
- Node labeling issues after upgrades
- GPU operator upgrade procedures
Dynamic Resource Allocation (DRA)
DRA replaces the traditional device plugin model with flexible, claim-based GPU allocation. Instead of requesting nvidia.com/gpu: 1, workloads create ResourceClaim objects that reference DeviceClass resources with CEL-based attribute selectors.
Requirements: Kubernetes ≥ v1.34.2, GPU Operator v25.10.0+, NVIDIA Driver ≥ 580.
Install with DRA
DRA uses a separate Helm chart (nvidia-dra-driver-gpu) alongside the GPU Operator. The legacy device plugin must be disabled:
# Label nodes for DRA
kubectl label node <node> nvidia.com/dra-kubelet-plugin=true
# Install GPU Operator with device plugin disabled
helm upgrade --install gpu-operator nvidia/gpu-operator \
--version=v25.10.1 \
--namespace gpu-operator --create-namespace \
--set devicePlugin.enabled=false \
--set driver.manager.env[0].name=NODE_LABEL_FOR_GPU_POD_EVICTION \
--set driver.manager.env[0].value="nvidia.com/dra-kubelet-plugin"
# Install DRA driver
helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
--version="25.12.0" \
--namespace nvidia-dra-driver-gpu --create-namespace \
--set nvidiaDriverRoot=/run/nvidia/driver \
--set gpuResourcesEnabledOverride=true \
-f dra-values.yaml
DRA driver dra-values.yaml:
image:
pullPolicy: IfNotPresent
kubeletPlugin:
nodeSelector:
nvidia.com/dra-kubelet-plugin: "true"
Verify
kubectl get pods -n nvidia-dra-driver-gpu
kubectl get deviceclass
# Expected: gpu.nvidia.com, mig.nvidia.com
Workload Example
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: gpu-claim
spec:
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata:
name: inference
spec:
containers:
- name: model
image: my-inference:latest
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimTemplateName: gpu-claim
For CEL-based GPU filtering, MIG with DRA, ComputeDomains (Multi-Node NVLink), prioritized device lists, health monitoring, and migration guide from device plugin to DRA, see references/dra.md.
Cross-References
- kueue — GPU quota management and job queueing
- deepspeed — Distributed training with ZeRO optimization
- promql — Query DCGM metrics in Prometheus
- nccl — NCCL communication layer managed alongside GPU drivers
- vllm — GPU inference serving requiring GPU Operator
- sglang — Alternative inference engine requiring GPU Operator
- leaderworkerset — Multi-node GPU workloads with device plugin
- fsdp — Distributed training requiring GPU access
- megatron-lm — Large-scale training requiring multi-GPU setup
- karpenter — Node provisioning for GPU instances
- volcano — Batch scheduling for GPU training jobs
- prometheus-grafana — Scrape DCGM-exporter GPU metrics
References
dra.md— Dynamic Resource Allocation: CEL selectors, MIG with DRA, ComputeDomains, health checks, migration from device pluginadvanced-features.md— KubeVirt GPU passthrough, Kata Containers, Confidential Computing, NVIDIADriver CRD, GDS, and GDRCopysecurity-hardening.md— Pod Security Standards, security contexts, /dev/shm, NCCL NetworkPolicies, GPU device plugin securitytroubleshooting.md— Operator pod failures, driver issues, and GPU detection problems
More by tylertitsworth
View alluv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.
TensorRT-LLM — engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup.
NVIDIA Triton Inference Server — model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM.
KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).
