gpu-operator

@tylertitsworth/gpu-operator

0 forks

Updated 4/1/2026

NVIDIA GPU Operator on Kubernetes — ClusterPolicy, DRA (Dynamic Resource Allocation), time-slicing, MIG, DCGM metrics, driver management, GPUDirect RDMA/GDS, CDI, and GPU scheduling. Use when installing, configuring, or troubleshooting GPU infrastructure on K8s.

Installation

$npx agent-skills-cli install @tylertitsworth/gpu-operator

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositorytylertitsworth/skills

Pathgpu-operator/SKILL.md

Branchmain

Scoped Name@tylertitsworth/gpu-operator

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: gpu-operator description: "NVIDIA GPU Operator on Kubernetes — ClusterPolicy, DRA (Dynamic Resource Allocation), time-slicing, MIG, DCGM metrics, driver management, GPUDirect RDMA/GDS, CDI, and GPU scheduling. Use when installing, configuring, or troubleshooting GPU infrastructure on K8s."

NVIDIA GPU Operator

Automates management of NVIDIA software components on Kubernetes — drivers, device plugin, container toolkit, DCGM, and GPU feature discovery. Version: v25.10.x

Docs: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/ GitHub: https://github.com/NVIDIA/gpu-operator

ClusterPolicy Configuration

The GPU Operator is configured through a ClusterPolicy CRD. Key Helm values map directly to ClusterPolicy spec fields.

Essential Helm Values

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --version=v25.10.1 \
  --set driver.enabled=true \
  --set driver.kernelModuleType=auto \
  --set dcgmExporter.enabled=true \
  --set cdi.enabled=true \
  --set mig.strategy=single

Parameter	Default	Purpose
`driver.enabled`	`true`	Deploy driver containers; set `false` if drivers pre-installed
`driver.kernelModuleType`	`auto`	`auto`, `open`, or `proprietary` — auto picks based on GPU + driver branch
`driver.rdma.enabled`	`false`	Build/load nvidia-peermem for GPUDirect RDMA
`driver.rdma.useHostMofed`	`false`	Set `true` if MLNX_OFED pre-installed on host
`cdi.enabled`	`true`	Use Container Device Interface for GPU injection (v25.10+)
`dcgmExporter.enabled`	`true`	Deploy DCGM Exporter for Prometheus GPU metrics
`devicePlugin.config`	`{}`	ConfigMap name for time-slicing or MPS config
`mig.strategy`	`none`	`none`, `single`, or `mixed`
`nfd.enabled`	`true`	Deploy Node Feature Discovery; disable if already running

Pre-installed Drivers

When using pre-installed drivers (e.g., AWS EKS optimized AMIs, or host-managed installs):

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --set driver.enabled=false

Patching ClusterPolicy

Modify configuration post-install:

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  --type='json' \
  -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"mixed"}]'

GPU Time-Slicing

Oversubscribe GPUs by advertising multiple replicas per physical GPU. No memory or fault isolation (unlike MIG).

Configuration

Create a ConfigMap in the GPU Operator namespace:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4

Field	Purpose
`replicas`	Number of virtual GPUs per physical GPU
`renameByDefault`	If `true`, advertise as `nvidia.com/gpu.shared` instead of `nvidia.com/gpu`
`failRequestsGreaterThanOne`	If `true`, reject pods requesting >1 GPU (prevents accidental full-GPU requests)
`migStrategy`	Set to `none` for time-slicing only; `mixed` to combine with MIG

Enable Time-Slicing

Reference the ConfigMap in the ClusterPolicy:

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  --type merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'

Apply to specific nodes via label:

kubectl label node <node> nvidia.com/device-plugin.config=any

Node Labels After Configuration

nvidia.com/gpu.replicas = 4
nvidia.com/gpu.product = A100-SXM4-80GB-SHARED

Use the -SHARED suffix in node selectors to schedule onto time-sliced GPUs specifically.

Best Practices

Dev/test: 4-8 replicas — acceptable for notebooks, small experiments
Training: Avoid time-slicing — no memory isolation means OOM from neighbors
Inference: 2-4 replicas for small models with predictable memory usage
Set failRequestsGreaterThanOne: true to prevent accidental exclusive access requests

Multi-Instance GPU (MIG)

Hardware-level GPU partitioning with memory and fault isolation. Supported on A100, A30, H100, H200.

MIG Strategies

Strategy	Use Case	Behavior
`single`	All GPUs get same MIG profile	One resource type advertised (e.g., `nvidia.com/gpu`)
`mixed`	Different profiles per GPU	Multiple resource types (e.g., `nvidia.com/mig-1g.10gb`)

Configure MIG Profiles

# Set strategy
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  --type='json' \
  -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"single"}]'

# Label node with desired profile
kubectl label nodes <node> nvidia.com/mig.config=all-1g.10gb --overwrite

Common MIG Profiles (H100 80GB)

Profile	Instances	GPU Memory Each	Compute (SMs)
`all-1g.10gb`	7	10GB	1/7
`all-2g.20gb`	3	20GB	2/7
`all-3g.40gb`	2	40GB	3/7
`all-7g.80gb`	1	80GB	7/7

Mixed MIG Example

For heterogeneous profiles on a single GPU:

kubectl label nodes <node> \
  nvidia.com/mig.config=1g.10gb-2,3g.40gb-1 --overwrite

MIG + Time-Slicing

Combine MIG partitions with time-slicing for maximum sharing:

sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/mig-1g.10gb
      replicas: 2

MIG Best Practices

Reconfiguring MIG requires stopping all GPU workloads on the node — cordon first
Use single strategy when all workloads need the same partition size
Use mixed when running diverse workloads (e.g., inference + training on same node)
Monitor nvidia.com/mig.config.state label — should be success after configuration

DCGM Monitoring

DCGM Exporter provides Prometheus metrics for GPU health and utilization.

Key Metrics

Metric	Description
`DCGM_FI_DEV_GPU_UTIL`	GPU compute utilization %
`DCGM_FI_DEV_MEM_COPY_UTIL`	Memory bandwidth utilization %
`DCGM_FI_DEV_FB_USED`	Framebuffer memory used (MB)
`DCGM_FI_DEV_FB_FREE`	Framebuffer memory free (MB)
`DCGM_FI_DEV_GPU_TEMP`	GPU temperature (°C)
`DCGM_FI_DEV_POWER_USAGE`	Power draw (W)
`DCGM_FI_DEV_SM_CLOCK`	SM clock frequency (MHz)
`DCGM_FI_DEV_XID_ERRORS`	XID error count (hardware errors)
`DCGM_FI_DEV_PCIE_REPLAY_COUNTER`	PCIe replay errors (connectivity issues)

Custom Metrics ConfigMap

Override default metrics collection:

apiVersion: v1
kind: ConfigMap
metadata:
  name: dcgm-metrics
  namespace: gpu-operator
data:
  default-counters.csv: |
    DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization
    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer used
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer free
    DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature
    DCGM_FI_DEV_POWER_USAGE, gauge, Power usage
    DCGM_FI_DEV_XID_ERRORS, gauge, XID errors
    DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Graphics engine active ratio
    DCGM_FI_PROF_DRAM_ACTIVE, gauge, DRAM active ratio
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Tensor core active ratio

Reference in Helm:

helm install gpu-operator nvidia/gpu-operator \
  --set dcgmExporter.config.name=dcgm-metrics

HPC Job Mapping (v25.10.1+)

DCGM Exporter can associate GPU metrics with HPC jobs scheduled by a workload manager (e.g., Slurm). Enable via ClusterPolicy:

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  --type merge \
  -p '{"spec":{"dcgmExporter":{"hpcJobMapping":{"enabled":true,"directory":"/var/lib/dcgm-exporter/job-mapping"}}}}'

Field	Default	Purpose
`dcgmExporter.hpcJobMapping.enabled`	`false`	Enable HPC job-to-GPU metric association
`dcgmExporter.hpcJobMapping.directory`	`/var/lib/dcgm-exporter/job-mapping`	Directory where workload manager writes job mapping files

Prometheus ServiceMonitor

DCGM Exporter creates a Service that Prometheus can scrape. For kube-prometheus-stack:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
  - port: gpu-metrics
    interval: 15s

Useful PromQL Queries

# Average GPU utilization per node
avg by (node) (DCGM_FI_DEV_GPU_UTIL)

# Memory pressure — nodes over 90% GPU memory
DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.9

# XID errors (hardware issues)
rate(DCGM_FI_DEV_XID_ERRORS[5m]) > 0

# Tensor core utilization (training efficiency)
avg by (pod) (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE)

GPU Feature Discovery Labels

GFD automatically labels nodes. Useful for scheduling:

Label	Example	Use For
`nvidia.com/gpu.product`	`NVIDIA-H100-80GB-HBM3`	Target specific GPU models
`nvidia.com/gpu.memory`	`81920`	Schedule by GPU memory (MB)
`nvidia.com/gpu.count`	`8`	Node GPU count
`nvidia.com/gpu.replicas`	`4`	Time-slicing replica count
`nvidia.com/mig.capable`	`true`	MIG-capable GPUs
`nvidia.com/cuda.driver.major`	`570`	Driver version filtering
`nvidia.com/gpu.compute.major`	`9`	Compute capability (9.0 = H100)

Node Selector Example

nodeSelector:
  nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
  nvidia.com/gpu.sharing-strategy: none  # exclusive access

GPUDirect RDMA

Enables direct data exchange between GPUs and network devices (ConnectX NICs, BlueField DPUs) via PCIe. Two kernel-level approaches:

Approach	Requirements	Helm Config
DMA-BUF (recommended)	Open kernel modules, Linux 5.12+, CUDA 11.7+	Default — no extra flags needed
Legacy nvidia-peermem	Any kernel module type, MLNX_OFED required	`--set driver.rdma.enabled=true`

# DMA-BUF with host-installed MOFED
helm install gpu-operator nvidia/gpu-operator \
  --set driver.rdma.useHostMofed=true

# Legacy nvidia-peermem
helm install gpu-operator nvidia/gpu-operator \
  --set driver.rdma.enabled=true

Use with NVIDIA Network Operator for full RDMA lifecycle management. For GDS and GDRCopy configuration, see references/advanced-features.md.

Container Toolkit and CDI

The NVIDIA Container Toolkit ensures containers can access GPUs. CDI (Container Device Interface) is the modern injection method (default in v25.10+).

Helm Parameter	Default	Purpose
`toolkit.enabled`	`true`	Deploy Container Toolkit DaemonSet
`cdi.enabled`	`true`	Use CDI for device injection
`cdi.default`	`false`	Make CDI the default runtime mode

Validator

The GPU Operator Validator runs post-install checks to confirm all components are functional:

kubectl get pods -n gpu-operator -l app=nvidia-operator-validator
kubectl logs -n gpu-operator -l app=nvidia-operator-validator

Helm Parameter	Default	Purpose
`validator.driver.env`	`[]`	Env vars for driver validation (e.g., `DISABLE_DEV_CHAR_SYMLINK_CREATION`)

Diagnostics

Run the bundled diagnostic script for a comprehensive GPU Operator health check:

bash scripts/gpu-operator-diag.sh [namespace]

Checks: operator pod health, ClusterPolicy state, GPU node detection, driver/device-plugin/DCGM/GFD pod status, GPU workload inventory, scheduling failures, and time-slicing configuration.

Troubleshooting

See references/troubleshooting.md for:

Driver pod CrashLoopBackOff diagnosis
Device plugin not advertising GPUs
MIG configuration stuck in pending state
DCGM Exporter missing metrics
Node labeling issues after upgrades
GPU operator upgrade procedures

Dynamic Resource Allocation (DRA)

DRA replaces the traditional device plugin model with flexible, claim-based GPU allocation. Instead of requesting nvidia.com/gpu: 1, workloads create ResourceClaim objects that reference DeviceClass resources with CEL-based attribute selectors.

Requirements: Kubernetes ≥ v1.34.2, GPU Operator v25.10.0+, NVIDIA Driver ≥ 580.

Install with DRA

DRA uses a separate Helm chart (nvidia-dra-driver-gpu) alongside the GPU Operator. The legacy device plugin must be disabled:

# Label nodes for DRA
kubectl label node <node> nvidia.com/dra-kubelet-plugin=true

# Install GPU Operator with device plugin disabled
helm upgrade --install gpu-operator nvidia/gpu-operator \
  --version=v25.10.1 \
  --namespace gpu-operator --create-namespace \
  --set devicePlugin.enabled=false \
  --set driver.manager.env[0].name=NODE_LABEL_FOR_GPU_POD_EVICTION \
  --set driver.manager.env[0].value="nvidia.com/dra-kubelet-plugin"

# Install DRA driver
helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
  --version="25.12.0" \
  --namespace nvidia-dra-driver-gpu --create-namespace \
  --set nvidiaDriverRoot=/run/nvidia/driver \
  --set gpuResourcesEnabledOverride=true \
  -f dra-values.yaml

DRA driver dra-values.yaml:

image:
  pullPolicy: IfNotPresent
kubeletPlugin:
  nodeSelector:
    nvidia.com/dra-kubelet-plugin: "true"

Verify

kubectl get pods -n nvidia-dra-driver-gpu
kubectl get deviceclass
# Expected: gpu.nvidia.com, mig.nvidia.com

Workload Example

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: gpu-claim
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata:
  name: inference
spec:
  containers:
  - name: model
    image: my-inference:latest
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    resourceClaimTemplateName: gpu-claim

For CEL-based GPU filtering, MIG with DRA, ComputeDomains (Multi-Node NVLink), prioritized device lists, health monitoring, and migration guide from device plugin to DRA, see references/dra.md.

Cross-References

kueue — GPU quota management and job queueing
deepspeed — Distributed training with ZeRO optimization
promql — Query DCGM metrics in Prometheus
nccl — NCCL communication layer managed alongside GPU drivers
vllm — GPU inference serving requiring GPU Operator
sglang — Alternative inference engine requiring GPU Operator
leaderworkerset — Multi-node GPU workloads with device plugin
fsdp — Distributed training requiring GPU access
megatron-lm — Large-scale training requiring multi-GPU setup
karpenter — Node provisioning for GPU instances
volcano — Batch scheduling for GPU training jobs
prometheus-grafana — Scrape DCGM-exporter GPU metrics

References

dra.md — Dynamic Resource Allocation: CEL selectors, MIG with DRA, ComputeDomains, health checks, migration from device plugin
advanced-features.md — KubeVirt GPU passthrough, Kata Containers, Confidential Computing, NVIDIADriver CRD, GDS, and GDRCopy
security-hardening.md — Pod Security Standards, security contexts, /dev/shm, NCCL NetworkPolicies, GPU device plugin security
troubleshooting.md — Operator pod failures, driver issues, and GPU detection problems

More by tylertitsworth

View all

uv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.

tensorrt-llm

TensorRT-LLM — engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup.

triton-inference-server

NVIDIA Triton Inference Server — model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM.

kuberay

KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).