Agent SkillsAgent Skills
tylertitsworth

gpu-operator

@tylertitsworth/gpu-operator
tylertitsworth
0
0 forks
Updated 4/1/2026
View on GitHub

NVIDIA GPU Operator on Kubernetes β€” ClusterPolicy, DRA (Dynamic Resource Allocation), time-slicing, MIG, DCGM metrics, driver management, GPUDirect RDMA/GDS, CDI, and GPU scheduling. Use when installing, configuring, or troubleshooting GPU infrastructure on K8s.

Installation

$npx agent-skills-cli install @tylertitsworth/gpu-operator
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathgpu-operator/SKILL.md
Branchmain
Scoped Name@tylertitsworth/gpu-operator

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions


name: gpu-operator description: "NVIDIA GPU Operator on Kubernetes β€” ClusterPolicy, DRA (Dynamic Resource Allocation), time-slicing, MIG, DCGM metrics, driver management, GPUDirect RDMA/GDS, CDI, and GPU scheduling. Use when installing, configuring, or troubleshooting GPU infrastructure on K8s."

NVIDIA GPU Operator

Automates management of NVIDIA software components on Kubernetes β€” drivers, device plugin, container toolkit, DCGM, and GPU feature discovery. Version: v25.10.x

Docs: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/ GitHub: https://github.com/NVIDIA/gpu-operator

ClusterPolicy Configuration

The GPU Operator is configured through a ClusterPolicy CRD. Key Helm values map directly to ClusterPolicy spec fields.

Essential Helm Values

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --version=v25.10.1 \
  --set driver.enabled=true \
  --set driver.kernelModuleType=auto \
  --set dcgmExporter.enabled=true \
  --set cdi.enabled=true \
  --set mig.strategy=single
ParameterDefaultPurpose
driver.enabledtrueDeploy driver containers; set false if drivers pre-installed
driver.kernelModuleTypeautoauto, open, or proprietary β€” auto picks based on GPU + driver branch
driver.rdma.enabledfalseBuild/load nvidia-peermem for GPUDirect RDMA
driver.rdma.useHostMofedfalseSet true if MLNX_OFED pre-installed on host
cdi.enabledtrueUse Container Device Interface for GPU injection (v25.10+)
dcgmExporter.enabledtrueDeploy DCGM Exporter for Prometheus GPU metrics
devicePlugin.config{}ConfigMap name for time-slicing or MPS config
mig.strategynonenone, single, or mixed
nfd.enabledtrueDeploy Node Feature Discovery; disable if already running

Pre-installed Drivers

When using pre-installed drivers (e.g., AWS EKS optimized AMIs, or host-managed installs):

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --set driver.enabled=false

Patching ClusterPolicy

Modify configuration post-install:

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  --type='json' \
  -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"mixed"}]'

GPU Time-Slicing

Oversubscribe GPUs by advertising multiple replicas per physical GPU. No memory or fault isolation (unlike MIG).

Configuration

Create a ConfigMap in the GPU Operator namespace:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4
FieldPurpose
replicasNumber of virtual GPUs per physical GPU
renameByDefaultIf true, advertise as nvidia.com/gpu.shared instead of nvidia.com/gpu
failRequestsGreaterThanOneIf true, reject pods requesting >1 GPU (prevents accidental full-GPU requests)
migStrategySet to none for time-slicing only; mixed to combine with MIG

Enable Time-Slicing

Reference the ConfigMap in the ClusterPolicy:

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  --type merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'

Apply to specific nodes via label:

kubectl label node <node> nvidia.com/device-plugin.config=any

Node Labels After Configuration

nvidia.com/gpu.replicas = 4
nvidia.com/gpu.product = A100-SXM4-80GB-SHARED

Use the -SHARED suffix in node selectors to schedule onto time-sliced GPUs specifically.

Best Practices

  • Dev/test: 4-8 replicas β€” acceptable for notebooks, small experiments
  • Training: Avoid time-slicing β€” no memory isolation means OOM from neighbors
  • Inference: 2-4 replicas for small models with predictable memory usage
  • Set failRequestsGreaterThanOne: true to prevent accidental exclusive access requests

Multi-Instance GPU (MIG)

Hardware-level GPU partitioning with memory and fault isolation. Supported on A100, A30, H100, H200.

MIG Strategies

StrategyUse CaseBehavior
singleAll GPUs get same MIG profileOne resource type advertised (e.g., nvidia.com/gpu)
mixedDifferent profiles per GPUMultiple resource types (e.g., nvidia.com/mig-1g.10gb)

Configure MIG Profiles

# Set strategy
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  --type='json' \
  -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"single"}]'

# Label node with desired profile
kubectl label nodes <node> nvidia.com/mig.config=all-1g.10gb --overwrite

Common MIG Profiles (H100 80GB)

ProfileInstancesGPU Memory EachCompute (SMs)
all-1g.10gb710GB1/7
all-2g.20gb320GB2/7
all-3g.40gb240GB3/7
all-7g.80gb180GB7/7

Mixed MIG Example

For heterogeneous profiles on a single GPU:

kubectl label nodes <node> \
  nvidia.com/mig.config=1g.10gb-2,3g.40gb-1 --overwrite

MIG + Time-Slicing

Combine MIG partitions with time-slicing for maximum sharing:

sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/mig-1g.10gb
      replicas: 2

MIG Best Practices

  • Reconfiguring MIG requires stopping all GPU workloads on the node β€” cordon first
  • Use single strategy when all workloads need the same partition size
  • Use mixed when running diverse workloads (e.g., inference + training on same node)
  • Monitor nvidia.com/mig.config.state label β€” should be success after configuration

DCGM Monitoring

DCGM Exporter provides Prometheus metrics for GPU health and utilization.

Key Metrics

MetricDescription
DCGM_FI_DEV_GPU_UTILGPU compute utilization %
DCGM_FI_DEV_MEM_COPY_UTILMemory bandwidth utilization %
DCGM_FI_DEV_FB_USEDFramebuffer memory used (MB)
DCGM_FI_DEV_FB_FREEFramebuffer memory free (MB)
DCGM_FI_DEV_GPU_TEMPGPU temperature (Β°C)
DCGM_FI_DEV_POWER_USAGEPower draw (W)
DCGM_FI_DEV_SM_CLOCKSM clock frequency (MHz)
DCGM_FI_DEV_XID_ERRORSXID error count (hardware errors)
DCGM_FI_DEV_PCIE_REPLAY_COUNTERPCIe replay errors (connectivity issues)

Custom Metrics ConfigMap

Override default metrics collection:

apiVersion: v1
kind: ConfigMap
metadata:
  name: dcgm-metrics
  namespace: gpu-operator
data:
  default-counters.csv: |
    DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization
    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer used
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer free
    DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature
    DCGM_FI_DEV_POWER_USAGE, gauge, Power usage
    DCGM_FI_DEV_XID_ERRORS, gauge, XID errors
    DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Graphics engine active ratio
    DCGM_FI_PROF_DRAM_ACTIVE, gauge, DRAM active ratio
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Tensor core active ratio

Reference in Helm:

helm install gpu-operator nvidia/gpu-operator \
  --set dcgmExporter.config.name=dcgm-metrics

HPC Job Mapping (v25.10.1+)

DCGM Exporter can associate GPU metrics with HPC jobs scheduled by a workload manager (e.g., Slurm). Enable via ClusterPolicy:

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  --type merge \
  -p '{"spec":{"dcgmExporter":{"hpcJobMapping":{"enabled":true,"directory":"/var/lib/dcgm-exporter/job-mapping"}}}}'
FieldDefaultPurpose
dcgmExporter.hpcJobMapping.enabledfalseEnable HPC job-to-GPU metric association
dcgmExporter.hpcJobMapping.directory/var/lib/dcgm-exporter/job-mappingDirectory where workload manager writes job mapping files

Prometheus ServiceMonitor

DCGM Exporter creates a Service that Prometheus can scrape. For kube-prometheus-stack:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
  - port: gpu-metrics
    interval: 15s

Useful PromQL Queries

# Average GPU utilization per node
avg by (node) (DCGM_FI_DEV_GPU_UTIL)

# Memory pressure β€” nodes over 90% GPU memory
DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.9

# XID errors (hardware issues)
rate(DCGM_FI_DEV_XID_ERRORS[5m]) > 0

# Tensor core utilization (training efficiency)
avg by (pod) (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE)

GPU Feature Discovery Labels

GFD automatically labels nodes. Useful for scheduling:

LabelExampleUse For
nvidia.com/gpu.productNVIDIA-H100-80GB-HBM3Target specific GPU models
nvidia.com/gpu.memory81920Schedule by GPU memory (MB)
nvidia.com/gpu.count8Node GPU count
nvidia.com/gpu.replicas4Time-slicing replica count
nvidia.com/mig.capabletrueMIG-capable GPUs
nvidia.com/cuda.driver.major570Driver version filtering
nvidia.com/gpu.compute.major9Compute capability (9.0 = H100)

Node Selector Example

nodeSelector:
  nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
  nvidia.com/gpu.sharing-strategy: none  # exclusive access

GPUDirect RDMA

Enables direct data exchange between GPUs and network devices (ConnectX NICs, BlueField DPUs) via PCIe. Two kernel-level approaches:

ApproachRequirementsHelm Config
DMA-BUF (recommended)Open kernel modules, Linux 5.12+, CUDA 11.7+Default β€” no extra flags needed
Legacy nvidia-peermemAny kernel module type, MLNX_OFED required--set driver.rdma.enabled=true
# DMA-BUF with host-installed MOFED
helm install gpu-operator nvidia/gpu-operator \
  --set driver.rdma.useHostMofed=true

# Legacy nvidia-peermem
helm install gpu-operator nvidia/gpu-operator \
  --set driver.rdma.enabled=true

Use with NVIDIA Network Operator for full RDMA lifecycle management. For GDS and GDRCopy configuration, see references/advanced-features.md.

Container Toolkit and CDI

The NVIDIA Container Toolkit ensures containers can access GPUs. CDI (Container Device Interface) is the modern injection method (default in v25.10+).

Helm ParameterDefaultPurpose
toolkit.enabledtrueDeploy Container Toolkit DaemonSet
cdi.enabledtrueUse CDI for device injection
cdi.defaultfalseMake CDI the default runtime mode

Validator

The GPU Operator Validator runs post-install checks to confirm all components are functional:

kubectl get pods -n gpu-operator -l app=nvidia-operator-validator
kubectl logs -n gpu-operator -l app=nvidia-operator-validator
Helm ParameterDefaultPurpose
validator.driver.env[]Env vars for driver validation (e.g., DISABLE_DEV_CHAR_SYMLINK_CREATION)

Diagnostics

Run the bundled diagnostic script for a comprehensive GPU Operator health check:

bash scripts/gpu-operator-diag.sh [namespace]

Checks: operator pod health, ClusterPolicy state, GPU node detection, driver/device-plugin/DCGM/GFD pod status, GPU workload inventory, scheduling failures, and time-slicing configuration.

Troubleshooting

See references/troubleshooting.md for:

  • Driver pod CrashLoopBackOff diagnosis
  • Device plugin not advertising GPUs
  • MIG configuration stuck in pending state
  • DCGM Exporter missing metrics
  • Node labeling issues after upgrades
  • GPU operator upgrade procedures

Dynamic Resource Allocation (DRA)

DRA replaces the traditional device plugin model with flexible, claim-based GPU allocation. Instead of requesting nvidia.com/gpu: 1, workloads create ResourceClaim objects that reference DeviceClass resources with CEL-based attribute selectors.

Requirements: Kubernetes β‰₯ v1.34.2, GPU Operator v25.10.0+, NVIDIA Driver β‰₯ 580.

Install with DRA

DRA uses a separate Helm chart (nvidia-dra-driver-gpu) alongside the GPU Operator. The legacy device plugin must be disabled:

# Label nodes for DRA
kubectl label node <node> nvidia.com/dra-kubelet-plugin=true

# Install GPU Operator with device plugin disabled
helm upgrade --install gpu-operator nvidia/gpu-operator \
  --version=v25.10.1 \
  --namespace gpu-operator --create-namespace \
  --set devicePlugin.enabled=false \
  --set driver.manager.env[0].name=NODE_LABEL_FOR_GPU_POD_EVICTION \
  --set driver.manager.env[0].value="nvidia.com/dra-kubelet-plugin"

# Install DRA driver
helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
  --version="25.12.0" \
  --namespace nvidia-dra-driver-gpu --create-namespace \
  --set nvidiaDriverRoot=/run/nvidia/driver \
  --set gpuResourcesEnabledOverride=true \
  -f dra-values.yaml

DRA driver dra-values.yaml:

image:
  pullPolicy: IfNotPresent
kubeletPlugin:
  nodeSelector:
    nvidia.com/dra-kubelet-plugin: "true"

Verify

kubectl get pods -n nvidia-dra-driver-gpu
kubectl get deviceclass
# Expected: gpu.nvidia.com, mig.nvidia.com

Workload Example

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: gpu-claim
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata:
  name: inference
spec:
  containers:
  - name: model
    image: my-inference:latest
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    resourceClaimTemplateName: gpu-claim

For CEL-based GPU filtering, MIG with DRA, ComputeDomains (Multi-Node NVLink), prioritized device lists, health monitoring, and migration guide from device plugin to DRA, see references/dra.md.

Cross-References

  • kueue β€” GPU quota management and job queueing
  • deepspeed β€” Distributed training with ZeRO optimization
  • promql β€” Query DCGM metrics in Prometheus
  • nccl β€” NCCL communication layer managed alongside GPU drivers
  • vllm β€” GPU inference serving requiring GPU Operator
  • sglang β€” Alternative inference engine requiring GPU Operator
  • leaderworkerset β€” Multi-node GPU workloads with device plugin
  • fsdp β€” Distributed training requiring GPU access
  • megatron-lm β€” Large-scale training requiring multi-GPU setup
  • karpenter β€” Node provisioning for GPU instances
  • volcano β€” Batch scheduling for GPU training jobs
  • prometheus-grafana β€” Scrape DCGM-exporter GPU metrics

References

  • dra.md β€” Dynamic Resource Allocation: CEL selectors, MIG with DRA, ComputeDomains, health checks, migration from device plugin
  • advanced-features.md β€” KubeVirt GPU passthrough, Kata Containers, Confidential Computing, NVIDIADriver CRD, GDS, and GDRCopy
  • security-hardening.md β€” Pod Security Standards, security contexts, /dev/shm, NCCL NetworkPolicies, GPU device plugin security
  • troubleshooting.md β€” Operator pod failures, driver issues, and GPU detection problems