Agent SkillsAgent Skills
tylertitsworth

kuberay

@tylertitsworth/kuberay
tylertitsworth
0
0 forks
Updated 4/1/2026
View on GitHub

KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).

Installation

$npx agent-skills-cli install @tylertitsworth/kuberay
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathkuberay/SKILL.md
Branchmain
Scoped Name@tylertitsworth/kuberay

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions


name: kuberay description: "KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core)."

KubeRay

Kubernetes operator for Ray. Provides CRDs for running distributed Ray workloads natively on K8s.

Docs: https://docs.ray.io/en/latest/cluster/kubernetes/index.html
GitHub: https://github.com/ray-project/kuberay
Operator: v1.5.1 | Ray: 2.54.0 | API: ray.io/v1

CRDs

CRDPurposeLifecycle
RayClusterLong-running Ray cluster (head + worker groups)Manual or autoscaled
RayJobOne-shot: creates cluster, submits job, optionally cleans upEphemeral
RayServiceRay Serve with zero-downtime upgradesLong-running serving

When to use which:

  • RayJob for batch/training — new cluster per job, auto-cleanup, cost-efficient
  • RayCluster for interactive/dev — persistent, no startup latency per job
  • RayService for model serving — managed upgrades, HA, traffic routing

Installation

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 \
  --namespace kuberay-system --create-namespace

Key Helm Values

image:
  repository: quay.io/kuberay/operator
  tag: v1.5.1

# Namespace scoping
watchNamespace: []                    # empty = all namespaces
singleNamespaceInstall: false         # true = Role instead of ClusterRole

# RBAC
rbacEnable: true
crNamespacedRbacEnable: true          # false for GitOps tools like ArgoCD

# Feature gates
featureGates:
- name: RayClusterStatusConditions
  enabled: true

# Operator tuning
reconcileConcurrency: 1               # increase for many CRs
batchScheduler: ""                     # "volcano" or "yunikorn"

# Leader election (for HA)
leaderElection:
  enabled: true

Verify: kubectl get pods -n kuberay-system

RayCluster

GPU Cluster Example

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: gpu-cluster
spec:
  rayVersion: "2.54.0"
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default             # Default | Conservative | Aggressive
    idleTimeoutSeconds: 60
  headGroupSpec:
    serviceType: ClusterIP
    rayStartParams:
      dashboard-host: "0.0.0.0"
      num-cpus: "0"                    # prevent workloads on head
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray-ml:2.54.0-gpu
          resources:
            limits: { cpu: "4", memory: 16Gi }
            requests: { cpu: "4", memory: 16Gi }
          env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: void                # head doesn't need GPU
  workerGroupSpecs:
  - groupName: gpu-a100
    replicas: 2
    minReplicas: 0
    maxReplicas: 8
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray-ml:2.54.0-gpu
          resources:
            limits: { cpu: "8", memory: 64Gi, nvidia.com/gpu: "1" }
            requests: { cpu: "8", memory: 64Gi, nvidia.com/gpu: "1" }
        tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
        nodeSelector:
          nvidia.com/gpu.product: A100

Add multiple workerGroupSpecs entries for heterogeneous hardware (GPU types, spot vs on-demand, CPU-only groups).

Label Selector API (v1.5+, Preferred Over rayStartParams)

KubeRay v1.5 introduces a top-level resources and labels API for each head/worker group — replacing the previous practice of embedding labels and resources in rayStartParams strings. These top-level values are mirrored into pod labels, enabling combined Ray + K8s label selector queries, and are consumed by the Ray autoscaler for improved decisions.

headGroupSpec:
  rayStartParams: {}               # no longer need label/resource hacks here
  resources:
    custom_accelerator: "4"        # Ray logical resource (replaces JSON string in rayStartParams)
  labels:
    ray.io/zone: us-west-2a        # also mirrors into pod labels
    ray.io/region: us-west-2
workerGroupSpecs:
- groupName: gpu-workers
  rayStartParams: {}
  resources:
    custom_accelerator: "4"
  labels:
    ray.io/zone: us-west-2b
  template:
    # ...

Before (v1.4 style — still works, but deprecated):

rayStartParams:
  resources: '"{\"custom_accelerator\": 4}"'

Configuration Best Practices

Pod sizing:

  • Size each Ray pod to fill one K8s node (fewer large pods > many small)
  • Set memory and GPU requests = limits (KubeRay ignores memory/GPU requests, uses limits)
  • CPU: set requests only (no limits) to avoid throttling; KubeRay uses requests if limits absent
  • KubeRay rounds CPU to nearest integer for Ray resource accounting

Head pod:

  • Set num-cpus: "0" to prevent workloads on head
  • Set dashboard-host: "0.0.0.0" to expose dashboard
  • Set NVIDIA_VISIBLE_DEVICES: void if head is on GPU node but shouldn't use GPUs

Worker groups:

  • All rayStartParams values must be strings
  • Use same Ray image + version across head and all workers (same Python version too)
  • Multiple worker groups for heterogeneous hardware (GPU types, spot vs on-demand)
  • Use nodeSelector and tolerations to target specific node pools

Custom Ray resources:

rayStartParams:
  resources: '"{\"TPU\": 4, \"custom_resource\": 1}"'  # JSON string of custom resources

Head Service

KubeRay auto-creates <cluster>-head-svc with ports:

PortNamePurpose
6379gcsGlobal Control Store
8265dashboardRay Dashboard + Jobs API
10001clientRay client connections
8000serveRay Serve HTTP endpoint

Override serviceType in headGroupSpec: ClusterIP (default), NodePort, LoadBalancer.

RayJob

RayJob manages two things: a RayCluster and a submitter that calls ray job submit to run your code on that cluster. The submitter is NOT your workload — it's a lightweight pod that submits and monitors.

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: training-job
spec:
  entrypoint: python /home/ray/train.py --epochs 10
  runtimeEnvYAML: |
    pip:
      - torch==2.5.0
      - transformers
    env_vars:
      WANDB_API_KEY: "secret"
    working_dir: "https://github.com/org/repo/archive/main.zip"
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 300
  activeDeadlineSeconds: 7200          # max total runtime
  backoffLimit: 0                      # ⚠️ each retry = NEW full cluster (see warning below)
  submissionMode: K8sJobMode           # see submission modes below
  suspend: false                       # true for Kueue integration
  rayClusterSpec:
    rayVersion: "2.54.0"
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray-ml:2.54.0-gpu
            resources:
              limits:
                cpu: "4"
                memory: 16Gi
    workerGroupSpecs:
    - groupName: gpu-workers
      replicas: 4
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray-ml:2.54.0-gpu
            resources:
              limits:
                cpu: "8"
                memory: 64Gi
                nvidia.com/gpu: "1"

⚠️ backoffLimit Creates Full New Clusters Per Retry

Cost trap: spec.backoffLimit on a RayJob creates a completely new RayCluster per retry — not a cheap pod restart. backoffLimit: 3 means up to 4× full GPU clusters provisioned sequentially: 4× node spin-up, 4× image pulls, 4× Kueue quota consumed. On 8×A100 nodes, that's 32 GPU-hours wasted on retries alone.

Three different retry mechanisms exist — don't confuse them:

FieldScopeWhat happens on retryCost
spec.backoffLimitEntire RayJobDeletes cluster, creates a brand new oneFull cluster cost per retry
submitterConfig.backoffLimitSubmitter pod onlyRestarts the lightweight ray job submit podNear zero
ray.train.FailureConfig(max_failures=N)Ray Train workersRestarts failed workers on the same clusterNear zero

Production recommendation:

spec:
  backoffLimit: 0              # never recreate the entire cluster on failure
  submitterConfig:
    backoffLimit: 3            # retry submission if dashboard temporarily unreachable

Use FailureConfig(max_failures=N) in your Ray Train script for worker-level recovery with checkpoint restore — this retries on the same cluster without reprovisioning:

from ray.train import RunConfig, FailureConfig

run_config = RunConfig(
    failure_config=FailureConfig(max_failures=3),  # retry workers, keep cluster
)

Set backoffLimit: 1 only if you experience transient node provisioning failures and want one automatic retry of the full cluster.

Submission Modes

ModeHow It WorksWhen to Use
K8sJobMode (default)Creates a K8s Job pod that runs ray job submitMost reliable. Works with Kueue.
HTTPModeOperator sends HTTP POST to Ray Dashboard directlyNo extra pod. Dashboard must be reachable from operator.
SidecarModeInjects submitter container into head podNo extra pod. Cannot use clusterSelector. Head restart must be Never.
InteractiveMode (alpha)Waits for user to submit via kubectl ray pluginJupyter/notebook workflows.

In K8sJobMode, the submitter pod gets two injected env vars: RAY_DASHBOARD_ADDRESS and RAY_JOB_SUBMISSION_ID.

Key Fields

FieldPurpose
entrypointCommand passed to ray job submit
runtimeEnvYAMLpip packages, env vars, working_dir, py_modules
shutdownAfterJobFinishesDelete RayCluster on completion (simple boolean — prefer deletionPolicy for fine-grained control)
ttlSecondsAfterFinishedDelay before cleanup (applies when shutdownAfterJobFinishes: true)
deletionPolicyAdvanced deletion control (v1.5+, see below)
activeDeadlineSecondsMax runtime before DeadlineExceeded failure
backoffLimitFull retries (each = new cluster). Different from submitterConfig.backoffLimit (submitter pod retries).
submissionModeSee table above
suspendSet true for Kueue (Kueue controls unsuspension)
clusterSelectorUse existing RayCluster instead of creating one
entrypointNumCpus/GpusReserve head resources for driver script

Advanced Deletion Policies (v1.5+)

Replaces the shutdownAfterJobFinishes boolean with per-status, per-action TTLs (DeleteCluster, DeleteWorkers, DeleteSelf). For the full spec and action table, see references/kuberay-v1.5.md.

For full RayJob details (lifecycle, deletion strategies, submitter customization, troubleshooting), see references/rayjob.md.

Using Existing Clusters

Skip cluster creation — submit to a running RayCluster:

spec:
  clusterSelector:
    ray.io/cluster: my-existing-cluster
  # Do NOT include rayClusterSpec

Autoscaling

Three levels of autoscaling work together:

  1. Ray Serve auto-scales replicas (actors) based on request load
  2. Ray Autoscaler scales Ray worker pods based on logical resource demand
  3. K8s Cluster Autoscaler provisions new nodes for pending pods

Configuration

spec:
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default           # Default | Aggressive | Conservative
    idleTimeoutSeconds: 60           # seconds before removing idle workers
    resources:
      limits:
        cpu: "500m"
        memory: 512Mi
ModeBehavior
DefaultScale up to meet demand, conservative bin-packing
AggressiveScale up faster, less bin-packing
ConservativeScale up more slowly

Key behavior: The autoscaler monitors logical resource demands (from @ray.remote decorators), not physical utilization. If a task requests more resources than any single worker provides, the autoscaler won't scale.

Autoscaler V2 (alpha, Ray ≥ 2.10): Improved observability and stability. Enable via KubeRay feature gate.

GCS Fault Tolerance

Without GCS FT, head pod failure kills the entire cluster. Enable with external Redis:

spec:
  headGroupSpec:
    rayStartParams:
      redis-password: "${REDIS_PASSWORD}"
    template:
      metadata:
        annotations:
          ray.io/ft-enabled: "true"
      spec:
        containers:
        - name: ray-head
          env:
          - name: RAY_REDIS_ADDRESS
            value: "redis:6379"
          - name: RAY_gcs_rpc_server_reconnect_timeout_s
            value: "120"             # worker reconnect timeout (default 60s)

With GCS FT: workers continue serving during head recovery, cluster state persists in Redis.

Authentication (v1.5+)

For token auth (v1.5.1+, Ray ≥ 2.52.0) and TLS gRPC encryption, see references/kuberay-v1.5.md.

Kueue Integration

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: queued-training
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  suspend: true    # Kueue controls unsuspension
  # ... rest of spec

Also works with RayCluster (set spec.suspend: true).

Observability

Ray Dashboard

export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl port-forward $HEAD_POD 8265:8265
# Open http://localhost:8265

Status Conditions (feature gate: RayClusterStatusConditions)

ConditionTrue When
RayClusterProvisionedAll pods reached ready at least once
HeadPodReadyHead pod is currently ready
RayClusterReplicaFailureReconciliation error (failed create/delete pod)

RayService conditions: Ready (serving traffic), UpgradeInProgress (pending cluster exists).

Prometheus Metrics

Head pod exposes metrics on port 8080. Configure ServiceMonitor to scrape.

kubectl exec -it $HEAD_POD -- ray status       # cluster resources
kubectl exec -it $HEAD_POD -- ray list actors   # actor states
kubectl exec -it $HEAD_POD -- ray summary actors

Kubernetes Events

kubectl describe raycluster <name>   # events: CreatedService, CreatedHeadPod, CreatedWorkerPod
kubectl describe rayjob <name>       # events: job submission, completion, failure

kubectl ray Plugin

The KubeRay kubectl plugin (beta, v1.3.0+) simplifies cluster creation, log collection, sessions, and job submission without YAML. For installation, commands, and a comparison with raw kubectl, see references/kubectl-ray-plugin.md.

Key kubectl Commands

# List all Ray resources
kubectl get rayclusters,rayjobs,rayservices -A

# Ray pods
kubectl get pods -l ray.io/is-ray-node=yes
kubectl get pods -l ray.io/node-type=head
kubectl get pods -l ray.io/node-type=worker

# Head pod logs
kubectl logs $HEAD_POD -c ray-head

# Autoscaler logs (sidecar)
kubectl logs $HEAD_POD -c autoscaler

# Worker init container (if stuck)
kubectl logs <worker-pod> -c wait-gcs-ready

# Operator logs
kubectl logs -n kuberay-system deploy/kuberay-operator

# Ray internal logs
kubectl exec -it $HEAD_POD -- ls /tmp/ray/session_latest/logs/
kubectl exec -it $HEAD_POD -- cat /tmp/ray/session_latest/logs/gcs_server.out

RayService

For Ray Serve deployments with zero-downtime upgrades, see references/rayservice.md.

Incremental Upgrade Strategy (v1.5+)

Replaces blue-green (100% resource surge) with rolling traffic migration using maxSurgePercent/stepSizePercent/intervalSeconds. For the full YAML spec and resource comparison, see references/kuberay-v1.5.md.

Disaggregated Prefill-Decode via RayService

For deploying vLLM with disaggregated prefill-decode using build_pd_openai_app in a RayService, see the Disaggregated PD section in references/rayservice.md.

Troubleshooting

For debugging common KubeRay issues, see references/troubleshooting.md.

References

  • kubectl-ray-plugin.md — kubectl ray plugin for cluster management shortcuts
  • kuberay-v1.5.md — v1.5 features: token auth, TLS, deletion policies, incremental upgrade
  • rayjob.md — RayJob lifecycle, submission modes, and batch workload patterns
  • rayservice.md — RayService for serving Ray Serve apps with zero-downtime upgrades
  • security.md — Dashboard authentication, NetworkPolicies, GCS port security, RBAC scoping
  • troubleshooting.md — Common KubeRay issues: pod scheduling, GCS failures, and autoscaler problems

Cross-References

  • kueue — Queue and gang-schedule Ray workloads
  • ray-core — Ray programming model
  • ray-train — Distributed training on Ray clusters
  • ray-serve — Model serving on Ray clusters
  • ray-data — Data processing on Ray clusters
  • gpu-operator — GPU driver and device plugin for Ray GPU workers
  • volcano — Alternative scheduler for Ray workloads
  • prometheus-grafana — Scrape Ray cluster Prometheus metrics
  • nccl — NCCL tuning for Ray Train multi-node GPU communication
  • flyte-kuberay — Run Flyte tasks on KubeRay clusters