Agent SkillsAgent Skills
tylertitsworth

kubeflow-trainer

@tylertitsworth/kubeflow-trainer
tylertitsworth
0
0 forks
Updated 4/7/2026
View on GitHub

Kubeflow Trainer v2 β€” TrainJob CRD, Training Runtimes, Python SDK, JobSet, multi-framework support. Use when orchestrating distributed training on K8s. NOT for inference or Ray Train.

Installation

$npx agent-skills-cli install @tylertitsworth/kubeflow-trainer
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathkubeflow-trainer/SKILL.md
Branchmain
Scoped Name@tylertitsworth/kubeflow-trainer

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions


name: kubeflow-trainer description: "Kubeflow Trainer v2 β€” TrainJob CRD, Training Runtimes, Python SDK, JobSet, multi-framework support. Use when orchestrating distributed training on K8s. NOT for inference or Ray Train."

Kubeflow Trainer v2

Kubernetes-native distributed training operator. Replaces per-framework CRDs (PyTorchJob, MPIJob, etc.) with a unified TrainJob + Runtime architecture built on JobSet.

Docs: https://www.kubeflow.org/docs/components/trainer/ GitHub: https://github.com/kubeflow/trainer Version: v2.1.x | API: trainer.kubeflow.org/v1alpha1 | Requires: Kubernetes β‰₯ 1.31

Architecture Overview

Trainer v2 separates concerns between two personas:

  • Platform Administrators create ClusterTrainingRuntime / TrainingRuntime β€” reusable blueprints defining infrastructure (images, scheduling, gang-scheduling, failure policies)
  • AI Practitioners create TrainJob β€” lightweight job specs referencing a runtime, specifying only training-specific settings (image, args, resources, num nodes)

Under the hood, TrainJob β†’ controller creates a JobSet β†’ which creates indexed Jobs β†’ which create Pods. The controller injects runtime config, environment variables (torchrun, MPI), and scheduling metadata automatically.

CRD Hierarchy

CRDScopePurpose
ClusterTrainingRuntimeClusterShared runtime blueprints (torch-distributed, deepspeed-distributed, etc.)
TrainingRuntimeNamespaceTeam-specific runtime blueprints
TrainJobNamespaceIndividual training job referencing a runtime

Installation

See references/installation-and-operations.md for full installation (Helm charts, kustomize manifests), CRD setup, runtime deployment, Helm values, and upgrade procedures.

Quick start:

export VERSION=v2.1.0
# Controller + CRDs
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=${VERSION}"
# Default runtimes
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=${VERSION}"

Or with Helm:

helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
  --namespace kubeflow-system --create-namespace \
  --version ${VERSION#v} \
  --set runtimes.defaultEnabled=true

TrainJob Spec

The TrainJob is intentionally minimal. AI practitioners specify only what differs from the runtime blueprint:

FieldPurpose
spec.runtimeRefReference to ClusterTrainingRuntime or TrainingRuntime (apiGroup, kind, name)
spec.trainer.imageTraining container image (overrides runtime default)
spec.trainer.commandEntrypoint command
spec.trainer.argsArguments passed to training script
spec.trainer.numNodesNumber of training nodes (overrides runtime's mlPolicy.numNodes)
spec.trainer.resourcesPerNodeResource requests/limits per node (CPU, memory, GPU)
spec.trainer.envAdditional environment variables
spec.initializer.datasetDataset initializer config (HuggingFace, S3, etc.)
spec.initializer.modelModel initializer config (HuggingFace, S3, etc.)
spec.labels / spec.annotationsPropagated to generated JobSet and Pods
spec.managedByExternal controller field (e.g., Kueue's kueue.x-k8s.io/multikueue)
spec.suspendSuspend the job (used by Kueue for admission control)

Runtime Reference

Every TrainJob must reference a runtime. The runtimeRef field specifies:

  • apiGroup: trainer.kubeflow.org (default)
  • kind: ClusterTrainingRuntime or TrainingRuntime
  • name: the runtime name (e.g., torch-distributed)

The controller merges TrainJob overrides on top of the runtime template to produce the final JobSet.

Training Runtimes

Runtimes define the infrastructure blueprint. Key sections of the runtime spec:

MLPolicy

Controls distributed training framework behavior. Only one framework source can be set:

FrameworkmlPolicy fieldWhat it configures
PlainML(none β€” just numNodes)Simple parallel jobs without distributed framework
PyTorchtorch.numProcPerNodetorchrun launcher β€” set to auto or integer for GPU count per node
MPImpi.numProcPerNode, mpi.mpiImplementation, mpi.sshAuthMountPath, mpi.runLauncherAsNodempirun launcher β€” auto SSH key generation, OpenMPI/IntelMPI support

Torch policy: The controller injects torchrun as the entrypoint with --nnodes, --nproc-per-node, --master-addr, --master-port configured automatically. Set numProcPerNode: auto to use all available GPUs, or a specific integer.

MPI policy: The controller generates SSH keys, creates hostfiles, and launches via mpirun. DeepSpeed uses this path since it launches via MPI under the hood. Set runLauncherAsNode: true to have the launcher also participate as a training node.

Job Template

The template field contains a full JobSet spec. Each replicatedJob defines a step:

ReplicatedJob nameAncestor labelPurpose
dataset-initializerdataset-initializerDownload/prepare dataset to shared volume
model-initializermodel-initializerDownload model weights to shared volume
nodetrainerTraining worker nodes
launchertrainerMPI launcher node (MPI runtimes only)

The ancestor label (trainer.kubeflow.org/trainjob-ancestor-step) tells the controller which TrainJob fields to inject into each ReplicatedJob. Use dependsOn to sequence steps (initializers complete before training starts).

PodGroupPolicy

Configures gang scheduling. See references/scheduling-and-integrations.md for Kueue, Volcano, and Coscheduling integration details.

Framework Label

Every runtime must include trainer.kubeflow.org/framework: <name> label. The SDK uses this to identify compatible runtimes. Values used by default runtimes: torch, deepspeed, mlx, torchtune. Custom runtimes can use any string; JAX support is under active development (proposal stage as of v2.1.0).

Supported Runtimes

Runtime NameFrameworkLauncherUse Case
torch-distributedPyTorchtorchrunDDP, FSDP, any torch.distributed workload
deepspeed-distributedDeepSpeedmpirunZeRO stages, pipeline parallelism
mlx-distributedMLXmpirunApple MLX distributed training (CUDA supported v2.1+)
torchtune-llama3.2-1bTorchTunetorchrunBuiltinTrainer LLM fine-tuning (Llama 3.2 1B)
torchtune-llama3.2-3bTorchTunetorchrunBuiltinTrainer LLM fine-tuning (Llama 3.2 3B)
torchtune-qwen2.5-1.5bTorchTunetorchrunBuiltinTrainer LLM fine-tuning (Qwen 2.5 1.5B)

Custom runtimes can define any container image, command, volumes, init containers, sidecars, and PodTemplate overrides. See references/advanced-configuration.md.

Distributed AI Data Cache (v2.1+)

Stream data directly to GPU nodes from an in-memory cache cluster powered by Apache Arrow and DataFusion. The Data Cache eliminates I/O bottlenecks for large tabular datasets via zero-copy transfers, maximizing GPU utilization during pre-training and post-training.

How It Works

  1. A cache initializer downloads the dataset and loads it into an in-memory Arrow/DataFusion cache cluster
  2. Cache nodes expose a gRPC endpoint with readiness probes β€” training doesn't start until the cache is warm
  3. Training nodes read data from the cache via Arrow Flight protocol β€” zero-copy, zero serialization overhead
  4. The cache runs as a separate ReplicatedJob in the JobSet, co-located with training nodes for low latency

When to Use

  • Datasets too large for each node to hold locally
  • I/O-bound training where GPU utilization drops below 80%
  • Repeated epochs over the same dataset (cache persists across epochs)
  • Parquet, CSV, or any tabular format supported by DataFusion

SDK Usage

from kubeflow.trainer import (
    TrainerClient, CustomTrainer, Runtime,
    Initializer, DataCacheInitializer,
)

client = TrainerClient()

job_name = client.train(
    runtime=Runtime(name="torch-distributed-with-cache"),
    initializer=Initializer(
        dataset=DataCacheInitializer(
            storage_uri="cache://schema_name/table_name",
            metadata_loc="s3://my-bucket/path/to/metadata.json",
            num_data_nodes=4,
            worker_mem="64Gi",
        ),
    ),
    trainer=CustomTrainer(
        func=my_train_func,
        num_nodes=8,
        resources_per_node={"gpu": 8},
    ),
)

YAML Configuration

A cache-enabled runtime adds a cache-initializer ReplicatedJob:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torch-distributed-with-cache
spec:
  template:
    spec:
      replicatedJobs:
        - name: cache-initializer
          template:
            metadata:
              labels:
                trainer.kubeflow.org/trainjob-ancestor-step: cache-initializer
            spec:
              template:
                spec:
                  containers:
                    - name: cache-initializer
                      image: ghcr.io/kubeflow/trainer/dataset-initializer
                      readinessProbe:
                        grpc:
                          port: 50051
                        initialDelaySeconds: 10
                        periodSeconds: 5
        - name: node
          dependsOn:
            - name: cache-initializer
              status: Ready  # Waits for readiness, not completion
          template:
            metadata:
              labels:
                trainer.kubeflow.org/trainjob-ancestor-step: trainer
            spec:
              template:
                spec:
                  containers:
                    - name: node
                      image: training:latest
                      env:
                        - name: DATA_CACHE_ENDPOINT
                          value: "cache-initializer-0:50051"

See the official docs for full configuration options.

Python SDK

The kubeflow-training SDK (pip install kubeflow-training) provides a Pythonic interface that abstracts all Kubernetes YAML:

See references/python-sdk.md for the full TrainerClient API, CustomTrainer/BuiltinTrainer configuration, job lifecycle management, and log streaming.

Key SDK operations:

  • TrainerClient().train(runtime=..., trainer=...) β€” submit a TrainJob
  • client.get_job(name) β€” get job status and steps
  • client.get_job_logs(name, follow=True) β€” stream logs
  • client.list_runtimes() β€” discover available runtimes
  • client.delete_job(name) β€” clean up

Multi-Node Networking

For distributed training across nodes, the network stack matters:

  • NCCL is the default backend for GPU-to-GPU communication in PyTorch distributed. The controller sets MASTER_ADDR and MASTER_PORT automatically.
  • NCCL environment variables β€” set via spec.trainer.env or runtime template for tuning: NCCL_DEBUG, NCCL_SOCKET_IFNAME, NCCL_IB_DISABLE, NCCL_P2P_DISABLE, NCCL_NET_GDR_LEVEL, NCCL_SOCKET_NTHREADS
  • Shared storage β€” initializer steps write to a shared PVC (ReadWriteMany) mounted across all training nodes. Define volumes in the runtime template's JobSet spec.
  • Host networking β€” for InfiniBand/RoCE, set hostNetwork: true in the Pod spec within the runtime template

See references/advanced-configuration.md for NCCL tuning and network topology details.

Resource Management

GPU and resource configuration flows through two layers:

  1. Runtime template β€” default resources in the container spec
  2. TrainJob override β€” spec.trainer.resourcesPerNode overrides per-job

Additional scheduling controls are set in the runtime's Job template PodSpec:

  • nodeSelector β€” target specific GPU node pools
  • tolerations β€” tolerate GPU node taints
  • priorityClassName β€” Kubernetes priority class
  • topologySpreadConstraints β€” spread across failure domains
  • affinity β€” node/pod affinity rules

See references/scheduling-and-integrations.md for Kueue quota management and topology-aware scheduling.

Fault Tolerance

Built on JobSet's failure policy and Kubernetes Job restart semantics:

  • Restart policy β€” set restartPolicy in the runtime's Pod template (OnFailure, Never)
  • JobSet failure policy β€” configure max restarts, per-index retry, and failure rules in the runtime's JobSet template
  • Checkpoint recovery β€” mount a persistent volume for checkpoints; training code resumes from latest checkpoint on restart. The shared volume from initializers can serve this purpose.
  • Elastic training β€” PyTorch's torchrun supports elastic scaling natively. Set min/max nodes in the torchrun config. Trainer v2 leverages this through the Torch MLPolicy. Full elastic training support (auto-scaling node count) is under active development.

Monitoring

  • Job status: kubectl get trainjobs shows phase (Suspended, Running, Succeeded, Failed)
  • Events: kubectl describe trainjob <name> β€” controller emits events for creation, scheduling, completion, failure
  • Logs: kubectl logs -l job-name=<trainjob-name>-node-0 --follow or via SDK client.get_job_logs()
  • Metrics: Controller exposes Prometheus metrics at /metrics β€” reconciliation latency, queue depth, active jobs
  • JobSet status: kubectl get jobsets shows underlying JobSet phase and per-job completion

Troubleshooting

See references/troubleshooting.md for detailed debugging patterns covering NCCL timeouts, OOM kills, scheduling failures, image pull errors, runtime misconfigurations, and SDK issues.

Quick diagnostics:

  1. kubectl get trainjob <name> -o yaml β€” check .status.conditions for error messages
  2. kubectl get jobset <name> -o yaml β€” check underlying JobSet status
  3. kubectl describe pods -l job-name=<name>-node-0 β€” check events for scheduling/pull issues
  4. kubectl logs <pod> -c node β€” check training container logs for NCCL/OOM errors

Cross-References

  • volcano β€” Gang-schedule training jobs via Volcano scheduler integration
  • kueue β€” Queue management and fair-sharing for training workloads
  • gpu-operator β€” GPU driver and device plugin prerequisites for training nodes
  • nccl β€” NCCL tuning for multi-node distributed training jobs
  • fsdp β€” FSDP training strategy used in Kubeflow training jobs
  • deepspeed β€” DeepSpeed ZeRO integration for memory-efficient training
  • pytorch β€” PyTorch distributed training fundamentals
  • wandb β€” Experiment tracking from Kubeflow training containers
  • mlflow β€” Alternative experiment tracking and model registry
  • prometheus-grafana β€” Monitor training job metrics and GPU utilization
  • minio β€” S3-compatible checkpoint and dataset storage
  • aws-fsx β€” High-performance shared storage for training data
  • leaderworkerset β€” Alternative multi-node workload primitive for inference/training

References

FileContents
references/installation-and-operations.mdInstallation methods, Helm values, CRD management, upgrades, runtime deployment
references/python-sdk.mdTrainerClient API, CustomTrainer, BuiltinTrainer, TorchTune, job lifecycle
references/scheduling-and-integrations.mdKueue integration, Volcano, Coscheduling, topology-aware scheduling, priority classes
references/advanced-configuration.mdCustom runtimes, PodTemplate overrides, init containers, sidecars, NCCL tuning, volumes
references/troubleshooting.mdCommon failures, debugging patterns, NCCL issues, OOM, scheduling, SDK errors
references/training-pipeline-integration.mdEnd-to-end pipeline: DVC β†’ Ray Data β†’ Kubeflow Trainer (FSDP/DeepSpeed) β†’ W&B/MLflow β†’ S3 β†’ lm-eval