kubeflow-trainer

@tylertitsworth/kubeflow-trainer

0 forks

Updated 4/7/2026

Kubeflow Trainer v2 — TrainJob CRD, Training Runtimes, Python SDK, JobSet, multi-framework support. Use when orchestrating distributed training on K8s. NOT for inference or Ray Train.

Installation

$npx agent-skills-cli install @tylertitsworth/kubeflow-trainer

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositorytylertitsworth/skills

Pathkubeflow-trainer/SKILL.md

Branchmain

Scoped Name@tylertitsworth/kubeflow-trainer

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: kubeflow-trainer description: "Kubeflow Trainer v2 — TrainJob CRD, Training Runtimes, Python SDK, JobSet, multi-framework support. Use when orchestrating distributed training on K8s. NOT for inference or Ray Train."

Kubeflow Trainer v2

Kubernetes-native distributed training operator. Replaces per-framework CRDs (PyTorchJob, MPIJob, etc.) with a unified TrainJob + Runtime architecture built on JobSet.

Docs: https://www.kubeflow.org/docs/components/trainer/ GitHub: https://github.com/kubeflow/trainer Version: v2.1.x | API: trainer.kubeflow.org/v1alpha1 | Requires: Kubernetes ≥ 1.31

Architecture Overview

Trainer v2 separates concerns between two personas:

Platform Administrators create ClusterTrainingRuntime / TrainingRuntime — reusable blueprints defining infrastructure (images, scheduling, gang-scheduling, failure policies)
AI Practitioners create TrainJob — lightweight job specs referencing a runtime, specifying only training-specific settings (image, args, resources, num nodes)

Under the hood, TrainJob → controller creates a JobSet → which creates indexed Jobs → which create Pods. The controller injects runtime config, environment variables (torchrun, MPI), and scheduling metadata automatically.

CRD Hierarchy

CRD	Scope	Purpose
`ClusterTrainingRuntime`	Cluster	Shared runtime blueprints (torch-distributed, deepspeed-distributed, etc.)
`TrainingRuntime`	Namespace	Team-specific runtime blueprints
`TrainJob`	Namespace	Individual training job referencing a runtime

Installation

See references/installation-and-operations.md for full installation (Helm charts, kustomize manifests), CRD setup, runtime deployment, Helm values, and upgrade procedures.

Quick start:

export VERSION=v2.1.0
# Controller + CRDs
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=${VERSION}"
# Default runtimes
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=${VERSION}"

Or with Helm:

helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
  --namespace kubeflow-system --create-namespace \
  --version ${VERSION#v} \
  --set runtimes.defaultEnabled=true

TrainJob Spec

The TrainJob is intentionally minimal. AI practitioners specify only what differs from the runtime blueprint:

Field	Purpose
`spec.runtimeRef`	Reference to ClusterTrainingRuntime or TrainingRuntime (apiGroup, kind, name)
`spec.trainer.image`	Training container image (overrides runtime default)
`spec.trainer.command`	Entrypoint command
`spec.trainer.args`	Arguments passed to training script
`spec.trainer.numNodes`	Number of training nodes (overrides runtime's `mlPolicy.numNodes`)
`spec.trainer.resourcesPerNode`	Resource requests/limits per node (CPU, memory, GPU)
`spec.trainer.env`	Additional environment variables
`spec.initializer.dataset`	Dataset initializer config (HuggingFace, S3, etc.)
`spec.initializer.model`	Model initializer config (HuggingFace, S3, etc.)
`spec.labels` / `spec.annotations`	Propagated to generated JobSet and Pods
`spec.managedBy`	External controller field (e.g., Kueue's `kueue.x-k8s.io/multikueue`)
`spec.suspend`	Suspend the job (used by Kueue for admission control)

Runtime Reference

Every TrainJob must reference a runtime. The runtimeRef field specifies:

apiGroup: trainer.kubeflow.org (default)
kind: ClusterTrainingRuntime or TrainingRuntime
name: the runtime name (e.g., torch-distributed)

The controller merges TrainJob overrides on top of the runtime template to produce the final JobSet.

Training Runtimes

Runtimes define the infrastructure blueprint. Key sections of the runtime spec:

MLPolicy

Controls distributed training framework behavior. Only one framework source can be set:

Framework	`mlPolicy` field	What it configures
PlainML	(none — just `numNodes`)	Simple parallel jobs without distributed framework
PyTorch	`torch.numProcPerNode`	`torchrun` launcher — set to `auto` or integer for GPU count per node
MPI	`mpi.numProcPerNode`, `mpi.mpiImplementation`, `mpi.sshAuthMountPath`, `mpi.runLauncherAsNode`	`mpirun` launcher — auto SSH key generation, OpenMPI/IntelMPI support

Torch policy: The controller injects torchrun as the entrypoint with --nnodes, --nproc-per-node, --master-addr, --master-port configured automatically. Set numProcPerNode: auto to use all available GPUs, or a specific integer.

MPI policy: The controller generates SSH keys, creates hostfiles, and launches via mpirun. DeepSpeed uses this path since it launches via MPI under the hood. Set runLauncherAsNode: true to have the launcher also participate as a training node.

Job Template

The template field contains a full JobSet spec. Each replicatedJob defines a step:

ReplicatedJob name	Ancestor label	Purpose
`dataset-initializer`	`dataset-initializer`	Download/prepare dataset to shared volume
`model-initializer`	`model-initializer`	Download model weights to shared volume
`node`	`trainer`	Training worker nodes
`launcher`	`trainer`	MPI launcher node (MPI runtimes only)

The ancestor label (trainer.kubeflow.org/trainjob-ancestor-step) tells the controller which TrainJob fields to inject into each ReplicatedJob. Use dependsOn to sequence steps (initializers complete before training starts).

PodGroupPolicy

Configures gang scheduling. See references/scheduling-and-integrations.md for Kueue, Volcano, and Coscheduling integration details.

Framework Label

Every runtime must include trainer.kubeflow.org/framework: <name> label. The SDK uses this to identify compatible runtimes. Values used by default runtimes: torch, deepspeed, mlx, torchtune. Custom runtimes can use any string; JAX support is under active development (proposal stage as of v2.1.0).

Supported Runtimes

Runtime Name	Framework	Launcher	Use Case
`torch-distributed`	PyTorch	`torchrun`	DDP, FSDP, any torch.distributed workload
`deepspeed-distributed`	DeepSpeed	`mpirun`	ZeRO stages, pipeline parallelism
`mlx-distributed`	MLX	`mpirun`	Apple MLX distributed training (CUDA supported v2.1+)
`torchtune-llama3.2-1b`	TorchTune	`torchrun`	BuiltinTrainer LLM fine-tuning (Llama 3.2 1B)
`torchtune-llama3.2-3b`	TorchTune	`torchrun`	BuiltinTrainer LLM fine-tuning (Llama 3.2 3B)
`torchtune-qwen2.5-1.5b`	TorchTune	`torchrun`	BuiltinTrainer LLM fine-tuning (Qwen 2.5 1.5B)

Custom runtimes can define any container image, command, volumes, init containers, sidecars, and PodTemplate overrides. See references/advanced-configuration.md.

Distributed AI Data Cache (v2.1+)

Stream data directly to GPU nodes from an in-memory cache cluster powered by Apache Arrow and DataFusion. The Data Cache eliminates I/O bottlenecks for large tabular datasets via zero-copy transfers, maximizing GPU utilization during pre-training and post-training.

How It Works

A cache initializer downloads the dataset and loads it into an in-memory Arrow/DataFusion cache cluster
Cache nodes expose a gRPC endpoint with readiness probes — training doesn't start until the cache is warm
Training nodes read data from the cache via Arrow Flight protocol — zero-copy, zero serialization overhead
The cache runs as a separate ReplicatedJob in the JobSet, co-located with training nodes for low latency

When to Use

Datasets too large for each node to hold locally
I/O-bound training where GPU utilization drops below 80%
Repeated epochs over the same dataset (cache persists across epochs)
Parquet, CSV, or any tabular format supported by DataFusion

SDK Usage

from kubeflow.trainer import (
    TrainerClient, CustomTrainer, Runtime,
    Initializer, DataCacheInitializer,
)

client = TrainerClient()

job_name = client.train(
    runtime=Runtime(name="torch-distributed-with-cache"),
    initializer=Initializer(
        dataset=DataCacheInitializer(
            storage_uri="cache://schema_name/table_name",
            metadata_loc="s3://my-bucket/path/to/metadata.json",
            num_data_nodes=4,
            worker_mem="64Gi",
        ),
    ),
    trainer=CustomTrainer(
        func=my_train_func,
        num_nodes=8,
        resources_per_node={"gpu": 8},
    ),
)

YAML Configuration

A cache-enabled runtime adds a cache-initializer ReplicatedJob:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torch-distributed-with-cache
spec:
  template:
    spec:
      replicatedJobs:
        - name: cache-initializer
          template:
            metadata:
              labels:
                trainer.kubeflow.org/trainjob-ancestor-step: cache-initializer
            spec:
              template:
                spec:
                  containers:
                    - name: cache-initializer
                      image: ghcr.io/kubeflow/trainer/dataset-initializer
                      readinessProbe:
                        grpc:
                          port: 50051
                        initialDelaySeconds: 10
                        periodSeconds: 5
        - name: node
          dependsOn:
            - name: cache-initializer
              status: Ready  # Waits for readiness, not completion
          template:
            metadata:
              labels:
                trainer.kubeflow.org/trainjob-ancestor-step: trainer
            spec:
              template:
                spec:
                  containers:
                    - name: node
                      image: training:latest
                      env:
                        - name: DATA_CACHE_ENDPOINT
                          value: "cache-initializer-0:50051"

See the official docs for full configuration options.

Python SDK

The kubeflow-training SDK (pip install kubeflow-training) provides a Pythonic interface that abstracts all Kubernetes YAML:

See references/python-sdk.md for the full TrainerClient API, CustomTrainer/BuiltinTrainer configuration, job lifecycle management, and log streaming.

Key SDK operations:

TrainerClient().train(runtime=..., trainer=...) — submit a TrainJob
client.get_job(name) — get job status and steps
client.get_job_logs(name, follow=True) — stream logs
client.list_runtimes() — discover available runtimes
client.delete_job(name) — clean up

Multi-Node Networking

For distributed training across nodes, the network stack matters:

NCCL is the default backend for GPU-to-GPU communication in PyTorch distributed. The controller sets MASTER_ADDR and MASTER_PORT automatically.
NCCL environment variables — set via spec.trainer.env or runtime template for tuning: NCCL_DEBUG, NCCL_SOCKET_IFNAME, NCCL_IB_DISABLE, NCCL_P2P_DISABLE, NCCL_NET_GDR_LEVEL, NCCL_SOCKET_NTHREADS
Shared storage — initializer steps write to a shared PVC (ReadWriteMany) mounted across all training nodes. Define volumes in the runtime template's JobSet spec.
Host networking — for InfiniBand/RoCE, set hostNetwork: true in the Pod spec within the runtime template

See references/advanced-configuration.md for NCCL tuning and network topology details.

Resource Management

GPU and resource configuration flows through two layers:

Runtime template — default resources in the container spec
TrainJob override — spec.trainer.resourcesPerNode overrides per-job

Additional scheduling controls are set in the runtime's Job template PodSpec:

nodeSelector — target specific GPU node pools
tolerations — tolerate GPU node taints
priorityClassName — Kubernetes priority class
topologySpreadConstraints — spread across failure domains
affinity — node/pod affinity rules

See references/scheduling-and-integrations.md for Kueue quota management and topology-aware scheduling.

Fault Tolerance

Built on JobSet's failure policy and Kubernetes Job restart semantics:

Restart policy — set restartPolicy in the runtime's Pod template (OnFailure, Never)
JobSet failure policy — configure max restarts, per-index retry, and failure rules in the runtime's JobSet template
Checkpoint recovery — mount a persistent volume for checkpoints; training code resumes from latest checkpoint on restart. The shared volume from initializers can serve this purpose.
Elastic training — PyTorch's torchrun supports elastic scaling natively. Set min/max nodes in the torchrun config. Trainer v2 leverages this through the Torch MLPolicy. Full elastic training support (auto-scaling node count) is under active development.

Monitoring

Job status: kubectl get trainjobs shows phase (Suspended, Running, Succeeded, Failed)
Events: kubectl describe trainjob <name> — controller emits events for creation, scheduling, completion, failure
Logs: kubectl logs -l job-name=<trainjob-name>-node-0 --follow or via SDK client.get_job_logs()
Metrics: Controller exposes Prometheus metrics at /metrics — reconciliation latency, queue depth, active jobs
JobSet status: kubectl get jobsets shows underlying JobSet phase and per-job completion

Troubleshooting

See references/troubleshooting.md for detailed debugging patterns covering NCCL timeouts, OOM kills, scheduling failures, image pull errors, runtime misconfigurations, and SDK issues.

Quick diagnostics:

kubectl get trainjob <name> -o yaml — check .status.conditions for error messages
kubectl get jobset <name> -o yaml — check underlying JobSet status
kubectl describe pods -l job-name=<name>-node-0 — check events for scheduling/pull issues
kubectl logs <pod> -c node — check training container logs for NCCL/OOM errors

Cross-References

volcano — Gang-schedule training jobs via Volcano scheduler integration
kueue — Queue management and fair-sharing for training workloads
gpu-operator — GPU driver and device plugin prerequisites for training nodes
nccl — NCCL tuning for multi-node distributed training jobs
fsdp — FSDP training strategy used in Kubeflow training jobs
deepspeed — DeepSpeed ZeRO integration for memory-efficient training
pytorch — PyTorch distributed training fundamentals
wandb — Experiment tracking from Kubeflow training containers
mlflow — Alternative experiment tracking and model registry
prometheus-grafana — Monitor training job metrics and GPU utilization
minio — S3-compatible checkpoint and dataset storage
aws-fsx — High-performance shared storage for training data
leaderworkerset — Alternative multi-node workload primitive for inference/training

References

File	Contents
`references/installation-and-operations.md`	Installation methods, Helm values, CRD management, upgrades, runtime deployment
`references/python-sdk.md`	TrainerClient API, CustomTrainer, BuiltinTrainer, TorchTune, job lifecycle
`references/scheduling-and-integrations.md`	Kueue integration, Volcano, Coscheduling, topology-aware scheduling, priority classes
`references/advanced-configuration.md`	Custom runtimes, PodTemplate overrides, init containers, sidecars, NCCL tuning, volumes
`references/troubleshooting.md`	Common failures, debugging patterns, NCCL issues, OOM, scheduling, SDK errors
`references/training-pipeline-integration.md`	End-to-end pipeline: DVC → Ray Data → Kubeflow Trainer (FSDP/DeepSpeed) → W&B/MLflow → S3 → lm-eval

More by tylertitsworth

View all

llm-evaluation

LLM evaluation with lm-evaluation-harness — MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving.

aws-fsx

FSx for Lustre — performance tuning, striping, S3 data repositories, EKS integration. Use when configuring high-performance storage for ML on EKS. NOT for EBS or EFS.

verl

verl (Volcano Engine RL) — PPO, GRPO, DAPO, GSPO, RLOO, TIS (token/sequence importance sampling), rollout server mode, reward models, rule-based rewards, vLLM/SGLang rollout, and multi-GPU FSDP/Megatron training. Use when doing RLHF or RL post-training on LLMs.

uv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.