Kubeflow Trainer v2 — TrainJob CRD, Training Runtimes, Python SDK, JobSet, multi-framework support. Use when orchestrating distributed training on K8s. NOT for inference or Ray Train.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: kubeflow-trainer description: "Kubeflow Trainer v2 — TrainJob CRD, Training Runtimes, Python SDK, JobSet, multi-framework support. Use when orchestrating distributed training on K8s. NOT for inference or Ray Train."
Kubeflow Trainer v2
Kubernetes-native distributed training operator. Replaces per-framework CRDs (PyTorchJob, MPIJob, etc.) with a unified TrainJob + Runtime architecture built on JobSet.
Docs: https://www.kubeflow.org/docs/components/trainer/
GitHub: https://github.com/kubeflow/trainer
Version: v2.1.x | API: trainer.kubeflow.org/v1alpha1 | Requires: Kubernetes ≥ 1.31
Architecture Overview
Trainer v2 separates concerns between two personas:
- Platform Administrators create
ClusterTrainingRuntime/TrainingRuntime— reusable blueprints defining infrastructure (images, scheduling, gang-scheduling, failure policies) - AI Practitioners create
TrainJob— lightweight job specs referencing a runtime, specifying only training-specific settings (image, args, resources, num nodes)
Under the hood, TrainJob → controller creates a JobSet → which creates indexed Jobs → which create Pods. The controller injects runtime config, environment variables (torchrun, MPI), and scheduling metadata automatically.
CRD Hierarchy
| CRD | Scope | Purpose |
|---|---|---|
ClusterTrainingRuntime | Cluster | Shared runtime blueprints (torch-distributed, deepspeed-distributed, etc.) |
TrainingRuntime | Namespace | Team-specific runtime blueprints |
TrainJob | Namespace | Individual training job referencing a runtime |
Installation
See references/installation-and-operations.md for full installation (Helm charts, kustomize manifests), CRD setup, runtime deployment, Helm values, and upgrade procedures.
Quick start:
export VERSION=v2.1.0
# Controller + CRDs
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=${VERSION}"
# Default runtimes
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=${VERSION}"
Or with Helm:
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
--namespace kubeflow-system --create-namespace \
--version ${VERSION#v} \
--set runtimes.defaultEnabled=true
TrainJob Spec
The TrainJob is intentionally minimal. AI practitioners specify only what differs from the runtime blueprint:
| Field | Purpose |
|---|---|
spec.runtimeRef | Reference to ClusterTrainingRuntime or TrainingRuntime (apiGroup, kind, name) |
spec.trainer.image | Training container image (overrides runtime default) |
spec.trainer.command | Entrypoint command |
spec.trainer.args | Arguments passed to training script |
spec.trainer.numNodes | Number of training nodes (overrides runtime's mlPolicy.numNodes) |
spec.trainer.resourcesPerNode | Resource requests/limits per node (CPU, memory, GPU) |
spec.trainer.env | Additional environment variables |
spec.initializer.dataset | Dataset initializer config (HuggingFace, S3, etc.) |
spec.initializer.model | Model initializer config (HuggingFace, S3, etc.) |
spec.labels / spec.annotations | Propagated to generated JobSet and Pods |
spec.managedBy | External controller field (e.g., Kueue's kueue.x-k8s.io/multikueue) |
spec.suspend | Suspend the job (used by Kueue for admission control) |
Runtime Reference
Every TrainJob must reference a runtime. The runtimeRef field specifies:
apiGroup:trainer.kubeflow.org(default)kind:ClusterTrainingRuntimeorTrainingRuntimename: the runtime name (e.g.,torch-distributed)
The controller merges TrainJob overrides on top of the runtime template to produce the final JobSet.
Training Runtimes
Runtimes define the infrastructure blueprint. Key sections of the runtime spec:
MLPolicy
Controls distributed training framework behavior. Only one framework source can be set:
| Framework | mlPolicy field | What it configures |
|---|---|---|
| PlainML | (none — just numNodes) | Simple parallel jobs without distributed framework |
| PyTorch | torch.numProcPerNode | torchrun launcher — set to auto or integer for GPU count per node |
| MPI | mpi.numProcPerNode, mpi.mpiImplementation, mpi.sshAuthMountPath, mpi.runLauncherAsNode | mpirun launcher — auto SSH key generation, OpenMPI/IntelMPI support |
Torch policy: The controller injects torchrun as the entrypoint with --nnodes, --nproc-per-node, --master-addr, --master-port configured automatically. Set numProcPerNode: auto to use all available GPUs, or a specific integer.
MPI policy: The controller generates SSH keys, creates hostfiles, and launches via mpirun. DeepSpeed uses this path since it launches via MPI under the hood. Set runLauncherAsNode: true to have the launcher also participate as a training node.
Job Template
The template field contains a full JobSet spec. Each replicatedJob defines a step:
| ReplicatedJob name | Ancestor label | Purpose |
|---|---|---|
dataset-initializer | dataset-initializer | Download/prepare dataset to shared volume |
model-initializer | model-initializer | Download model weights to shared volume |
node | trainer | Training worker nodes |
launcher | trainer | MPI launcher node (MPI runtimes only) |
The ancestor label (trainer.kubeflow.org/trainjob-ancestor-step) tells the controller which TrainJob fields to inject into each ReplicatedJob. Use dependsOn to sequence steps (initializers complete before training starts).
PodGroupPolicy
Configures gang scheduling. See references/scheduling-and-integrations.md for Kueue, Volcano, and Coscheduling integration details.
Framework Label
Every runtime must include trainer.kubeflow.org/framework: <name> label. The SDK uses this to identify compatible runtimes. Values used by default runtimes: torch, deepspeed, mlx, torchtune. Custom runtimes can use any string; JAX support is under active development (proposal stage as of v2.1.0).
Supported Runtimes
| Runtime Name | Framework | Launcher | Use Case |
|---|---|---|---|
torch-distributed | PyTorch | torchrun | DDP, FSDP, any torch.distributed workload |
deepspeed-distributed | DeepSpeed | mpirun | ZeRO stages, pipeline parallelism |
mlx-distributed | MLX | mpirun | Apple MLX distributed training (CUDA supported v2.1+) |
torchtune-llama3.2-1b | TorchTune | torchrun | BuiltinTrainer LLM fine-tuning (Llama 3.2 1B) |
torchtune-llama3.2-3b | TorchTune | torchrun | BuiltinTrainer LLM fine-tuning (Llama 3.2 3B) |
torchtune-qwen2.5-1.5b | TorchTune | torchrun | BuiltinTrainer LLM fine-tuning (Qwen 2.5 1.5B) |
Custom runtimes can define any container image, command, volumes, init containers, sidecars, and PodTemplate overrides. See references/advanced-configuration.md.
Distributed AI Data Cache (v2.1+)
Stream data directly to GPU nodes from an in-memory cache cluster powered by Apache Arrow and DataFusion. The Data Cache eliminates I/O bottlenecks for large tabular datasets via zero-copy transfers, maximizing GPU utilization during pre-training and post-training.
How It Works
- A cache initializer downloads the dataset and loads it into an in-memory Arrow/DataFusion cache cluster
- Cache nodes expose a gRPC endpoint with readiness probes — training doesn't start until the cache is warm
- Training nodes read data from the cache via Arrow Flight protocol — zero-copy, zero serialization overhead
- The cache runs as a separate ReplicatedJob in the JobSet, co-located with training nodes for low latency
When to Use
- Datasets too large for each node to hold locally
- I/O-bound training where GPU utilization drops below 80%
- Repeated epochs over the same dataset (cache persists across epochs)
- Parquet, CSV, or any tabular format supported by DataFusion
SDK Usage
from kubeflow.trainer import (
TrainerClient, CustomTrainer, Runtime,
Initializer, DataCacheInitializer,
)
client = TrainerClient()
job_name = client.train(
runtime=Runtime(name="torch-distributed-with-cache"),
initializer=Initializer(
dataset=DataCacheInitializer(
storage_uri="cache://schema_name/table_name",
metadata_loc="s3://my-bucket/path/to/metadata.json",
num_data_nodes=4,
worker_mem="64Gi",
),
),
trainer=CustomTrainer(
func=my_train_func,
num_nodes=8,
resources_per_node={"gpu": 8},
),
)
YAML Configuration
A cache-enabled runtime adds a cache-initializer ReplicatedJob:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-distributed-with-cache
spec:
template:
spec:
replicatedJobs:
- name: cache-initializer
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: cache-initializer
spec:
template:
spec:
containers:
- name: cache-initializer
image: ghcr.io/kubeflow/trainer/dataset-initializer
readinessProbe:
grpc:
port: 50051
initialDelaySeconds: 10
periodSeconds: 5
- name: node
dependsOn:
- name: cache-initializer
status: Ready # Waits for readiness, not completion
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: trainer
spec:
template:
spec:
containers:
- name: node
image: training:latest
env:
- name: DATA_CACHE_ENDPOINT
value: "cache-initializer-0:50051"
See the official docs for full configuration options.
Python SDK
The kubeflow-training SDK (pip install kubeflow-training) provides a Pythonic interface that abstracts all Kubernetes YAML:
See references/python-sdk.md for the full TrainerClient API, CustomTrainer/BuiltinTrainer configuration, job lifecycle management, and log streaming.
Key SDK operations:
TrainerClient().train(runtime=..., trainer=...)— submit a TrainJobclient.get_job(name)— get job status and stepsclient.get_job_logs(name, follow=True)— stream logsclient.list_runtimes()— discover available runtimesclient.delete_job(name)— clean up
Multi-Node Networking
For distributed training across nodes, the network stack matters:
- NCCL is the default backend for GPU-to-GPU communication in PyTorch distributed. The controller sets
MASTER_ADDRandMASTER_PORTautomatically. - NCCL environment variables — set via
spec.trainer.envor runtime template for tuning:NCCL_DEBUG,NCCL_SOCKET_IFNAME,NCCL_IB_DISABLE,NCCL_P2P_DISABLE,NCCL_NET_GDR_LEVEL,NCCL_SOCKET_NTHREADS - Shared storage — initializer steps write to a shared PVC (ReadWriteMany) mounted across all training nodes. Define volumes in the runtime template's JobSet spec.
- Host networking — for InfiniBand/RoCE, set
hostNetwork: truein the Pod spec within the runtime template
See references/advanced-configuration.md for NCCL tuning and network topology details.
Resource Management
GPU and resource configuration flows through two layers:
- Runtime template — default resources in the container spec
- TrainJob override —
spec.trainer.resourcesPerNodeoverrides per-job
Additional scheduling controls are set in the runtime's Job template PodSpec:
nodeSelector— target specific GPU node poolstolerations— tolerate GPU node taintspriorityClassName— Kubernetes priority classtopologySpreadConstraints— spread across failure domainsaffinity— node/pod affinity rules
See references/scheduling-and-integrations.md for Kueue quota management and topology-aware scheduling.
Fault Tolerance
Built on JobSet's failure policy and Kubernetes Job restart semantics:
- Restart policy — set
restartPolicyin the runtime's Pod template (OnFailure,Never) - JobSet failure policy — configure max restarts, per-index retry, and failure rules in the runtime's JobSet template
- Checkpoint recovery — mount a persistent volume for checkpoints; training code resumes from latest checkpoint on restart. The shared volume from initializers can serve this purpose.
- Elastic training — PyTorch's
torchrunsupports elastic scaling natively. Set min/max nodes in thetorchrunconfig. Trainer v2 leverages this through the Torch MLPolicy. Full elastic training support (auto-scaling node count) is under active development.
Monitoring
- Job status:
kubectl get trainjobsshows phase (Suspended, Running, Succeeded, Failed) - Events:
kubectl describe trainjob <name>— controller emits events for creation, scheduling, completion, failure - Logs:
kubectl logs -l job-name=<trainjob-name>-node-0 --followor via SDKclient.get_job_logs() - Metrics: Controller exposes Prometheus metrics at
/metrics— reconciliation latency, queue depth, active jobs - JobSet status:
kubectl get jobsetsshows underlying JobSet phase and per-job completion
Troubleshooting
See references/troubleshooting.md for detailed debugging patterns covering NCCL timeouts, OOM kills, scheduling failures, image pull errors, runtime misconfigurations, and SDK issues.
Quick diagnostics:
kubectl get trainjob <name> -o yaml— check.status.conditionsfor error messageskubectl get jobset <name> -o yaml— check underlying JobSet statuskubectl describe pods -l job-name=<name>-node-0— check events for scheduling/pull issueskubectl logs <pod> -c node— check training container logs for NCCL/OOM errors
Cross-References
- volcano — Gang-schedule training jobs via Volcano scheduler integration
- kueue — Queue management and fair-sharing for training workloads
- gpu-operator — GPU driver and device plugin prerequisites for training nodes
- nccl — NCCL tuning for multi-node distributed training jobs
- fsdp — FSDP training strategy used in Kubeflow training jobs
- deepspeed — DeepSpeed ZeRO integration for memory-efficient training
- pytorch — PyTorch distributed training fundamentals
- wandb — Experiment tracking from Kubeflow training containers
- mlflow — Alternative experiment tracking and model registry
- prometheus-grafana — Monitor training job metrics and GPU utilization
- minio — S3-compatible checkpoint and dataset storage
- aws-fsx — High-performance shared storage for training data
- leaderworkerset — Alternative multi-node workload primitive for inference/training
References
| File | Contents |
|---|---|
references/installation-and-operations.md | Installation methods, Helm values, CRD management, upgrades, runtime deployment |
references/python-sdk.md | TrainerClient API, CustomTrainer, BuiltinTrainer, TorchTune, job lifecycle |
references/scheduling-and-integrations.md | Kueue integration, Volcano, Coscheduling, topology-aware scheduling, priority classes |
references/advanced-configuration.md | Custom runtimes, PodTemplate overrides, init containers, sidecars, NCCL tuning, volumes |
references/troubleshooting.md | Common failures, debugging patterns, NCCL issues, OOM, scheduling, SDK errors |
references/training-pipeline-integration.md | End-to-end pipeline: DVC → Ray Data → Kubeflow Trainer (FSDP/DeepSpeed) → W&B/MLflow → S3 → lm-eval |
More by tylertitsworth
View allLLM evaluation with lm-evaluation-harness — MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving.
FSx for Lustre — performance tuning, striping, S3 data repositories, EKS integration. Use when configuring high-performance storage for ML on EKS. NOT for EBS or EFS.
verl (Volcano Engine RL) — PPO, GRPO, DAPO, GSPO, RLOO, TIS (token/sequence importance sampling), rollout server mode, reward models, rule-based rewards, vLLM/SGLang rollout, and multi-GPU FSDP/Megatron training. Use when doing RLHF or RL post-training on LLMs.
uv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.
