LeaderWorkerSet β multi-node ML inference and training on K8s, Kueue integration, TAS placement. Use when deploying multi-node ML workloads. NOT for single-node jobs.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: leaderworkerset description: "LeaderWorkerSet β multi-node ML inference and training on K8s, Kueue integration, TAS placement. Use when deploying multi-node ML workloads. NOT for single-node jobs."
LeaderWorkerSet (LWS)
Kubernetes API for deploying groups of pods as a unit of replication. The standard K8s primitive for multi-node ML inference and training where pods must be managed as a cohesive group.
Docs: https://lws.sigs.k8s.io/docs/
GitHub: https://github.com/kubernetes-sigs/lws
Version: v0.8.0 | API: leaderworkerset.x-k8s.io/v1 | Requires: Kubernetes β₯ 1.26
Why LWS
Traditional K8s workload APIs (Deployment, StatefulSet) treat each pod independently. Multi-node ML workloads need pods managed as a group β if one pod fails, the entire group must restart. LWS provides:
- Group-as-a-unit: N replicas, each containing 1 leader + M workers (a "super pod")
- All-or-nothing restart: If any pod in a group fails, the whole group recreates
- Topology-aware placement: Co-locate all pods in a group on the same rack/block
- HPA on groups: Scale replicas (groups) based on leader pod metrics
- Rolling updates per group: Upgrade groups atomically, not individual pods
- Built-in env vars:
LWS_LEADER_ADDRESS,LWS_GROUP_SIZE,LWS_WORKER_INDEXinjected into every pod
Adopted by: NVIDIA NIM, vLLM, SGLang, Kubeflow Trainer, llm-d, NVIDIA Dynamo, Apple AXLearn, llmaz, OME. Supported on GKE, EKS, OpenShift.
Core Concepts
Each LWS creates N replicas of pod groups. Each group has:
- 1 leader pod (index 0) β named
<lws-name>-<replicaIndex> - M worker pods (index 1..M) β named
<lws-name>-<replicaIndex>-<workerIndex> - Total pods per group =
sizefield (leader + workers)
Workers are managed via a StatefulSet per group. The leader pod is created directly.
Injected Environment Variables
| Variable | Value | Available In |
|---|---|---|
LWS_LEADER_ADDRESS | DNS name of the leader pod | All pods |
LWS_GROUP_SIZE | Total pods in the group (size field) | All pods |
LWS_WORKER_INDEX | 0 for leader, 1..M for workers | All pods |
These are the primary mechanism for distributed frameworks (Ray, torchrun, NCCL) to discover peers.
Installation
# kubectl
VERSION=v0.8.0
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml
kubectl wait deploy/lws-controller-manager -n lws-system --for=condition=available --timeout=5m
# Helm
helm install lws oci://registry.k8s.io/lws/charts/lws \
--version=0.8.0 --namespace lws-system --create-namespace --wait
Deployment Pattern
The canonical LWS pattern: leader starts the distributed runtime + serving process, workers join via LWS_LEADER_ADDRESS. The same structure applies to vLLM (Ray), SGLang (native NCCL), and NIM β only the container image and startup commands differ.
Key configuration points in the LWS manifest:
size: Total pods per group (leader + workers). Set to number of nodes needed.restartPolicy: RecreateGroupOnPodRestart: Required for distributed ML β partial restarts corrupt NCCL/Ray state.- Leader template: starts distributed runtime head + serving process
- Worker template: joins leader via
$(LWS_LEADER_ADDRESS) /dev/shmemptyDir: required for NCCL shared memory (16GB+ recommended)- GPU resources on both leader and worker
Set tensor-parallel-size = GPUs per node, pipeline-parallel-size = number of nodes. For NIM multi-node, see the nvidia-nim skill.
Parallelism Strategy
| GPUs Needed | Nodes | TP | PP | Example |
|---|---|---|---|---|
| β€8 | 1 | 8 | 1 | Single node, no LWS needed |
| 16 | 2 | 8 | 2 | ~400B dense model, FP8 |
| 32 | 4 | 8 | 4 | ~600B MoE model, BF16 |
| 16 | 2 | 16 | 1 | SGLang native TP across nodes |
Restart Policies
| Policy | Behavior |
|---|---|
None | No group-level restart (default). Pods restart individually per K8s policy. |
RecreateGroupOnPodRestart | If any pod in the group restarts, all pods in the group are recreated. Use this for distributed ML. |
RecreateGroupOnPodRestart is critical because a partial restart corrupts distributed state (NCCL communicators, Ray object store, model shards).
Startup Policy
| Policy | Behavior |
|---|---|
LeaderCreated (default) | Worker StatefulSet created when leader pod object is created |
LeaderReady | Worker StatefulSet created only after leader pod is Ready |
Use LeaderReady when workers need the leader's service (e.g., Ray head, NCCL rendezvous) to be running before they start.
Persistent Storage (v0.8.0+)
LWS supports volumeClaimTemplates for per-pod persistent storage β useful for checkpointing and model caching. Volumes are provisioned per-pod in each group, similar to StatefulSet PVCs.
leaderWorkerTemplate:
leaderTemplate:
spec:
containers:
- name: leader
volumeMounts:
- name: checkpoint
mountPath: /checkpoints
workerTemplate:
spec:
containers:
- name: worker
volumeMounts:
- name: checkpoint
mountPath: /checkpoints
volumeClaimTemplates:
- metadata:
name: checkpoint
spec:
accessModes: [ReadWriteOnce]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
persistentVolumeClaimRetentionPolicy:
whenDeleted: Delete # or Retain
whenScaled: Retain # or Delete
Networking
LWS creates a headless Service for DNS-based pod discovery. Pod DNS: <pod-name>.<headless-service>.<namespace>.svc.cluster.local
Subdomain policies:
Shared(default): Single headless Service for all replicasExclusive: Separate headless Service per replica group β use when replicas must be network-isolated
RDMA/RoCE
For high-performance NCCL communication, pods may need hostNetwork: true and InfiniBand device mounts. For NCCL tuning, see the nccl skill.
Topology Aware Scheduling (TAS)
Co-locate leader and workers in the same topology domain (rack, block) for optimal NCCL performance. Two mechanisms available:
Kueue TAS Integration
Requires Kueue with TAS enabled. The podset-group-name annotation ensures leader and workers land in the same topology domain:
leaderTemplate:
metadata:
annotations:
kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rack"
kueue.x-k8s.io/podset-group-name: "lws-group"
workerTemplate:
metadata:
annotations:
kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rack"
kueue.x-k8s.io/podset-group-name: "lws-group"
preferred vs required topology: Use podset-preferred-topology for best-effort placement (allows scheduling even if ideal topology is unavailable). Use podset-required-topology for strict co-location (pods wait until a matching domain has capacity).
LWS-Native Exclusive Topology
Independent of Kueue. Enforces 1:1 mapping between LWS replica and topology domain:
metadata:
annotations:
leaderworkerset.sigs.k8s.io/exclusive-topology: rack
Each replica group exclusively owns its topology domain β no other LWS replicas can share it. Best for workloads that need guaranteed bandwidth isolation.
Subgroups (Disaggregated Serving)
For prefill/decode separation or heterogeneous pod placement, SubGroups define scheduling sub-units within a replica:
leaderWorkerTemplate:
size: 9 # 1 leader + 8 workers
subGroupPolicy:
subGroupSize: 4 # 4 pods per sub-group
subGroupPolicyType: LeaderExcluded # leader excluded from worker subgroups
Policy types:
LeaderWorker(default): Leader is included in the first subgroup alongside workers. Subgroups:(0..subGroupSize),(subGroupSize+1..2*subGroupSize), ...LeaderExcluded: Leader is excluded from all subgroups. Workers are grouped bysubGroupSize. Use when the leader is a lightweight coordinator on a CPU node and workers are GPU nodes.
Use leaderworkerset.sigs.k8s.io/subgroup-exclusive-topology: rack to place each subgroup on a separate rack. This enables patterns like:
- Prefill workers on high-compute racks, decode workers on high-memory racks
- Leader as a lightweight router on CPU nodes, GPU workers on accelerator racks
Each subgroup's index is exposed via the leaderworkerset.sigs.k8s.io/subgroup-index label.
Kueue Integration
Label the LWS for Kueue admission control:
metadata:
labels:
kueue.x-k8s.io/queue-name: gpu-queue
Kueue treats each LWS replica group as a unit of admission. Scale-up creates new groups that remain gated until quota is available. Enable in KueueConfiguration:
integrations:
frameworks:
- "leaderworkerset.x-k8s.io/leaderworkerset"
Enabled by default since Kueue v0.16.
Rollout Strategy
spec:
rolloutStrategy:
type: RollingUpdate
rollingUpdateConfiguration:
maxUnavailable: 1 # max groups down during rollout
maxSurge: 0 # max extra groups during rollout
partition: 0 # canary: set >0 to only update groups >= partition
Updates are atomic per group β all pods in a group are updated together. Use partition for canary deployments: set to N-1 to update only the last group first.
HPA Scaling
LWS exposes a scale subresource. HPA targets leader pods:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inference-hpa
spec:
scaleTargetRef:
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
name: vllm
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: vllm_requests_running
target:
type: AverageValue
averageValue: "50"
Scaling adds/removes entire groups, not individual pods. Combine with Kueue for quota-gated scale-up: new groups queue until GPU quota is available.
Labels and Selectors
| Label | Value | On |
|---|---|---|
leaderworkerset.sigs.k8s.io/name | LWS name | All pods |
leaderworkerset.sigs.k8s.io/group-index | Replica index (0..N-1) | All pods |
leaderworkerset.sigs.k8s.io/worker-index | Worker index (0=leader) | All pods |
leaderworkerset.sigs.k8s.io/subgroup-index | Subgroup index (when SubGroupPolicy set) | All pods |
# All pods
kubectl get pods -l leaderworkerset.sigs.k8s.io/name=<lws-name>
# Leaders only
kubectl get pods -l leaderworkerset.sigs.k8s.io/name=<lws-name>,leaderworkerset.sigs.k8s.io/worker-index=0
Distributed Training
LWS can orchestrate multi-node PyTorch training using torchrun. Use workerTemplate alone (same template for leader and workers) with torchrun --node-rank=$(LWS_WORKER_INDEX) --rdzv-endpoint=$(LWS_LEADER_ADDRESS):29500.
Note: For training, consider Kubeflow TrainJob or PyTorchJob which provide richer training lifecycle management. LWS is preferable when you need custom topologies, HPA scaling of training replicas, or integration with serving patterns.
When to Use LWS vs Alternatives
| Workload | Use | Why |
|---|---|---|
| Multi-node inference (vLLM/SGLang/NIM) | LWS | Group restart, HPA, topology placement |
| Multi-node training (one-shot) | PyTorchJob / TrainJob | Richer training lifecycle, completion tracking |
| Multi-node training (long-running, elastic) | LWS | HPA scaling, rolling updates |
| Batch job parallelism | JobSet | Completion semantics, indexed jobs |
| Stateful services (databases) | StatefulSet | Ordered rollout, stable storage |
LWS vs Volcano: Complementary β LWS defines the pod-group primitive, Volcano provides gang scheduling as a scheduler replacement. They can be used together. For Kueue-style admission, use the native Kueue LWS integration. See the volcano skill.
Troubleshooting
See references/troubleshooting.md for common issues with scheduling, networking, restarts, and NCCL failures.
kubectl get lws # LWS status
kubectl describe lws <name> # detailed state + events
kubectl logs -n lws-system deploy/lws-controller-manager --tail=200 # controller logs
kubectl exec <pod-name> -- env | grep LWS_ # verify injected vars
References
troubleshooting.mdβ Pending pods, rolling update issues, and network configuration problems
Cross-References
- kueue β Queue and quota management for LWS workloads
- vllm β vLLM serving configuration and optimization
- nccl β NCCL environment tuning, RDMA/RoCE config, and benchmarking
- nvidia-nim β NIM inference microservices; multi-node NIM uses LWS
- sglang β SGLang serving configuration and K8s deployment patterns
- volcano β Alternative batch scheduler with gang scheduling
- gpu-operator β NVIDIA GPU driver and device plugin setup
- skills/karpenter β Provision GPU nodes for LWS workloads
- skills/keda β Autoscale LWS-based inference deployments
- prometheus-grafana β Monitor LWS workload metrics
- gateway-api-inference β Route inference traffic to LWS-managed model servers
- aws-efa β EFA networking for multi-node LWS workloads
Reference
- LWS docs
- LWS GitHub
- LWS API reference
- Kueue LWS integration
references/troubleshooting.mdβ Common LWS issues and fixes
More by tylertitsworth
View alluv β fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.
TensorRT-LLM β engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup.
NVIDIA Triton Inference Server β model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM.
KubeRay operator β RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).
