leaderworkerset

@tylertitsworth/leaderworkerset

0 forks

Updated 4/1/2026

LeaderWorkerSet — multi-node ML inference and training on K8s, Kueue integration, TAS placement. Use when deploying multi-node ML workloads. NOT for single-node jobs.

Installation

$npx agent-skills-cli install @tylertitsworth/leaderworkerset

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositorytylertitsworth/skills

Pathleaderworkerset/SKILL.md

Branchmain

Scoped Name@tylertitsworth/leaderworkerset

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: leaderworkerset description: "LeaderWorkerSet — multi-node ML inference and training on K8s, Kueue integration, TAS placement. Use when deploying multi-node ML workloads. NOT for single-node jobs."

LeaderWorkerSet (LWS)

Kubernetes API for deploying groups of pods as a unit of replication. The standard K8s primitive for multi-node ML inference and training where pods must be managed as a cohesive group.

Docs: https://lws.sigs.k8s.io/docs/ GitHub: https://github.com/kubernetes-sigs/lws Version: v0.8.0 | API: leaderworkerset.x-k8s.io/v1 | Requires: Kubernetes ≥ 1.26

Why LWS

Traditional K8s workload APIs (Deployment, StatefulSet) treat each pod independently. Multi-node ML workloads need pods managed as a group — if one pod fails, the entire group must restart. LWS provides:

Group-as-a-unit: N replicas, each containing 1 leader + M workers (a "super pod")
All-or-nothing restart: If any pod in a group fails, the whole group recreates
Topology-aware placement: Co-locate all pods in a group on the same rack/block
HPA on groups: Scale replicas (groups) based on leader pod metrics
Rolling updates per group: Upgrade groups atomically, not individual pods
Built-in env vars: LWS_LEADER_ADDRESS, LWS_GROUP_SIZE, LWS_WORKER_INDEX injected into every pod

Adopted by: NVIDIA NIM, vLLM, SGLang, Kubeflow Trainer, llm-d, NVIDIA Dynamo, Apple AXLearn, llmaz, OME. Supported on GKE, EKS, OpenShift.

Core Concepts

Each LWS creates N replicas of pod groups. Each group has:

1 leader pod (index 0) — named <lws-name>-<replicaIndex>
M worker pods (index 1..M) — named <lws-name>-<replicaIndex>-<workerIndex>
Total pods per group = size field (leader + workers)

Workers are managed via a StatefulSet per group. The leader pod is created directly.

Injected Environment Variables

Variable	Value	Available In
`LWS_LEADER_ADDRESS`	DNS name of the leader pod	All pods
`LWS_GROUP_SIZE`	Total pods in the group (size field)	All pods
`LWS_WORKER_INDEX`	0 for leader, 1..M for workers	All pods

These are the primary mechanism for distributed frameworks (Ray, torchrun, NCCL) to discover peers.

Installation

# kubectl
VERSION=v0.8.0
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml
kubectl wait deploy/lws-controller-manager -n lws-system --for=condition=available --timeout=5m

# Helm
helm install lws oci://registry.k8s.io/lws/charts/lws \
  --version=0.8.0 --namespace lws-system --create-namespace --wait

Deployment Pattern

The canonical LWS pattern: leader starts the distributed runtime + serving process, workers join via LWS_LEADER_ADDRESS. The same structure applies to vLLM (Ray), SGLang (native NCCL), and NIM — only the container image and startup commands differ.

Key configuration points in the LWS manifest:

size: Total pods per group (leader + workers). Set to number of nodes needed.
restartPolicy: RecreateGroupOnPodRestart: Required for distributed ML — partial restarts corrupt NCCL/Ray state.
Leader template: starts distributed runtime head + serving process
Worker template: joins leader via $(LWS_LEADER_ADDRESS)
/dev/shm emptyDir: required for NCCL shared memory (16GB+ recommended)
GPU resources on both leader and worker

Set tensor-parallel-size = GPUs per node, pipeline-parallel-size = number of nodes. For NIM multi-node, see the nvidia-nim skill.

Parallelism Strategy

GPUs Needed	Nodes	TP	PP	Example
≤8	1	8	1	Single node, no LWS needed
16	2	8	2	~400B dense model, FP8
32	4	8	4	~600B MoE model, BF16
16	2	16	1	SGLang native TP across nodes

Restart Policies

Policy	Behavior
`None`	No group-level restart (default). Pods restart individually per K8s policy.
`RecreateGroupOnPodRestart`	If any pod in the group restarts, all pods in the group are recreated. Use this for distributed ML.

RecreateGroupOnPodRestart is critical because a partial restart corrupts distributed state (NCCL communicators, Ray object store, model shards).

Startup Policy

Policy	Behavior
`LeaderCreated` (default)	Worker StatefulSet created when leader pod object is created
`LeaderReady`	Worker StatefulSet created only after leader pod is Ready

Use LeaderReady when workers need the leader's service (e.g., Ray head, NCCL rendezvous) to be running before they start.

Persistent Storage (v0.8.0+)

LWS supports volumeClaimTemplates for per-pod persistent storage — useful for checkpointing and model caching. Volumes are provisioned per-pod in each group, similar to StatefulSet PVCs.

leaderWorkerTemplate:
  leaderTemplate:
    spec:
      containers:
      - name: leader
        volumeMounts:
        - name: checkpoint
          mountPath: /checkpoints
  workerTemplate:
    spec:
      containers:
      - name: worker
        volumeMounts:
        - name: checkpoint
          mountPath: /checkpoints
  volumeClaimTemplates:
  - metadata:
      name: checkpoint
    spec:
      accessModes: [ReadWriteOnce]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 100Gi
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Delete    # or Retain
    whenScaled: Retain     # or Delete

Networking

LWS creates a headless Service for DNS-based pod discovery. Pod DNS: <pod-name>.<headless-service>.<namespace>.svc.cluster.local

Subdomain policies:

Shared (default): Single headless Service for all replicas
Exclusive: Separate headless Service per replica group — use when replicas must be network-isolated

RDMA/RoCE

For high-performance NCCL communication, pods may need hostNetwork: true and InfiniBand device mounts. For NCCL tuning, see the nccl skill.

Topology Aware Scheduling (TAS)

Co-locate leader and workers in the same topology domain (rack, block) for optimal NCCL performance. Two mechanisms available:

Kueue TAS Integration

Requires Kueue with TAS enabled. The podset-group-name annotation ensures leader and workers land in the same topology domain:

leaderTemplate:
  metadata:
    annotations:
      kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rack"
      kueue.x-k8s.io/podset-group-name: "lws-group"
workerTemplate:
  metadata:
    annotations:
      kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rack"
      kueue.x-k8s.io/podset-group-name: "lws-group"

preferred vs required topology: Use podset-preferred-topology for best-effort placement (allows scheduling even if ideal topology is unavailable). Use podset-required-topology for strict co-location (pods wait until a matching domain has capacity).

LWS-Native Exclusive Topology

Independent of Kueue. Enforces 1:1 mapping between LWS replica and topology domain:

metadata:
  annotations:
    leaderworkerset.sigs.k8s.io/exclusive-topology: rack

Each replica group exclusively owns its topology domain — no other LWS replicas can share it. Best for workloads that need guaranteed bandwidth isolation.

Subgroups (Disaggregated Serving)

For prefill/decode separation or heterogeneous pod placement, SubGroups define scheduling sub-units within a replica:

leaderWorkerTemplate:
  size: 9                                # 1 leader + 8 workers
  subGroupPolicy:
    subGroupSize: 4                      # 4 pods per sub-group
    subGroupPolicyType: LeaderExcluded   # leader excluded from worker subgroups

Policy types:

LeaderWorker (default): Leader is included in the first subgroup alongside workers. Subgroups: (0..subGroupSize), (subGroupSize+1..2*subGroupSize), ...
LeaderExcluded: Leader is excluded from all subgroups. Workers are grouped by subGroupSize. Use when the leader is a lightweight coordinator on a CPU node and workers are GPU nodes.

Use leaderworkerset.sigs.k8s.io/subgroup-exclusive-topology: rack to place each subgroup on a separate rack. This enables patterns like:

Prefill workers on high-compute racks, decode workers on high-memory racks
Leader as a lightweight router on CPU nodes, GPU workers on accelerator racks

Each subgroup's index is exposed via the leaderworkerset.sigs.k8s.io/subgroup-index label.

Kueue Integration

Label the LWS for Kueue admission control:

metadata:
  labels:
    kueue.x-k8s.io/queue-name: gpu-queue

Kueue treats each LWS replica group as a unit of admission. Scale-up creates new groups that remain gated until quota is available. Enable in KueueConfiguration:

integrations:
  frameworks:
  - "leaderworkerset.x-k8s.io/leaderworkerset"

Enabled by default since Kueue v0.16.

Rollout Strategy

spec:
  rolloutStrategy:
    type: RollingUpdate
    rollingUpdateConfiguration:
      maxUnavailable: 1       # max groups down during rollout
      maxSurge: 0             # max extra groups during rollout
      partition: 0            # canary: set >0 to only update groups >= partition

Updates are atomic per group — all pods in a group are updated together. Use partition for canary deployments: set to N-1 to update only the last group first.

HPA Scaling

LWS exposes a scale subresource. HPA targets leader pods:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    name: vllm
  minReplicas: 1
  maxReplicas: 4
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_requests_running
      target:
        type: AverageValue
        averageValue: "50"

Scaling adds/removes entire groups, not individual pods. Combine with Kueue for quota-gated scale-up: new groups queue until GPU quota is available.

Labels and Selectors

Label	Value	On
`leaderworkerset.sigs.k8s.io/name`	LWS name	All pods
`leaderworkerset.sigs.k8s.io/group-index`	Replica index (0..N-1)	All pods
`leaderworkerset.sigs.k8s.io/worker-index`	Worker index (0=leader)	All pods
`leaderworkerset.sigs.k8s.io/subgroup-index`	Subgroup index (when SubGroupPolicy set)	All pods

# All pods
kubectl get pods -l leaderworkerset.sigs.k8s.io/name=<lws-name>
# Leaders only
kubectl get pods -l leaderworkerset.sigs.k8s.io/name=<lws-name>,leaderworkerset.sigs.k8s.io/worker-index=0

Distributed Training

LWS can orchestrate multi-node PyTorch training using torchrun. Use workerTemplate alone (same template for leader and workers) with torchrun --node-rank=$(LWS_WORKER_INDEX) --rdzv-endpoint=$(LWS_LEADER_ADDRESS):29500.

Note: For training, consider Kubeflow TrainJob or PyTorchJob which provide richer training lifecycle management. LWS is preferable when you need custom topologies, HPA scaling of training replicas, or integration with serving patterns.

When to Use LWS vs Alternatives

Workload	Use	Why
Multi-node inference (vLLM/SGLang/NIM)	LWS	Group restart, HPA, topology placement
Multi-node training (one-shot)	PyTorchJob / TrainJob	Richer training lifecycle, completion tracking
Multi-node training (long-running, elastic)	LWS	HPA scaling, rolling updates
Batch job parallelism	JobSet	Completion semantics, indexed jobs
Stateful services (databases)	StatefulSet	Ordered rollout, stable storage

LWS vs Volcano: Complementary — LWS defines the pod-group primitive, Volcano provides gang scheduling as a scheduler replacement. They can be used together. For Kueue-style admission, use the native Kueue LWS integration. See the volcano skill.

Troubleshooting

See references/troubleshooting.md for common issues with scheduling, networking, restarts, and NCCL failures.

kubectl get lws                                        # LWS status
kubectl describe lws <name>                            # detailed state + events
kubectl logs -n lws-system deploy/lws-controller-manager --tail=200  # controller logs
kubectl exec <pod-name> -- env | grep LWS_             # verify injected vars

References

troubleshooting.md — Pending pods, rolling update issues, and network configuration problems

Cross-References

kueue — Queue and quota management for LWS workloads
vllm — vLLM serving configuration and optimization
nccl — NCCL environment tuning, RDMA/RoCE config, and benchmarking
nvidia-nim — NIM inference microservices; multi-node NIM uses LWS
sglang — SGLang serving configuration and K8s deployment patterns
volcano — Alternative batch scheduler with gang scheduling
gpu-operator — NVIDIA GPU driver and device plugin setup
skills/karpenter — Provision GPU nodes for LWS workloads
skills/keda — Autoscale LWS-based inference deployments
prometheus-grafana — Monitor LWS workload metrics
gateway-api-inference — Route inference traffic to LWS-managed model servers
aws-efa — EFA networking for multi-node LWS workloads

Reference

LWS docs
LWS GitHub
LWS API reference
Kueue LWS integration
references/troubleshooting.md — Common LWS issues and fixes

More by tylertitsworth

View all

uv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.

tensorrt-llm

TensorRT-LLM — engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup.

triton-inference-server

NVIDIA Triton Inference Server — model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM.

kuberay

KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).