Agent SkillsAgent Skills
tylertitsworth

leaderworkerset

@tylertitsworth/leaderworkerset
tylertitsworth
0
0 forks
Updated 4/1/2026
View on GitHub

LeaderWorkerSet — multi-node ML inference and training on K8s, Kueue integration, TAS placement. Use when deploying multi-node ML workloads. NOT for single-node jobs.

Installation

$npx agent-skills-cli install @tylertitsworth/leaderworkerset
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathleaderworkerset/SKILL.md
Branchmain
Scoped Name@tylertitsworth/leaderworkerset

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions


name: leaderworkerset description: "LeaderWorkerSet — multi-node ML inference and training on K8s, Kueue integration, TAS placement. Use when deploying multi-node ML workloads. NOT for single-node jobs."

LeaderWorkerSet (LWS)

Kubernetes API for deploying groups of pods as a unit of replication. The standard K8s primitive for multi-node ML inference and training where pods must be managed as a cohesive group.

Docs: https://lws.sigs.k8s.io/docs/ GitHub: https://github.com/kubernetes-sigs/lws Version: v0.8.0 | API: leaderworkerset.x-k8s.io/v1 | Requires: Kubernetes ≥ 1.26

Why LWS

Traditional K8s workload APIs (Deployment, StatefulSet) treat each pod independently. Multi-node ML workloads need pods managed as a group — if one pod fails, the entire group must restart. LWS provides:

  • Group-as-a-unit: N replicas, each containing 1 leader + M workers (a "super pod")
  • All-or-nothing restart: If any pod in a group fails, the whole group recreates
  • Topology-aware placement: Co-locate all pods in a group on the same rack/block
  • HPA on groups: Scale replicas (groups) based on leader pod metrics
  • Rolling updates per group: Upgrade groups atomically, not individual pods
  • Built-in env vars: LWS_LEADER_ADDRESS, LWS_GROUP_SIZE, LWS_WORKER_INDEX injected into every pod

Adopted by: NVIDIA NIM, vLLM, SGLang, Kubeflow Trainer, llm-d, NVIDIA Dynamo, Apple AXLearn, llmaz, OME. Supported on GKE, EKS, OpenShift.

Core Concepts

Each LWS creates N replicas of pod groups. Each group has:

  • 1 leader pod (index 0) — named <lws-name>-<replicaIndex>
  • M worker pods (index 1..M) — named <lws-name>-<replicaIndex>-<workerIndex>
  • Total pods per group = size field (leader + workers)

Workers are managed via a StatefulSet per group. The leader pod is created directly.

Injected Environment Variables

VariableValueAvailable In
LWS_LEADER_ADDRESSDNS name of the leader podAll pods
LWS_GROUP_SIZETotal pods in the group (size field)All pods
LWS_WORKER_INDEX0 for leader, 1..M for workersAll pods

These are the primary mechanism for distributed frameworks (Ray, torchrun, NCCL) to discover peers.

Installation

# kubectl
VERSION=v0.8.0
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml
kubectl wait deploy/lws-controller-manager -n lws-system --for=condition=available --timeout=5m

# Helm
helm install lws oci://registry.k8s.io/lws/charts/lws \
  --version=0.8.0 --namespace lws-system --create-namespace --wait

Deployment Pattern

The canonical LWS pattern: leader starts the distributed runtime + serving process, workers join via LWS_LEADER_ADDRESS. The same structure applies to vLLM (Ray), SGLang (native NCCL), and NIM — only the container image and startup commands differ.

Key configuration points in the LWS manifest:

  • size: Total pods per group (leader + workers). Set to number of nodes needed.
  • restartPolicy: RecreateGroupOnPodRestart: Required for distributed ML — partial restarts corrupt NCCL/Ray state.
  • Leader template: starts distributed runtime head + serving process
  • Worker template: joins leader via $(LWS_LEADER_ADDRESS)
  • /dev/shm emptyDir: required for NCCL shared memory (16GB+ recommended)
  • GPU resources on both leader and worker

Set tensor-parallel-size = GPUs per node, pipeline-parallel-size = number of nodes. For NIM multi-node, see the nvidia-nim skill.

Parallelism Strategy

GPUs NeededNodesTPPPExample
≤8181Single node, no LWS needed
16282~400B dense model, FP8
32484~600B MoE model, BF16
162161SGLang native TP across nodes

Restart Policies

PolicyBehavior
NoneNo group-level restart (default). Pods restart individually per K8s policy.
RecreateGroupOnPodRestartIf any pod in the group restarts, all pods in the group are recreated. Use this for distributed ML.

RecreateGroupOnPodRestart is critical because a partial restart corrupts distributed state (NCCL communicators, Ray object store, model shards).

Startup Policy

PolicyBehavior
LeaderCreated (default)Worker StatefulSet created when leader pod object is created
LeaderReadyWorker StatefulSet created only after leader pod is Ready

Use LeaderReady when workers need the leader's service (e.g., Ray head, NCCL rendezvous) to be running before they start.

Persistent Storage (v0.8.0+)

LWS supports volumeClaimTemplates for per-pod persistent storage — useful for checkpointing and model caching. Volumes are provisioned per-pod in each group, similar to StatefulSet PVCs.

leaderWorkerTemplate:
  leaderTemplate:
    spec:
      containers:
      - name: leader
        volumeMounts:
        - name: checkpoint
          mountPath: /checkpoints
  workerTemplate:
    spec:
      containers:
      - name: worker
        volumeMounts:
        - name: checkpoint
          mountPath: /checkpoints
  volumeClaimTemplates:
  - metadata:
      name: checkpoint
    spec:
      accessModes: [ReadWriteOnce]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 100Gi
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Delete    # or Retain
    whenScaled: Retain     # or Delete

Networking

LWS creates a headless Service for DNS-based pod discovery. Pod DNS: <pod-name>.<headless-service>.<namespace>.svc.cluster.local

Subdomain policies:

  • Shared (default): Single headless Service for all replicas
  • Exclusive: Separate headless Service per replica group — use when replicas must be network-isolated

RDMA/RoCE

For high-performance NCCL communication, pods may need hostNetwork: true and InfiniBand device mounts. For NCCL tuning, see the nccl skill.

Topology Aware Scheduling (TAS)

Co-locate leader and workers in the same topology domain (rack, block) for optimal NCCL performance. Two mechanisms available:

Kueue TAS Integration

Requires Kueue with TAS enabled. The podset-group-name annotation ensures leader and workers land in the same topology domain:

leaderTemplate:
  metadata:
    annotations:
      kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rack"
      kueue.x-k8s.io/podset-group-name: "lws-group"
workerTemplate:
  metadata:
    annotations:
      kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rack"
      kueue.x-k8s.io/podset-group-name: "lws-group"

preferred vs required topology: Use podset-preferred-topology for best-effort placement (allows scheduling even if ideal topology is unavailable). Use podset-required-topology for strict co-location (pods wait until a matching domain has capacity).

LWS-Native Exclusive Topology

Independent of Kueue. Enforces 1:1 mapping between LWS replica and topology domain:

metadata:
  annotations:
    leaderworkerset.sigs.k8s.io/exclusive-topology: rack

Each replica group exclusively owns its topology domain — no other LWS replicas can share it. Best for workloads that need guaranteed bandwidth isolation.

Subgroups (Disaggregated Serving)

For prefill/decode separation or heterogeneous pod placement, SubGroups define scheduling sub-units within a replica:

leaderWorkerTemplate:
  size: 9                                # 1 leader + 8 workers
  subGroupPolicy:
    subGroupSize: 4                      # 4 pods per sub-group
    subGroupPolicyType: LeaderExcluded   # leader excluded from worker subgroups

Policy types:

  • LeaderWorker (default): Leader is included in the first subgroup alongside workers. Subgroups: (0..subGroupSize), (subGroupSize+1..2*subGroupSize), ...
  • LeaderExcluded: Leader is excluded from all subgroups. Workers are grouped by subGroupSize. Use when the leader is a lightweight coordinator on a CPU node and workers are GPU nodes.

Use leaderworkerset.sigs.k8s.io/subgroup-exclusive-topology: rack to place each subgroup on a separate rack. This enables patterns like:

  • Prefill workers on high-compute racks, decode workers on high-memory racks
  • Leader as a lightweight router on CPU nodes, GPU workers on accelerator racks

Each subgroup's index is exposed via the leaderworkerset.sigs.k8s.io/subgroup-index label.

Kueue Integration

Label the LWS for Kueue admission control:

metadata:
  labels:
    kueue.x-k8s.io/queue-name: gpu-queue

Kueue treats each LWS replica group as a unit of admission. Scale-up creates new groups that remain gated until quota is available. Enable in KueueConfiguration:

integrations:
  frameworks:
  - "leaderworkerset.x-k8s.io/leaderworkerset"

Enabled by default since Kueue v0.16.

Rollout Strategy

spec:
  rolloutStrategy:
    type: RollingUpdate
    rollingUpdateConfiguration:
      maxUnavailable: 1       # max groups down during rollout
      maxSurge: 0             # max extra groups during rollout
      partition: 0            # canary: set >0 to only update groups >= partition

Updates are atomic per group — all pods in a group are updated together. Use partition for canary deployments: set to N-1 to update only the last group first.

HPA Scaling

LWS exposes a scale subresource. HPA targets leader pods:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    name: vllm
  minReplicas: 1
  maxReplicas: 4
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_requests_running
      target:
        type: AverageValue
        averageValue: "50"

Scaling adds/removes entire groups, not individual pods. Combine with Kueue for quota-gated scale-up: new groups queue until GPU quota is available.

Labels and Selectors

LabelValueOn
leaderworkerset.sigs.k8s.io/nameLWS nameAll pods
leaderworkerset.sigs.k8s.io/group-indexReplica index (0..N-1)All pods
leaderworkerset.sigs.k8s.io/worker-indexWorker index (0=leader)All pods
leaderworkerset.sigs.k8s.io/subgroup-indexSubgroup index (when SubGroupPolicy set)All pods
# All pods
kubectl get pods -l leaderworkerset.sigs.k8s.io/name=<lws-name>
# Leaders only
kubectl get pods -l leaderworkerset.sigs.k8s.io/name=<lws-name>,leaderworkerset.sigs.k8s.io/worker-index=0

Distributed Training

LWS can orchestrate multi-node PyTorch training using torchrun. Use workerTemplate alone (same template for leader and workers) with torchrun --node-rank=$(LWS_WORKER_INDEX) --rdzv-endpoint=$(LWS_LEADER_ADDRESS):29500.

Note: For training, consider Kubeflow TrainJob or PyTorchJob which provide richer training lifecycle management. LWS is preferable when you need custom topologies, HPA scaling of training replicas, or integration with serving patterns.

When to Use LWS vs Alternatives

WorkloadUseWhy
Multi-node inference (vLLM/SGLang/NIM)LWSGroup restart, HPA, topology placement
Multi-node training (one-shot)PyTorchJob / TrainJobRicher training lifecycle, completion tracking
Multi-node training (long-running, elastic)LWSHPA scaling, rolling updates
Batch job parallelismJobSetCompletion semantics, indexed jobs
Stateful services (databases)StatefulSetOrdered rollout, stable storage

LWS vs Volcano: Complementary — LWS defines the pod-group primitive, Volcano provides gang scheduling as a scheduler replacement. They can be used together. For Kueue-style admission, use the native Kueue LWS integration. See the volcano skill.

Troubleshooting

See references/troubleshooting.md for common issues with scheduling, networking, restarts, and NCCL failures.

kubectl get lws                                        # LWS status
kubectl describe lws <name>                            # detailed state + events
kubectl logs -n lws-system deploy/lws-controller-manager --tail=200  # controller logs
kubectl exec <pod-name> -- env | grep LWS_             # verify injected vars

References

  • troubleshooting.md — Pending pods, rolling update issues, and network configuration problems

Cross-References

  • kueue — Queue and quota management for LWS workloads
  • vllm — vLLM serving configuration and optimization
  • nccl — NCCL environment tuning, RDMA/RoCE config, and benchmarking
  • nvidia-nim — NIM inference microservices; multi-node NIM uses LWS
  • sglang — SGLang serving configuration and K8s deployment patterns
  • volcano — Alternative batch scheduler with gang scheduling
  • gpu-operator — NVIDIA GPU driver and device plugin setup
  • skills/karpenter — Provision GPU nodes for LWS workloads
  • skills/keda — Autoscale LWS-based inference deployments
  • prometheus-grafana — Monitor LWS workload metrics
  • gateway-api-inference — Route inference traffic to LWS-managed model servers
  • aws-efa — EFA networking for multi-node LWS workloads

Reference