Agent SkillsAgent Skills
tylertitsworth

aws-efa

@tylertitsworth/aws-efa
tylertitsworth
0
0 forks
Updated 4/7/2026
View on GitHub

AWS EFA — SRD protocol, GPUDirect RDMA, NCCL integration, EKS node setup. Use when configuring EFA for distributed GPU training on EKS. NOT for standard TCP networking.

Installation

$npx agent-skills-cli install @tylertitsworth/aws-efa
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathaws-efa/SKILL.md
Branchmain
Scoped Name@tylertitsworth/aws-efa

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions


name: aws-efa description: "AWS EFA — SRD protocol, GPUDirect RDMA, NCCL integration, EKS node setup. Use when configuring EFA for distributed GPU training on EKS. NOT for standard TCP networking."

AWS Elastic Fabric Adapter (EFA)

What EFA Is

EFA is a network interface that combines a standard Elastic Network Adapter (ENA) with an OS-bypass interface using the AWS Scalable Reliable Datagram (SRD) protocol. The OS-bypass path allows applications (via libfabric) to communicate directly with the network hardware, skipping the kernel network stack entirely.

On GPU instances, EFA enables GPUDirect RDMA: NCCL collective operations move data directly between GPU memory across nodes without copying through CPU memory.

Protocol Stack

Application (NCCL)
    ↓
aws-ofi-nccl plugin (NCCL → libfabric translation)
    ↓
libfabric (EFA provider)
    ↓
EFA OS-bypass hardware interface
    ↓
SRD protocol (AWS Scalable Reliable Datagram)
    ↓
AWS network fabric

SRD is a custom AWS transport protocol optimized for HPC/ML. It provides:

  • Reliable delivery with minimal overhead
  • Multi-path routing across the AWS network (unlike TCP which uses single paths)
  • Packet spraying across multiple network paths for higher aggregate throughput
  • Built-in congestion control tuned for collective operations

EFA vs ENA vs EFA-Only

Interface TypeIP StackOS-BypassUse
ENAYesNoStandard networking (pods, services)
EFAYesYesPrimary interface (network card 0) — both IP + RDMA
EFA-onlyNoYesAdditional interfaces (cards 1+) — RDMA traffic only

EFA-only interfaces (available on p5, p5e, trn2) have no IP address — they carry only SRD/RDMA traffic with lower overhead. Network card 0 must always be a full EFA (with ENA) for standard connectivity.

Instance Types and Topology

InstanceGPUsEFA InterfacesAggregate BandwidthNVSwitch
p4d.24xlarge8× A100 40GB4400 GbpsYes (intra-node)
p4de.24xlarge8× A100 80GB4400 GbpsYes
p5.48xlarge8× H100 80GB323200 GbpsYes
p5e.48xlarge8× H200 141GB323200 GbpsYes
p5en.48xlarge8× H200 141GB323200 GbpsYes
p6-b200.48xlarge8× B200323200 GbpsYes
trn1.32xlarge16× Trainium8800 GbpsN/A
trn2.48xlarge16× Trainium2161600 GbpsN/A

Topology: How NCCL Uses EFA

Within a node, GPUs communicate via NVSwitch (NVLink). Across nodes, NCCL maps EFA interfaces to GPUs:

  • p4d (4 EFA, 8 GPU): Each EFA serves 2 GPUs. NCCL uses 4 network channels.
  • p5 (32 EFA, 8 GPU): Each GPU gets 4 dedicated EFA interfaces. NCCL uses up to 32 channels.

NCCL auto-detects this topology. The number of channels (NCCL_MIN_NCHANNELS) defaults based on EFA count — don't override unless benchmarking shows improvement.

How GPUDirect RDMA Works on EFA

  1. GPU allocates memory for send/receive buffers
  2. NCCL calls aws-ofi-nccl, which calls libfabric
  3. Libfabric's EFA provider registers GPU memory with the EFA device
  4. EFA hardware DMA-reads directly from GPU memory (no CPU copy)
  5. SRD packets are sent across the AWS fabric
  6. Remote EFA hardware DMA-writes directly into remote GPU memory

Requirements for GPUDirect RDMA:

  • NVIDIA peer memory module loaded (included in EKS GPU AMIs)
  • EFA driver with GPUDirect support (included in EKS AMIs ≥ 2023)
  • Huge Pages allocated for EFA internal buffers

EKS Node Configuration

eksctl

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: ml-cluster
  region: us-west-2
  version: "1.33"
iam:
  withOIDC: true
availabilityZones: ["us-west-2a", "us-west-2b"]
managedNodeGroups:
  - name: gpu-efa
    instanceType: p5.48xlarge
    minSize: 0
    desiredCapacity: 2
    maxSize: 8
    availabilityZones: ["us-west-2a"]    # Single AZ for placement group
    volumeSize: 500
    privateNetworking: true
    efaEnabled: true                      # Handles SG, placement group, device plugin

When efaEnabled: true, eksctl automatically:

  1. Creates an EFA security group (all traffic between members)
  2. Creates a cluster placement group (co-locates instances)
  3. Deploys the AWS EFA device plugin DaemonSet
  4. Deploys the NVIDIA device plugin (Amazon Linux 2)
  5. Configures all EFA interfaces on the launch template

Terraform

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"
  cluster_name = "ml-cluster"
  enable_efa_support = true
  eks_managed_node_groups = {
    gpu-efa = {
      ami_type           = "AL2023_x86_64_NVIDIA"
      instance_types     = ["p5.48xlarge"]
      enable_efa_support = true
      subnet_ids         = [module.vpc.private_subnets[0]]  # Single AZ
    }
  }
}

EFA Device Plugin

The AWS EFA Kubernetes device plugin exposes vpc.amazonaws.com/efa as a schedulable resource.

# Verify
kubectl get ds -n kube-system aws-efa-k8s-device-plugin-daemonset
kubectl describe node <node> | grep "vpc.amazonaws.com/efa"
# → vpc.amazonaws.com/efa: 32   (p5.48xlarge)

For p6-b200 instances, use device plugin v0.5.6 or later.

Pod Resource Requests

When using Ray Train via RayJob/RayCluster CRDs, the worker pod template should request all EFA interfaces:

resources:
  requests:
    nvidia.com/gpu: "8"
    vpc.amazonaws.com/efa: "32"        # All 32 on p5
    hugepages-2Mi: "5120Mi"
    memory: "128Gi"
  limits:
    nvidia.com/gpu: "8"
    vpc.amazonaws.com/efa: "32"
    hugepages-2Mi: "5120Mi"
    memory: "128Gi"

Always request all EFA interfaces — partial allocation causes suboptimal NCCL topology mapping and degraded performance.

Security Groups

EFA requires a security group allowing all traffic between members:

Inbound:  All protocols, all ports, source = self (same SG)
Outbound: All protocols, all ports, destination = self (same SG)

This is separate from the EKS cluster security group. Both must be attached to EFA nodes. eksctl and the Terraform EKS module create this automatically with efaEnabled/enable_efa_support.

Placement Groups

All EFA nodes must be in a cluster placement group in a single Availability Zone. This ensures instances are physically co-located on the same network spine for lowest latency.

aws ec2 create-placement-group --group-name ml-pg --strategy cluster

Capacity limitations: Cluster placement groups can fail to launch if insufficient capacity exists in the AZ. Use minSize: 0 and scale up on demand.

Huge Pages

EFA requires 2MiB Huge Pages for internal buffers. EKS GPU AMIs pre-allocate 5128 × 2MiB (~10 GiB).

Pods must request Huge Pages and mount the hugepages volume:

resources:
  requests:
    hugepages-2Mi: "5120Mi"
  limits:
    hugepages-2Mi: "5120Mi"
volumes:
  - name: hugepages
    emptyDir:
      medium: HugePages
volumeMounts:
  - name: hugepages
    mountPath: /dev/hugepages

For Bottlerocket nodes, configure via settings:

bottlerocket:
  settings:
    kernel:
      sysctl:
        "vm.nr_hugepages": "5128"

NCCL over EFA: aws-ofi-nccl

The aws-ofi-nccl plugin bridges NCCL and libfabric. It's pre-installed in:

  • NVIDIA NGC containers (nvcr.io/nvidia/pytorch:*)
  • AWS Deep Learning Containers (763104351884.dkr.ecr.*.amazonaws.com/pytorch-training:*)

Environment Variables

Modern software stacks (aws-ofi-nccl ≥ 1.7.0, libfabric ≥ 1.18.0) require minimal configuration — most legacy env vars are auto-detected:

VariableStatusNotes
FI_PROVIDER=efaNot neededAuto-detected by libfabric
FI_EFA_USE_DEVICE_RDMA=1Not needed (libfabric ≥ 1.18.0)Was needed for older stacks; harmless to set
FI_EFA_FORK_SAFE=1Not needed (aws-ofi-nccl ≥ 1.7.0)Legacy
NCCL_MIN_NCHANNELSLeave at defaultAuto-set based on NIC count (8 for p4d, higher for p5)
NCCL_ALGOOptionalOverride collective algorithm (Tree, Ring, CollnetDirect, CollnetChain, NVLS)
NCCL_PROTOOptionalOverride protocol (Simple, LL, LL128)
NCCL_CROSS_NIC0 (default)Set to 1 to allow cross-NIC communication patterns
NCCL_NET_GDR_LEVELAutoGPUDirect RDMA level; auto-detected
NCCL_DEBUGWARN (default)Set to INFO for debugging, TRACE for verbose
NCCL_TOPO_DUMP_FILENot setDump detected topology to file for inspection
NCCL_SOCKET_IFNAMEAutoNetwork interface for OOB (out-of-band) communication

Container Image Requirements

Your training container must include:

  1. NCCL (typically bundled with PyTorch/CUDA)
  2. aws-ofi-nccl plugin
  3. libfabric with EFA provider
  4. EFA installer components (efa_installer or individual packages)

AWS Deep Learning Containers and NGC PyTorch containers include all of these. If building a custom image:

FROM nvcr.io/nvidia/pytorch:24.12-py3
# EFA components already included in NGC containers

# Or for custom builds:
# RUN apt-get update && apt-get install -y libfabric-dev
# RUN git clone https://github.com/aws/aws-ofi-nccl && cd aws-ofi-nccl && ...

Verifying EFA is Active

In NCCL debug output (NCCL_DEBUG=INFO), look for:

NCCL INFO NET/OFI Using aws-ofi-nccl ...
NCCL INFO NET/OFI Selected Provider is efa

Bad signs — EFA not being used:

  • Selected Provider is sockets or tcp → EFA driver not available
  • No EFA devices found → Device plugin not running or EFA interfaces not allocated
  • Bandwidth in NCCL all-reduce test ≪ expected → Check placement group, security group

Expected Bandwidth

InstanceMessage SizeExpected Bus BWNotes
p4d.24xlarge1 GB+~50-80 GB/s4 EFA × 100 Gbps
p5.48xlarge1 GB+~300-400 GB/s32 EFA × 100 Gbps

If observed bandwidth is ~10-25 GB/s (TCP-level), EFA is not being used correctly.

EFA-Only Interfaces

For p5/p5e/trn2, interfaces on network cards 1+ can be EFA-only (no IP stack):

  • Lower overhead — no TCP/IP processing on RDMA paths
  • All bandwidth dedicated to collective operations
  • Requires: VPC CNI ≥ 1.18.5, Amazon Linux 2 AMI ≥ v20240928
  • Cannot be configured via eksctl — requires custom launch template with InterfaceType: efa-only

Communication Backends: Libfabric vs UCX

Two communication libraries sit between applications and EFA hardware. Understanding their differences matters for configuring NCCL (training), NixlConnector (disaggregated serving), and DeepEP (MOE all2all).

Libfabric (OFI)

Libfabric is the native communication library for EFA. The aws-ofi-nccl plugin bridges NCCL → libfabric → EFA.

AspectDetails
EFA integrationNative — EFA provider is maintained by AWS in the libfabric tree
GPUDirect RDMASupported via FI_EFA_USE_DEVICE_RDMA (auto-enabled on libfabric ≥ 1.18.0)
Used byNCCL (via aws-ofi-nccl), training frameworks (PyTorch DDP, FSDP, Megatron-LM)
Transport selectionAutomatic — libfabric detects EFA provider. No manual transport config needed.
Key env varsFI_PROVIDER (auto), FI_EFA_USE_DEVICE_RDMA (auto ≥ 1.18.0), FI_EFA_FORK_SAFE (auto ≥ aws-ofi-nccl 1.7.0)
Multi-pathSRD protocol provides built-in multi-path routing across AWS fabric

UCX (Unified Communication X)

UCX is a general-purpose communication library supporting multiple transports. It's the default backend for NIXL/NixlConnector in vLLM disaggregated serving.

AspectDetails
EFA integrationVia libfabric shim — UCX calls libfabric's EFA provider internally
GPUDirect RDMASupported when kv_buffer_device="cuda" and EFA GPUDirect is available
Used byvLLM NixlConnector, NIXL library, some MPI implementations
Transport selectionUCX_TLS=all auto-selects best available (RDMA > shared memory > TCP). Can specify: rc,ud,sm for InfiniBand, or let auto-detection find EFA via libfabric.
Key env varsUCX_TLS (transport selection), UCX_NET_DEVICES (NIC selection), UCX_LOG_LEVEL (debugging)
Multi-pathRelies on underlying provider — gets SRD multi-path when using EFA

Which Backend for What

WorkloadBackendWhy
NCCL collectives (training, all-reduce)Libfabric (via aws-ofi-nccl)Native EFA support, mature, AWS-optimized
KV cache transfer (NixlConnector PD)UCX (via NIXL)NIXL uses UCX as default; async send/recv model fits PD pattern
MOE all2all (DeepEP, pplx)Libfabric or UCXDepends on backend: deepep_* uses its own transport; pplx uses NCCL → libfabric
Custom connectorsEitherP2pNcclConnector uses NCCL (→ libfabric); LMCache uses NIXL (→ UCX)

Key Difference: NCCL env vars don't apply to NixlConnector

When using NixlConnector for PD serving, NCCL_IB_HCA, NCCL_SOCKET_IFNAME, etc. have no effect. Configure UCX variables instead:

PD Serving (NixlConnector)Training (NCCL)
UCX_TLS=allFI_PROVIDER=efa (auto)
UCX_NET_DEVICES=allNCCL_SOCKET_IFNAME (auto)
UCX_LOG_LEVEL=info (debug)NCCL_DEBUG=INFO (debug)

EFA vs InfiniBand for Disaggregated Serving and MOE

EFA and InfiniBand both provide RDMA for KV cache transfer and expert all2all, but differ fundamentally in architecture:

Protocol and Routing

EFA (SRD)InfiniBand (IB)
ProtocolScalable Reliable Datagram — connectionless, multi-pathQueue Pair — connection-oriented, single-path per QP
RoutingPacket spraying across AWS fabric — automatic load balancingSubnet-managed routing — deterministic paths, manual ECMP for multi-path
Congestion controlBuilt into SRD, tuned for ML collectivesHardware-based (ECN), requires switch configuration
TopologyFlat — any instance can reach any other at full bandwidth (within placement group)Fat-tree or rail-optimized — bandwidth depends on switch tiers

Implications for PD Serving

ConcernEFAInfiniBand
KV transfer bandwidthConsistent — SRD multi-path prevents hotspots. p5: 3200 Gbps aggregate.High peak, but single-QP transfers use one path. Multiple QPs or RDMA-CM needed for multi-path.
Tail latencyLower variance — packet spraying smooths burstsLower absolute minimum latency, but higher variance under contention
ScalingEasier — no subnet manager, no switch config. Placement groups handle locality.Requires subnet manager, careful topology planning, switch firmware management
GPUDirect RDMASupported on p4d/p5/p5e (NVIDIA peer memory + EFA driver)Native support on all RDMA NICs (ConnectX-6/7)
NixlConnectorUCX auto-detects EFA via libfabric. UCX_TLS=all works.UCX native IB support. UCX_TLS=rc for reliable connected. UCX_NET_DEVICES=mlx5_0:1.

Implications for MOE All2All

ConcernEFAInfiniBand
All2all patternGood — SRD handles many-to-many well due to packet sprayingGood — but all2all generates N² flows; fat-tree bandwidth must support it
DeepEP backendsdeepep_high_throughput and deepep_low_latency work over EFA via NCCL → libfabricNative IB support; deepep_low_latency benefits from IB's lower absolute latency
pplx backendWorks via NCCL → aws-ofi-nccl → libfabric → EFAWorks via NCCL → IB verbs directly
Expert migration (EPLB)Fast — equal bandwidth to all nodes in placement groupDepends on topology — migration between nodes on different switches may be slower

When EFA Wins

  • Cloud-native deployments — no switch management, no subnet manager, automatic scaling
  • Large clusters — SRD multi-path scales better than fat-tree at 100+ nodes
  • Mixed workloads — PD + MOE + training can coexist on EFA without careful traffic engineering

When InfiniBand Wins

  • Absolute lowest latency — IB RDMA has ~1-2 μs vs EFA's ~3-5 μs for small messages
  • On-prem — IB is the standard; EFA is AWS-only
  • Established tooling — IB has decades of RDMA ecosystem (perftest, ibstat, opensm)

Concurrent Traffic Patterns: PD + MOE on EFA

MOE models with disaggregated serving generate overlapping network traffic:

Traffic TypePatternBandwidth NeedEFA Interfaces Used
KV cache transfer (PD)Point-to-point: prefill → decodeHigh burst (GB-scale per request)UCX selects available EFA NICs
All2all (MOE expert routing)Many-to-many: all GPUs exchange tokensSustained, proportional to batch × expertsNCCL maps NICs to GPUs
NCCL collectives (attention layers)AllReduce within TP groupModerateShared with all2all

Bandwidth Planning

InstanceAggregate BWKV Transfer BudgetAll2All BudgetSufficient?
p4d (4 EFA, 400 Gbps)~50 GB/s~15 GB/s~35 GB/s⚠️ Tight for large MOE + PD
p5 (32 EFA, 3200 Gbps)~400 GB/s~50 GB/s~350 GB/s✅ Comfortable
p5e (32 EFA, 3200 Gbps)~400 GB/s~50 GB/s~350 GB/s✅ Comfortable

On p4d, if running MOE + PD simultaneously, KV transfer and all2all contend for the same 4 EFA interfaces. Monitor with NCCL_DEBUG=INFO and UCX_LOG_LEVEL=info to identify bottlenecks. Consider dedicating specific EFA interfaces to each traffic type via UCX_NET_DEVICES and NCCL_IB_HCA if contention is observed.

References

Cross-References

  • aws-fsx — FSx storage for training data on EFA-enabled nodes
  • pytorch — PyTorch distributed training over EFA
  • fsdp — FSDP distributed training using EFA for all-reduce
  • megatron-lm — Megatron-LM multi-node training over EFA
  • ray-train — Ray Train distributed jobs on EFA-enabled clusters
  • vllm — vLLM disaggregated serving using EFA for KV cache transfer
  • ray-serve — Ray Serve PD serving on EFA-enabled clusters
  • gpu-operator — GPU driver and GPUDirect RDMA support
  • nccl — NCCL communication via EFA transport (aws-ofi-nccl)
  • deepspeed — DeepSpeed multi-node training over EFA
  • kubeflow-trainer — Orchestrate EFA-enabled training jobs on K8s
  • karpenter — Provision EFA-enabled GPU instances

Reference

  • EFA docs
  • EFA EKS best practices
  • aws-ofi-nccl GitHub
  • references/troubleshooting.md — NCCL debug, diagnostics, bandwidth expectations
  • scripts/check_efa.sh — verify EFA device availability, libfabric provider, GPU, and NCCL config on a node
  • assets/architecture.md — EFA network stack, GPUDirect RDMA flow, and EKS multi-node topology diagrams