Agent SkillsAgent Skills
tylertitsworth

aws-fsx

@tylertitsworth/aws-fsx
tylertitsworth
0
0 forks
Updated 4/7/2026
View on GitHub

FSx for Lustre — performance tuning, striping, S3 data repositories, EKS integration. Use when configuring high-performance storage for ML on EKS. NOT for EBS or EFS.

Installation

$npx agent-skills-cli install @tylertitsworth/aws-fsx
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathaws-fsx/SKILL.md
Branchmain
Scoped Name@tylertitsworth/aws-fsx

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions


name: aws-fsx description: "FSx for Lustre — performance tuning, striping, S3 data repositories, EKS integration. Use when configuring high-performance storage for ML on EKS. NOT for EBS or EFS."

AWS FSx for Lustre

Architecture

An FSx for Lustre filesystem consists of:

  • Metadata Targets (MDTs): Store file metadata — names, timestamps, permissions, directory structure, file layouts. Hosted on metadata servers (MDS).
  • Object Storage Targets (OSTs): Store actual file data. Each OST is backed by a disk (SSD or HDD). Files are striped across OSTs for parallel I/O.
  • File servers: In-memory cache layer in front of OSTs. Hot data is served from cache (network-limited), cold data from disk.
Client (pod) ─── Lustre client ─── File servers ─── OSTs (data)
                                      │                └── SSD or HDD disks
                                      └── MDT (metadata)

Every client mounts the full filesystem and communicates directly with the file servers hosting the relevant OSTs — no single server bottleneck.

Read/Write Paths

  • Cached read: Client → file server in-memory/SSD cache → network-limited throughput
  • Uncached read: Client → file server → disk → limited by lower of network and disk throughput
  • Write: Client → file server → disk → limited by lower of network and disk throughput
  • S3 lazy load: First access to an S3-backed file triggers a fetch from S3 → stored on OSTs → subsequent reads from OSTs

Client Throughput Limits

Client Network InterfaceMax Throughput per Client
Standard ENA100 Gbps (5 Gbps per OST)
EFA700 Gbps
EFA with GPUDirect Storage (GDS)1200 Gbps

For filesystems with >10 GB/s throughput capacity, AWS recommends EFA-enabled clients. GDS allows GPU memory to read/write directly to Lustre storage without CPU copies.

Deployment Types

TypeDurabilityThroughputMin SizeUse Case
SCRATCH_1None200 MB/s per TiB (burst)1.2 TiBEphemeral training data
SCRATCH_2None200 MB/s per TiB (burst)1.2 TiBBetter networking than SCRATCH_1
PERSISTENT_1In-AZ replication50–200 MB/s per TiB1.2 TiBLonger-lived workloads
PERSISTENT_2In-AZ replication125–1000 MB/s per TiB1.2 TiBProduction, highest throughput

SCRATCH filesystems have no data replication — data is lost if hardware fails. Ideal for training jobs where data is re-derivable from S3.

PERSISTENT_2 supports perUnitStorageThroughput of 125, 250, 500, or 1000 MB/s per TiB. 1000 MB/s requires SSD storage.

Storage sizing: Minimum 1.2 TiB, then increments of 2.4 TiB. Throughput scales linearly with size.

Striping

Lustre splits files into chunks distributed across multiple OSTs. This is the primary performance lever for large file I/O.

Default Progressive File Layout (PFL)

Filesystems created after August 2023 use a 4-component PFL:

File SizeStripe CountEffect
≤ 100 MiB1Single OST, no overhead
100 MiB – 10 GiB8Parallel I/O across 8 OSTs
10 GiB – 100 GiB16Higher parallelism
> 100 GiB32Maximum parallelism

Custom Striping

# Set stripe count on a directory (applies to new files)
lfs setstripe -c 16 /mount/training-data/

# Set PFL for a directory
lfs setstripe -E 100M -c 1 -E 10G -c 8 -E 100G -c 16 -E -1 -c 32 /mount/data/

# Check file layout
lfs getstripe /mount/data/large-file.bin

# Migrate existing file to new layout
lfs migrate -c 32 /mount/data/existing-file.bin

# View OST usage
lfs df -h /mount/

Striping Guidelines

  • Large files (>1 GiB): Higher stripe count improves throughput. Stripe across many OSTs.
  • Small files (<100 MiB): Stripe count of 1. Higher counts add metadata overhead (network round-trip per OST in layout).
  • Stripe count -1: Stripe across all OSTs. Use for largest files.
  • Stripe size: Default 1 MiB. Rarely needs changing.
  • Don't set high stripe counts on directories with many small files — metadata overhead degrades performance.

What Striping Can't Fix

  • Metadata-heavy workloads (millions of tiny files, ls on huge directories): Limited by MDT IOPS, not striping.
  • Single-threaded sequential reads: Limited by single OST throughput. Application must use parallel I/O.
  • Random small I/O: Lustre is optimized for large sequential I/O. Small random reads/writes are limited by latency.

Metadata IOPS

Metadata operations (file create, open, close, delete, directory operations) are limited by MDT performance.

PERSISTENT_2: User-Provisioned Metadata IOPS

OperationIOPS per Provisioned Unit
File create, open, close2
File delete1
Directory create, rename0.1
Directory delete0.2

Valid provisioned values: 1500, 3000, 6000, 12000, and multiples of 12000 up to 192000.

SSD: Automatic Mode

Storage CapacityIncluded Metadata IOPS
1.2 TiB1500
2.4 TiB3000
4.8–9.6 TiB6000
12–45.6 TiB12000
≥48 TiB12000 per 24 TiB

ML Workload Implications

  • Training data loading: Mostly sequential reads of large files → limited by OST throughput, not metadata. Striping helps.
  • Checkpoint saving: Large sequential writes → striping helps. But initial file creation hits MDT.
  • Preprocessing with many small files: Can bottleneck on metadata IOPS. Consider pre-aggregating into fewer large files (TFRecord, WebDataset, etc.).

Data Repository Associations (DRA)

DRAs are the mechanism for linking FSx for Lustre to S3 buckets. Each DRA maps a filesystem path to an S3 prefix, creating a bidirectional data bridge.

Key Properties

  • Maximum 8 DRAs per filesystem
  • Each DRA maps one filesystem path to one S3 prefix (1:1)
  • Paths cannot overlap (e.g., /ns1/ and /ns1/subdir/ cannot coexist)
  • / as filesystem path is only allowed for the first DRA
  • Only one DRA request processed at a time (others queue)
  • Not available on Scratch 1 or Lustre 2.10 filesystems

How DRAs Work

  1. Metadata import: S3 object metadata (name, size, timestamps) is loaded into the MDT. File data is not copied — it's lazy-loaded on first access.
  2. First read: Triggers an HSM (Hierarchical Storage Management) restore from S3 → data fetched and cached on OSTs.
  3. Subsequent reads: Served from OSTs (no S3 latency).
  4. Auto-import: S3 changes automatically reflected in Lustre metadata.
  5. Auto-export: Filesystem changes automatically written back to S3 (asynchronous).

DRA Configuration

SettingOptionsDefault (Console)Default (CLI/API)
Import policyNone, New, Changed, Deleted (any combination)New + Changed + DeletedDisabled
Export policyNone, New, Changed, Deleted (any combination)New + Changed + DeletedDisabled
Import metadata on createYes / NoYesNo

Auto-import policies:

PolicyEffect
NoneNo auto-import. Use import data repository tasks manually.
NewImport metadata for new S3 objects
ChangedImport metadata for modified S3 objects
DeletedRemove metadata for deleted S3 objects

Auto-export: Exports regular files, symlinks, and directories. Does not export special files (FIFO, block, character, socket).

Conflict handling: If the same file is modified in both the filesystem and S3 simultaneously, there is no automatic conflict resolution. Application-level coordination is required.

Import and Export Data Repository Tasks

In addition to automatic import/export, you can run on-demand tasks:

  • Import task: Loads metadata for new/changed files from S3 into the filesystem
  • Export task: Exports file data and metadata to S3

Note: Auto-export and export tasks cannot be used simultaneously on the same filesystem. Auto-import and import tasks can be used simultaneously.

Cross-Region and Cross-Account

FeatureCross-RegionCross-Account
Auto-import❌ Same region only✅ Supported
Auto-export✅ Supported✅ Supported

Pre-Warming Data

Lazy loading means first-epoch training has S3 latency on first access. Pre-warm with:

# Restore specific files from S3 archive to OSTs
lfs hsm_restore /mount/data/file1 /mount/data/file2

# Bulk restore a directory
nohup find /mount/data/ -type f -print0 | xargs -0 -n 1 lfs hsm_restore &

ImportedFileChunkSize

Controls how S3-imported files are striped (default: 1 GiB). Files larger than this value are automatically striped across ceil(FileSize / ChunkSize) + 1 OSTs.

Intelligent-Tiering Storage Class

An alternative to SSD/HDD storage classes. Automatically tiers data across three access tiers:

TierLatencyCostWhen Data Moves Here
Frequent AccessSub-millisecond (SSD)HighestRecently accessed data
Infrequent AccessLow (HDD)LowerData not accessed recently
ArchiveHigher (retrieval delay)LowestRarely accessed data
  • Optional SSD read cache for frequently accessed data
  • No minimum storage capacity — pay only for data stored
  • Starting at <$0.005/GB-month
  • Not available with Scratch deployments
  • Metadata IOPS: Only 6000 or 12000 (user-provisioned mode only, no automatic)

ML use case: Good for long-lived datasets where only a subset is actively used. Training data accessed during current epoch stays in Frequent Access tier; older experiment data tiers down automatically.

Compression

LZ4 transparent compression (dataCompressionType: LZ4) reduces storage costs and can improve throughput for compressible data. Applied to new files written after enabling.

Backups

  • Automatic backups: Daily, stored in S3 (11 9's durability). Retention 0-90 days.
  • Manual backups: On-demand via console/CLI/API.
  • Cross-region/cross-account backup copy: For disaster recovery and compliance.
  • Restore: Creates a new filesystem from backup.
  • Incremental: Only changed data since last backup.
  • Persistent filesystems only — Scratch filesystems do not support backups.

Backups are managed via AWS Backup for policy-based scheduling.

Storage Quotas

User-level and group-level quotas to control storage consumption:

# Set quota for a user (soft limit, hard limit, grace period)
lfs setquota -u username --block-softlimit 100G --block-hardlimit 120G /mount/

# Set quota for a group
lfs setquota -g groupname --block-softlimit 1T --block-hardlimit 1.2T /mount/

# Check quota usage
lfs quota -u username /mount/
lfs quota -g groupname /mount/

Useful for multi-team environments sharing a filesystem to prevent any team from exhausting storage.

Encryption

  • At rest: All filesystems encrypted. AWS managed key (default) or customer-managed KMS key.
  • In transit: Encryption of data in transit between clients and file servers. Available in select regions. Enabled automatically for supported client kernel versions.

GPUDirect Storage (GDS)

For EFA-enabled filesystems with NVIDIA GPUs, GDS enables direct data transfer between GPU memory and FSx for Lustre storage, bypassing CPU and system memory entirely.

  • Throughput: Up to 1200 Gbps per client (vs 700 Gbps with EFA alone)
  • Requirements: EFA-enabled filesystem, EFA-enabled GPU instance, NVIDIA GDS driver
  • Eliminates the CPU copy bottleneck for checkpoint writes and data loading

All Features Summary

Supported

  • POSIX filesystem semantics (with caveats below)
  • ReadWriteMany — multiple pods across nodes simultaneously
  • Data Repository Associations (DRA) — up to 8 S3 links per filesystem
  • Auto-import and auto-export with S3
  • Import/export data repository tasks (on-demand)
  • Progressive file layouts (PFL) with automatic striping
  • lfs CLI for stripe management, HSM operations, quotas
  • Transparent LZ4 compression
  • Encryption at rest (AWS managed or customer KMS keys)
  • Encryption in transit (select regions)
  • EFA (700 Gbps) and GPUDirect Storage (1200 Gbps) per-client throughput
  • Intelligent-Tiering storage class (automatic cost optimization)
  • Automatic and manual backups (persistent filesystems)
  • Cross-region and cross-account backup copy
  • Storage quotas (user and group level)
  • Lustre client on Amazon Linux 2, AL2023, Ubuntu, RHEL, CentOS, SUSE
  • EKS CSI driver (dynamic and static provisioning)
  • AWS Backup integration
  • CloudWatch metrics (throughput, IOPS, metadata ops, capacity)
  • Storage capacity scaling (increase online)

Not Supported / Limitations

LimitationDetail
Single AZFSx for Lustre filesystems exist in one subnet/AZ. No multi-AZ replication.
No NFS/SMBLustre protocol only. Requires Lustre client (kernel module).
No online resize downCan increase capacity, cannot shrink.
No snapshotsNo built-in snapshot capability (unlike EBS or FSx ONTAP).
Minimum size1.2 TiB minimum. Can't create small filesystems.
S3 lazy load latencyFirst access to uncached S3-backed files has S3 latency.
Metadata IOPS capDirectory operations are slow relative to data I/O. Millions of tiny files suffer.
Client compatibilityRequires specific kernel versions with Lustre client module.
Hard link limitLustre has lower hard link limits than ext4/XFS.
No POSIX ACLsOnly basic Unix permissions (uid/gid/mode).

EKS CSI Driver

The fsx.csi.aws.com CSI driver enables dynamic and static provisioning. Install as an EKS add-on (requires Pod Identity agent). Key StorageClass parameters:

ParameterValuesEffect
subnetIdsubnet IDRequired. Subnet for filesystem ENI
securityGroupIdsSG IDsSecurity groups (must allow TCP 988 inbound)
deploymentTypeSCRATCH_1, SCRATCH_2, PERSISTENT_1, PERSISTENT_2Durability/performance tier
perUnitStorageThroughput1251000MB/s per TiB (PERSISTENT only)
dataCompressionTypeNONE, LZ4Transparent compression
s3ImportPaths3://bucket/prefixS3 data repository source
autoImportPolicyNONE, NEW, NEW_CHANGED, NEW_CHANGED_DELETEDAuto-import from S3

See references/troubleshooting.md for common issues.

References

Cross-References

  • aws-efa — EFA networking for maximum client throughput (700 Gbps+)
  • ray-train — Distributed training jobs consuming FSx-backed PVCs
  • kueue — Queue training jobs that mount FSx volumes
  • megatron-lm — Large-scale training with shared checkpoint storage
  • minio — S3-compatible alternative for smaller-scale storage
  • longhorn — Alternative distributed storage for non-Lustre workloads
  • kubeflow-trainer — FSx-backed PVCs for training jobs
  • gpu-operator — GPU nodes consuming FSx-mounted training data
  • deepspeed — DeepSpeed checkpoint storage on FSx

Reference

  • FSx for Lustre docs
  • FSx CSI driver
  • FSx pricing
  • references/troubleshooting.md — mount issues, performance, data repository tasks
  • assets/storageclass.yaml — FSx Lustre StorageClass with S3 data repository, LZ4 compression, and PVC example