megatron-lm

@tylertitsworth/megatron-lm

0 forks

Updated 4/7/2026

NVIDIA Megatron-LM — tensor/pipeline/sequence/expert/context parallelism, Megatron-Core composable library, Megatron FSDP, mixed precision (BF16/FP8), checkpoint conversion (Megatron Bridge), data loading, and MoE training. Use when training LLMs at scale with Megatron. NOT for inference.

Installation

$npx agent-skills-cli install @tylertitsworth/megatron-lm

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositorytylertitsworth/skills

Pathmegatron-lm/SKILL.md

Branchmain

Scoped Name@tylertitsworth/megatron-lm

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: megatron-lm description: "NVIDIA Megatron-LM — tensor/pipeline/sequence/expert/context parallelism, Megatron-Core composable library, Megatron FSDP, mixed precision (BF16/FP8), checkpoint conversion (Megatron Bridge), data loading, and MoE training. Use when training LLMs at scale with Megatron. NOT for inference."

Megatron-LM

NVIDIA Megatron-LM is the framework for training transformer models at scale. Provides GPU-optimized building blocks with advanced parallelism strategies. Achieves up to 47% MFU on H100 clusters.

Components:

Megatron-Core (MCore) — composable library: parallelism, kernels, model building blocks, distributed checkpointing, Megatron FSDP
Megatron-LM — reference training scripts using Megatron-Core
Megatron Bridge — HuggingFace ↔ Megatron checkpoint conversion

Requirements: NVIDIA GPUs (A100/H100/B200), CUDA 12+, PyTorch 2.4+, Transformer Engine (required for FP8), Apex (optional, for fused optimizers/kernels).

Parallelism Configuration

Megatron combines multiple parallelism dimensions. Total GPUs = TP × PP × DP (× CP × EP for advanced configs).

Parallelism Settings

Setting	Purpose	Default	Notes
`tensor_model_parallel_size`	Shard layers across GPUs (TP)	`1`	≤ GPUs per node (NVLink bound)
`pipeline_model_parallel_size`	Split layers into pipeline stages (PP)	`1`	For models exceeding single-node memory
`sequence_parallel`	Distribute LayerNorm/Dropout across TP ranks (SP)	`False`	Requires TP > 1, always enable with TP
`context_parallel_size`	Split long sequences across GPUs (CP)	`1`	For very long context training
`expert_model_parallel_size`	Distribute MoE experts across GPUs (EP)	`1`	For MoE models
`num_layers_per_virtual_pipeline_stage`	Interleaved PP schedule	None	Reduces pipeline bubbles
`use_distributed_optimizer`	ZeRO-1 style optimizer sharding	`False`	Reduces per-GPU optimizer memory by 1/DP
`use_megatron_fsdp`	Enable Megatron FSDP (MCore 0.14+)	`False`	Full parameter sharding; see Megatron FSDP section

Sequence Parallelism Memory Footgun

The trap: Without --sequence-parallel, only weights and attention/MLP compute are sharded across TP ranks. LayerNorm inputs/outputs and Dropout activations remain fully duplicated on every TP rank. Engineers assume "activations scale as 1/TP" and OOM on long sequences.

Activation memory per GPU (for one transformer layer, batch size b, sequence length s, hidden size h):

Component	Without SP	With SP
LayerNorm inputs (×2)	2 × 2bsh	2 × 2bsh / TP
Dropout masks (×2)	2 × bsh	2 × bsh / TP
Total non-sharded activations	~6bsh	~6bsh / TP

Without SP, these ~6bsh bytes per layer per GPU are constant regardless of TP degree. At 70B scale (h=8192) with sequence length 8192, batch 1, TP=8: that's ~384 MB/layer × 80 layers = ~30 GB of wasted duplicate activations.

Rule: Always pass --sequence-parallel when --tensor-model-parallel-size > 1. There is no downside — SP adds minimal all-gather/reduce-scatter ops that overlap with compute.

Parallelism Interaction Matrix

Not all parallelism dimensions compose freely. This matrix shows tested/supported combinations:

Combination	Supported	Notes
TP + PP	✅	Standard 3D parallelism. TP within node, PP across nodes
TP + SP	✅	SP requires TP > 1. Always enable SP with TP
TP + CP	✅	CP orthogonal to TP. Each CP rank handles a sequence chunk
TP + EP	✅	EP for MoE experts, TP for attention/dense layers
PP + CP	✅	CP applied at each pipeline stage
PP + EP	✅	Experts distributed across EP group at each pipeline stage
CP + EP	✅	Fully orthogonal for long-context MoE
TP + PP + CP	✅	Full 4D parallelism. Total GPUs = TP × PP × CP × DP
TP + PP + EP	✅	MoE with pipeline stages
TP + PP + CP + EP	✅	Maximum parallelism for large MoE with long context
Megatron FSDP + TP	✅	FSDP replaces DP with full parameter sharding
Megatron FSDP + EP	✅	FSDP-aware expert data parallelism
Megatron FSDP + CP	✅	Supported since MCore 0.14
Megatron FSDP + PP	❌	Not supported — FSDP and PP are mutually exclusive

GPU allocation formula:

Standard: total_GPUs = TP × PP × DP
With CP: total_GPUs = TP × PP × CP × DP
With EP: MoE layers use EP group, dense layers use DP group. total_GPUs = TP × PP × (EP or DP)
With FSDP: total_GPUs = TP × CP × FSDP_DP (no PP)

Recommended Configurations

Model Size	GPUs	TP	PP	DP	Notes
7B	8	1	1	8	Single node, DP only
13B	8	2	1	4	TP within NVLink domain
70B	32	4	2	4	TP + PP across nodes
175B	128	8	4	4	Full 3D parallelism
405B	256	8	8	4	Large-scale multi-node
MoE 671B (DeepSeek-V3)	256	8	1	4	TP=8, EP=32, no PP for MoE

Megatron-Core Architecture

Megatron-Core (MCore) is the composable library underpinning Megatron-LM. Import from megatron.core:

Key MCore Modules

Module	Purpose
`megatron.core.transformer`	TransformerLayer, TransformerBlock, attention, MLP implementations
`megatron.core.models`	GPTModel, T5Model, multimodal models
`megatron.core.parallel_state`	Process group management for all parallelism dimensions
`megatron.core.tensor_parallel`	ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding
`megatron.core.pipeline_parallel`	Pipeline schedules (1F1B, interleaved)
`megatron.core.context_parallel`	Ring Attention / CP implementation
`megatron.core.distributed`	DistributedDataParallel, Megatron FSDP
`megatron.core.optimizer`	DistributedOptimizer with mixed precision support
`megatron.core.dist_checkpointing`	Sharded checkpoint save/load with parallelism resharding
`megatron.core.transformer.moe`	MoE router, expert layers, grouped GEMM, token dispatch

Building Custom Models with MCore

from megatron.core.transformer.transformer_config import TransformerConfig
from megatron.core.models.gpt.gpt_model import GPTModel
from megatron.core.transformer.spec_utils import ModuleSpec

config = TransformerConfig(
    num_layers=32,
    hidden_size=4096,
    num_attention_heads=32,
    num_query_groups=8,           # GQA
    ffn_hidden_size=11008,
    use_cpu_initialization=True,
    bf16=True,
    fp8=None,                      # Set to "e4m3" for FP8 on H100
    sequence_parallel=True,
    tensor_model_parallel_size=4,
    pipeline_model_parallel_size=2,
    context_parallel_size=1,
)

model = GPTModel(
    config=config,
    transformer_layer_spec=get_gpt_layer_with_transformer_engine_spec(),
    vocab_size=32000,
    max_sequence_length=4096,
    parallel_output=True,
)

Megatron FSDP (MCore 0.14+)

Megatron FSDP is MCore's custom FSDP implementation — replaces DP with full parameter/gradient/optimizer sharding. Designed specifically for LLM training with MCore's parallelism stack.

When to use: When model + optimizer state exceeds GPU memory with standard DP, but you want to avoid PP (which introduces pipeline bubbles). Megatron FSDP is MCore's alternative to DeepSpeed ZeRO-3.

Enable Megatron FSDP

--use-megatron-fsdp \
--data-parallel-sharding-strategy optim_grads_params \
--no-gradient-accumulation-fusion \
--use-distributed-optimizer

Key Features

Shards optimizer states, gradients, and parameters across DP ranks
Communication/computation overlap for minimal overhead
Compatible with BF16, FP8 (E4M3/E5M2), mixed precision O1/O2/O3
Works with TP, EP, and CP (not PP)
Meta device initialization for extremely large models
DTensor-based distributed checkpointing (MCore 0.14+)

Configuration Recommendations

# Disable CUDA_DEVICE_MAX_CONNECTIONS for full comm/compute overlap
unset CUDA_DEVICE_MAX_CONNECTIONS

# Add per-token loss for gradient sharding optimization
--calculate-per-token-loss

FSDP Unit Configuration

from megatron.core.distributed import FullyShardedDataParallel

model = FullyShardedDataParallel(
    transformer_config,
    model,
    ddp_config,
    fsdp_unit_modules=[TransformerLayer, LanguageModelEmbedding],
)

Each FSDP unit is the smallest module whose weights can be released after use. Typically one TransformerLayer = one FSDP unit.

Model Architecture Settings

Setting	Purpose	Example
`num_layers`	Transformer layers	`32`
`hidden_size`	Hidden dimension	`4096`
`num_attention_heads`	Attention heads	`32`
`num_query_groups`	GQA groups (0 = MHA)	`8`
`ffn_hidden_size`	FFN hidden size	`11008`
`seq_length`	Training sequence length	`4096`
`max_position_embeddings`	Max position embeddings	`4096`
`swiglu`	SwiGLU activation	`True`
`normalization`	Norm type (`LayerNorm`, `RMSNorm`)	`RMSNorm`
`use_rotary_position_embeddings`	RoPE	`True`
`untie_embeddings_and_output_weights`	Separate embed/output weights	`True`
`disable_bias_linear`	Remove bias from linear layers	`True`

Training Settings

Setting	Purpose	Default
`micro_batch_size`	Per-GPU batch size	`1`
`global_batch_size`	Total batch across all GPUs	required
`lr`	Learning rate	required
`min_lr`	Minimum LR (for decay)	`0.0`
`lr_decay_style`	Decay schedule (`cosine`, `linear`, `constant`)	`cosine`
`lr_warmup_iters`	Warmup iterations	`0`
`train_iters`	Total training iterations	required
`weight_decay`	Weight decay	`0.01`
`clip_grad`	Gradient clipping	`1.0`
`dataloader_type`	Dataloader (`cyclic`, `single`)	`cyclic`
`split`	Train/val/test split	`99,1,0`

Mixed Precision

Setting	Purpose
`bf16`	BF16 training (standard for A100/H100)
`attention_softmax_in_fp32`	FP32 softmax for numerical stability
`accumulate_allreduce_grads_in_fp32`	FP32 gradient accumulation
`fp8_format`	FP8 format: `hybrid` (E4M3 forward, E5M2 backward)
`fp8_amax_history_len`	FP8 amax history length
`fp8_amax_compute_algo`	FP8 amax algorithm (`max`, `most_recent`)

FP8 requires H100+ GPUs and Transformer Engine in the container image.

Dynamic Context Parallelism (MCore, Jan 2026)

Version note: Dynamic CP was introduced in MCore main branch in January 2026. It is not available in MCore ≤ 0.15.x stable releases — you need MCore main (post-January 2026) or a future stable release.

Data format prerequisite: Dynamic CP requires inputs in THD (Token × Head × Dim) layout. If your existing data pipeline produces BSHD-layout tensors (batch × sequence × head × dim), you must migrate the format before enabling dynamic CP.

Static CP assigns the same context_parallel_size to every micro-batch in a run. This causes two problems for variable-length datasets (post-training, video generation):

CP over-sharding: A micro-batch containing only short sequences gets sharded across the CP group even though all sequences fit on a single GPU — adding unnecessary ring-attention communication with too little compute to hide it.
DP imbalance: Micro-batches with disproportionately long sequences drive excessive FLOPs on some DP ranks, stalling gradient synchronization.

Dynamic CP solves both by selecting the optimal CP size per micro-batch at runtime.

How It Works

A per-iteration solver runs inside data_iterator_wrapper at training time (not as a separate offline preprocessing step). Each iteration, the solver examines the current micro-batch's sequence-length distribution and assigns the optimal CP size for that batch. Short-sequence batches use CP=1; long-sequence batches use CP=4 or CP=8. MCore re-forms the CP communication groups accordingly between micro-batches. Only CP groups are rebuilt (not TP or PP), which is cheap.

Reported gains: Up to 1.48× throughput on real post-training datasets with long-tail sequence distributions (NVIDIA technical blog, January 28, 2026).

Enabling Dynamic CP

Dynamic CP is configured via YAML config (not a standalone --dynamic-context-parallel CLI flag). Use Megatron's YAML config system (--yaml-cfg) with a config that sets dynamic_context_parallel: true on the model config, alongside the standard sequence-packing flags:

# Train with dynamic CP via YAML config (requires MCore main, Jan 2026+)
python pretrain_gpt.py \
  --yaml-cfg dynamic_cp_config.yaml \
  --context-parallel-size 8 \         # max CP to use; actual per-batch CP ≤ this
  --seq-length 32768 \                # max packed sequence length
  --reset-position-ids \              # required for packed variable-length inputs
  --reset-attention-mask \            # required for packed variable-length inputs
  --eod-mask-loss \                   # mask loss on EOD tokens (padding)
  ...

--context-parallel-size sets the upper bound on CP degree. The solver will choose any CP ≤ this value per micro-batch (typically powers of 2: 1, 2, 4, 8).

Interaction with Other Parallelism

Combination	Dynamic CP Support
DP + Dynamic CP	✅ Core use case — eliminates DP FLOPs imbalance
TP + Dynamic CP	✅ TP fixed; only CP varies per micro-batch
PP + Dynamic CP	⚠️ Limited — The 1F1B pipeline schedule keeps multiple micro-batches in-flight simultaneously across stages; reforming CP groups per micro-batch while stages are mid-flight stalls the pipeline. In practice, Dynamic CP works only with PP=1 in the current MCore implementation.
EP + Dynamic CP	✅ Orthogonal; MoE expert routing is CP-unaware

When to Use Dynamic CP

Workload	Use?	Reason
Post-training / SFT on chat data	✅ Yes	High sequence-length variance
Pre-training on uniform-length data	❌ Skip	Sequences are already uniform; static CP is sufficient
Video generation (DiT)	✅ Yes	Token counts vary widely by video resolution/duration
Long-context pre-training (uniform 128K)	❌ Skip	All sequences same length; use static CP

Communication Optimization

Setting	Purpose
`overlap_grad_reduce`	Overlap gradient all-reduce with backward pass
`overlap_param_gather`	Overlap parameter all-gather with forward pass
`tp_comm_overlap`	Overlap TP communication with compute
`dp_comm_overlap_bucket_size`	Bucket size for gradient reduction (tune for network)

Memory Optimization

Setting	Purpose
`recompute_activations`	Recompute all activations (max memory savings)
`recompute_granularity`	`selective` (recommended) or `full`
`recompute_method`	`uniform` or `block` (with `full` granularity)
`recompute_num_layers`	Layers to recompute per stage
`use_distributed_optimizer`	Shard optimizer state across DP ranks
`cpu_offload_optimizer`	Offload optimizer to CPU (slower, saves GPU)

MoE (Mixture of Experts)

Setting	Purpose
`num_experts`	Total number of experts
`moe_router_topk`	Experts per token
`moe_aux_loss_coeff`	Auxiliary load balancing loss coefficient
`moe_grouped_gemm`	Fused expert computation
`expert_model_parallel_size`	Distribute experts across GPUs
`moe_permute_fusion`	Fused permutation kernel
`moe_token_dispatcher_type`	`alltoall` (default) or `allgather`

Resiliency and Fault Tolerance

Checkpointing

Setting	Purpose
`save`	Checkpoint save directory (PVC path)
`load`	Checkpoint load directory
`save_interval`	Save every N iterations
`ckpt_format`	Format: `torch` or `torch_dist` (distributed, faster)
`auto_detect_ckpt_format`	Auto-detect checkpoint format on load
`no_save_optim`	Skip optimizer state in checkpoint (smaller files)
`no_save_rng`	Skip RNG state

Megatron saves sharded checkpoints — each TP/PP rank saves its shard to the same directory.

Elastic Training and Failure Recovery

Checkpoint-based recovery: Set save_interval frequently (every 100-500 iterations). On pod failure, the Job restarts and resumes from the last checkpoint via load.
torch_dist format: Use ckpt_format="torch_dist" for async distributed checkpointing — significantly faster save/load at scale.
Straggler detection: Set enable_straggler_detection=True to log slow ranks. Helps identify failing nodes before they crash.
Manual elastic resizing: Megatron doesn't support automatic elastic training. To change GPU count, save a checkpoint, convert parallelism with Megatron Bridge, and restart.

Checkpoint Conversion (Megatron Bridge)

Convert between HuggingFace and Megatron formats, or change parallelism dimensions. This is critical for:

Starting training from HF pretrained weights
Exporting Megatron checkpoints for inference (vLLM, TGI, SGLang)
Changing parallelism config (re-sharding) without retraining

from megatron.bridge import AutoBridge

# HuggingFace → Megatron
bridge = AutoBridge.from_hf_pretrained("my-org/my-model", trust_remote_code=True)
provider = bridge.to_megatron_provider()
provider.tensor_model_parallel_size = 4
provider.pipeline_model_parallel_size = 2
provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)

# Megatron → HuggingFace (for inference export)
bridge.save_hf_pretrained(model, "/checkpoints/hf_export")

# Stream weights without full save
for name, weight in bridge.export_hf_weights(model, cpu=True):
    print(name, tuple(weight.shape))

Install: pip install megatron-bridge. For a CLI-based conversion script, use the example from the package:

# Using the included conversion example script
python examples/conversion/convert_checkpoints.py import \
  --hf-model-path my-org/my-model \
  --output-path /checkpoints/megatron \
  --tp 4 --pp 2

python examples/conversion/convert_checkpoints.py export \
  --checkpoint-path /checkpoints/megatron \
  --output-path /checkpoints/hf_export

Checkpoint Format Comparison

Format	Save Speed	Load Speed	Portability	Use Case
`torch`	Slow (serial)	Slow	Highest — single file	Small models, debugging
`torch_dist`	Fast (parallel, async)	Fast	Requires same TP×PP	Production training
HuggingFace (via Bridge)	N/A (conversion)	N/A	Universal	Inference, sharing

MCore Distributed Checkpointing

MCore's dist_checkpointing module handles sharded save/load with automatic parallelism resharding:

from megatron.core import dist_checkpointing

# Save sharded state dict
state_dict = {"model": model.sharded_state_dict()}
dist_checkpointing.save(state_dict, checkpoint_dir)

# Load with different parallelism (automatic resharding)
state_dict = {"model": model.sharded_state_dict()}
dist_checkpointing.load(state_dict, checkpoint_dir)
model.load_state_dict(state_dict["model"])

Data Configuration

Setting	Purpose
`tokenizer_type`	`HuggingFaceTokenizer` or `SentencePieceTokenizer`
`tokenizer_model`	HF model ID or path to tokenizer
`data_path`	Path to preprocessed binary data (`.bin` + `.idx`)
`split`	Train/val/test split ratio (e.g., `99,1,0`)
`dataloader_type`	`cyclic` (wraps around) or `single` (one pass)

Data Preprocessing

Run as a preprocessing Job before training:

python tools/preprocess_data.py \
  --input raw_data.jsonl \
  --output-prefix /data/my_dataset \
  --tokenizer-type HuggingFaceTokenizer \
  --tokenizer-model meta-llama/Llama-3.1-8B \
  --workers 32 --append-eod
# Produces: my_dataset_text_document.bin + .idx

Data blending — weighted mix of datasets:

data_path: "0.7 dataset_a_text_document 0.3 dataset_b_text_document"

Kubernetes Deployment

Megatron training runs as a multi-node PyTorchJob or similar CRD. Key pod spec considerations:

MASTER_ADDR / MASTER_PORT: Set via the training operator (PyTorchJob sets these automatically)
Shared storage: All ranks need access to the same data and checkpoint PVC
/dev/shm: Mount as emptyDir with medium: Memory for inter-GPU communication
GPU topology: Use topologySpreadConstraints or node affinity to keep TP ranks on the same node
nproc_per_node: Match to nvidia.com/gpu resource limit

Debugging

See references/troubleshooting.md for:

Communication errors and rank failures
OOM at various scales
Pipeline bubble optimization
Checkpoint conversion issues
Data loading problems

References

troubleshooting.md — Communication errors, checkpoint issues, and training failures

Cross-References

pytorch — PyTorch distributed training fundamentals
fsdp — Alternative: PyTorch FSDP for smaller-scale training
deepspeed — Alternative: DeepSpeed ZeRO for large model training
flash-attention — Ring Attention / context parallelism
aws-efa — EFA networking for multi-node Megatron training
verl — RL training using Megatron-LM backend
nccl — NCCL tuning for multi-node 3D parallelism
gpu-operator — GPU driver and device plugin prerequisites
kubeflow-trainer — Orchestrate Megatron training jobs on Kubernetes
kueue — Queue large-scale Megatron training workloads
wandb — Experiment tracking for Megatron training runs
nemo — NeMo framework built on Megatron Core

Reference

Megatron-LM GitHub
Megatron-Core docs
Megatron Bridge
Megatron FSDP User Guide
references/troubleshooting.md — common errors and fixes
assets/pretrain_llama.sh — launch script for Llama-style pretraining with TP, distributed optimizer, selective recomputation, and torch_dist checkpoints
assets/architecture.md — Mermaid architecture diagrams

More by tylertitsworth

View all

llm-evaluation

LLM evaluation with lm-evaluation-harness — MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving.

aws-fsx

FSx for Lustre — performance tuning, striping, S3 data repositories, EKS integration. Use when configuring high-performance storage for ML on EKS. NOT for EBS or EFS.

verl

verl (Volcano Engine RL) — PPO, GRPO, DAPO, GSPO, RLOO, TIS (token/sequence importance sampling), rollout server mode, reward models, rule-based rewards, vLLM/SGLang rollout, and multi-GPU FSDP/Megatron training. Use when doing RLHF or RL post-training on LLMs.

uv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.