NVIDIA Megatron-LM β tensor/pipeline/sequence/expert/context parallelism, Megatron-Core composable library, Megatron FSDP, mixed precision (BF16/FP8), checkpoint conversion (Megatron Bridge), data loading, and MoE training. Use when training LLMs at scale with Megatron. NOT for inference.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: megatron-lm description: "NVIDIA Megatron-LM β tensor/pipeline/sequence/expert/context parallelism, Megatron-Core composable library, Megatron FSDP, mixed precision (BF16/FP8), checkpoint conversion (Megatron Bridge), data loading, and MoE training. Use when training LLMs at scale with Megatron. NOT for inference."
Megatron-LM
NVIDIA Megatron-LM is the framework for training transformer models at scale. Provides GPU-optimized building blocks with advanced parallelism strategies. Achieves up to 47% MFU on H100 clusters.
Components:
- Megatron-Core (MCore) β composable library: parallelism, kernels, model building blocks, distributed checkpointing, Megatron FSDP
- Megatron-LM β reference training scripts using Megatron-Core
- Megatron Bridge β HuggingFace β Megatron checkpoint conversion
Requirements: NVIDIA GPUs (A100/H100/B200), CUDA 12+, PyTorch 2.4+, Transformer Engine (required for FP8), Apex (optional, for fused optimizers/kernels).
Parallelism Configuration
Megatron combines multiple parallelism dimensions. Total GPUs = TP Γ PP Γ DP (Γ CP Γ EP for advanced configs).
Parallelism Settings
| Setting | Purpose | Default | Notes |
|---|---|---|---|
tensor_model_parallel_size | Shard layers across GPUs (TP) | 1 | β€ GPUs per node (NVLink bound) |
pipeline_model_parallel_size | Split layers into pipeline stages (PP) | 1 | For models exceeding single-node memory |
sequence_parallel | Distribute LayerNorm/Dropout across TP ranks (SP) | False | Requires TP > 1, always enable with TP |
context_parallel_size | Split long sequences across GPUs (CP) | 1 | For very long context training |
expert_model_parallel_size | Distribute MoE experts across GPUs (EP) | 1 | For MoE models |
num_layers_per_virtual_pipeline_stage | Interleaved PP schedule | None | Reduces pipeline bubbles |
use_distributed_optimizer | ZeRO-1 style optimizer sharding | False | Reduces per-GPU optimizer memory by 1/DP |
use_megatron_fsdp | Enable Megatron FSDP (MCore 0.14+) | False | Full parameter sharding; see Megatron FSDP section |
Sequence Parallelism Memory Footgun
The trap: Without --sequence-parallel, only weights and attention/MLP compute are sharded across TP ranks. LayerNorm inputs/outputs and Dropout activations remain fully duplicated on every TP rank. Engineers assume "activations scale as 1/TP" and OOM on long sequences.
Activation memory per GPU (for one transformer layer, batch size b, sequence length s, hidden size h):
| Component | Without SP | With SP |
|---|---|---|
| LayerNorm inputs (Γ2) | 2 Γ 2bsh | 2 Γ 2bsh / TP |
| Dropout masks (Γ2) | 2 Γ bsh | 2 Γ bsh / TP |
| Total non-sharded activations | ~6bsh | ~6bsh / TP |
Without SP, these ~6bsh bytes per layer per GPU are constant regardless of TP degree. At 70B scale (h=8192) with sequence length 8192, batch 1, TP=8: that's ~384 MB/layer Γ 80 layers = ~30 GB of wasted duplicate activations.
Rule: Always pass --sequence-parallel when --tensor-model-parallel-size > 1. There is no downside β SP adds minimal all-gather/reduce-scatter ops that overlap with compute.
Parallelism Interaction Matrix
Not all parallelism dimensions compose freely. This matrix shows tested/supported combinations:
| Combination | Supported | Notes |
|---|---|---|
| TP + PP | β | Standard 3D parallelism. TP within node, PP across nodes |
| TP + SP | β | SP requires TP > 1. Always enable SP with TP |
| TP + CP | β | CP orthogonal to TP. Each CP rank handles a sequence chunk |
| TP + EP | β | EP for MoE experts, TP for attention/dense layers |
| PP + CP | β | CP applied at each pipeline stage |
| PP + EP | β | Experts distributed across EP group at each pipeline stage |
| CP + EP | β | Fully orthogonal for long-context MoE |
| TP + PP + CP | β | Full 4D parallelism. Total GPUs = TP Γ PP Γ CP Γ DP |
| TP + PP + EP | β | MoE with pipeline stages |
| TP + PP + CP + EP | β | Maximum parallelism for large MoE with long context |
| Megatron FSDP + TP | β | FSDP replaces DP with full parameter sharding |
| Megatron FSDP + EP | β | FSDP-aware expert data parallelism |
| Megatron FSDP + CP | β | Supported since MCore 0.14 |
| Megatron FSDP + PP | β | Not supported β FSDP and PP are mutually exclusive |
GPU allocation formula:
- Standard:
total_GPUs = TP Γ PP Γ DP - With CP:
total_GPUs = TP Γ PP Γ CP Γ DP - With EP: MoE layers use EP group, dense layers use DP group.
total_GPUs = TP Γ PP Γ (EP or DP) - With FSDP:
total_GPUs = TP Γ CP Γ FSDP_DP(no PP)
Recommended Configurations
| Model Size | GPUs | TP | PP | DP | Notes |
|---|---|---|---|---|---|
| 7B | 8 | 1 | 1 | 8 | Single node, DP only |
| 13B | 8 | 2 | 1 | 4 | TP within NVLink domain |
| 70B | 32 | 4 | 2 | 4 | TP + PP across nodes |
| 175B | 128 | 8 | 4 | 4 | Full 3D parallelism |
| 405B | 256 | 8 | 8 | 4 | Large-scale multi-node |
| MoE 671B (DeepSeek-V3) | 256 | 8 | 1 | 4 | TP=8, EP=32, no PP for MoE |
Megatron-Core Architecture
Megatron-Core (MCore) is the composable library underpinning Megatron-LM. Import from megatron.core:
Key MCore Modules
| Module | Purpose |
|---|---|
megatron.core.transformer | TransformerLayer, TransformerBlock, attention, MLP implementations |
megatron.core.models | GPTModel, T5Model, multimodal models |
megatron.core.parallel_state | Process group management for all parallelism dimensions |
megatron.core.tensor_parallel | ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding |
megatron.core.pipeline_parallel | Pipeline schedules (1F1B, interleaved) |
megatron.core.context_parallel | Ring Attention / CP implementation |
megatron.core.distributed | DistributedDataParallel, Megatron FSDP |
megatron.core.optimizer | DistributedOptimizer with mixed precision support |
megatron.core.dist_checkpointing | Sharded checkpoint save/load with parallelism resharding |
megatron.core.transformer.moe | MoE router, expert layers, grouped GEMM, token dispatch |
Building Custom Models with MCore
from megatron.core.transformer.transformer_config import TransformerConfig
from megatron.core.models.gpt.gpt_model import GPTModel
from megatron.core.transformer.spec_utils import ModuleSpec
config = TransformerConfig(
num_layers=32,
hidden_size=4096,
num_attention_heads=32,
num_query_groups=8, # GQA
ffn_hidden_size=11008,
use_cpu_initialization=True,
bf16=True,
fp8=None, # Set to "e4m3" for FP8 on H100
sequence_parallel=True,
tensor_model_parallel_size=4,
pipeline_model_parallel_size=2,
context_parallel_size=1,
)
model = GPTModel(
config=config,
transformer_layer_spec=get_gpt_layer_with_transformer_engine_spec(),
vocab_size=32000,
max_sequence_length=4096,
parallel_output=True,
)
Megatron FSDP (MCore 0.14+)
Megatron FSDP is MCore's custom FSDP implementation β replaces DP with full parameter/gradient/optimizer sharding. Designed specifically for LLM training with MCore's parallelism stack.
When to use: When model + optimizer state exceeds GPU memory with standard DP, but you want to avoid PP (which introduces pipeline bubbles). Megatron FSDP is MCore's alternative to DeepSpeed ZeRO-3.
Enable Megatron FSDP
--use-megatron-fsdp \
--data-parallel-sharding-strategy optim_grads_params \
--no-gradient-accumulation-fusion \
--use-distributed-optimizer
Key Features
- Shards optimizer states, gradients, and parameters across DP ranks
- Communication/computation overlap for minimal overhead
- Compatible with BF16, FP8 (E4M3/E5M2), mixed precision O1/O2/O3
- Works with TP, EP, and CP (not PP)
- Meta device initialization for extremely large models
- DTensor-based distributed checkpointing (MCore 0.14+)
Configuration Recommendations
# Disable CUDA_DEVICE_MAX_CONNECTIONS for full comm/compute overlap
unset CUDA_DEVICE_MAX_CONNECTIONS
# Add per-token loss for gradient sharding optimization
--calculate-per-token-loss
FSDP Unit Configuration
from megatron.core.distributed import FullyShardedDataParallel
model = FullyShardedDataParallel(
transformer_config,
model,
ddp_config,
fsdp_unit_modules=[TransformerLayer, LanguageModelEmbedding],
)
Each FSDP unit is the smallest module whose weights can be released after use. Typically one TransformerLayer = one FSDP unit.
Model Architecture Settings
| Setting | Purpose | Example |
|---|---|---|
num_layers | Transformer layers | 32 |
hidden_size | Hidden dimension | 4096 |
num_attention_heads | Attention heads | 32 |
num_query_groups | GQA groups (0 = MHA) | 8 |
ffn_hidden_size | FFN hidden size | 11008 |
seq_length | Training sequence length | 4096 |
max_position_embeddings | Max position embeddings | 4096 |
swiglu | SwiGLU activation | True |
normalization | Norm type (LayerNorm, RMSNorm) | RMSNorm |
use_rotary_position_embeddings | RoPE | True |
untie_embeddings_and_output_weights | Separate embed/output weights | True |
disable_bias_linear | Remove bias from linear layers | True |
Training Settings
| Setting | Purpose | Default |
|---|---|---|
micro_batch_size | Per-GPU batch size | 1 |
global_batch_size | Total batch across all GPUs | required |
lr | Learning rate | required |
min_lr | Minimum LR (for decay) | 0.0 |
lr_decay_style | Decay schedule (cosine, linear, constant) | cosine |
lr_warmup_iters | Warmup iterations | 0 |
train_iters | Total training iterations | required |
weight_decay | Weight decay | 0.01 |
clip_grad | Gradient clipping | 1.0 |
dataloader_type | Dataloader (cyclic, single) | cyclic |
split | Train/val/test split | 99,1,0 |
Mixed Precision
| Setting | Purpose |
|---|---|
bf16 | BF16 training (standard for A100/H100) |
attention_softmax_in_fp32 | FP32 softmax for numerical stability |
accumulate_allreduce_grads_in_fp32 | FP32 gradient accumulation |
fp8_format | FP8 format: hybrid (E4M3 forward, E5M2 backward) |
fp8_amax_history_len | FP8 amax history length |
fp8_amax_compute_algo | FP8 amax algorithm (max, most_recent) |
FP8 requires H100+ GPUs and Transformer Engine in the container image.
Dynamic Context Parallelism (MCore, Jan 2026)
Version note: Dynamic CP was introduced in MCore main branch in January 2026. It is not available in MCore β€ 0.15.x stable releases β you need MCore main (post-January 2026) or a future stable release.
Data format prerequisite: Dynamic CP requires inputs in THD (Token Γ Head Γ Dim) layout. If your existing data pipeline produces BSHD-layout tensors (batch Γ sequence Γ head Γ dim), you must migrate the format before enabling dynamic CP.
Static CP assigns the same context_parallel_size to every micro-batch in a run. This causes two problems for variable-length datasets (post-training, video generation):
- CP over-sharding: A micro-batch containing only short sequences gets sharded across the CP group even though all sequences fit on a single GPU β adding unnecessary ring-attention communication with too little compute to hide it.
- DP imbalance: Micro-batches with disproportionately long sequences drive excessive FLOPs on some DP ranks, stalling gradient synchronization.
Dynamic CP solves both by selecting the optimal CP size per micro-batch at runtime.
How It Works
A per-iteration solver runs inside data_iterator_wrapper at training time (not as a separate offline preprocessing step). Each iteration, the solver examines the current micro-batch's sequence-length distribution and assigns the optimal CP size for that batch. Short-sequence batches use CP=1; long-sequence batches use CP=4 or CP=8. MCore re-forms the CP communication groups accordingly between micro-batches. Only CP groups are rebuilt (not TP or PP), which is cheap.
Reported gains: Up to 1.48Γ throughput on real post-training datasets with long-tail sequence distributions (NVIDIA technical blog, January 28, 2026).
Enabling Dynamic CP
Dynamic CP is configured via YAML config (not a standalone --dynamic-context-parallel CLI flag). Use Megatron's YAML config system (--yaml-cfg) with a config that sets dynamic_context_parallel: true on the model config, alongside the standard sequence-packing flags:
# Train with dynamic CP via YAML config (requires MCore main, Jan 2026+)
python pretrain_gpt.py \
--yaml-cfg dynamic_cp_config.yaml \
--context-parallel-size 8 \ # max CP to use; actual per-batch CP β€ this
--seq-length 32768 \ # max packed sequence length
--reset-position-ids \ # required for packed variable-length inputs
--reset-attention-mask \ # required for packed variable-length inputs
--eod-mask-loss \ # mask loss on EOD tokens (padding)
...
--context-parallel-size sets the upper bound on CP degree. The solver will choose any CP β€ this value per micro-batch (typically powers of 2: 1, 2, 4, 8).
Interaction with Other Parallelism
| Combination | Dynamic CP Support |
|---|---|
| DP + Dynamic CP | β Core use case β eliminates DP FLOPs imbalance |
| TP + Dynamic CP | β TP fixed; only CP varies per micro-batch |
| PP + Dynamic CP | β οΈ Limited β The 1F1B pipeline schedule keeps multiple micro-batches in-flight simultaneously across stages; reforming CP groups per micro-batch while stages are mid-flight stalls the pipeline. In practice, Dynamic CP works only with PP=1 in the current MCore implementation. |
| EP + Dynamic CP | β Orthogonal; MoE expert routing is CP-unaware |
When to Use Dynamic CP
| Workload | Use? | Reason |
|---|---|---|
| Post-training / SFT on chat data | β Yes | High sequence-length variance |
| Pre-training on uniform-length data | β Skip | Sequences are already uniform; static CP is sufficient |
| Video generation (DiT) | β Yes | Token counts vary widely by video resolution/duration |
| Long-context pre-training (uniform 128K) | β Skip | All sequences same length; use static CP |
Communication Optimization
| Setting | Purpose |
|---|---|
overlap_grad_reduce | Overlap gradient all-reduce with backward pass |
overlap_param_gather | Overlap parameter all-gather with forward pass |
tp_comm_overlap | Overlap TP communication with compute |
dp_comm_overlap_bucket_size | Bucket size for gradient reduction (tune for network) |
Memory Optimization
| Setting | Purpose |
|---|---|
recompute_activations | Recompute all activations (max memory savings) |
recompute_granularity | selective (recommended) or full |
recompute_method | uniform or block (with full granularity) |
recompute_num_layers | Layers to recompute per stage |
use_distributed_optimizer | Shard optimizer state across DP ranks |
cpu_offload_optimizer | Offload optimizer to CPU (slower, saves GPU) |
MoE (Mixture of Experts)
| Setting | Purpose |
|---|---|
num_experts | Total number of experts |
moe_router_topk | Experts per token |
moe_aux_loss_coeff | Auxiliary load balancing loss coefficient |
moe_grouped_gemm | Fused expert computation |
expert_model_parallel_size | Distribute experts across GPUs |
moe_permute_fusion | Fused permutation kernel |
moe_token_dispatcher_type | alltoall (default) or allgather |
Resiliency and Fault Tolerance
Checkpointing
| Setting | Purpose |
|---|---|
save | Checkpoint save directory (PVC path) |
load | Checkpoint load directory |
save_interval | Save every N iterations |
ckpt_format | Format: torch or torch_dist (distributed, faster) |
auto_detect_ckpt_format | Auto-detect checkpoint format on load |
no_save_optim | Skip optimizer state in checkpoint (smaller files) |
no_save_rng | Skip RNG state |
Megatron saves sharded checkpoints β each TP/PP rank saves its shard to the same directory.
Elastic Training and Failure Recovery
- Checkpoint-based recovery: Set
save_intervalfrequently (every 100-500 iterations). On pod failure, the Job restarts and resumes from the last checkpoint viaload. - torch_dist format: Use
ckpt_format="torch_dist"for async distributed checkpointing β significantly faster save/load at scale. - Straggler detection: Set
enable_straggler_detection=Trueto log slow ranks. Helps identify failing nodes before they crash. - Manual elastic resizing: Megatron doesn't support automatic elastic training. To change GPU count, save a checkpoint, convert parallelism with Megatron Bridge, and restart.
Checkpoint Conversion (Megatron Bridge)
Convert between HuggingFace and Megatron formats, or change parallelism dimensions. This is critical for:
- Starting training from HF pretrained weights
- Exporting Megatron checkpoints for inference (vLLM, TGI, SGLang)
- Changing parallelism config (re-sharding) without retraining
from megatron.bridge import AutoBridge
# HuggingFace β Megatron
bridge = AutoBridge.from_hf_pretrained("my-org/my-model", trust_remote_code=True)
provider = bridge.to_megatron_provider()
provider.tensor_model_parallel_size = 4
provider.pipeline_model_parallel_size = 2
provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)
# Megatron β HuggingFace (for inference export)
bridge.save_hf_pretrained(model, "/checkpoints/hf_export")
# Stream weights without full save
for name, weight in bridge.export_hf_weights(model, cpu=True):
print(name, tuple(weight.shape))
Install: pip install megatron-bridge. For a CLI-based conversion script, use the example from the package:
# Using the included conversion example script
python examples/conversion/convert_checkpoints.py import \
--hf-model-path my-org/my-model \
--output-path /checkpoints/megatron \
--tp 4 --pp 2
python examples/conversion/convert_checkpoints.py export \
--checkpoint-path /checkpoints/megatron \
--output-path /checkpoints/hf_export
Checkpoint Format Comparison
| Format | Save Speed | Load Speed | Portability | Use Case |
|---|---|---|---|---|
torch | Slow (serial) | Slow | Highest β single file | Small models, debugging |
torch_dist | Fast (parallel, async) | Fast | Requires same TPΓPP | Production training |
| HuggingFace (via Bridge) | N/A (conversion) | N/A | Universal | Inference, sharing |
MCore Distributed Checkpointing
MCore's dist_checkpointing module handles sharded save/load with automatic parallelism resharding:
from megatron.core import dist_checkpointing
# Save sharded state dict
state_dict = {"model": model.sharded_state_dict()}
dist_checkpointing.save(state_dict, checkpoint_dir)
# Load with different parallelism (automatic resharding)
state_dict = {"model": model.sharded_state_dict()}
dist_checkpointing.load(state_dict, checkpoint_dir)
model.load_state_dict(state_dict["model"])
Data Configuration
| Setting | Purpose |
|---|---|
tokenizer_type | HuggingFaceTokenizer or SentencePieceTokenizer |
tokenizer_model | HF model ID or path to tokenizer |
data_path | Path to preprocessed binary data (.bin + .idx) |
split | Train/val/test split ratio (e.g., 99,1,0) |
dataloader_type | cyclic (wraps around) or single (one pass) |
Data Preprocessing
Run as a preprocessing Job before training:
python tools/preprocess_data.py \
--input raw_data.jsonl \
--output-prefix /data/my_dataset \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model meta-llama/Llama-3.1-8B \
--workers 32 --append-eod
# Produces: my_dataset_text_document.bin + .idx
Data blending β weighted mix of datasets:
data_path: "0.7 dataset_a_text_document 0.3 dataset_b_text_document"
Kubernetes Deployment
Megatron training runs as a multi-node PyTorchJob or similar CRD. Key pod spec considerations:
MASTER_ADDR/MASTER_PORT: Set via the training operator (PyTorchJob sets these automatically)- Shared storage: All ranks need access to the same data and checkpoint PVC
/dev/shm: Mount as emptyDir withmedium: Memoryfor inter-GPU communication- GPU topology: Use
topologySpreadConstraintsor node affinity to keep TP ranks on the same node nproc_per_node: Match tonvidia.com/gpuresource limit
Debugging
See references/troubleshooting.md for:
- Communication errors and rank failures
- OOM at various scales
- Pipeline bubble optimization
- Checkpoint conversion issues
- Data loading problems
References
troubleshooting.mdβ Communication errors, checkpoint issues, and training failures
Cross-References
- pytorch β PyTorch distributed training fundamentals
- fsdp β Alternative: PyTorch FSDP for smaller-scale training
- deepspeed β Alternative: DeepSpeed ZeRO for large model training
- flash-attention β Ring Attention / context parallelism
- aws-efa β EFA networking for multi-node Megatron training
- verl β RL training using Megatron-LM backend
- nccl β NCCL tuning for multi-node 3D parallelism
- gpu-operator β GPU driver and device plugin prerequisites
- kubeflow-trainer β Orchestrate Megatron training jobs on Kubernetes
- kueue β Queue large-scale Megatron training workloads
- wandb β Experiment tracking for Megatron training runs
- nemo β NeMo framework built on Megatron Core
Reference
- Megatron-LM GitHub
- Megatron-Core docs
- Megatron Bridge
- Megatron FSDP User Guide
references/troubleshooting.mdβ common errors and fixesassets/pretrain_llama.shβ launch script for Llama-style pretraining with TP, distributed optimizer, selective recomputation, and torch_dist checkpointsassets/architecture.mdβ Mermaid architecture diagrams
More by tylertitsworth
View allLLM evaluation with lm-evaluation-harness β MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving.
FSx for Lustre β performance tuning, striping, S3 data repositories, EKS integration. Use when configuring high-performance storage for ML on EKS. NOT for EBS or EFS.
verl (Volcano Engine RL) β PPO, GRPO, DAPO, GSPO, RLOO, TIS (token/sequence importance sampling), rollout server mode, reward models, rule-based rewards, vLLM/SGLang rollout, and multi-GPU FSDP/Megatron training. Use when doing RLHF or RL post-training on LLMs.
uv β fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.
