Agent SkillsAgent Skills
tylertitsworth

megatron-lm

@tylertitsworth/megatron-lm
tylertitsworth
0
0 forks
Updated 4/7/2026
View on GitHub

NVIDIA Megatron-LM — tensor/pipeline/sequence/expert/context parallelism, Megatron-Core composable library, Megatron FSDP, mixed precision (BF16/FP8), checkpoint conversion (Megatron Bridge), data loading, and MoE training. Use when training LLMs at scale with Megatron. NOT for inference.

Installation

$npx agent-skills-cli install @tylertitsworth/megatron-lm
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathmegatron-lm/SKILL.md
Branchmain
Scoped Name@tylertitsworth/megatron-lm

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions


name: megatron-lm description: "NVIDIA Megatron-LM — tensor/pipeline/sequence/expert/context parallelism, Megatron-Core composable library, Megatron FSDP, mixed precision (BF16/FP8), checkpoint conversion (Megatron Bridge), data loading, and MoE training. Use when training LLMs at scale with Megatron. NOT for inference."

Megatron-LM

NVIDIA Megatron-LM is the framework for training transformer models at scale. Provides GPU-optimized building blocks with advanced parallelism strategies. Achieves up to 47% MFU on H100 clusters.

Components:

  • Megatron-Core (MCore) — composable library: parallelism, kernels, model building blocks, distributed checkpointing, Megatron FSDP
  • Megatron-LM — reference training scripts using Megatron-Core
  • Megatron Bridge — HuggingFace ↔ Megatron checkpoint conversion

Requirements: NVIDIA GPUs (A100/H100/B200), CUDA 12+, PyTorch 2.4+, Transformer Engine (required for FP8), Apex (optional, for fused optimizers/kernels).

Parallelism Configuration

Megatron combines multiple parallelism dimensions. Total GPUs = TP × PP × DP (× CP × EP for advanced configs).

Parallelism Settings

SettingPurposeDefaultNotes
tensor_model_parallel_sizeShard layers across GPUs (TP)1≤ GPUs per node (NVLink bound)
pipeline_model_parallel_sizeSplit layers into pipeline stages (PP)1For models exceeding single-node memory
sequence_parallelDistribute LayerNorm/Dropout across TP ranks (SP)FalseRequires TP > 1, always enable with TP
context_parallel_sizeSplit long sequences across GPUs (CP)1For very long context training
expert_model_parallel_sizeDistribute MoE experts across GPUs (EP)1For MoE models
num_layers_per_virtual_pipeline_stageInterleaved PP scheduleNoneReduces pipeline bubbles
use_distributed_optimizerZeRO-1 style optimizer shardingFalseReduces per-GPU optimizer memory by 1/DP
use_megatron_fsdpEnable Megatron FSDP (MCore 0.14+)FalseFull parameter sharding; see Megatron FSDP section

Sequence Parallelism Memory Footgun

The trap: Without --sequence-parallel, only weights and attention/MLP compute are sharded across TP ranks. LayerNorm inputs/outputs and Dropout activations remain fully duplicated on every TP rank. Engineers assume "activations scale as 1/TP" and OOM on long sequences.

Activation memory per GPU (for one transformer layer, batch size b, sequence length s, hidden size h):

ComponentWithout SPWith SP
LayerNorm inputs (×2)2 × 2bsh2 × 2bsh / TP
Dropout masks (×2)2 × bsh2 × bsh / TP
Total non-sharded activations~6bsh~6bsh / TP

Without SP, these ~6bsh bytes per layer per GPU are constant regardless of TP degree. At 70B scale (h=8192) with sequence length 8192, batch 1, TP=8: that's ~384 MB/layer × 80 layers = ~30 GB of wasted duplicate activations.

Rule: Always pass --sequence-parallel when --tensor-model-parallel-size > 1. There is no downside — SP adds minimal all-gather/reduce-scatter ops that overlap with compute.

Parallelism Interaction Matrix

Not all parallelism dimensions compose freely. This matrix shows tested/supported combinations:

CombinationSupportedNotes
TP + PPStandard 3D parallelism. TP within node, PP across nodes
TP + SPSP requires TP > 1. Always enable SP with TP
TP + CPCP orthogonal to TP. Each CP rank handles a sequence chunk
TP + EPEP for MoE experts, TP for attention/dense layers
PP + CPCP applied at each pipeline stage
PP + EPExperts distributed across EP group at each pipeline stage
CP + EPFully orthogonal for long-context MoE
TP + PP + CPFull 4D parallelism. Total GPUs = TP × PP × CP × DP
TP + PP + EPMoE with pipeline stages
TP + PP + CP + EPMaximum parallelism for large MoE with long context
Megatron FSDP + TPFSDP replaces DP with full parameter sharding
Megatron FSDP + EPFSDP-aware expert data parallelism
Megatron FSDP + CPSupported since MCore 0.14
Megatron FSDP + PPNot supported — FSDP and PP are mutually exclusive

GPU allocation formula:

  • Standard: total_GPUs = TP × PP × DP
  • With CP: total_GPUs = TP × PP × CP × DP
  • With EP: MoE layers use EP group, dense layers use DP group. total_GPUs = TP × PP × (EP or DP)
  • With FSDP: total_GPUs = TP × CP × FSDP_DP (no PP)

Recommended Configurations

Model SizeGPUsTPPPDPNotes
7B8118Single node, DP only
13B8214TP within NVLink domain
70B32424TP + PP across nodes
175B128844Full 3D parallelism
405B256884Large-scale multi-node
MoE 671B (DeepSeek-V3)256814TP=8, EP=32, no PP for MoE

Megatron-Core Architecture

Megatron-Core (MCore) is the composable library underpinning Megatron-LM. Import from megatron.core:

Key MCore Modules

ModulePurpose
megatron.core.transformerTransformerLayer, TransformerBlock, attention, MLP implementations
megatron.core.modelsGPTModel, T5Model, multimodal models
megatron.core.parallel_stateProcess group management for all parallelism dimensions
megatron.core.tensor_parallelColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding
megatron.core.pipeline_parallelPipeline schedules (1F1B, interleaved)
megatron.core.context_parallelRing Attention / CP implementation
megatron.core.distributedDistributedDataParallel, Megatron FSDP
megatron.core.optimizerDistributedOptimizer with mixed precision support
megatron.core.dist_checkpointingSharded checkpoint save/load with parallelism resharding
megatron.core.transformer.moeMoE router, expert layers, grouped GEMM, token dispatch

Building Custom Models with MCore

from megatron.core.transformer.transformer_config import TransformerConfig
from megatron.core.models.gpt.gpt_model import GPTModel
from megatron.core.transformer.spec_utils import ModuleSpec

config = TransformerConfig(
    num_layers=32,
    hidden_size=4096,
    num_attention_heads=32,
    num_query_groups=8,           # GQA
    ffn_hidden_size=11008,
    use_cpu_initialization=True,
    bf16=True,
    fp8=None,                      # Set to "e4m3" for FP8 on H100
    sequence_parallel=True,
    tensor_model_parallel_size=4,
    pipeline_model_parallel_size=2,
    context_parallel_size=1,
)

model = GPTModel(
    config=config,
    transformer_layer_spec=get_gpt_layer_with_transformer_engine_spec(),
    vocab_size=32000,
    max_sequence_length=4096,
    parallel_output=True,
)

Megatron FSDP (MCore 0.14+)

Megatron FSDP is MCore's custom FSDP implementation — replaces DP with full parameter/gradient/optimizer sharding. Designed specifically for LLM training with MCore's parallelism stack.

When to use: When model + optimizer state exceeds GPU memory with standard DP, but you want to avoid PP (which introduces pipeline bubbles). Megatron FSDP is MCore's alternative to DeepSpeed ZeRO-3.

Enable Megatron FSDP

--use-megatron-fsdp \
--data-parallel-sharding-strategy optim_grads_params \
--no-gradient-accumulation-fusion \
--use-distributed-optimizer

Key Features

  • Shards optimizer states, gradients, and parameters across DP ranks
  • Communication/computation overlap for minimal overhead
  • Compatible with BF16, FP8 (E4M3/E5M2), mixed precision O1/O2/O3
  • Works with TP, EP, and CP (not PP)
  • Meta device initialization for extremely large models
  • DTensor-based distributed checkpointing (MCore 0.14+)

Configuration Recommendations

# Disable CUDA_DEVICE_MAX_CONNECTIONS for full comm/compute overlap
unset CUDA_DEVICE_MAX_CONNECTIONS

# Add per-token loss for gradient sharding optimization
--calculate-per-token-loss

FSDP Unit Configuration

from megatron.core.distributed import FullyShardedDataParallel

model = FullyShardedDataParallel(
    transformer_config,
    model,
    ddp_config,
    fsdp_unit_modules=[TransformerLayer, LanguageModelEmbedding],
)

Each FSDP unit is the smallest module whose weights can be released after use. Typically one TransformerLayer = one FSDP unit.

Model Architecture Settings

SettingPurposeExample
num_layersTransformer layers32
hidden_sizeHidden dimension4096
num_attention_headsAttention heads32
num_query_groupsGQA groups (0 = MHA)8
ffn_hidden_sizeFFN hidden size11008
seq_lengthTraining sequence length4096
max_position_embeddingsMax position embeddings4096
swigluSwiGLU activationTrue
normalizationNorm type (LayerNorm, RMSNorm)RMSNorm
use_rotary_position_embeddingsRoPETrue
untie_embeddings_and_output_weightsSeparate embed/output weightsTrue
disable_bias_linearRemove bias from linear layersTrue

Training Settings

SettingPurposeDefault
micro_batch_sizePer-GPU batch size1
global_batch_sizeTotal batch across all GPUsrequired
lrLearning raterequired
min_lrMinimum LR (for decay)0.0
lr_decay_styleDecay schedule (cosine, linear, constant)cosine
lr_warmup_itersWarmup iterations0
train_itersTotal training iterationsrequired
weight_decayWeight decay0.01
clip_gradGradient clipping1.0
dataloader_typeDataloader (cyclic, single)cyclic
splitTrain/val/test split99,1,0

Mixed Precision

SettingPurpose
bf16BF16 training (standard for A100/H100)
attention_softmax_in_fp32FP32 softmax for numerical stability
accumulate_allreduce_grads_in_fp32FP32 gradient accumulation
fp8_formatFP8 format: hybrid (E4M3 forward, E5M2 backward)
fp8_amax_history_lenFP8 amax history length
fp8_amax_compute_algoFP8 amax algorithm (max, most_recent)

FP8 requires H100+ GPUs and Transformer Engine in the container image.

Dynamic Context Parallelism (MCore, Jan 2026)

Version note: Dynamic CP was introduced in MCore main branch in January 2026. It is not available in MCore ≤ 0.15.x stable releases — you need MCore main (post-January 2026) or a future stable release.

Data format prerequisite: Dynamic CP requires inputs in THD (Token × Head × Dim) layout. If your existing data pipeline produces BSHD-layout tensors (batch × sequence × head × dim), you must migrate the format before enabling dynamic CP.

Static CP assigns the same context_parallel_size to every micro-batch in a run. This causes two problems for variable-length datasets (post-training, video generation):

  1. CP over-sharding: A micro-batch containing only short sequences gets sharded across the CP group even though all sequences fit on a single GPU — adding unnecessary ring-attention communication with too little compute to hide it.
  2. DP imbalance: Micro-batches with disproportionately long sequences drive excessive FLOPs on some DP ranks, stalling gradient synchronization.

Dynamic CP solves both by selecting the optimal CP size per micro-batch at runtime.

How It Works

A per-iteration solver runs inside data_iterator_wrapper at training time (not as a separate offline preprocessing step). Each iteration, the solver examines the current micro-batch's sequence-length distribution and assigns the optimal CP size for that batch. Short-sequence batches use CP=1; long-sequence batches use CP=4 or CP=8. MCore re-forms the CP communication groups accordingly between micro-batches. Only CP groups are rebuilt (not TP or PP), which is cheap.

Reported gains: Up to 1.48× throughput on real post-training datasets with long-tail sequence distributions (NVIDIA technical blog, January 28, 2026).

Enabling Dynamic CP

Dynamic CP is configured via YAML config (not a standalone --dynamic-context-parallel CLI flag). Use Megatron's YAML config system (--yaml-cfg) with a config that sets dynamic_context_parallel: true on the model config, alongside the standard sequence-packing flags:

# Train with dynamic CP via YAML config (requires MCore main, Jan 2026+)
python pretrain_gpt.py \
  --yaml-cfg dynamic_cp_config.yaml \
  --context-parallel-size 8 \         # max CP to use; actual per-batch CP ≤ this
  --seq-length 32768 \                # max packed sequence length
  --reset-position-ids \              # required for packed variable-length inputs
  --reset-attention-mask \            # required for packed variable-length inputs
  --eod-mask-loss \                   # mask loss on EOD tokens (padding)
  ...

--context-parallel-size sets the upper bound on CP degree. The solver will choose any CP ≤ this value per micro-batch (typically powers of 2: 1, 2, 4, 8).

Interaction with Other Parallelism

CombinationDynamic CP Support
DP + Dynamic CP✅ Core use case — eliminates DP FLOPs imbalance
TP + Dynamic CP✅ TP fixed; only CP varies per micro-batch
PP + Dynamic CP⚠️ Limited — The 1F1B pipeline schedule keeps multiple micro-batches in-flight simultaneously across stages; reforming CP groups per micro-batch while stages are mid-flight stalls the pipeline. In practice, Dynamic CP works only with PP=1 in the current MCore implementation.
EP + Dynamic CP✅ Orthogonal; MoE expert routing is CP-unaware

When to Use Dynamic CP

WorkloadUse?Reason
Post-training / SFT on chat data✅ YesHigh sequence-length variance
Pre-training on uniform-length data❌ SkipSequences are already uniform; static CP is sufficient
Video generation (DiT)✅ YesToken counts vary widely by video resolution/duration
Long-context pre-training (uniform 128K)❌ SkipAll sequences same length; use static CP

Communication Optimization

SettingPurpose
overlap_grad_reduceOverlap gradient all-reduce with backward pass
overlap_param_gatherOverlap parameter all-gather with forward pass
tp_comm_overlapOverlap TP communication with compute
dp_comm_overlap_bucket_sizeBucket size for gradient reduction (tune for network)

Memory Optimization

SettingPurpose
recompute_activationsRecompute all activations (max memory savings)
recompute_granularityselective (recommended) or full
recompute_methoduniform or block (with full granularity)
recompute_num_layersLayers to recompute per stage
use_distributed_optimizerShard optimizer state across DP ranks
cpu_offload_optimizerOffload optimizer to CPU (slower, saves GPU)

MoE (Mixture of Experts)

SettingPurpose
num_expertsTotal number of experts
moe_router_topkExperts per token
moe_aux_loss_coeffAuxiliary load balancing loss coefficient
moe_grouped_gemmFused expert computation
expert_model_parallel_sizeDistribute experts across GPUs
moe_permute_fusionFused permutation kernel
moe_token_dispatcher_typealltoall (default) or allgather

Resiliency and Fault Tolerance

Checkpointing

SettingPurpose
saveCheckpoint save directory (PVC path)
loadCheckpoint load directory
save_intervalSave every N iterations
ckpt_formatFormat: torch or torch_dist (distributed, faster)
auto_detect_ckpt_formatAuto-detect checkpoint format on load
no_save_optimSkip optimizer state in checkpoint (smaller files)
no_save_rngSkip RNG state

Megatron saves sharded checkpoints — each TP/PP rank saves its shard to the same directory.

Elastic Training and Failure Recovery

  • Checkpoint-based recovery: Set save_interval frequently (every 100-500 iterations). On pod failure, the Job restarts and resumes from the last checkpoint via load.
  • torch_dist format: Use ckpt_format="torch_dist" for async distributed checkpointing — significantly faster save/load at scale.
  • Straggler detection: Set enable_straggler_detection=True to log slow ranks. Helps identify failing nodes before they crash.
  • Manual elastic resizing: Megatron doesn't support automatic elastic training. To change GPU count, save a checkpoint, convert parallelism with Megatron Bridge, and restart.

Checkpoint Conversion (Megatron Bridge)

Convert between HuggingFace and Megatron formats, or change parallelism dimensions. This is critical for:

  • Starting training from HF pretrained weights
  • Exporting Megatron checkpoints for inference (vLLM, TGI, SGLang)
  • Changing parallelism config (re-sharding) without retraining
from megatron.bridge import AutoBridge

# HuggingFace → Megatron
bridge = AutoBridge.from_hf_pretrained("my-org/my-model", trust_remote_code=True)
provider = bridge.to_megatron_provider()
provider.tensor_model_parallel_size = 4
provider.pipeline_model_parallel_size = 2
provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)

# Megatron → HuggingFace (for inference export)
bridge.save_hf_pretrained(model, "/checkpoints/hf_export")

# Stream weights without full save
for name, weight in bridge.export_hf_weights(model, cpu=True):
    print(name, tuple(weight.shape))

Install: pip install megatron-bridge. For a CLI-based conversion script, use the example from the package:

# Using the included conversion example script
python examples/conversion/convert_checkpoints.py import \
  --hf-model-path my-org/my-model \
  --output-path /checkpoints/megatron \
  --tp 4 --pp 2

python examples/conversion/convert_checkpoints.py export \
  --checkpoint-path /checkpoints/megatron \
  --output-path /checkpoints/hf_export

Checkpoint Format Comparison

FormatSave SpeedLoad SpeedPortabilityUse Case
torchSlow (serial)SlowHighest — single fileSmall models, debugging
torch_distFast (parallel, async)FastRequires same TP×PPProduction training
HuggingFace (via Bridge)N/A (conversion)N/AUniversalInference, sharing

MCore Distributed Checkpointing

MCore's dist_checkpointing module handles sharded save/load with automatic parallelism resharding:

from megatron.core import dist_checkpointing

# Save sharded state dict
state_dict = {"model": model.sharded_state_dict()}
dist_checkpointing.save(state_dict, checkpoint_dir)

# Load with different parallelism (automatic resharding)
state_dict = {"model": model.sharded_state_dict()}
dist_checkpointing.load(state_dict, checkpoint_dir)
model.load_state_dict(state_dict["model"])

Data Configuration

SettingPurpose
tokenizer_typeHuggingFaceTokenizer or SentencePieceTokenizer
tokenizer_modelHF model ID or path to tokenizer
data_pathPath to preprocessed binary data (.bin + .idx)
splitTrain/val/test split ratio (e.g., 99,1,0)
dataloader_typecyclic (wraps around) or single (one pass)

Data Preprocessing

Run as a preprocessing Job before training:

python tools/preprocess_data.py \
  --input raw_data.jsonl \
  --output-prefix /data/my_dataset \
  --tokenizer-type HuggingFaceTokenizer \
  --tokenizer-model meta-llama/Llama-3.1-8B \
  --workers 32 --append-eod
# Produces: my_dataset_text_document.bin + .idx

Data blending — weighted mix of datasets:

data_path: "0.7 dataset_a_text_document 0.3 dataset_b_text_document"

Kubernetes Deployment

Megatron training runs as a multi-node PyTorchJob or similar CRD. Key pod spec considerations:

  • MASTER_ADDR / MASTER_PORT: Set via the training operator (PyTorchJob sets these automatically)
  • Shared storage: All ranks need access to the same data and checkpoint PVC
  • /dev/shm: Mount as emptyDir with medium: Memory for inter-GPU communication
  • GPU topology: Use topologySpreadConstraints or node affinity to keep TP ranks on the same node
  • nproc_per_node: Match to nvidia.com/gpu resource limit

Debugging

See references/troubleshooting.md for:

  • Communication errors and rank failures
  • OOM at various scales
  • Pipeline bubble optimization
  • Checkpoint conversion issues
  • Data loading problems

References

Cross-References

  • pytorch — PyTorch distributed training fundamentals
  • fsdp — Alternative: PyTorch FSDP for smaller-scale training
  • deepspeed — Alternative: DeepSpeed ZeRO for large model training
  • flash-attention — Ring Attention / context parallelism
  • aws-efa — EFA networking for multi-node Megatron training
  • verl — RL training using Megatron-LM backend
  • nccl — NCCL tuning for multi-node 3D parallelism
  • gpu-operator — GPU driver and device plugin prerequisites
  • kubeflow-trainer — Orchestrate Megatron training jobs on Kubernetes
  • kueue — Queue large-scale Megatron training workloads
  • wandb — Experiment tracking for Megatron training runs
  • nemo — NeMo framework built on Megatron Core

Reference