NVFP4 Pretraining with Megatron Bridge
Pretrain Llama 3.1 8B with NVFP4 mixed precision on DGX Station using Megatron Bridge
Set up the environment
The recommended way to run Megatron-Bridge on DGX Station is through the NeMo Framework container, which includes Megatron-Bridge, Megatron-Core, Transformer Engine, and all CUDA dependencies pre-installed. Running outside the container is not supported in this playbook — the NVFP4 kernels rely on the exact Transformer Engine / CUDA versions shipped inside the image.
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nvfp4-pretraining/assets
# Use the latest nemo tag
export TAG=26.04
docker run --rm -it \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-v "$(pwd):/workdir" \
-w /workdir \
--entrypoint bash \
nvcr.io/nvidia/nemo:${TAG}
All subsequent torchrun / python commands in this playbook are meant to be executed from the shell inside this container.
Review the pretraining script
The pretraining script can be found at pretrain_llama.py. The key piece is the NVFP4 precision config, built on top of Megatron-Bridge's prebuilt bf16_with_nvfp4_mixed recipe:
from megatron.bridge.training.mixed_precision import bf16_with_nvfp4_mixed
def nvfp4_mixed_precision():
cfg = bf16_with_nvfp4_mixed()
cfg.first_last_layers_bf16 = True
cfg.num_layers_at_start_in_bf16 = 0
cfg.num_layers_at_end_in_bf16 = 4
return cfg
bf16_with_nvfp4_mixed() already sets fp8="e4m3" and fp8_recipe="nvfp4" under the hood; we just toggle the layer-pinning knobs on top:
- Last 4 layers in BF16 (
num_layers_at_end_in_bf16=4) for training stability (adjustable per model) - No start-layer pinning (
num_layers_at_start_in_bf16=0) — last-layer stability is usually enough
NOTE
The script uses llama3_8b_pretrain_config() which defaults to context_parallel_size=2. The script overrides this to context_parallel_size=1 for single-GPU runs. If you swap in a larger recipe (e.g. nemotron_3_nano_pretrain_config, which defaults to TP=4), you must either launch torchrun --nproc_per_node=4 on a 4-GPU node or override config.model.tensor_model_parallel_size = 1 before calling pretrain(...), or you will hit:
AssertionError: world size (1) is not divisible by total_model_size (...tensor_model_parallel_size=4 * ...).
Launch NVFP4 pre-training
Launch a short training run with mock data and tee the output to a log file so you can inspect VRAM and per-iteration latency afterwards:
torchrun --nproc_per_node=1 pretrain_llama.py > nvfp4.log 2>&1
Expected output (see nvfp4.log):
- Model initialization logs and a
Theoretical memory footprints: weight and optimizer=...line - Iteration progress printed every step (
log_interval=1), e.g.iteration 10/50 | ... elapsed time per iteration (ms): ... | lm loss: ... - A
[Rank 0] ... memory (GB) | mem-max-reserved-gigabytes: ...line — this is your peak VRAM - A checkpoint saved to
/workdir/nemo_experiments/default/checkpoints
If the run finishes with EXIT=0 (or no traceback), your NVFP4 pretraining setup is working.
Compare with BF16 baseline
Run the same script with --disable-fp4 to establish a BF16 baseline, again logging to a file:
# Remove the prior checkpoint directory so the two runs don't interfere
rm -rf nemo_experiments
torchrun --nproc_per_node=1 pretrain_llama.py --disable-fp4 > bf16.log 2>&1
To compare the two runs on latency and throughput, grep the per-iteration lines out of each log:
grep -E "elapsed time per iteration|MODEL_TFLOP" nvfp4.log
grep -E "elapsed time per iteration|MODEL_TFLOP" bf16.log
Each step prints two lines:
Step Time : 5.39s GPU utilization: 2347.0MODEL_TFLOP/s/GPU— step latency and throughputiteration 10/50 | ... elapsed time per iteration (ms): 5390 | ... lm loss: ...— same latency in ms plus loss
Iteration 10 includes one-time CUDA-graph/compile overhead, so average iterations 20–50 for a fair per-step latency number.
Measuring peak VRAM (from nvidia-smi)
Megatron's in-log memory numbers (mem-max-reserved-gigabytes) reflect PyTorch's caching-allocator reservation, which can drift from what the device actually holds. For an accurate read, watch nvidia-smi live from a second shell while training runs:
watch -n 1 nvidia-smi
See the measured numbers in overview.md for expected VRAM and latency on 1× GB300 with Llama 3.1 8B.
Script arguments
pretrain_llama.py accepts the following arguments:
| Argument | Type | Default | Description |
|---|---|---|---|
--disable-fp4 | flag | off | Disable NVFP4; use plain BF16 mixed precision as a baseline |
--train-iters | int | 50 | Number of training iterations |
--warmup-iters | int | 2 | Number of warmup iterations |
--global-batch-size | int | 64 | Global batch size |
--micro-batch-size | int | 4 | Micro batch size (drives peak VRAM; increase to use more memory) |
--seq-length | int | 4096 | Sequence length |
Example combining several flags:
torchrun --nproc_per_node=1 pretrain_llama.py \
--train-iters 50 --warmup-iters 2 \
--global-batch-size 64 --micro-batch-size 4 --seq-length 4096
Point to real data
To train on your own dataset, modify the config in the script:
config = llama3_8b_pretrain_config()
config.data.data_path = "/path/to/your/preprocessed/dataset"
config.train.train_iters = 5000
config.train.global_batch_size = 256
config.train.micro_batch_size = 2
Megatron-Bridge expects preprocessed data in Megatron format. See the Megatron-Bridge data preparation guide for details.
Cleanup
Remove checkpoints and log files generated by the runs:
rm -rf nemo_experiments/ nvfp4.log bf16.log
Then exit the container shell (exit) — the --rm flag in Step 1 deletes it automatically.
References
- Quickstart: https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/tutorials/recipes/llama/00_quickstart_pretrain.py
- Mixed precision: https://docs.nvidia.com/nemo/megatron-bridge/latest/training/mixed-precision.html
- API: https://docs.nvidia.com/nemo/megatron-bridge/latest/apidocs/bridge/bridge.training.mixed_precision.html