NVFP4 Pretraining with Megatron Bridge

NVFP4 training

NVFP4 is a 4-bit floating-point format natively supported by NVIDIA Blackwell Tensor Cores. When applied during pretraining, NVFP4 reduces memory bandwidth and compute cost for matrix multiplications while preserving model quality through mixed-precision accumulation in higher precision (BF16/FP32).

Megatron-Bridge is NVIDIA's library for large-scale distributed training built on top of Megatron-Core. It provides composable recipe configs for models, optimizers, and mixed-precision strategies — including the first-class bf16_with_nvfp4_mixed recipe used in this playbook.

Combining the two lets you pretrain LLMs at lower memory cost and higher throughput compared to BF16-only training, with minimal accuracy trade-off.

Key benefits:

~2× higher training throughput vs BF16 - Higher TFLOPs at minimal loss in model quality
Native Blackwell NVFP4 GEMMs — FP4 matmuls run as a single Tensor Core instruction, no software emulation overhead
Recipe-based configuration — swap between bf16_mixed, bf16_with_fp8_current_scaling_mixed, and bf16_with_nvfp4_mixed with a single line
Stability controls — pin the first/last N transformer layers in BF16 (this playbook keeps the last 4 layers in BF16 via first_last_layers_bf16)
~2× memory reduction - For inference weight storage vs FP8, ~3.5× vs FP16

What you'll accomplish

Pretrain a Llama 3.1 8B model using Megatron-Bridge with NVFP4 mixed precision on NVIDIA DGX Station. You'll run a short training loop with mock data to verify the full pipeline end-to-end, compare against a plain BF16 baseline via the --disable-fp4 flag and then learn how to point it at real data if required.

Measured results

Run settings:

Model: Llama 3.1 8B (llama3_8b_pretrain_config())
50 iterations, 2 warmup
Global batch size 64, micro batch size 4, sequence length 4096
Dummy data (Megatron-Core's built-in MockGPTDataset — synthetic random token IDs, no real corpus)
Single GB300 GPU, nvcr.io/nvidia/nemo:26.04 container
Latency: average of iterations 20–50 (iter 10 includes one-time CUDA-graph/compile overhead)
VRAM: peak of nvidia-smi --query-compute-apps=used_memory sampled every 2 s during the run

Precision	Recipe	Avg step time	Throughput (Model TFLOP/s/GPU)	Peak VRAM
BF16 baseline	`bf16_mixed()`	9.05 s	~1399	221.6 GB
NVFP4 (last-4 BF16)	`bf16_with_nvfp4_mixed()` + `first_last_layers_bf16=True`, `num_layers_at_end_in_bf16=4`	5.39 s	~2347	207.8 GB

NVFP4 is 1.68× faster than BF16 (≈68% higher throughput) with ≈13.8 GB (≈6%) less peak VRAM — the regime NVFP4 was designed for, where matmul FLOPs dominate each step and quantization overhead is amortized over wide linear projections.

What to know before starting

Basic Python and PyTorch usage
Familiarity with distributed training concepts (torchrun)
Understanding of mixed precision training (FP16/BF16/FP8)

Prerequisites

NVIDIA DGX Station with Blackwell architecture GPU (GB300 chip)
Docker installed with GPU support
NVIDIA Container Toolkit configured
Megatron-Bridge installed (via the NeMo Framework NGC container)

Verify your setup:

# Check GPU availability and architecture
nvidia-smi

# Verify Python and torch
python3 -c "import torch; print(torch.cuda.get_device_name(0))"

Time & risk

Estimated duration: 20-30 minutes (quick test loop with default --train-iters 50); longer for real data
Risks:
- NVFP4 requires Blackwell GPUs — will fail on Hopper or older
- Mock data is used by default (eval_iters=0); real data requires a preprocessed Megatron-format dataset
Rollback: Stop the torchrun process and remove any checkpoint directories
Last Updated: 05/26/2026
- First Publication