NVFP4 Pretraining with Megatron Bridge
Pretrain Llama 3.1 8B with NVFP4 mixed precision on DGX Station using Megatron Bridge
NVFP4 training
NVFP4 is a 4-bit floating-point format natively supported by NVIDIA Blackwell Tensor Cores. When applied during pretraining, NVFP4 reduces memory bandwidth and compute cost for matrix multiplications while preserving model quality through mixed-precision accumulation in higher precision (BF16/FP32).
Megatron-Bridge is NVIDIA's library for large-scale distributed training built on top of Megatron-Core.
It provides composable recipe configs for models, optimizers, and mixed-precision strategies — including the first-class bf16_with_nvfp4_mixed recipe used in this playbook.
Combining the two lets you pretrain LLMs at lower memory cost and higher throughput compared to BF16-only training, with minimal accuracy trade-off.
Key benefits:
- ~2× higher training throughput vs BF16 - Higher TFLOPs at minimal loss in model quality
- Native Blackwell NVFP4 GEMMs — FP4 matmuls run as a single Tensor Core instruction, no software emulation overhead
- Recipe-based configuration — swap between
bf16_mixed,bf16_with_fp8_current_scaling_mixed, andbf16_with_nvfp4_mixedwith a single line - Stability controls — pin the first/last N transformer layers in BF16 (this playbook keeps the last 4 layers in BF16 via
first_last_layers_bf16) - ~2× memory reduction - For inference weight storage vs FP8, ~3.5× vs FP16
What you'll accomplish
Pretrain a Llama 3.1 8B model using Megatron-Bridge with NVFP4 mixed precision on NVIDIA DGX Station.
You'll run a short training loop with mock data to verify the full pipeline end-to-end, compare against a plain BF16 baseline via the --disable-fp4 flag and then learn how to point it at real data if required.
Measured results
Run settings:
- Model: Llama 3.1 8B (
llama3_8b_pretrain_config()) - 50 iterations, 2 warmup
- Global batch size 64, micro batch size 4, sequence length 4096
- Dummy data (Megatron-Core's built-in
MockGPTDataset— synthetic random token IDs, no real corpus) - Single GB300 GPU,
nvcr.io/nvidia/nemo:26.04container - Latency: average of iterations 20–50 (iter 10 includes one-time CUDA-graph/compile overhead)
- VRAM: peak of
nvidia-smi --query-compute-apps=used_memorysampled every 2 s during the run
| Precision | Recipe | Avg step time | Throughput (Model TFLOP/s/GPU) | Peak VRAM |
|---|---|---|---|---|
| BF16 baseline | bf16_mixed() | 9.05 s | ~1399 | 221.6 GB |
| NVFP4 (last-4 BF16) | bf16_with_nvfp4_mixed() + first_last_layers_bf16=True, num_layers_at_end_in_bf16=4 | 5.39 s | ~2347 | 207.8 GB |
NVFP4 is 1.68× faster than BF16 (≈68% higher throughput) with ≈13.8 GB (≈6%) less peak VRAM — the regime NVFP4 was designed for, where matmul FLOPs dominate each step and quantization overhead is amortized over wide linear projections.
What to know before starting
- Basic Python and PyTorch usage
- Familiarity with distributed training concepts (
torchrun) - Understanding of mixed precision training (FP16/BF16/FP8)
Prerequisites
- NVIDIA DGX Station with Blackwell architecture GPU (GB300 chip)
- Docker installed with GPU support
- NVIDIA Container Toolkit configured
- Megatron-Bridge installed (via the NeMo Framework NGC container)
Verify your setup:
# Check GPU availability and architecture
nvidia-smi
# Verify Python and torch
python3 -c "import torch; print(torch.cuda.get_device_name(0))"
Time & risk
- Estimated duration: 20-30 minutes (quick test loop with default
--train-iters 50); longer for real data - Risks:
- NVFP4 requires Blackwell GPUs — will fail on Hopper or older
- Mock data is used by default (
eval_iters=0); real data requires a preprocessed Megatron-format dataset
- Rollback: Stop the
torchrunprocess and remove any checkpoint directories - Last Updated: 05/26/2026
- First Publication