Skip to main content
NVIDIA
Explore
Models
Skills
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • MIG on DGX Station

data science

  • Topic Modeling
  • Text to Knowledge Graph on DGX Station

tools

  • NVFP4 Quantization

fine tuning

  • NVFP4 Pretraining with Megatron Bridge
  • Nanochat Training

use case

  • Run NemoClaw with a Local LLM
  • DGX Station AI Skills for Coding Agents
  • Profiler-Driven Kernel Optimization for Fine-Tuning
  • Local Healthcare Agent on DGX Station
  • Secure Long Running AI Agents with OpenShell on DGX Station
  • Local Coding Agent

inference

  • vLLM for Inference
  • Image & Video Generation with ComfyUI
  • Isaac GR00T N1.6 Fine-Tuning
  • LLM Inference with SGLang

NVFP4 Pretraining with Megatron Bridge

30 MIN

Pretrain Llama 3.1 8B with NVFP4 mixed precision on DGX Station using Megatron Bridge

Megatron BridgeNVFP4Training
OverviewOverviewPretrain with NVFP4Pretrain with NVFP4TroubleshootingTroubleshooting

NVFP4 training

NVFP4 is a 4-bit floating-point format natively supported by NVIDIA Blackwell Tensor Cores. When applied during pretraining, NVFP4 reduces memory bandwidth and compute cost for matrix multiplications while preserving model quality through mixed-precision accumulation in higher precision (BF16/FP32).

Megatron-Bridge is NVIDIA's library for large-scale distributed training built on top of Megatron-Core. It provides composable recipe configs for models, optimizers, and mixed-precision strategies — including the first-class bf16_with_nvfp4_mixed recipe used in this playbook.

Combining the two lets you pretrain LLMs at lower memory cost and higher throughput compared to BF16-only training, with minimal accuracy trade-off.

Key benefits:

  • ~2× higher training throughput vs BF16 - Higher TFLOPs at minimal loss in model quality
  • Native Blackwell NVFP4 GEMMs — FP4 matmuls run as a single Tensor Core instruction, no software emulation overhead
  • Recipe-based configuration — swap between bf16_mixed, bf16_with_fp8_current_scaling_mixed, and bf16_with_nvfp4_mixed with a single line
  • Stability controls — pin the first/last N transformer layers in BF16 (this playbook keeps the last 4 layers in BF16 via first_last_layers_bf16)
  • ~2× memory reduction - For inference weight storage vs FP8, ~3.5× vs FP16

What you'll accomplish

Pretrain a Llama 3.1 8B model using Megatron-Bridge with NVFP4 mixed precision on NVIDIA DGX Station. You'll run a short training loop with mock data to verify the full pipeline end-to-end, compare against a plain BF16 baseline via the --disable-fp4 flag and then learn how to point it at real data if required.

Measured results

Run settings:

  • Model: Llama 3.1 8B (llama3_8b_pretrain_config())
  • 50 iterations, 2 warmup
  • Global batch size 64, micro batch size 4, sequence length 4096
  • Dummy data (Megatron-Core's built-in MockGPTDataset — synthetic random token IDs, no real corpus)
  • Single GB300 GPU, nvcr.io/nvidia/nemo:26.04 container
  • Latency: average of iterations 20–50 (iter 10 includes one-time CUDA-graph/compile overhead)
  • VRAM: peak of nvidia-smi --query-compute-apps=used_memory sampled every 2 s during the run
PrecisionRecipeAvg step timeThroughput (Model TFLOP/s/GPU)Peak VRAM
BF16 baselinebf16_mixed()9.05 s~1399221.6 GB
NVFP4 (last-4 BF16)bf16_with_nvfp4_mixed() + first_last_layers_bf16=True, num_layers_at_end_in_bf16=45.39 s~2347207.8 GB

NVFP4 is 1.68× faster than BF16 (≈68% higher throughput) with ≈13.8 GB (≈6%) less peak VRAM — the regime NVFP4 was designed for, where matmul FLOPs dominate each step and quantization overhead is amortized over wide linear projections.

What to know before starting

  • Basic Python and PyTorch usage
  • Familiarity with distributed training concepts (torchrun)
  • Understanding of mixed precision training (FP16/BF16/FP8)

Prerequisites

  • NVIDIA DGX Station with Blackwell architecture GPU (GB300 chip)
  • Docker installed with GPU support
  • NVIDIA Container Toolkit configured
  • Megatron-Bridge installed (via the NeMo Framework NGC container)

Verify your setup:

# Check GPU availability and architecture
nvidia-smi

# Verify Python and torch
python3 -c "import torch; print(torch.cuda.get_device_name(0))"

Time & risk

  • Estimated duration: 20-30 minutes (quick test loop with default --train-iters 50); longer for real data
  • Risks:
    • NVFP4 requires Blackwell GPUs — will fail on Hopper or older
    • Mock data is used by default (eval_iters=0); real data requires a preprocessed Megatron-format dataset
  • Rollback: Stop the torchrun process and remove any checkpoint directories
  • Last Updated: 05/26/2026
    • First Publication

Resources

  • Megatron Bridge Documentation
  • Mixed Precision Training Guide
  • Megatron Bridge GitHub
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation