Skip to main content
NVIDIA
Explore
Models
Skills
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • MIG on DGX Station

data science

  • Topic Modeling
  • Text to Knowledge Graph on DGX Station

tools

  • NVFP4 Quantization

fine tuning

  • NVFP4 Pretraining with Megatron Bridge
  • Nanochat Training

use case

  • Run NemoClaw with a Local LLM
  • DGX Station AI Skills for Coding Agents
  • Profiler-Driven Kernel Optimization for Fine-Tuning
  • Local Healthcare Agent on DGX Station
  • Secure Long Running AI Agents with OpenShell on DGX Station
  • Local Coding Agent

inference

  • vLLM for Inference
  • Image & Video Generation with ComfyUI
  • Isaac GR00T N1.6 Fine-Tuning
  • LLM Inference with SGLang

NVFP4 Pretraining with Megatron Bridge

30 MIN

Pretrain Llama 3.1 8B with NVFP4 mixed precision on DGX Station using Megatron Bridge

Megatron BridgeNVFP4Training
OverviewOverviewPretrain with NVFP4Pretrain with NVFP4TroubleshootingTroubleshooting

Step 1
Set up the environment

The recommended way to run Megatron-Bridge on DGX Station is through the NeMo Framework container, which includes Megatron-Bridge, Megatron-Core, Transformer Engine, and all CUDA dependencies pre-installed. Running outside the container is not supported in this playbook — the NVFP4 kernels rely on the exact Transformer Engine / CUDA versions shipped inside the image.

git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nvfp4-pretraining/assets

# Use the latest nemo tag
export TAG=26.04

docker run --rm -it \
  --gpus all \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -v "$(pwd):/workdir" \
  -w /workdir \
  --entrypoint bash \
  nvcr.io/nvidia/nemo:${TAG}

All subsequent torchrun / python commands in this playbook are meant to be executed from the shell inside this container.

Step 2
Review the pretraining script

The pretraining script can be found at pretrain_llama.py. The key piece is the NVFP4 precision config, built on top of Megatron-Bridge's prebuilt bf16_with_nvfp4_mixed recipe:

from megatron.bridge.training.mixed_precision import bf16_with_nvfp4_mixed

def nvfp4_mixed_precision():
    cfg = bf16_with_nvfp4_mixed()
    cfg.first_last_layers_bf16 = True
    cfg.num_layers_at_start_in_bf16 = 0
    cfg.num_layers_at_end_in_bf16 = 4
    return cfg

bf16_with_nvfp4_mixed() already sets fp8="e4m3" and fp8_recipe="nvfp4" under the hood; we just toggle the layer-pinning knobs on top:

  • Last 4 layers in BF16 (num_layers_at_end_in_bf16=4) for training stability (adjustable per model)
  • No start-layer pinning (num_layers_at_start_in_bf16=0) — last-layer stability is usually enough

NOTE

The script uses llama3_8b_pretrain_config() which defaults to context_parallel_size=2. The script overrides this to context_parallel_size=1 for single-GPU runs. If you swap in a larger recipe (e.g. nemotron_3_nano_pretrain_config, which defaults to TP=4), you must either launch torchrun --nproc_per_node=4 on a 4-GPU node or override config.model.tensor_model_parallel_size = 1 before calling pretrain(...), or you will hit: AssertionError: world size (1) is not divisible by total_model_size (...tensor_model_parallel_size=4 * ...).

Step 3
Launch NVFP4 pre-training

Launch a short training run with mock data and tee the output to a log file so you can inspect VRAM and per-iteration latency afterwards:

torchrun --nproc_per_node=1 pretrain_llama.py > nvfp4.log 2>&1

Expected output (see nvfp4.log):

  • Model initialization logs and a Theoretical memory footprints: weight and optimizer=... line
  • Iteration progress printed every step (log_interval=1), e.g. iteration 10/50 | ... elapsed time per iteration (ms): ... | lm loss: ...
  • A [Rank 0] ... memory (GB) | mem-max-reserved-gigabytes: ... line — this is your peak VRAM
  • A checkpoint saved to /workdir/nemo_experiments/default/checkpoints

If the run finishes with EXIT=0 (or no traceback), your NVFP4 pretraining setup is working.

Step 4
Compare with BF16 baseline

Run the same script with --disable-fp4 to establish a BF16 baseline, again logging to a file:

# Remove the prior checkpoint directory so the two runs don't interfere
rm -rf nemo_experiments

torchrun --nproc_per_node=1 pretrain_llama.py --disable-fp4 > bf16.log 2>&1

To compare the two runs on latency and throughput, grep the per-iteration lines out of each log:

grep -E "elapsed time per iteration|MODEL_TFLOP" nvfp4.log
grep -E "elapsed time per iteration|MODEL_TFLOP" bf16.log

Each step prints two lines:

  • Step Time : 5.39s GPU utilization: 2347.0MODEL_TFLOP/s/GPU — step latency and throughput
  • iteration 10/50 | ... elapsed time per iteration (ms): 5390 | ... lm loss: ... — same latency in ms plus loss

Iteration 10 includes one-time CUDA-graph/compile overhead, so average iterations 20–50 for a fair per-step latency number.

Measuring peak VRAM (from nvidia-smi)

Megatron's in-log memory numbers (mem-max-reserved-gigabytes) reflect PyTorch's caching-allocator reservation, which can drift from what the device actually holds. For an accurate read, watch nvidia-smi live from a second shell while training runs:

watch -n 1 nvidia-smi

See the measured numbers in overview.md for expected VRAM and latency on 1× GB300 with Llama 3.1 8B.

Step 5
Script arguments

pretrain_llama.py accepts the following arguments:

ArgumentTypeDefaultDescription
--disable-fp4flagoffDisable NVFP4; use plain BF16 mixed precision as a baseline
--train-itersint50Number of training iterations
--warmup-itersint2Number of warmup iterations
--global-batch-sizeint64Global batch size
--micro-batch-sizeint4Micro batch size (drives peak VRAM; increase to use more memory)
--seq-lengthint4096Sequence length

Example combining several flags:

torchrun --nproc_per_node=1 pretrain_llama.py \
    --train-iters 50 --warmup-iters 2 \
    --global-batch-size 64 --micro-batch-size 4 --seq-length 4096

Step 6
Point to real data

To train on your own dataset, modify the config in the script:

config = llama3_8b_pretrain_config()
config.data.data_path = "/path/to/your/preprocessed/dataset"
config.train.train_iters = 5000
config.train.global_batch_size = 256
config.train.micro_batch_size = 2

Megatron-Bridge expects preprocessed data in Megatron format. See the Megatron-Bridge data preparation guide for details.

Step 7
Cleanup

Remove checkpoints and log files generated by the runs:

rm -rf nemo_experiments/ nvfp4.log bf16.log

Then exit the container shell (exit) — the --rm flag in Step 1 deletes it automatically.

References

  • Quickstart: https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/tutorials/recipes/llama/00_quickstart_pretrain.py
  • Mixed precision: https://docs.nvidia.com/nemo/megatron-bridge/latest/training/mixed-precision.html
  • API: https://docs.nvidia.com/nemo/megatron-bridge/latest/apidocs/bridge/bridge.training.mixed_precision.html

Resources

  • Megatron Bridge Documentation
  • Mixed Precision Training Guide
  • Megatron Bridge GitHub
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation