NVFP4 Pretraining with Megatron Bridge

Step 1
Set up the environment

The recommended way to run Megatron-Bridge on DGX Station is through the NeMo Framework container, which includes Megatron-Bridge, Megatron-Core, Transformer Engine, and all CUDA dependencies pre-installed. Running outside the container is not supported in this playbook — the NVFP4 kernels rely on the exact Transformer Engine / CUDA versions shipped inside the image.

The training recipe fetches the Llama 3 8B architecture config from HuggingFace, so export a HuggingFace token with access to meta-llama/Meta-Llama-3-8B before launching the container:

export HF_TOKEN=<YOUR_HF_TOKEN>

git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nvfp4-pretraining/assets

# Use the latest nemo tag
export TAG=26.04

# This training primarily works on GB300 GPU
GB300_DEVICE=$(nvidia-smi --query-gpu=index,name --format=csv,noheader | awk -F', ' '/GB300/ {print $1; exit}')

docker run --rm -it \
  --gpus device=${GB300_DEVICE} \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -v "$(pwd):/workdir" \
  -w /workdir \
  --entrypoint bash \
  nvcr.io/nvidia/nemo:${TAG}

All subsequent torchrun / python commands in this playbook are meant to be executed from the shell inside this container.

Step 2
Review the pretraining script

The pretraining script can be found at pretrain_llama.py. The key piece is the NVFP4 precision config, built on top of Megatron-Bridge's prebuilt bf16_with_nvfp4_mixed recipe:

from megatron.bridge.training.mixed_precision import bf16_with_nvfp4_mixed

def nvfp4_mixed_precision():
    cfg = bf16_with_nvfp4_mixed()
    cfg.first_last_layers_bf16 = True
    cfg.num_layers_at_start_in_bf16 = 0
    cfg.num_layers_at_end_in_bf16 = 4
    return cfg

bf16_with_nvfp4_mixed() already sets fp8="e4m3" and fp8_recipe="nvfp4" under the hood; we just toggle the layer-pinning knobs on top:

Last 4 layers in BF16 (num_layers_at_end_in_bf16=4) for training stability (adjustable per model)
No start-layer pinning (num_layers_at_start_in_bf16=0) — last-layer stability is usually enough

NOTE

The script uses llama3_8b_pretrain_config() which defaults to context_parallel_size=2. The script overrides this to context_parallel_size=1 for single-GPU runs. If you swap in a larger recipe (e.g. nemotron_3_nano_pretrain_config, which defaults to TP=4), you must either launch torchrun --nproc_per_node=4 on a 4-GPU node or override config.model.tensor_model_parallel_size = 1 before calling pretrain(...), or you will hit: AssertionError: world size (1) is not divisible by total_model_size (...tensor_model_parallel_size=4 * ...).

Step 3
Launch NVFP4 pre-training

Launch a short training run with mock data and tee the output to a log file so you can inspect VRAM and per-iteration latency afterwards:

torchrun --nproc_per_node=1 pretrain_llama.py > nvfp4.log 2>&1

Expected output (see nvfp4.log):

Model initialization logs and a Theoretical memory footprints: weight and optimizer=... line
Iteration progress printed every step (log_interval=1), e.g. iteration 10/50 | ... elapsed time per iteration (ms): ... | lm loss: ...
A [Rank 0] ... memory (GB) | mem-max-reserved-gigabytes: ... line — this is your peak VRAM
A checkpoint saved to /workdir/nemo_experiments/default/checkpoints

If the run finishes with EXIT=0 (or no traceback), your NVFP4 pretraining setup is working.

Step 4
Compare with BF16 baseline

Run the same script with --disable-fp4 to establish a BF16 baseline, again logging to a file:

# Remove the prior checkpoint directory so the two runs don't interfere
rm -rf nemo_experiments

torchrun --nproc_per_node=1 pretrain_llama.py --disable-fp4 > bf16.log 2>&1

To compare the two runs on latency and throughput, grep the per-iteration lines out of each log:

grep -E "elapsed time per iteration|MODEL_TFLOP" nvfp4.log
grep -E "elapsed time per iteration|MODEL_TFLOP" bf16.log

Each step prints two lines:

Step Time : 5.39s GPU utilization: 2347.0MODEL_TFLOP/s/GPU — step latency and throughput
iteration 10/50 | ... elapsed time per iteration (ms): 5390 | ... lm loss: ... — same latency in ms plus loss

Iteration 10 includes one-time CUDA-graph/compile overhead, so average iterations 20–50 for a fair per-step latency number.

Measuring peak VRAM (from `nvidia-smi`)

Megatron's in-log memory numbers (mem-max-reserved-gigabytes) reflect PyTorch's caching-allocator reservation, which can drift from what the device actually holds. For an accurate read, watch nvidia-smi live from a second shell while training runs:

watch -n 1 nvidia-smi

See the measured numbers in overview.md for expected VRAM and latency on 1× GB300 with Llama 3.1 8B.

Step 5
Script arguments

pretrain_llama.py accepts the following arguments:

Argument	Type	Default	Description
`--disable-fp4`	flag	off	Disable NVFP4; use plain BF16 mixed precision as a baseline
`--train-iters`	int	50	Number of training iterations
`--warmup-iters`	int	2	Number of warmup iterations
`--global-batch-size`	int	64	Global batch size
`--micro-batch-size`	int	4	Micro batch size (drives peak VRAM; increase to use more memory)
`--seq-length`	int	4096	Sequence length

Example combining several flags:

torchrun --nproc_per_node=1 pretrain_llama.py \
    --train-iters 50 --warmup-iters 2 \
    --global-batch-size 64 --micro-batch-size 4 --seq-length 4096

Step 6
Point to real data

To train on your own dataset, modify the config in the script:

config = llama3_8b_pretrain_config()
config.dataset.data_path = ["/path/to/your/preprocessed/dataset"]
config.train.train_iters = 5000
config.train.global_batch_size = 256
config.train.micro_batch_size = 2

Megatron-Bridge expects preprocessed data in Megatron format. See the Megatron-Bridge data preparation guide for details.

Step 7
Cleanup

Remove checkpoints and log files generated by the runs:

rm -rf nemo_experiments/ nvfp4.log bf16.log

Then exit the container shell (exit) — the --rm flag in Step 1 deletes it automatically.

NVFP4 Pretraining with Megatron Bridge

Step 1
Set up the environment

Step 2
Review the pretraining script

Step 3
Launch NVFP4 pre-training

Step 4
Compare with BF16 baseline

Measuring peak VRAM (from `nvidia-smi`)

Step 5
Script arguments

Step 6
Point to real data

Step 7
Cleanup

References

Resources

NVFP4 Pretraining with Megatron Bridge

Step 1Set up the environment

Step 2Review the pretraining script

Step 3Launch NVFP4 pre-training

Step 4Compare with BF16 baseline

Measuring peak VRAM (from nvidia-smi)

Step 5Script arguments

Step 6Point to real data

Step 7Cleanup

References

Resources

Step 1
Set up the environment

Step 2
Review the pretraining script

Step 3
Launch NVFP4 pre-training

Step 4
Compare with BF16 baseline

Measuring peak VRAM (from `nvidia-smi`)

Step 5
Script arguments

Step 6
Point to real data

Step 7
Cleanup