NVFP4 Pretraining with Megatron Bridge
30 MIN
Pretrain Llama 3.1 8B with NVFP4 mixed precision on DGX Station using Megatron Bridge
| Symptom | Cause | Fix |
|---|---|---|
RuntimeError: NVFP4 is not supported on this GPU or similar FP4 error | GPU is not Blackwell architecture | NVFP4 requires Blackwell GPUs (GB200, GB300). Check with nvidia-smi |
ModuleNotFoundError: No module named 'megatron.bridge' | Megatron Bridge not installed | Run pip install megatron-bridge or use the NGC container |
CUDA out of memory during model init | Insufficient GPU memory for Llama 3.1 8B + optimizer states | Reduce micro_batch_size or use --nproc_per_node for model parallelism |
torchrun hangs or times out | NCCL communication failure between GPUs | Check NCCL_DEBUG=INFO torchrun ... for details; verify all GPUs are visible |
| Training loss is NaN | Precision instability | Increase num_layers_at_end_in_bf16 (e.g., from 4 to 8) or reduce learning rate |
--disable-fp4 works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with pip install --upgrade transformer-engine |
| Slow training throughput | Not using Tensor Cores efficiently | Ensure batch dimensions are multiples of 8; check that nvidia-smi shows high GPU utilization |
| Permission denied on Docker | User not in docker group | Run sudo usermod -aG docker $USER && newgrp docker |