NVFP4 Quantization

1 HR

Quantize a model to NVFP4 to run on Spark using TensorRT Model Optimizer

SymptomCauseFix
"Permission denied" when accessing Hugging FaceMissing or invalid HF tokenRun huggingface-cli login with valid token
Container exits with CUDA out of memoryInsufficient GPU memoryReduce batch size or use a machine with more GPU memory
Model files not found in output directoryVolume mount failed or wrong pathVerify $(pwd)/output_models resolves correctly
Git clone fails inside containerNetwork connectivity issuesCheck internet connection and retry
Quantization process hangsContainer resource limitsIncrease Docker memory limits or use --ulimit flags
Cannot access gated repo for URLCertain HuggingFace models have restricted accessRegenerate your HuggingFace token; and request access to the gated model on your web browser

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'