Multi-modal Inference

1 HR

Setup multi-modal inference with TensorRT

SymptomCauseFix
"CUDA out of memory" errorInsufficient VRAM for modelUse FP8/FP4 quantization or smaller model
"Invalid HF token" errorMissing or expired HuggingFace tokenSet valid token: export HF_TOKEN=<YOUR_TOKEN>
Cannot access gated repo for URLCertain HuggingFace models have restricted accessRegenerate your HuggingFace token; and request access to the gated model on your web browser
Model download timeoutsNetwork issues or rate limitingRetry command or pre-download models

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'