Install and Use vLLM for Inference

30 MIN

Use a container or build vLLM from source for Spark

Common issues for running on a single Spark

SymptomCauseFix
CUDA version mismatch errorsWrong CUDA toolkit versionReinstall CUDA 12.9 using exact installer
Container registry authentication failsInvalid or expired GitLab tokenGenerate new auth token
SM_121a architecture not recognizedMissing LLVM patchesVerify SM_121a patches applied to LLVM source

Common Issues for running on two Sparks

SymptomCauseFix
Node 2 not visible in Ray clusterNetwork connectivity issueVerify QSFP cable connection, check IP configuration
Cannot access gated repo for URLCertain HuggingFace models have restricted accessRegenerate your HuggingFace token; and request access to the gated model on your web browser
Model download failsAuthentication or network issueRe-run huggingface-cli login, check internet access
Cannot access gated repo for URLCertain HuggingFace models have restricted accessRegenerate your HuggingFace token; and request access to the gated model on your web browser
CUDA out of memory with 405BInsufficient GPU memoryUse 70B model or reduce max_model_len parameter
Container startup failsMissing ARM64 imageRebuild vLLM image following ARM64 instructions

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'