NIM on Spark

30 MIN

Deploy a NIM on Spark

SymptomCauseFix
Container fails to start with GPU errorNVIDIA Container Toolkit not configuredInstall nvidia-container-toolkit and restart Docker
"Invalid credentials" during docker loginIncorrect NGC API key formatVerify API key from NGC portal, ensure no extra whitespace
Model download hangs or failsNetwork connectivity or insufficient disk spaceCheck internet connection and available disk space in cache directory
API returns 404 or connection refusedContainer not fully started or wrong portWait for container startup completion, verify port 8000 is accessible
runtime not foundNVIDIA Container Toolkit not properly configuredRun sudo nvidia-ctk runtime configure --runtime=docker and restart Docker

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'