Nemotron-3-Nano with llama.cpp
30 MIN
Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark
| Symptom | Cause | Fix |
|---|---|---|
cmake fails with "CUDA not found" | CUDA toolkit not in PATH | Run export PATH=/usr/local/cuda/bin:$PATH and retry |
| Model download fails or is interrupted | Network issues | Re-run the hf download command - it will resume from where it stopped |
| "CUDA out of memory" when starting server | Insufficient GPU memory | Reduce --ctx-size to 4096 or use a smaller quantization (Q4_K_M) |
| Server starts but inference is slow | Model not fully loaded to GPU | Verify --n-gpu-layers 99 is set and check nvidia-smi shows GPU usage |
| "Connection refused" on port 30000 | Server not running or wrong port | Verify server is running and check the --port parameter |
| "model not found" in API response | Wrong model path | Verify the model path in --model parameter matches the downloaded file location |
NOTE
DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
For latest known issues, please review the DGX Spark User Guide.