Nemotron-3-Nano with llama.cpp

Symptom	Cause	Fix
`cmake` fails with "CUDA not found"	CUDA toolkit not in PATH	Run `export PATH=/usr/local/cuda/bin:$PATH` and retry
Model download fails or is interrupted	Network issues	Re-run the `hf download` command - it will resume from where it stopped
"CUDA out of memory" when starting server	Insufficient GPU memory	Reduce `--ctx-size` to 4096 or use a smaller quantization (Q4_K_M)
Server starts but inference is slow	Model not fully loaded to GPU	Verify `--n-gpu-layers 99` is set and check `nvidia-smi` shows GPU usage
"Connection refused" on port 30000	Server not running or wrong port	Verify server is running and check the `--port` parameter
"model not found" in API response	Wrong model path	Verify the model path in `--model` parameter matches the downloaded file location

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

For latest known issues, please review the DGX Spark User Guide.

Nemotron-3-Nano with llama.cpp

Resources