| Symptom | Cause | Fix |
|---|---|---|
cmake fails with "CUDA not found" | CUDA toolkit not in PATH | Run export PATH=/usr/local/cuda/bin:$PATH and re-run CMake from a clean build directory |
| Build errors mentioning wrong GPU arch | CMake CMAKE_CUDA_ARCHITECTURES does not match GB10 | Use -DCMAKE_CUDA_ARCHITECTURES="121" for DGX Spark GB10 as in the instructions |
| GGUF download fails or stalls | Network or Hugging Face availability | Re-run hf download; it resumes partial files |
"CUDA out of memory" when starting llama-server | Model too large for current context or VRAM | Lower --ctx-size (e.g. 4096) or use a smaller quantization from the same repo |
| Server runs but latency is high | Layers not on GPU | Confirm --n-gpu-layers is high enough for your model; check nvidia-smi during a request |
curl: (7) Failed to connect on port 30000 | No listener yet, wrong host, or crash | Wait for server is listening; run curl on the same host as llama-server (or Spark’s IP); run ss -tln and confirm :30000; read server stderr for OOM or bad --model path |
| Chat API errors or empty replies | Wrong --model path or incompatible GGUF | Verify the path to the .gguf file; update llama.cpp if the GGUF requires a newer format |
NOTE
DGX Spark uses Unified Memory Architecture (UMA), which allows flexible sharing between GPU and CPU memory. Some software is still catching up to UMA behavior. If you hit memory pressure unexpectedly, you can try flushing the page cache (use with care on shared systems):
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
For the latest platform issues, see the DGX Spark known issues documentation.