Run models with llama.cpp on DGX Spark

Symptom	Cause	Fix
`cmake` fails with "CUDA not found"	CUDA toolkit not in PATH	Run `export PATH=/usr/local/cuda/bin:$PATH` and re-run CMake from a clean build directory
Build errors mentioning wrong GPU arch	CMake `CMAKE_CUDA_ARCHITECTURES` does not match GB10	Use `-DCMAKE_CUDA_ARCHITECTURES="121"` for DGX Spark GB10 as in the instructions
GGUF download fails or stalls	Network or Hugging Face availability	Re-run `hf download`; it resumes partial files
"CUDA out of memory" when starting `llama-server`	Model too large for current context or VRAM	Lower `--ctx-size` (e.g. 4096) or use a smaller quantization from the same repo
Server runs but latency is high	Layers not on GPU	Confirm `--n-gpu-layers` is high enough for your model; check `nvidia-smi` during a request
`curl: (7) Failed to connect` on port 30000	No listener yet, wrong host, or crash	Wait for `server is listening`; run `curl` on the same host as `llama-server` (or Spark’s IP); run `ss -tln` and confirm `:30000`; read server stderr for OOM or bad `--model` path
Chat API errors or empty replies	Wrong `--model` path or incompatible GGUF	Verify the path to the `.gguf` file; update llama.cpp if the GGUF requires a newer format

NOTE

DGX Spark uses Unified Memory Architecture (UMA), which allows flexible sharing between GPU and CPU memory. Some software is still catching up to UMA behavior. If you hit memory pressure unexpectedly, you can try flushing the page cache (use with care on shared systems):

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

For the latest platform issues, see the DGX Spark known issues documentation.

Run models with llama.cpp on DGX Spark

Resources