Run models with llama.cpp on DGX Spark

30 MIN

Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Gemma 4 31B IT as example)

SymptomCauseFix
cmake fails with "CUDA not found"CUDA toolkit not in PATHRun export PATH=/usr/local/cuda/bin:$PATH and re-run CMake from a clean build directory
Build errors mentioning wrong GPU archCMake CMAKE_CUDA_ARCHITECTURES does not match GB10Use -DCMAKE_CUDA_ARCHITECTURES="121" for DGX Spark GB10 as in the instructions
GGUF download fails or stallsNetwork or Hugging Face availabilityRe-run hf download; it resumes partial files
"CUDA out of memory" when starting llama-serverModel too large for current context or VRAMLower --ctx-size (e.g. 4096) or use a smaller quantization from the same repo
Server runs but latency is highLayers not on GPUConfirm --n-gpu-layers is high enough for your model; check nvidia-smi during a request
curl: (7) Failed to connect on port 30000No listener yet, wrong host, or crashWait for server is listening; run curl on the same host as llama-server (or Spark’s IP); run ss -tln and confirm :30000; read server stderr for OOM or bad --model path
Chat API errors or empty repliesWrong --model path or incompatible GGUFVerify the path to the .gguf file; update llama.cpp if the GGUF requires a newer format

NOTE

DGX Spark uses Unified Memory Architecture (UMA), which allows flexible sharing between GPU and CPU memory. Some software is still catching up to UMA behavior. If you hit memory pressure unexpectedly, you can try flushing the page cache (use with care on shared systems):

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

For the latest platform issues, see the DGX Spark known issues documentation.