llama.cpp is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through llama-server’s OpenAI-compatible HTTP API.
This playbook walks through that stack end to end. As the model example, it uses Gemma 4 31B IT - a frontier reasoning model built by Google DeepMind that llama.cpp supports, with strengths in coding, agentic workflows, and fine-tuning. The instructions download its F16 GGUF from Hugging Face. The same build and server steps apply to other GGUFs (including other sizes in the support matrix below).
You will build llama.cpp with CUDA for GB10, download a Gemma 4 31B IT model checkpoint, and run llama-server with GPU offload. You get:
/v1/chat/completions endpoint for tools and appsHardware requirements
Software requirements
git --versioncmake --versionnvcc --versionThe following models are supported with llama.cpp on Spark. All listed models are available and ready to use:
| Model | Support Status | HF Handle |
|---|---|---|
| Gemma 4 31B IT | ✅ | ggml-org/gemma-4-31B-it-GGUF |
| Gemma 4 26B A4B IT | ✅ | ggml-org/gemma-4-26B-A4B-it-GGUF |
| Gemma 4 E4B IT | ✅ | ggml-org/gemma-4-E4B-it-GGUF |
| Gemma 4 E2B IT | ✅ | ggml-org/gemma-4-E2B-it-GGUF |
| Nemotron-3-Nano | ✅ | unsloth/Nemotron-3-Nano-30B-A3B-GGUF |
llama.cpp clone and the model directory under ~/models/ to reclaim disk space