llama.cpp is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through llama-server’s OpenAI-compatible HTTP API.
This playbook walks through that stack end to end using Nemotron 3 Nano Omni as the hands-on example: an NVIDIA MoE family that runs well from quantized GGUF on Spark. Checkpoint choices and paths for all supported models are summarized in the matrix below; commands are in the instructions.
You will build llama.cpp with CUDA for GB10, download a Nemotron 3 Nano Omni example checkpoint, and run llama-server with GPU offload. You get:
/v1/chat/completions endpoint for tools and appsHardware requirements
Software requirements
git --versioncmake --versionnvcc --versionThe following models are supported with llama.cpp on Spark. The instructions use the Nemotron 3 Nano Omni example row by default.
| Model | Support Status | HF Handle |
|---|---|---|
| Nemotron 3 Nano Omni (example walkthrough) | ✅ | ggml-org/NVIDIA-Nemotron-3-Nano-Omni |
| Qwen3.6-35B-A3B | ✅ | unsloth/Qwen3.6-35B-A3B-GGUF |
| Qwen3.6-27B | ✅ | unsloth/Qwen3.6-27B-GGUF |
| Gemma 4 31B IT | ✅ | ggml-org/gemma-4-31B-it-GGUF |
| Gemma 4 26B A4B IT | ✅ | ggml-org/gemma-4-26B-A4B-it-GGUF |
| Gemma 4 E4B IT | ✅ | ggml-org/gemma-4-E4B-it-GGUF |
| Gemma 4 E2B IT | ✅ | ggml-org/gemma-4-E2B-it-GGUF |
| Nemotron-3-Nano | ✅ | unsloth/Nemotron-3-Nano-30B-A3B-GGUF |
llama.cpp clone and the model directory under ~/models/ to reclaim disk space