Basic idea
llama.cpp is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through llama-server’s OpenAI-compatible HTTP API.
This playbook walks through that stack end to end using Nemotron 3 Nano Omni as the hands-on example: an NVIDIA MoE family that runs well from quantized GGUF on Spark. Checkpoint choices and paths for all supported models are summarized in the matrix below; commands are in the instructions.
What you'll accomplish
You will build llama.cpp with CUDA for GB10, download a Nemotron 3 Nano Omni example checkpoint, and run llama-server with GPU offload. You get:
- Local inference through llama.cpp (no separate Python inference framework required)
- An OpenAI-compatible
/v1/chat/completionsendpoint for tools and apps - A concrete validation that the Nemotron 3 Nano Omni example runs on this stack on DGX Spark
What to know before starting
- Basic familiarity with Linux command line and terminal commands
- Understanding of git and building from source with CMake
- Basic knowledge of REST APIs and cURL for testing
- Familiarity with Hugging Face Hub for downloading GGUF files
Prerequisites
Hardware requirements
- NVIDIA DGX Spark with GB10 GPU
- Sufficient unified memory for the example Q8_0 checkpoint (weights on the order of ~35GB, plus KV cache and runtime overhead—scale up if you pick a larger quant or longer context)
- At least ~40GB free disk for the example download plus build artifacts (more if you keep multiple GGUFs)
Software requirements
- NVIDIA DGX OS
- Git:
git --version - CMake (3.14+):
cmake --version - CUDA Toolkit:
nvcc --version - Network access to GitHub and Hugging Face
Model support matrix
The following models are supported with llama.cpp on Spark. The instructions use the Nemotron 3 Nano Omni example row by default.
| Model | Support Status | HF Handle |
|---|---|---|
| Nemotron 3 Nano Omni (example walkthrough) | ✅ | ggml-org/NVIDIA-Nemotron-3-Nano-Omni |
| Qwen3.6-35B-A3B | ✅ | unsloth/Qwen3.6-35B-A3B-GGUF |
| Qwen3.6-27B | ✅ | unsloth/Qwen3.6-27B-GGUF |
| Gemma 4 31B IT | ✅ | ggml-org/gemma-4-31B-it-GGUF |
| Gemma 4 26B A4B IT | ✅ | ggml-org/gemma-4-26B-A4B-it-GGUF |
| Gemma 4 E4B IT | ✅ | ggml-org/gemma-4-E4B-it-GGUF |
| Gemma 4 E2B IT | ✅ | ggml-org/gemma-4-E2B-it-GGUF |
| Nemotron-3-Nano | ✅ | unsloth/Nemotron-3-Nano-30B-A3B-GGUF |
Time & risk
- Estimated time: About 30 minutes, plus downloading the example GGUF (~35GB order of magnitude for the default quant)
- Risk level: Low — build is local to your clone; no system-wide installs required for the steps below
- Rollback: Remove the
llama.cppclone and the model directory under~/models/to reclaim disk space - Last updated: 04/28/2026
- Walkthrough now uses Nemotron Omni; other model rows stay available