Basic idea
llama.cpp is a lightweight C/C++ inference stack for large language models. You build it with CUDA so it fully utilizes the DGX Spark GB10 GPU, then load GGUF weights and expose chat through llama-server’s OpenAI-compatible HTTP API.
This playbook walks through that stack end to end using MTP-enabled Qwen3.6-35B-A3B as the hands-on example. Checkpoint choices and paths for all supported models are summarized in the matrix below; commands are in the instructions.
What you'll accomplish
You will build llama.cpp with CUDA for GB10, download a Qwen3.6-35B-A3B checkpoint, and run llama-server with GPU offload. You get:
- Local inference through llama.cpp (no separate Python inference framework required)
- An OpenAI-compatible
/v1/chat/completions endpoint for tools and apps
- A concrete validation that the Qwen3.6-35B-A3B example runs on this stack on DGX Spark with MTP support.
What to know before starting
- Basic familiarity with Linux command line and terminal commands
- Understanding of git and building from source with CMake
- Basic knowledge of REST APIs and cURL for testing
Prerequisites
Hardware requirements
- NVIDIA DGX Spark with GB10 GPU
- Sufficient unified memory for the model and the KV-Cache being utilized (about 30GB free RAM for the model in the example)
- At least ~40GB free disk for the example download plus build artifacts (more if you keep multiple GGUFs)
Software requirements
- NVIDIA DGX OS
- Git:
git --version
- CMake (3.14+):
cmake --version
- CUDA Toolkit:
nvcc --version
- Network access to GitHub and Hugging Face
Model support matrix
DGX Spark supports any GGUF format model checkpoint with llama.cpp, as long as the system has memory available to host and run the checkpoint.
Time & risk
- Estimated time: About 30 minutes, plus downloading the example GGUF (~35GB order of magnitude for the default quant)
- Risk level: Low — build is local to your clone; no system-wide installs required for the steps below
- Rollback: Remove the
llama.cpp clone and the model directory under ~/.cache/huggingface/hub/ to reclaim disk space
- Last updated: 06/03/2026
- Walkthrough now uses Qwen3.6-35B-A3B as an example