Run models with llama.cpp on DGX Spark

Basic idea

llama.cpp is a lightweight C/C++ inference stack for large language models. You build it with CUDA so it fully utilizes the DGX Spark GB10 GPU, then load GGUF weights and expose chat through llama-server’s OpenAI-compatible HTTP API.

This playbook walks through that stack end to end using MTP-enabled Qwen3.6-35B-A3B as the hands-on example. Checkpoint choices and paths for all supported models are summarized in the matrix below; commands are in the instructions.

What you'll accomplish

You will build llama.cpp with CUDA for GB10, download a Qwen3.6-35B-A3B checkpoint, and run llama-server with GPU offload. You get:

Local inference through llama.cpp (no separate Python inference framework required)
An OpenAI-compatible /v1/chat/completions endpoint for tools and apps
A concrete validation that the Qwen3.6-35B-A3B example runs on this stack on DGX Spark with MTP support.

What to know before starting

Basic familiarity with Linux command line and terminal commands
Understanding of git and building from source with CMake
Basic knowledge of REST APIs and cURL for testing

Prerequisites

Hardware requirements

NVIDIA DGX Spark with GB10 GPU
Sufficient unified memory for the model and the KV-Cache being utilized (about 30GB free RAM for the model in the example)
At least ~40GB free disk for the example download plus build artifacts (more if you keep multiple GGUFs)

Software requirements

NVIDIA DGX OS
Git: git --version
CMake (3.14+): cmake --version
CUDA Toolkit: nvcc --version
Network access to GitHub and Hugging Face

Model support matrix

DGX Spark supports any GGUF format model checkpoint with llama.cpp, as long as the system has memory available to host and run the checkpoint.

Time & risk

Estimated time: About 30 minutes, plus downloading the example GGUF (~35GB order of magnitude for the default quant)
Risk level: Low — build is local to your clone; no system-wide installs required for the steps below
Rollback: Remove the llama.cpp clone and the model directory under ~/.cache/huggingface/hub/ to reclaim disk space
Last updated: 06/03/2026
- Walkthrough now uses Qwen3.6-35B-A3B as an example