Run models with llama.cpp on DGX Spark

Basic idea

llama.cpp is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through llama-server’s OpenAI-compatible HTTP API.

This playbook walks through that stack end to end using Nemotron 3 Nano Omni as the hands-on example: an NVIDIA MoE family that runs well from quantized GGUF on Spark. Checkpoint choices and paths for all supported models are summarized in the matrix below; commands are in the instructions.

What you'll accomplish

You will build llama.cpp with CUDA for GB10, download a Nemotron 3 Nano Omni example checkpoint, and run llama-server with GPU offload. You get:

Local inference through llama.cpp (no separate Python inference framework required)
An OpenAI-compatible /v1/chat/completions endpoint for tools and apps
A concrete validation that the Nemotron 3 Nano Omni example runs on this stack on DGX Spark

What to know before starting

Basic familiarity with Linux command line and terminal commands
Understanding of git and building from source with CMake
Basic knowledge of REST APIs and cURL for testing
Familiarity with Hugging Face Hub for downloading GGUF files

Prerequisites

Hardware requirements

NVIDIA DGX Spark with GB10 GPU
Sufficient unified memory for the example Q8_0 checkpoint (weights on the order of ~35GB, plus KV cache and runtime overhead—scale up if you pick a larger quant or longer context)
At least ~40GB free disk for the example download plus build artifacts (more if you keep multiple GGUFs)

Software requirements

NVIDIA DGX OS
Git: git --version
CMake (3.14+): cmake --version
CUDA Toolkit: nvcc --version
Network access to GitHub and Hugging Face

Model support matrix

The following models are supported with llama.cpp on Spark. The instructions use the Nemotron 3 Nano Omni example row by default.

Model	Support Status	HF Handle
Nemotron 3 Nano Omni (example walkthrough)	✅	`ggml-org/NVIDIA-Nemotron-3-Nano-Omni`
Qwen3.6-35B-A3B	✅	`unsloth/Qwen3.6-35B-A3B-GGUF`
Qwen3.6-27B	✅	`unsloth/Qwen3.6-27B-GGUF`
Gemma 4 31B IT	✅	`ggml-org/gemma-4-31B-it-GGUF`
Gemma 4 26B A4B IT	✅	`ggml-org/gemma-4-26B-A4B-it-GGUF`
Gemma 4 E4B IT	✅	`ggml-org/gemma-4-E4B-it-GGUF`
Gemma 4 E2B IT	✅	`ggml-org/gemma-4-E2B-it-GGUF`
Nemotron-3-Nano	✅	`unsloth/Nemotron-3-Nano-30B-A3B-GGUF`

Time & risk

Estimated time: About 30 minutes, plus downloading the example GGUF (~35GB order of magnitude for the default quant)
Risk level: Low — build is local to your clone; no system-wide installs required for the steps below
Rollback: Remove the llama.cpp clone and the model directory under ~/models/ to reclaim disk space
Last updated: 04/28/2026
- Walkthrough now uses Nemotron Omni; other model rows stay available

Basic idea

What you'll accomplish

You will build llama.cpp with CUDA for GB10, download a Nemotron 3 Nano Omni example checkpoint, and run llama-server with GPU offload. You get:

Local inference through llama.cpp (no separate Python inference framework required)
An OpenAI-compatible /v1/chat/completions endpoint for tools and apps
A concrete validation that the Nemotron 3 Nano Omni example runs on this stack on DGX Spark

What to know before starting

Basic familiarity with Linux command line and terminal commands
Understanding of git and building from source with CMake
Basic knowledge of REST APIs and cURL for testing
Familiarity with Hugging Face Hub for downloading GGUF files

Prerequisites

Hardware requirements

NVIDIA DGX Spark with GB10 GPU
Sufficient unified memory for the example Q8_0 checkpoint (weights on the order of ~35GB, plus KV cache and runtime overhead—scale up if you pick a larger quant or longer context)
At least ~40GB free disk for the example download plus build artifacts (more if you keep multiple GGUFs)

Software requirements

NVIDIA DGX OS
Git: git --version
CMake (3.14+): cmake --version
CUDA Toolkit: nvcc --version
Network access to GitHub and Hugging Face

Model support matrix

The following models are supported with llama.cpp on Spark. The instructions use the Nemotron 3 Nano Omni example row by default.

Model	Support Status	HF Handle
Nemotron 3 Nano Omni (example walkthrough)	✅	`ggml-org/NVIDIA-Nemotron-3-Nano-Omni`
Qwen3.6-35B-A3B	✅	`unsloth/Qwen3.6-35B-A3B-GGUF`
Qwen3.6-27B	✅	`unsloth/Qwen3.6-27B-GGUF`
Gemma 4 31B IT	✅	`ggml-org/gemma-4-31B-it-GGUF`
Gemma 4 26B A4B IT	✅	`ggml-org/gemma-4-26B-A4B-it-GGUF`
Gemma 4 E4B IT	✅	`ggml-org/gemma-4-E4B-it-GGUF`
Gemma 4 E2B IT	✅	`ggml-org/gemma-4-E2B-it-GGUF`
Nemotron-3-Nano	✅	`unsloth/Nemotron-3-Nano-30B-A3B-GGUF`

Time & risk

Estimated time: About 30 minutes, plus downloading the example GGUF (~35GB order of magnitude for the default quant)
Risk level: Low — build is local to your clone; no system-wide installs required for the steps below
Rollback: Remove the llama.cpp clone and the model directory under ~/models/ to reclaim disk space
Last updated: 04/28/2026
- Walkthrough now uses Nemotron Omni; other model rows stay available

Run models with llama.cpp on DGX Spark

Basic idea

What you'll accomplish

What to know before starting

Prerequisites

Model support matrix

Time & risk

Resources

Run models with llama.cpp on DGX Spark

Basic idea

What you'll accomplish

What to know before starting

Prerequisites

Model support matrix

Time & risk

Resources