Run models with llama.cpp on DGX Spark

30 MIN

Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Gemma 4 31B IT as example)

Basic idea

llama.cpp is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through llama-server’s OpenAI-compatible HTTP API.

This playbook walks through that stack end to end. As the model example, it uses Gemma 4 31B IT - a frontier reasoning model built by Google DeepMind that llama.cpp supports, with strengths in coding, agentic workflows, and fine-tuning. The instructions download its F16 GGUF from Hugging Face. The same build and server steps apply to other GGUFs (including other sizes in the support matrix below).

What you'll accomplish

You will build llama.cpp with CUDA for GB10, download a Gemma 4 31B IT model checkpoint, and run llama-server with GPU offload. You get:

  • Local inference through llama.cpp (no separate Python inference framework required)
  • An OpenAI-compatible /v1/chat/completions endpoint for tools and apps
  • A concrete validation that Gemma 4 31B IT runs on this stack on DGX Spark

What to know before starting

  • Basic familiarity with Linux command line and terminal commands
  • Understanding of git and building from source with CMake
  • Basic knowledge of REST APIs and cURL for testing
  • Familiarity with Hugging Face Hub for downloading GGUF files

Prerequisites

Hardware requirements

  • NVIDIA DGX Spark with GB10 GPU
  • Sufficient unified memory for the F16 checkpoint (on the order of ~62GB for weights alone; more when KV cache and runtime overhead are included)
  • At least ~70GB free disk for the F16 download plus build artifacts (use a smaller quant from the same repo if you need less disk and VRAM)

Software requirements

  • NVIDIA DGX OS
  • Git: git --version
  • CMake (3.14+): cmake --version
  • CUDA Toolkit: nvcc --version
  • Network access to GitHub and Hugging Face

Model Support Matrix

The following models are supported with llama.cpp on Spark. All listed models are available and ready to use:

ModelSupport StatusHF Handle
Gemma 4 31B ITggml-org/gemma-4-31B-it-GGUF
Gemma 4 26B A4B ITggml-org/gemma-4-26B-A4B-it-GGUF
Gemma 4 E4B ITggml-org/gemma-4-E4B-it-GGUF
Gemma 4 E2B ITggml-org/gemma-4-E2B-it-GGUF
Nemotron-3-Nanounsloth/Nemotron-3-Nano-30B-A3B-GGUF

Time & risk

  • Estimated time: About 30 minutes, plus downloading the ~62GB example
  • Risk level: Low — build is local to your clone; no system-wide installs required for the steps below
  • Rollback: Remove the llama.cpp clone and the model directory under ~/models/ to reclaim disk space
  • Last updated: 04/02/2026
    • First Publication