NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth

inference

  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • vLLM for Inference
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • LM Studio on DGX Spark

Run models with llama.cpp on DGX Spark

30 MIN

Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Gemma 4 31B IT as example)

DGX SparkInferenceLLMllama.cpp
View llama.cpp on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Basic idea

llama.cpp is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through llama-server’s OpenAI-compatible HTTP API.

This playbook walks through that stack end to end. As the model example, it uses Gemma 4 31B IT - a frontier reasoning model built by Google DeepMind that llama.cpp supports, with strengths in coding, agentic workflows, and fine-tuning. The instructions download its F16 GGUF from Hugging Face. The same build and server steps apply to other GGUFs (including other sizes in the support matrix below).

What you'll accomplish

You will build llama.cpp with CUDA for GB10, download a Gemma 4 31B IT model checkpoint, and run llama-server with GPU offload. You get:

  • Local inference through llama.cpp (no separate Python inference framework required)
  • An OpenAI-compatible /v1/chat/completions endpoint for tools and apps
  • A concrete validation that Gemma 4 31B IT runs on this stack on DGX Spark

What to know before starting

  • Basic familiarity with Linux command line and terminal commands
  • Understanding of git and building from source with CMake
  • Basic knowledge of REST APIs and cURL for testing
  • Familiarity with Hugging Face Hub for downloading GGUF files

Prerequisites

Hardware requirements

  • NVIDIA DGX Spark with GB10 GPU
  • Sufficient unified memory for the F16 checkpoint (on the order of ~62GB for weights alone; more when KV cache and runtime overhead are included)
  • At least ~70GB free disk for the F16 download plus build artifacts (use a smaller quant from the same repo if you need less disk and VRAM)

Software requirements

  • NVIDIA DGX OS
  • Git: git --version
  • CMake (3.14+): cmake --version
  • CUDA Toolkit: nvcc --version
  • Network access to GitHub and Hugging Face

Model Support Matrix

The following models are supported with llama.cpp on Spark. All listed models are available and ready to use:

ModelSupport StatusHF Handle
Gemma 4 31B IT✅ggml-org/gemma-4-31B-it-GGUF
Gemma 4 26B A4B IT✅ggml-org/gemma-4-26B-A4B-it-GGUF
Gemma 4 E4B IT✅ggml-org/gemma-4-E4B-it-GGUF
Gemma 4 E2B IT✅ggml-org/gemma-4-E2B-it-GGUF
Nemotron-3-Nano✅unsloth/Nemotron-3-Nano-30B-A3B-GGUF

Time & risk

  • Estimated time: About 30 minutes, plus downloading the ~62GB example
  • Risk level: Low — build is local to your clone; no system-wide installs required for the steps below
  • Rollback: Remove the llama.cpp clone and the model directory under ~/models/ to reclaim disk space
  • Last updated: 04/02/2026
    • First Publication

Resources

  • llama.cpp GitHub Repository
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation