NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • Run Hermes Agent with Local Models
  • cuTile Kernels
  • CLI Coding Agent
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞

inference

  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • LM Studio on DGX Spark
  • vLLM for Inference

Run models with llama.cpp on DGX Spark

30 MIN

Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Nemotron 3 Nano Omni as example)

DGX SparkInferenceLLMllama.cpp
View llama.cpp on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Basic idea

llama.cpp is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through llama-server’s OpenAI-compatible HTTP API.

This playbook walks through that stack end to end using Nemotron 3 Nano Omni as the hands-on example: an NVIDIA MoE family that runs well from quantized GGUF on Spark. Checkpoint choices and paths for all supported models are summarized in the matrix below; commands are in the instructions.

What you'll accomplish

You will build llama.cpp with CUDA for GB10, download a Nemotron 3 Nano Omni example checkpoint, and run llama-server with GPU offload. You get:

  • Local inference through llama.cpp (no separate Python inference framework required)
  • An OpenAI-compatible /v1/chat/completions endpoint for tools and apps
  • A concrete validation that the Nemotron 3 Nano Omni example runs on this stack on DGX Spark

What to know before starting

  • Basic familiarity with Linux command line and terminal commands
  • Understanding of git and building from source with CMake
  • Basic knowledge of REST APIs and cURL for testing
  • Familiarity with Hugging Face Hub for downloading GGUF files

Prerequisites

Hardware requirements

  • NVIDIA DGX Spark with GB10 GPU
  • Sufficient unified memory for the example Q8_0 checkpoint (weights on the order of ~35GB, plus KV cache and runtime overhead—scale up if you pick a larger quant or longer context)
  • At least ~40GB free disk for the example download plus build artifacts (more if you keep multiple GGUFs)

Software requirements

  • NVIDIA DGX OS
  • Git: git --version
  • CMake (3.14+): cmake --version
  • CUDA Toolkit: nvcc --version
  • Network access to GitHub and Hugging Face

Model support matrix

The following models are supported with llama.cpp on Spark. The instructions use the Nemotron 3 Nano Omni example row by default.

ModelSupport StatusHF Handle
Nemotron 3 Nano Omni (example walkthrough)✅ggml-org/NVIDIA-Nemotron-3-Nano-Omni
Qwen3.6-35B-A3B✅unsloth/Qwen3.6-35B-A3B-GGUF
Qwen3.6-27B✅unsloth/Qwen3.6-27B-GGUF
Gemma 4 31B IT✅ggml-org/gemma-4-31B-it-GGUF
Gemma 4 26B A4B IT✅ggml-org/gemma-4-26B-A4B-it-GGUF
Gemma 4 E4B IT✅ggml-org/gemma-4-E4B-it-GGUF
Gemma 4 E2B IT✅ggml-org/gemma-4-E2B-it-GGUF
Nemotron-3-Nano✅unsloth/Nemotron-3-Nano-30B-A3B-GGUF

Time & risk

  • Estimated time: About 30 minutes, plus downloading the example GGUF (~35GB order of magnitude for the default quant)
  • Risk level: Low — build is local to your clone; no system-wide installs required for the steps below
  • Rollback: Remove the llama.cpp clone and the model directory under ~/models/ to reclaim disk space
  • Last updated: 04/28/2026
    • Walkthrough now uses Nemotron Omni; other model rows stay available

Resources

  • llama.cpp GitHub Repository
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation