Nemotron-3-Nano with llama.cpp

Basic idea

Nemotron-3-Nano-30B-A3B is NVIDIA's powerful language model featuring a 30 billion parameter Mixture of Experts (MoE) architecture with only 3 billion active parameters. This efficient design enables high-quality inference with lower computational requirements, making it ideal for DGX Spark's GB10 GPU.

This playbook demonstrates how to run Nemotron-3-Nano using llama.cpp, which compiles CUDA kernels at build time specifically for your GPU architecture. The model includes built-in reasoning (thinking mode) and tool calling support via the chat template.

What you'll accomplish

You will have a fully functional Nemotron-3-Nano-30B-A3B inference server running on your DGX Spark, accessible via an OpenAI-compatible API. This setup enables:

Local LLM inference
OpenAI-compatible API endpoint for easy integration with existing tools
Built-in reasoning and tool calling capabilities

What to know before starting

Basic familiarity with Linux command line and terminal commands
Understanding of git and working with branches
Experience building software from source with CMake
Basic knowledge of REST APIs and cURL for testing
Familiarity with Hugging Face Hub for model downloads

Prerequisites

Hardware Requirements:

NVIDIA DGX Spark with GB10 GPU
At least 40GB available GPU memory (model uses ~38GB VRAM)
At least 50GB available storage space for model downloads and build artifacts

Software Requirements:

NVIDIA DGX OS
Git: git --version
CMake (3.14+): cmake --version
CUDA Toolkit: nvcc --version
Network access to GitHub and Hugging Face

Time & risk

Estimated time: 30 minutes (including model download of ~38GB)
Risk level: Low
- Build process compiles from source but doesn't modify system files
- Model downloads can be resumed if interrupted
Rollback: Delete the cloned llama.cpp directory and downloaded model files to fully remove the installation
Last Updated: 12/17/2025
- First Publication