NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Optimized JAX
  • Text to Knowledge Graph

tools

  • VS Code
  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark

fine tuning

  • Fine-tune with Pytorch
  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Unsloth on DGX Spark

use case

  • Install and Use Isaac Sim and Isaac Lab
  • Live VLM WebUI
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • NCCL for Two Sparks
  • Connect Two Sparks
  • Build a Video Search and Summarization (VSS) Agent

inference

  • Nemotron-3-Nano with llama.cpp
  • Speculative Decoding
  • vLLM for Inference
  • SGLang for Inference
  • TRT LLM for Inference
  • Multi-modal Inference
  • NIM on Spark
  • NVFP4 Quantization
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

Nemotron-3-Nano with llama.cpp

30 MIN

Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark

View on GitHub
OverviewInstructionsTroubleshooting

Basic idea

Nemotron-3-Nano-30B-A3B is NVIDIA's powerful language model featuring a 30 billion parameter Mixture of Experts (MoE) architecture with only 3 billion active parameters. This efficient design enables high-quality inference with lower computational requirements, making it ideal for DGX Spark's GB10 GPU.

This playbook demonstrates how to run Nemotron-3-Nano using llama.cpp, which compiles CUDA kernels at build time specifically for your GPU architecture. The model includes built-in reasoning (thinking mode) and tool calling support via the chat template.

What you'll accomplish

You will have a fully functional Nemotron-3-Nano-30B-A3B inference server running on your DGX Spark, accessible via an OpenAI-compatible API. This setup enables:

  • Local LLM inference
  • OpenAI-compatible API endpoint for easy integration with existing tools
  • Built-in reasoning and tool calling capabilities

What to know before starting

  • Basic familiarity with Linux command line and terminal commands
  • Understanding of git and working with branches
  • Experience building software from source with CMake
  • Basic knowledge of REST APIs and cURL for testing
  • Familiarity with Hugging Face Hub for model downloads

Prerequisites

Hardware Requirements:

  • NVIDIA DGX Spark with GB10 GPU
  • At least 40GB available GPU memory (model uses ~38GB VRAM)
  • At least 50GB available storage space for model downloads and build artifacts

Software Requirements:

  • NVIDIA DGX OS
  • Git: git --version
  • CMake (3.14+): cmake --version
  • CUDA Toolkit: nvcc --version
  • Network access to GitHub and Hugging Face

Time & risk

  • Estimated time: 30 minutes (including model download of ~38GB)
  • Risk level: Low
    • Build process compiles from source but doesn't modify system files
    • Model downloads can be resumed if interrupted
  • Rollback: Delete the cloned llama.cpp directory and downloaded model files to fully remove the installation
  • Last Updated: 12/17/2025
    • First Publication

Resources

  • llama.cpp GitHub Repository
  • Nemotron-3-Nano GGUF on Hugging Face
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide