NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • VS Code
  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • Spark & Reachy Photo Booth
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent

inference

  • LM Studio on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • Speculative Decoding
  • SGLang for Inference
  • TRT LLM for Inference
  • vLLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

Nemotron-3-Nano with llama.cpp

30 MIN

Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark

View on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Step 1
Verify prerequisites

Ensure you have the required tools installed on your DGX Spark before proceeding.

git --version
cmake --version
nvcc --version

All commands should return version information. If any are missing, install them before continuing.

Install the Hugging Face CLI:

python3 -m venv nemotron-venv
source nemotron-venv/bin/activate
pip install -U "huggingface_hub[cli]"

Verify installation:

hf version

Step 2
Clone llama.cpp repository

Clone the llama.cpp repository which provides the inference framework for running Nemotron models.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Step 3
Build llama.cpp with CUDA support

Build llama.cpp with CUDA enabled and targeting the GB10's sm_121 compute architecture. This compiles CUDA kernels specifically optimized for your DGX Spark GPU.

mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8

The build process takes approximately 5-10 minutes. You should see compilation progress and eventually a successful build message.

Step 4
Download the Nemotron GGUF model

Download the Q8 quantized GGUF model from Hugging Face. This model provides excellent quality while fitting within the GB10's memory capacity.

hf download unsloth/Nemotron-3-Nano-30B-A3B-GGUF \
  Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
  --local-dir ~/models/nemotron3-gguf

This downloads approximately 38GB. The download can be resumed if interrupted.

Step 5
Start the llama.cpp server

Launch the inference server with the Nemotron model. The server provides an OpenAI-compatible API endpoint.

./bin/llama-server \
  --model ~/models/nemotron3-gguf/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
  --host 0.0.0.0 \
  --port 30000 \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --threads 8

Parameter explanation:

  • --host 0.0.0.0: Listen on all network interfaces
  • --port 30000: API server port
  • --n-gpu-layers 99: Offload all layers to GPU
  • --ctx-size 8192: Context window size (can increase up to 1M)
  • --threads 8: CPU threads for non-GPU operations

You should see server startup messages indicating the model is loaded and ready:

llama_new_context_with_model: n_ctx = 8192
...
main: server is listening on 0.0.0.0:30000

Step 6
Test the API

Open a new terminal and test the inference server using the OpenAI-compatible chat completions endpoint.

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "New York is a great city because..."}],
    "max_tokens": 100
  }'

Expected response format:

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "We need to respond to user statement: \"New York is a great city because...\". Probably they want continuation, maybe a discussion. It's a simple open-ended prompt. Provide reasons why New York is great. No policy issues. Just respond creatively.",
        "content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people ("
      }
    }
  ],
  "created": 1765916539,
  "model": "Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 100,
    "prompt_tokens": 25,
    "total_tokens": 125
  },
  "id": "chatcmpl-...",
  "timings": {
    ...
  }
}

Step 7
Test reasoning capabilities

Nemotron-3-Nano includes built-in reasoning capabilities. Test with a more complex prompt:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
    "max_tokens": 500
  }'

The model will provide a detailed reasoning chain before giving the final answer.

Step 8
Cleanup

To stop the server, press Ctrl+C in the terminal where it's running.

To completely remove the installation:

# Remove llama.cpp build
rm -rf ~/llama.cpp

# Remove downloaded models
rm -rf ~/models/nemotron3-gguf

Step 9
Next steps

  1. Increase context size: For longer conversations, increase --ctx-size up to 1048576 (1M tokens), though this will use more memory
  2. Integrate with applications: Use the OpenAI-compatible API with tools like Open WebUI, Continue.dev, or custom applications

The server supports the full OpenAI API specification including streaming responses, function calling, and multi-turn conversations.

Resources

  • llama.cpp GitHub Repository
  • Nemotron-3-Nano GGUF on Hugging Face
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide