Nemotron-3-Nano with llama.cpp

Step 1
Verify prerequisites

Ensure you have the required tools installed on your DGX Spark before proceeding.

git --version
cmake --version
nvcc --version

All commands should return version information. If any are missing, install them before continuing.

Install the Hugging Face CLI:

python3 -m venv nemotron-venv
source nemotron-venv/bin/activate
pip install -U "huggingface_hub[cli]"

Verify installation:

hf version

Step 2
Clone llama.cpp repository

Clone the llama.cpp repository which provides the inference framework for running Nemotron models.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Step 3
Build llama.cpp with CUDA support

Build llama.cpp with CUDA enabled and targeting the GB10's sm_121 compute architecture. This compiles CUDA kernels specifically optimized for your DGX Spark GPU.

mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8

The build process takes approximately 5-10 minutes. You should see compilation progress and eventually a successful build message.

Step 4
Download the Nemotron GGUF model

Download the Q8 quantized GGUF model from Hugging Face. This model provides excellent quality while fitting within the GB10's memory capacity.

hf download unsloth/Nemotron-3-Nano-30B-A3B-GGUF \
  Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
  --local-dir ~/models/nemotron3-gguf

This downloads approximately 38GB. The download can be resumed if interrupted.

Step 5
Start the llama.cpp server

Launch the inference server with the Nemotron model. The server provides an OpenAI-compatible API endpoint.

./bin/llama-server \
  --model ~/models/nemotron3-gguf/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
  --host 0.0.0.0 \
  --port 30000 \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --threads 8

Parameter explanation:

--host 0.0.0.0: Listen on all network interfaces
--port 30000: API server port
--n-gpu-layers 99: Offload all layers to GPU
--ctx-size 8192: Context window size (can increase up to 1M)
--threads 8: CPU threads for non-GPU operations

You should see server startup messages indicating the model is loaded and ready:

llama_new_context_with_model: n_ctx = 8192
...
main: server is listening on 0.0.0.0:30000

Step 6
Test the API

Open a new terminal and test the inference server using the OpenAI-compatible chat completions endpoint.

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "New York is a great city because..."}],
    "max_tokens": 100
  }'

Expected response format:

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "We need to respond to user statement: \"New York is a great city because...\". Probably they want continuation, maybe a discussion. It's a simple open-ended prompt. Provide reasons why New York is great. No policy issues. Just respond creatively.",
        "content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people ("
      }
    }
  ],
  "created": 1765916539,
  "model": "Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 100,
    "prompt_tokens": 25,
    "total_tokens": 125
  },
  "id": "chatcmpl-...",
  "timings": {
    ...
  }
}

Step 7
Test reasoning capabilities

Nemotron-3-Nano includes built-in reasoning capabilities. Test with a more complex prompt:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
    "max_tokens": 500
  }'

The model will provide a detailed reasoning chain before giving the final answer.

Step 8
Cleanup

To stop the server, press Ctrl+C in the terminal where it's running.

To completely remove the installation:

# Remove llama.cpp build
rm -rf ~/llama.cpp

# Remove downloaded models
rm -rf ~/models/nemotron3-gguf

Step 9
Next steps

Increase context size: For longer conversations, increase --ctx-size up to 1048576 (1M tokens), though this will use more memory
Integrate with applications: Use the OpenAI-compatible API with tools like Open WebUI, Continue.dev, or custom applications

The server supports the full OpenAI API specification including streaming responses, function calling, and multi-turn conversations.

Nemotron-3-Nano with llama.cpp

Step 1Verify prerequisites

Step 2Clone llama.cpp repository

Step 3Build llama.cpp with CUDA support

Step 4Download the Nemotron GGUF model

Step 5Start the llama.cpp server

Step 6Test the API

Step 7Test reasoning capabilities

Step 8Cleanup

Step 9Next steps

Resources