Nemotron-3-Nano with llama.cpp
Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark
Verify prerequisites
Ensure you have the required tools installed on your DGX Spark before proceeding.
git --version
cmake --version
nvcc --version
All commands should return version information. If any are missing, install them before continuing.
Install the Hugging Face CLI:
python3 -m venv nemotron-venv
source nemotron-venv/bin/activate
pip install -U "huggingface_hub[cli]"
Verify installation:
hf version
Clone llama.cpp repository
Clone the llama.cpp repository which provides the inference framework for running Nemotron models.
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Build llama.cpp with CUDA support
Build llama.cpp with CUDA enabled and targeting the GB10's sm_121 compute architecture. This compiles CUDA kernels specifically optimized for your DGX Spark GPU.
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8
The build process takes approximately 5-10 minutes. You should see compilation progress and eventually a successful build message.
Download the Nemotron GGUF model
Download the Q8 quantized GGUF model from Hugging Face. This model provides excellent quality while fitting within the GB10's memory capacity.
hf download unsloth/Nemotron-3-Nano-30B-A3B-GGUF \
Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
--local-dir ~/models/nemotron3-gguf
This downloads approximately 38GB. The download can be resumed if interrupted.
Start the llama.cpp server
Launch the inference server with the Nemotron model. The server provides an OpenAI-compatible API endpoint.
./bin/llama-server \
--model ~/models/nemotron3-gguf/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
--host 0.0.0.0 \
--port 30000 \
--n-gpu-layers 99 \
--ctx-size 8192 \
--threads 8
Parameter explanation:
--host 0.0.0.0: Listen on all network interfaces--port 30000: API server port--n-gpu-layers 99: Offload all layers to GPU--ctx-size 8192: Context window size (can increase up to 1M)--threads 8: CPU threads for non-GPU operations
You should see server startup messages indicating the model is loaded and ready:
llama_new_context_with_model: n_ctx = 8192
...
main: server is listening on 0.0.0.0:30000
Test the API
Open a new terminal and test the inference server using the OpenAI-compatible chat completions endpoint.
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron",
"messages": [{"role": "user", "content": "New York is a great city because..."}],
"max_tokens": 100
}'
Expected response format:
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "We need to respond to user statement: \"New York is a great city because...\". Probably they want continuation, maybe a discussion. It's a simple open-ended prompt. Provide reasons why New York is great. No policy issues. Just respond creatively.",
"content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people ("
}
}
],
"created": 1765916539,
"model": "Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf",
"object": "chat.completion",
"usage": {
"completion_tokens": 100,
"prompt_tokens": 25,
"total_tokens": 125
},
"id": "chatcmpl-...",
"timings": {
...
}
}
Test reasoning capabilities
Nemotron-3-Nano includes built-in reasoning capabilities. Test with a more complex prompt:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron",
"messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
"max_tokens": 500
}'
The model will provide a detailed reasoning chain before giving the final answer.
Cleanup
To stop the server, press Ctrl+C in the terminal where it's running.
To completely remove the installation:
# Remove llama.cpp build
rm -rf ~/llama.cpp
# Remove downloaded models
rm -rf ~/models/nemotron3-gguf
Next steps
- Increase context size: For longer conversations, increase
--ctx-sizeup to 1048576 (1M tokens), though this will use more memory - Integrate with applications: Use the OpenAI-compatible API with tools like Open WebUI, Continue.dev, or custom applications
The server supports the full OpenAI API specification including streaming responses, function calling, and multi-turn conversations.