Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark
Ensure you have the required tools installed on your DGX Spark before proceeding.
git --version
cmake --version
nvcc --version
All commands should return version information. If any are missing, install them before continuing.
Install the Hugging Face CLI:
python3 -m venv nemotron-venv
source nemotron-venv/bin/activate
pip install -U "huggingface_hub[cli]"
Verify installation:
hf version
Clone the llama.cpp repository which provides the inference framework for running Nemotron models.
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Build llama.cpp with CUDA enabled and targeting the GB10's sm_121 compute architecture. This compiles CUDA kernels specifically optimized for your DGX Spark GPU.
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8
The build process takes approximately 5-10 minutes. You should see compilation progress and eventually a successful build message.
Download the Q8 quantized GGUF model from Hugging Face. This model provides excellent quality while fitting within the GB10's memory capacity.
hf download unsloth/Nemotron-3-Nano-30B-A3B-GGUF \
Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
--local-dir ~/models/nemotron3-gguf
This downloads approximately 38GB. The download can be resumed if interrupted.
Launch the inference server with the Nemotron model. The server provides an OpenAI-compatible API endpoint.
./bin/llama-server \
--model ~/models/nemotron3-gguf/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
--host 0.0.0.0 \
--port 30000 \
--n-gpu-layers 99 \
--ctx-size 8192 \
--threads 8
Parameter explanation:
--host 0.0.0.0: Listen on all network interfaces--port 30000: API server port--n-gpu-layers 99: Offload all layers to GPU--ctx-size 8192: Context window size (can increase up to 1M)--threads 8: CPU threads for non-GPU operationsYou should see server startup messages indicating the model is loaded and ready:
llama_new_context_with_model: n_ctx = 8192
...
main: server is listening on 0.0.0.0:30000
Open a new terminal and test the inference server using the OpenAI-compatible chat completions endpoint.
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron",
"messages": [{"role": "user", "content": "New York is a great city because..."}],
"max_tokens": 100
}'
Expected response format:
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "We need to respond to user statement: \"New York is a great city because...\". Probably they want continuation, maybe a discussion. It's a simple open-ended prompt. Provide reasons why New York is great. No policy issues. Just respond creatively.",
"content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people ("
}
}
],
"created": 1765916539,
"model": "Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf",
"object": "chat.completion",
"usage": {
"completion_tokens": 100,
"prompt_tokens": 25,
"total_tokens": 125
},
"id": "chatcmpl-...",
"timings": {
...
}
}
Nemotron-3-Nano includes built-in reasoning capabilities. Test with a more complex prompt:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron",
"messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
"max_tokens": 500
}'
The model will provide a detailed reasoning chain before giving the final answer.
To stop the server, press Ctrl+C in the terminal where it's running.
To completely remove the installation:
# Remove llama.cpp build
rm -rf ~/llama.cpp
# Remove downloaded models
rm -rf ~/models/nemotron3-gguf
--ctx-size up to 1048576 (1M tokens), though this will use more memoryThe server supports the full OpenAI API specification including streaming responses, function calling, and multi-turn conversations.