Run models with llama.cpp on DGX Spark

Step 1
Verify prerequisites

The example checkpoint is nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf from Hugging Face repo ggml-org/NVIDIA-Nemotron-3-Nano-Omni (full handle: ggml-org/NVIDIA-Nemotron-3-Nano-Omni/nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf). Other supported GGUFs—including Qwen3.6, Gemma, and alternate Nemotron Omni builds—use the same build and server steps; change hf download and --model paths (see the overview model matrix).

Ensure the required tools are installed:

git --version
cmake --version
nvcc --version

All commands should return version information. If any are missing, install them before continuing.

Install the Hugging Face CLI:

python3 -m venv llama-cpp-venv
source llama-cpp-venv/bin/activate
pip install -U "huggingface_hub[cli]"

Verify installation:

hf version

Step 2
Clone the llama.cpp repository

Clone upstream llama.cpp—the framework you are building:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Step 3
Build llama.cpp with CUDA

Configure CMake with CUDA and GB10’s sm_121 architecture so GGML’s CUDA backend matches your GPU:

mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8

The build usually takes on the order of 5–10 minutes. When it finishes, binaries such as llama-server appear under build/bin/.

Step 4
Download example Nemotron 3 Nano Omni GGUF

llama.cpp loads models in GGUF format. This playbook uses the Q8_0 checkpoint from ggml-org/NVIDIA-Nemotron-3-Nano-Omni, which balances quality and memory on DGX Spark GB10 unified memory.

hf download ggml-org/NVIDIA-Nemotron-3-Nano-Omni \
  nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf \
  --local-dir ~/models/NVIDIA-Nemotron-3-Nano-Omni

The file is on the order of ~35GB (exact size may vary). The download can be resumed if interrupted.

Step 5
Start llama-server with Nemotron 3 Nano Omni

From your llama.cpp/build directory, launch the OpenAI-compatible server with GPU offload:

./bin/llama-server \
  --model ~/models/NVIDIA-Nemotron-3-Nano-Omni/nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf \
  --host 0.0.0.0 \
  --port 30000 \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --threads 8

Parameters (short):

--host / --port: bind address and port for the HTTP API
--n-gpu-layers 99: offload layers to the GPU (adjust if you use a different model)
--ctx-size: context length (can be increased up to model/server limits; uses more memory)
--threads: CPU threads for non-GPU work

You should see log lines similar to:

llama_new_context_with_model: n_ctx = 8192
...
main: server is listening on 0.0.0.0:30000

Keep this terminal open while testing. Large GGUFs can take a minute or more to load; until you see server is listening, nothing accepts connections on port 30000 (see Troubleshooting if curl reports connection refused).

Step 6
Test the API

Use a second terminal on the same machine that runs llama-server (for example another SSH session into DGX Spark). If you run curl on your laptop while the server runs only on Spark, use the Spark hostname or IP instead of localhost.

curl -X POST http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "New York is a great city because..."}],
    "max_tokens": 100
  }'

If you see curl: (7) Failed to connect, the server is still loading, the process exited (check the server log for OOM or path errors), or you are not curling the host that runs llama-server.

Example shape of the response (fields vary by llama.cpp version; message may include extra keys):

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people ("
      }
    }
  ],
  "created": 1765916539,
  "model": "nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 100,
    "prompt_tokens": 25,
    "total_tokens": 125
  },
  "id": "chatcmpl-...",
  "timings": {
    ...
  }
}

Step 7
Longer completion (with Nemotron 3 Nano Omni)

Try a slightly longer prompt to confirm stable generation with Nemotron 3 Nano Omni:

curl -X POST http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
    "max_tokens": 500
  }'

Step 8
Cleanup

Stop the server with Ctrl+C in the terminal where it is running.

To remove this tutorial’s artifacts:

rm -rf ~/llama.cpp
rm -rf ~/models/NVIDIA-Nemotron-3-Nano-Omni

Deactivate the Python venv if you no longer need hf:

deactivate

Step 9
Next steps

Context length: Increase --ctx-size for longer chats (watch memory; 1M-token class contexts are possible only when the build, model, and hardware allow).
Other models: Point --model at any compatible GGUF; the llama.cpp server API stays the same.
Integrations: Point Open WebUI, Continue.dev, or custom clients at http://<spark-host>:30000/v1 using the OpenAI client pattern.

The server implements the usual OpenAI-style chat features your llama.cpp build enables (including streaming and tool-related flows where supported).

Step 1
Verify prerequisites

Ensure the required tools are installed:

git --version
cmake --version
nvcc --version

All commands should return version information. If any are missing, install them before continuing.

Install the Hugging Face CLI:

python3 -m venv llama-cpp-venv
source llama-cpp-venv/bin/activate
pip install -U "huggingface_hub[cli]"

Verify installation:

hf version

Step 2
Clone the llama.cpp repository

Clone upstream llama.cpp—the framework you are building:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Step 3
Build llama.cpp with CUDA

Configure CMake with CUDA and GB10’s sm_121 architecture so GGML’s CUDA backend matches your GPU:

mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8

The build usually takes on the order of 5–10 minutes. When it finishes, binaries such as llama-server appear under build/bin/.

Step 4
Download example Nemotron 3 Nano Omni GGUF

llama.cpp loads models in GGUF format. This playbook uses the Q8_0 checkpoint from ggml-org/NVIDIA-Nemotron-3-Nano-Omni, which balances quality and memory on DGX Spark GB10 unified memory.

hf download ggml-org/NVIDIA-Nemotron-3-Nano-Omni \
  nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf \
  --local-dir ~/models/NVIDIA-Nemotron-3-Nano-Omni

The file is on the order of ~35GB (exact size may vary). The download can be resumed if interrupted.

Step 5
Start llama-server with Nemotron 3 Nano Omni

From your llama.cpp/build directory, launch the OpenAI-compatible server with GPU offload:

./bin/llama-server \
  --model ~/models/NVIDIA-Nemotron-3-Nano-Omni/nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf \
  --host 0.0.0.0 \
  --port 30000 \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --threads 8

Parameters (short):

--host / --port: bind address and port for the HTTP API
--n-gpu-layers 99: offload layers to the GPU (adjust if you use a different model)
--ctx-size: context length (can be increased up to model/server limits; uses more memory)
--threads: CPU threads for non-GPU work

You should see log lines similar to:

llama_new_context_with_model: n_ctx = 8192
...
main: server is listening on 0.0.0.0:30000

Step 6
Test the API

curl -X POST http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "New York is a great city because..."}],
    "max_tokens": 100
  }'

If you see curl: (7) Failed to connect, the server is still loading, the process exited (check the server log for OOM or path errors), or you are not curling the host that runs llama-server.

Example shape of the response (fields vary by llama.cpp version; message may include extra keys):

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people ("
      }
    }
  ],
  "created": 1765916539,
  "model": "nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 100,
    "prompt_tokens": 25,
    "total_tokens": 125
  },
  "id": "chatcmpl-...",
  "timings": {
    ...
  }
}

Step 7
Longer completion (with Nemotron 3 Nano Omni)

Try a slightly longer prompt to confirm stable generation with Nemotron 3 Nano Omni:

curl -X POST http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
    "max_tokens": 500
  }'

Step 8
Cleanup

Stop the server with Ctrl+C in the terminal where it is running.

To remove this tutorial’s artifacts:

rm -rf ~/llama.cpp
rm -rf ~/models/NVIDIA-Nemotron-3-Nano-Omni

Deactivate the Python venv if you no longer need hf:

deactivate

Step 9
Next steps

Context length: Increase --ctx-size for longer chats (watch memory; 1M-token class contexts are possible only when the build, model, and hardware allow).
Other models: Point --model at any compatible GGUF; the llama.cpp server API stays the same.
Integrations: Point Open WebUI, Continue.dev, or custom clients at http://<spark-host>:30000/v1 using the OpenAI client pattern.

The server implements the usual OpenAI-style chat features your llama.cpp build enables (including streaming and tool-related flows where supported).

Run models with llama.cpp on DGX Spark

Step 1
Verify prerequisites

Step 2
Clone the llama.cpp repository

Step 3
Build llama.cpp with CUDA

Step 4
Download example Nemotron 3 Nano Omni GGUF

Step 5
Start llama-server with Nemotron 3 Nano Omni

Step 6
Test the API

Step 7
Longer completion (with Nemotron 3 Nano Omni)

Step 8
Cleanup

Step 9
Next steps

Resources

Run models with llama.cpp on DGX Spark

Step 1
Verify prerequisites

Step 2
Clone the llama.cpp repository

Step 3
Build llama.cpp with CUDA

Step 4
Download example Nemotron 3 Nano Omni GGUF

Step 5
Start llama-server with Nemotron 3 Nano Omni

Step 6
Test the API

Step 7
Longer completion (with Nemotron 3 Nano Omni)

Step 8
Cleanup

Step 9
Next steps

Resources

Run models with llama.cpp on DGX Spark

Step 1Verify prerequisites

Step 2Clone the llama.cpp repository

Step 3Build llama.cpp with CUDA

Step 4Download example Nemotron 3 Nano Omni GGUF

Step 5Start llama-server with Nemotron 3 Nano Omni

Step 6Test the API

Step 7Longer completion (with Nemotron 3 Nano Omni)

Step 8Cleanup

Step 9Next steps

Resources

Run models with llama.cpp on DGX Spark

Step 1Verify prerequisites

Step 2Clone the llama.cpp repository

Step 3Build llama.cpp with CUDA

Step 4Download example Nemotron 3 Nano Omni GGUF

Step 5Start llama-server with Nemotron 3 Nano Omni

Step 6Test the API

Step 7Longer completion (with Nemotron 3 Nano Omni)

Step 8Cleanup

Step 9Next steps

Resources

Step 1
Verify prerequisites

Step 2
Clone the llama.cpp repository

Step 3
Build llama.cpp with CUDA

Step 4
Download example Nemotron 3 Nano Omni GGUF

Step 5
Start llama-server with Nemotron 3 Nano Omni

Step 6
Test the API

Step 7
Longer completion (with Nemotron 3 Nano Omni)

Step 8
Cleanup

Step 9
Next steps

Step 1
Verify prerequisites

Step 2
Clone the llama.cpp repository

Step 3
Build llama.cpp with CUDA

Step 4
Download example Nemotron 3 Nano Omni GGUF

Step 5
Start llama-server with Nemotron 3 Nano Omni

Step 6
Test the API

Step 7
Longer completion (with Nemotron 3 Nano Omni)

Step 8
Cleanup

Step 9
Next steps