Skip to main content
NVIDIA
Explore
Models
Skills
Blueprints
GPUs
Docs
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • Run NemoClaw with a Local LLM
  • 🦞 Set Up Example NemoClaw Agents 🦞
  • Run Hermes Agent with Local Models
  • cuTile Kernels
  • CLI Coding Agent
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞

inference

  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • LM Studio on DGX Spark
  • vLLM for Inference

Run models with llama.cpp on DGX Spark

30 MIN

Build llama.cpp with CUDA and serve models via an OpenAI-compatible API

DGX SparkInferenceLLMllama.cpp
View llama.cpp on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Step 1
Install the dependencies

Install the required dependencies:

sudo apt install -y git clang cmake libcurl4-openssl-dev libssl-dev

Step 2
Clone the llama.cpp repository

Clone upstream llama.cpp—the framework you are building:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Step 3
Build llama.cpp with CUDA

Configure CMake with CUDA and GB10’s sm_121 architecture so GGML’s CUDA backend matches your GPU:

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_CURL=ON -DGGML_RPC=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real
cmake --build build --config Release -j

The build usually takes on the order of 5–10 minutes. When it finishes, binaries such as llama-server appear under build/bin/.

Step 4
Start llama-server with a model

llama.cpp loads models in GGUF format. This playbook uses the Q4_K_XL checkpoint from unsloth/Qwen3.6-35B-A3B-MTP-GGUF, which provides a good balance between quality and speed on DGX Spark.

From your llama.cpp/build directory, launch the OpenAI-compatible server with GPU offload. It will load the model from HuggingFace first if it hasn’t been downloaded before or if there are any updates.

All models are saved in the default HuggingFace cache directory in ~/.cache/huggingface/hub. For instance, this model will be saved into ~/.cache/huggingface/hub/models--unsloth--Qwen3.6-35B-A3B-MTP-GGUF

It will also automatically load mmproj file to enable vision capabilities if supported by the model. By default, llama-server will try to fit full model context with ability to serve 4 concurrent requests, but it will adjust parameters automatically if needed.

./bin/llama-server \
  -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
  --host 0.0.0.0 \
  --port 30000

To run with MTP speculative decoding, provide additional parameters as shown in the example below. MTP requires a compatible model, like unsloth/Qwen3.6-35B-A3B-MTP-GGUF used in this example. The following example also sets “preserve_thinking” flag that allows Qwen models to use so-called “interleaved thinking” by preserving all prior thinking blocks in the history which can be useful for agentic workflows.

./bin/llama-server \
  -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
  --host 0.0.0.0 \
  --port 30000 \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 

Parameters (short):

  • --host / --port: bind address and port for the HTTP API
  • --chat-template-kwargs: sets additional params for the json template parser, must be a valid json object string
  • --spec-type: comma-separated list of types of speculative decoding to use (default: none, most MTP-compatible models will use “draft-mtp”, but you need to check the model card first)
  • --spec-draft-n-max: number of tokens to draft for speculative decoding (default: 3)

You should see log lines similar to:

0.14.322.968 I srv    load_model: speculative decoding context initialized
0.14.322.970 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 262144
0.14.322.972 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 262144
0.14.322.972 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 262144
0.14.322.973 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 262144
0.14.323.063 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB

...
0.14.342.935 I srv  llama_server: model loaded
0.14.342.939 I srv  llama_server: server is listening on http://0.0.0.0:30000
0.14.342.944 I srv  update_slots: all slots are idle

Keep this terminal open while testing. Large GGUFs can take a minute or more to load, and initial model download can take a while if the model is not downloaded yet. You will see a progress bar when model is being downloaded.

The server is only ready to accept incoming connections on port 30000 after you see server is listening message (see Troubleshooting if curl reports connection refused).

Step 5
Test the API

Use a second terminal on the same machine that runs llama-server (for example another SSH session into DGX Spark). If you run curl on your laptop while the server runs only on Spark, use the Spark hostname or IP instead of localhost.

curl -X POST http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL",
    "messages": [{"role": "user", "content": "New York is a great city because..."}],
    "max_tokens": 100
  }'

If you see curl: (7) Failed to connect, the server is still loading, the process exited (check the server log for OOM or path errors), or you are not curling the host that runs llama-server.

Example shape of the response (fields vary by llama.cpp version; message may include extra keys):

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people ("
      }
    }
  ],
  "created": 1765916539,
  "model": "$MODEL_PATH",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 100,
    "prompt_tokens": 25,
    "total_tokens": 125
  },
  "id": "chatcmpl-...",
  "timings": {
    ...
  }
}

Step 6
Longer completion (with Qwen3.6-35B-A3B)

Try a slightly longer prompt to confirm stable generation with Qwen3.6-35B-A3B:

curl -X POST http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL",
    "messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
    "max_tokens": 500
  }'

Step 7
Cleanup

Stop the server with Ctrl+C in the terminal where it is running.

To remove this tutorial’s artifacts:

rm -rf ~/llama.cpp
rm -rf ~/.cache/huggingface/hub/models--unsloth--Qwen3.6-35B-A3B-MTP-GGUF

Step 8
Next steps

  1. Context length: By default, llama.cpp tries to allocate maximum context size supported for the model if possible, but you can also set it manually using --ctx-size (or -c) to adjust for your needs. For agentic or coding needs you need a minimum of 32768 tokens, preferably 100000 or more.
  2. Other models: You can use --model to load any compatible GGUF downloaded locally; the llama.cpp server API stays the same. Use -hf to let llama.cpp automatically manage downloads/updates. Please note that if you use --model with multi-modal models, you need to provide a path to .mmproj file using --mmproj parameter. If you use -hf it will load the mmproj file automatically.
  3. Integrations: Point Open WebUI, Continue.dev, or custom clients at http://<spark-host>:30000/v1 using the OpenAI client pattern.

The server implements the usual OpenAI-style chat features your llama.cpp build enables (including streaming and tool-related flows where supported).

Resources

  • llama.cpp GitHub Repository
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide