NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth

inference

  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • vLLM for Inference
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • LM Studio on DGX Spark

Run models with llama.cpp on DGX Spark

30 MIN

Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Gemma 4 31B IT as example)

DGX SparkInferenceLLMllama.cpp
View llama.cpp on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting
SymptomCauseFix
cmake fails with "CUDA not found"CUDA toolkit not in PATHRun export PATH=/usr/local/cuda/bin:$PATH and re-run CMake from a clean build directory
Build errors mentioning wrong GPU archCMake CMAKE_CUDA_ARCHITECTURES does not match GB10Use -DCMAKE_CUDA_ARCHITECTURES="121" for DGX Spark GB10 as in the instructions
GGUF download fails or stallsNetwork or Hugging Face availabilityRe-run hf download; it resumes partial files
"CUDA out of memory" when starting llama-serverModel too large for current context or VRAMLower --ctx-size (e.g. 4096) or use a smaller quantization from the same repo
Server runs but latency is highLayers not on GPUConfirm --n-gpu-layers is high enough for your model; check nvidia-smi during a request
curl: (7) Failed to connect on port 30000No listener yet, wrong host, or crashWait for server is listening; run curl on the same host as llama-server (or Spark’s IP); run ss -tln and confirm :30000; read server stderr for OOM or bad --model path
Chat API errors or empty repliesWrong --model path or incompatible GGUFVerify the path to the .gguf file; update llama.cpp if the GGUF requires a newer format

NOTE

DGX Spark uses Unified Memory Architecture (UMA), which allows flexible sharing between GPU and CPU memory. Some software is still catching up to UMA behavior. If you hit memory pressure unexpectedly, you can try flushing the page cache (use with care on shared systems):

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

For the latest platform issues, see the DGX Spark known issues documentation.

Resources

  • llama.cpp GitHub Repository
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation