NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • cuTile Kernels
  • CLI Coding Agent
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞

inference

  • LM Studio on DGX Spark
  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • vLLM for Inference

Speculative Decoding

30 MIN

Learn how to set up speculative decoding for fast inference on Spark

DGXSpark
View on GitHub
OverviewOverviewInstructionsInstructionsRun on Two SparksRun on Two SparksTroubleshootingTroubleshooting
SymptomCauseFix
"CUDA out of memory" errorInsufficient GPU memoryReduce kv_cache_free_gpu_memory_fraction to 0.9 or use a device with more VRAM
Container fails to startDocker GPU support issuesVerify nvidia-docker is installed and --gpus=all flag is supported
Model download failsNetwork or authentication issuesCheck HuggingFace authentication and network connectivity
Cannot access gated repo for URLCertain HuggingFace models have restricted accessRegenerate your HuggingFace token; and request access to the gated model on your web browser
Server doesn't respondPort conflicts or firewallCheck if port 8000 is available and not blocked
mpirun fails with SSH connection refusedSSH not configured between containers or nodesComplete SSH setup from Connect Two Sparks playbook; verify ssh <node_ip> works without password from both nodes
mpirun hangs or times out connecting to remote nodeHostfile IPs don't match actual node IPsVerify IPs in /etc/openmpi-hostfile match the IPs assigned to network interfaces with ip addr show
NCCL error: "Socket operation on non-socket"Wrong network interface specifiedCheck ibdev2netdev output and ensure NCCL_SOCKET_IFNAME and UCX_NET_DEVICES match the active interfaces enp1s0f1np1,enP2p1s0f1np1
Permission denied (publickey) during mpirunSSH keys not exchanged between containersRe-run SSH setup from Connect Two Sparks playbook or manually verify /root/.ssh/authorized_keys contains public keys from both nodes
Model download fails silently in multi-node setupHF_TOKEN not propagated to mpirunAdd -e HF_TOKEN=$HF_TOKEN to docker exec command and -x HF_TOKEN to mpirun command

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Resources

  • Speculative Decoding
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation