NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • cuTile Kernels
  • CLI Coding Agent
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞

inference

  • LM Studio on DGX Spark
  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • vLLM for Inference

Speculative Decoding

30 MIN

Learn how to set up speculative decoding for fast inference on Spark

DGXSpark
View on GitHub
OverviewOverviewInstructionsInstructionsRun on Two SparksRun on Two SparksTroubleshootingTroubleshooting

Basic idea

Speculative decoding speeds up text generation by using a small, fast model to draft several tokens ahead, then having the larger model quickly verify or adjust them. This way, the big model doesn't need to predict every token step-by-step, reducing latency while keeping output quality.

What you'll accomplish

You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using two approaches: EAGLE-3 and Draft-Target. These examples demonstrate how to accelerate large language model inference while maintaining output quality.

Why two Sparks?

A single DGX Spark has 128 GB of unified memory shared between the CPU and GPU. This is sufficient to run models like GPT-OSS-120B with EAGLE-3 or Llama-3.3-70B with Draft-Target, as shown in the Instructions tab.

Larger models like Qwen3-235B-A22B exceed what a single Spark can hold in memory — even with FP4 quantization, the model weights, KV cache, and Eagle3 draft head together require more than 128 GB. By connecting two Sparks, you double the available memory to 256 GB, making it possible to serve these larger models.

The Run on Two Sparks tab walks through this setup. The two Sparks are connected via QSFP cable and use tensor parallelism (TP=2) to split the model — each Spark holds half of every layer's weight matrices and computes its portion of each forward pass. The nodes communicate intermediate results over the high-bandwidth link using NCCL and OpenMPI, so the model operates as a single logical instance across both devices.

In short: two Sparks let you run models that are too large for one, while speculative decoding (Eagle3) on top further accelerates inference by drafting and verifying multiple tokens in parallel.

What to know before starting

  • Experience with Docker and containerized applications
  • Understanding of speculative decoding concepts
  • Familiarity with TensorRT-LLM serving and API endpoints
  • Knowledge of GPU memory management for large language models

Prerequisites

  • NVIDIA Spark device with sufficient GPU memory available

  • Docker with GPU support enabled

    docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 nvidia-smi
    
  • Active HuggingFace Token for model access

  • Network connectivity for model downloads

Time & risk

  • Duration: 10-20 minutes for setup, additional time for model downloads (varies by network speed)
  • Risks: GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads
  • Rollback: Stop Docker containers and optionally clean up downloaded model cache.
  • Last Updated: 04/20/2026
    • Upgrade to latest container 1.3.0rc12
    • Add Speculative Decoding example with Qwen3-235B-A22B on Two Sparks

Resources

  • Speculative Decoding
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation