NVIDIA
Explore
Models
Blueprints
GPUs
Docs
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • CUDA-X Data Science
  • Optimized JAX
  • Text to Knowledge Graph

tools

  • VS Code
  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • NCCL for Two Sparks
  • Connect Two Sparks
  • Build a Video Search and Summarization (VSS) Agent

inference

  • SGLang Inference Server
  • Multi-modal Inference
  • NIM on Spark
  • NVFP4 Quantization
  • Speculative Decoding
  • TRT LLM for Inference
  • vLLM for Inference
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2025 NVIDIA Corporation

Speculative Decoding

30 MIN

Learn how to set up speculative decoding for fast inference on Spark

View on GitHub
OverviewInstructionsTroubleshooting

Basic idea

Speculative decoding speeds up text generation by using a small, fast model to draft several tokens ahead, then having the larger model quickly verify or adjust them. This way, the big model doesn't need to predict every token step-by-step, reducing latency while keeping output quality.

What you'll accomplish

You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using the traditional Draft-Target approach. These examples demonstrate how to accelerate large language model inference while maintaining output quality.

What to know before starting

  • Experience with Docker and containerized applications
  • Understanding of speculative decoding concepts
  • Familiarity with TensorRT-LLM serving and API endpoints
  • Knowledge of GPU memory management for large language models

Prerequisites

  • NVIDIA Spark device with sufficient GPU memory available

  • Docker with GPU support enabled

    docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi
    
  • HuggingFace authentication configured (if needed for model downloads)

    huggingface-cli login
    
  • Network connectivity for model downloads

Time & risk

  • Duration: 10-20 minutes for setup, additional time for model downloads (varies by network speed)
  • Risks: GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads
  • Rollback: Stop Docker containers and optionally clean up downloaded model cache.
  • Last Updated: 10/12/2025
    • First publication

Resources

  • DGX Spark Documentation
  • DGX Spark Forum