Speculative decoding speeds up text generation by using a small, fast model to draft several tokens ahead, then having the larger model quickly verify or adjust them. This way, the big model doesn't need to predict every token step-by-step, reducing latency while keeping output quality.
You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using two approaches: EAGLE-3 and Draft-Target. These examples demonstrate how to accelerate large language model inference while maintaining output quality.
A single DGX Spark has 128 GB of unified memory shared between the CPU and GPU. This is sufficient to run models like GPT-OSS-120B with EAGLE-3 or Llama-3.3-70B with Draft-Target, as shown in the Instructions tab.
Larger models like Qwen3-235B-A22B exceed what a single Spark can hold in memory — even with FP4 quantization, the model weights, KV cache, and Eagle3 draft head together require more than 128 GB. By connecting two Sparks, you double the available memory to 256 GB, making it possible to serve these larger models.
The Run on Two Sparks tab walks through this setup. The two Sparks are connected via QSFP cable and use tensor parallelism (TP=2) to split the model — each Spark holds half of every layer's weight matrices and computes its portion of each forward pass. The nodes communicate intermediate results over the high-bandwidth link using NCCL and OpenMPI, so the model operates as a single logical instance across both devices.
In short: two Sparks let you run models that are too large for one, while speculative decoding (Eagle3) on top further accelerates inference by drafting and verifying multiple tokens in parallel.
NVIDIA Spark device with sufficient GPU memory available
Docker with GPU support enabled
docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 nvidia-smi
Active HuggingFace Token for model access
Network connectivity for model downloads