Speculative decoding speeds up text generation by using a small, fast model to draft several tokens ahead, then having the larger model quickly verify or adjust them. This way, the big model doesn't need to predict every token step-by-step, reducing latency while keeping output quality.
You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using two approaches: EAGLE-3 and Draft-Target. These examples demonstrate how to accelerate large language model inference while maintaining output quality.
NVIDIA Spark device with sufficient GPU memory available
Docker with GPU support enabled
docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 nvidia-smi
Active HuggingFace Token for model access
Network connectivity for model downloads