Speculative Decoding
Learn how to set up speculative decoding for fast inference on Spark
Configure Docker permissions
To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.
Open a new terminal and test Docker access. In the terminal, run:
docker ps
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
sudo usermod -aG docker $USER
newgrp docker
Set Environment Variables
Set up the environment variables for downstream services:
export HF_TOKEN=<your_huggingface_token>
Run Speculative Decoding Methods
Option 1: EAGLE-3
Run EAGLE-3 Speculative Decoding by executing the following command:
docker run \
-e HF_TOKEN=$HF_TOKEN \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus=all --ipc=host --network host \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
bash -c '
hf download openai/gpt-oss-120b && \
hf download nvidia/gpt-oss-120b-Eagle3-long-context \
--local-dir /opt/gpt-oss-120b-Eagle3/ && \
cat > /tmp/extra-llm-api-config.yml <<EOF
enable_attention_dp: false
disable_overlap_scheduler: false
enable_autotuner: false
cuda_graph_config:
max_batch_size: 1
speculative_config:
decoding_type: Eagle
max_draft_len: 5
speculative_model_dir: /opt/gpt-oss-120b-Eagle3/
kv_cache_config:
free_gpu_memory_fraction: 0.9
enable_block_reuse: false
EOF
export TIKTOKEN_ENCODINGS_BASE="/tmp/harmony-reqs" && \
mkdir -p $TIKTOKEN_ENCODINGS_BASE && \
wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken && \
wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
trtllm-serve openai/gpt-oss-120b \
--backend pytorch --tp_size 1 \
--max_batch_size 1 \
--extra_llm_api_options /tmp/extra-llm-api-config.yml'
Once the server is running, test it by making an API call from another terminal:
# Test completion endpoint
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-120b",
"prompt": "Solve the following problem step by step. If a train travels 180 km in 3 hours, and then slows down by 20% for the next 2 hours, what is the total distance traveled? Show all intermediate calculations and provide a final numeric answer.",
"max_tokens": 300,
"temperature": 0.7
}'
Key Features of EAGLE-3 Speculative Decoding
-
Simpler deployment — Instead of managing a separate draft model, EAGLE-3 uses a built-in drafting head that generates speculative tokens internally.
-
Better accuracy — By fusing features from multiple layers of the model, draft tokens are more likely to be accepted, reducing wasted computation.
-
Faster generation — Multiple tokens are verified in parallel per forward pass, cutting down the latency of autoregressive inference.
Option 2: Draft Target
Execute the following command to set up and run draft target speculative decoding:
docker run \
-e HF_TOKEN=$HF_TOKEN \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus=all --ipc=host --network host nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
bash -c "
# Download models
hf download nvidia/Llama-3.3-70B-Instruct-FP4 && \
hf download nvidia/Llama-3.1-8B-Instruct-FP4 \
--local-dir /opt/Llama-3.1-8B-Instruct-FP4/ && \
# Create configuration file
cat <<EOF > extra-llm-api-config.yml
print_iter_log: false
disable_overlap_scheduler: true
speculative_config:
decoding_type: DraftTarget
max_draft_len: 4
speculative_model_dir: /opt/Llama-3.1-8B-Instruct-FP4/
kv_cache_config:
enable_block_reuse: false
EOF
# Start TensorRT-LLM server
trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP4 \
--backend pytorch --tp_size 1 \
--max_batch_size 1 \
--kv_cache_free_gpu_memory_fraction 0.9 \
--extra_llm_api_options ./extra-llm-api-config.yml
"
Once the server is running, test it by making an API call from another terminal:
# Test completion endpoint
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/Llama-3.3-70B-Instruct-FP4",
"prompt": "Explain the benefits of speculative decoding:",
"max_tokens": 150,
"temperature": 0.7
}'
Key features of draft-target:
- Efficient resource usage: 8B draft model accelerates 70B target model
- Flexible configuration: Adjustable draft token length for optimization
- Memory efficient: Uses FP4 quantized models for reduced memory footprint
- Compatible models: Uses Llama family models with consistent tokenization
Cleanup
Stop the Docker container when finished:
# Find and stop the container
docker ps
docker stop <container_id>
# Optional: Clean up downloaded models from cache
# rm -rf $HOME/.cache/huggingface/hub/models--*gpt-oss*
Next Steps
- Experiment with different
max_draft_lenvalues (1, 2, 3, 4, 8) - Monitor token acceptance rates and throughput improvements
- Test with different prompt lengths and generation parameters
- Read more on Speculative Decoding here.