Speculative Decoding

Step 1
Configure Docker Permissions

Run on both Spark A and Spark B:

sudo usermod -aG docker $USER
newgrp docker

Step 2
Network Setup

Follow the network setup instructions from the Connect Two Sparks playbook.

NOTE

Complete Steps 1-3 from the Connect Two Sparks playbook before proceeding:

Step 1: Ensure same username on both systems
Step 2: Physical hardware connection (QSFP cable)
Step 3: Network interface configuration
- Use Option 2: Manual IP Assignment with the netplan configure file
- Each Spark has two pairs of network ports. When you physically connect a cable between two Sparks, the connected ports will show as Up. You can use whichever pair is Up — either enp1s0f0np0 and enP2p1s0f0np0, or enp1s0f1np1 and enP2p1s0f1np1
- This playbook assumes you are using enp1s0f1np1 and enP2p1s0f1np1. If your Up interfaces are different, substitute your interface names in the commands below

For this playbook, we will use the following IP addresses:

Spark A (Node 1):

enp1s0f1np1: 192.168.200.12/24
enP2p1s0f1np1: 192.168.200.14/24

Spark B (Node 2):

enp1s0f1np1: 192.168.200.13/24
enP2p1s0f1np1: 192.168.200.15/24

After completing the Connect Two Sparks setup, return here to continue with the TRT-LLM container setup.

Step 3
Set Container Name Variable

Run on both Spark A and Spark B:

export TRTLLM_MN_CONTAINER=trtllm-multinode

Step 4
Start the TRT-LLM Multi-Node Container

Run on both Spark A and Spark B:

docker run -d --rm \
  --name $TRTLLM_MN_CONTAINER \
  --gpus '"device=all"' \
  --network host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --device /dev/infiniband:/dev/infiniband \
  -e UCX_NET_DEVICES="enp1s0f1np1,enP2p1s0f1np1" \
  -e NCCL_SOCKET_IFNAME="enp1s0f1np1,enP2p1s0f1np1" \
  -e OMPI_MCA_btl_tcp_if_include="enp1s0f1np1,enP2p1s0f1np1" \
  -e OMPI_MCA_orte_default_hostfile="/etc/openmpi-hostfile" \
  -e OMPI_MCA_rmaps_ppr_n_pernode="1" \
  -e OMPI_ALLOW_RUN_AS_ROOT="1" \
  -e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM="1" \
  -e CPATH="/usr/local/cuda/include" \
  -e TRITON_PTXAS_PATH="/usr/local/cuda/bin/ptxas" \
  -v ~/.cache/huggingface/:/root/.cache/huggingface/ \
  -v ~/.ssh:/tmp/.ssh:ro \
  nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 \
  bash -c "curl https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/trt-llm/assets/trtllm-mn-entrypoint.sh | bash"

Verify:

docker logs -f $TRTLLM_MN_CONTAINER

Expected output at the end:

total 56K
drwx------ 2 root root 4.0K Jan 13 05:13 .
drwx------ 1 root root 4.0K Jan 13 05:12 ..
-rw------- 1 root root  100 Jan 13 05:13 authorized_keys
-rw------- 1 root root   45 Jan 13 05:13 config
-rw------- 1 root root  411 Jan 13 05:13 id_ed25519
-rw-r--r-- 1 root root  102 Jan 13 05:13 id_ed25519.pub
-rw------- 1 root root  411 Jan 13 05:13 id_ed25519_shared
-rw-r--r-- 1 root root  100 Jan 13 05:13 id_ed25519_shared.pub
-rw------- 1 root root 3.4K Jan 13 05:13 id_rsa
-rw-r--r-- 1 root root  743 Jan 13 05:13 id_rsa.pub
-rw------- 1 root root 5.0K Jan 13 05:13 known_hosts
-rw------- 1 root root 3.2K Jan 13 05:13 known_hosts.old
Starting SSH

Step 5
Configure OpenMPI Hostfile

The hostfile tells MPI which nodes participate in distributed execution. Use the IPs from the enp1s0f1np1 interface configured in Step 2.

On both Spark A and Spark B, create the hostfile:

cat > ~/openmpi-hostfile <<EOF
192.168.200.12
192.168.200.13
EOF

Run on both Spark A and Spark B to copy the hostfile into each container:

docker cp ~/openmpi-hostfile $TRTLLM_MN_CONTAINER:/etc/openmpi-hostfile

Verify connectivity:

docker exec -it $TRTLLM_MN_CONTAINER bash -c "mpirun -np 2 hostname"

Expected output:

nvidia@spark-afe0:~$ docker exec -it $TRTLLM_MN_CONTAINER bash -c "mpirun -np 2 hostname"
Warning: Permanently added '[192.168.200.13]:2233' (ED25519) to the list of known hosts.
spark-afe0
spark-ae11
nvidia@spark-afe0:~$

Step 6
Launch Eagle3 Speculative Decoding

Eagle3 speculative decoding accelerates inference by predicting multiple tokens ahead, then validating them in parallel. This can provide significant speedup compared to standard autoregressive generation.

Set your Hugging Face token

export HF_TOKEN=your_huggingface_token_here

Download the Eagle3 speculative model on both nodes

docker exec \
  -e HF_TOKEN=$HF_TOKEN \
  -it $TRTLLM_MN_CONTAINER bash -c "
    mpirun -x HF_TOKEN -np 2 bash -c 'hf download nvidia/Qwen3-235B-A22B-Eagle3 --local-dir /opt/Qwen3-235B-A22B-Eagle3/'
"

Create the Eagle3 speculative decoding configuration

This configuration enables Eagle speculative decoding with 3 draft tokens and conservative memory settings.

docker exec -it $TRTLLM_MN_CONTAINER bash -c "cat > /tmp/extra-llm-api-config.yml <<EOF
enable_attention_dp: false
disable_overlap_scheduler: false
enable_autotuner: false
enable_chunked_prefill: false
cuda_graph_config:
    max_batch_size: 1
speculative_config:
    decoding_type: Eagle
    max_draft_len: 3
    speculative_model_dir: /opt/Qwen3-235B-A22B-Eagle3/
kv_cache_config:
    free_gpu_memory_fraction: 0.9
    enable_block_reuse: false
EOF
"

Launch the server with Eagle3 speculative decoding

Run on Spark A only. This starts the TensorRT-LLM API server using the FP4 base model with Eagle3 speculative decoding enabled. The mpirun command coordinates execution across both nodes, so it only needs to be launched from Spark A. The maximum token length is set to 1024 (adjust as needed).

docker exec \
  -e MODEL="nvidia/Qwen3-235B-A22B-FP4" \
  -e HF_TOKEN=$HF_TOKEN \
  -it $TRTLLM_MN_CONTAINER bash -c '
    mpirun -x CPATH=/usr/local/cuda/include \
           -x TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas \
           -x HF_TOKEN \
           trtllm-llmapi-launch \
           trtllm-serve \
           $MODEL \
           --backend pytorch \
           --tp_size 2 \
           --max_num_tokens 1024 \
           --extra_llm_api_options /tmp/extra-llm-api-config.yml \
           --port 8355 --host 0.0.0.0
'

Expected output when the endpoint is ready:

[01/13/2026-06:16:56] [TRT-LLM] [I] get signal from executor worker
INFO:     Started server process [2011]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Step 7
Validate the API

Run on Spark A only. The server is listening on Spark A, so test the endpoint from there:

curl -s http://localhost:8355/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/Qwen3-235B-A22B-FP4",
    "messages": [{"role": "user", "content": "Paris is great because"}],
    "max_tokens": 64
  }'

Expected: A JSON response with generated text. This confirms the multi-node TensorRT-LLM server with Eagle3 speculative decoding is working correctly.

Step 8
Cleanup

Stop the containers

Run on both Spark A & B:

docker stop $TRTLLM_MN_CONTAINER

The containers will be automatically removed due to the --rm flag.

(Optional) Remove downloaded models

If you need to free up disk space:

Run on both Spark A & B:

rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3*

This removes the model files (~hundreds of GB). Skip this if you plan to run the setup again.

Step 9
Next Steps

Now that you have Eagle3 speculative decoding running, consider these optimizations and experiments:

Adjust draft length: Modify max_draft_len in the configuration (try values between 2-5) to balance speculation speed vs. accuracy
Try different models: Experiment with other model pairs that support Eagle speculative decoding
Optimize batch size: Adjust max_batch_size in cuda_graph_config for throughput-latency tradeoffs
Learn more: Review the TensorRT-LLM Speculative Decoding documentation for advanced tuning options
Benchmark performance: Compare inference speeds with and without speculative decoding to measure speedup gains

Speculative Decoding

Step 1Configure Docker Permissions

Step 2Network Setup

Step 3Set Container Name Variable

Step 4Start the TRT-LLM Multi-Node Container

Step 5Configure OpenMPI Hostfile

Step 6Launch Eagle3 Speculative Decoding

Set your Hugging Face token

Download the Eagle3 speculative model on both nodes

Create the Eagle3 speculative decoding configuration

Launch the server with Eagle3 speculative decoding

Step 7Validate the API

Step 8Cleanup

Stop the containers

(Optional) Remove downloaded models

Step 9Next Steps

Resources

Step 1
Configure Docker Permissions

Step 2
Network Setup

Step 3
Set Container Name Variable

Step 4
Start the TRT-LLM Multi-Node Container

Step 5
Configure OpenMPI Hostfile

Step 6
Launch Eagle3 Speculative Decoding

Step 7
Validate the API

Step 8
Cleanup

Step 9
Next Steps