Run on both Spark A and Spark B:
sudo usermod -aG docker $USER
newgrp docker
Follow the network setup instructions from the Connect Two Sparks playbook.
NOTE
Complete Steps 1-3 from the Connect Two Sparks playbook before proceeding:
enp1s0f0np0 and enP2p1s0f0np0, or enp1s0f1np1 and enP2p1s0f1np1enp1s0f1np1 and enP2p1s0f1np1. If your Up interfaces are different, substitute your interface names in the commands belowFor this playbook, we will use the following IP addresses:
Spark A (Node 1):
enp1s0f1np1: 192.168.200.12/24enP2p1s0f1np1: 192.168.200.14/24Spark B (Node 2):
enp1s0f1np1: 192.168.200.13/24enP2p1s0f1np1: 192.168.200.15/24After completing the Connect Two Sparks setup, return here to continue with the TRT-LLM container setup.
Run on both Spark A and Spark B:
export TRTLLM_MN_CONTAINER=trtllm-multinode
Run on both Spark A and Spark B:
docker run -d --rm \
--name $TRTLLM_MN_CONTAINER \
--gpus '"device=all"' \
--network host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--device /dev/infiniband:/dev/infiniband \
-e UCX_NET_DEVICES="enp1s0f1np1,enP2p1s0f1np1" \
-e NCCL_SOCKET_IFNAME="enp1s0f1np1,enP2p1s0f1np1" \
-e OMPI_MCA_btl_tcp_if_include="enp1s0f1np1,enP2p1s0f1np1" \
-e OMPI_MCA_orte_default_hostfile="/etc/openmpi-hostfile" \
-e OMPI_MCA_rmaps_ppr_n_pernode="1" \
-e OMPI_ALLOW_RUN_AS_ROOT="1" \
-e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM="1" \
-e CPATH="/usr/local/cuda/include" \
-e TRITON_PTXAS_PATH="/usr/local/cuda/bin/ptxas" \
-v ~/.cache/huggingface/:/root/.cache/huggingface/ \
-v ~/.ssh:/tmp/.ssh:ro \
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 \
bash -c "curl https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/trt-llm/assets/trtllm-mn-entrypoint.sh | bash"
Verify:
docker logs -f $TRTLLM_MN_CONTAINER
Expected output at the end:
total 56K
drwx------ 2 root root 4.0K Jan 13 05:13 .
drwx------ 1 root root 4.0K Jan 13 05:12 ..
-rw------- 1 root root 100 Jan 13 05:13 authorized_keys
-rw------- 1 root root 45 Jan 13 05:13 config
-rw------- 1 root root 411 Jan 13 05:13 id_ed25519
-rw-r--r-- 1 root root 102 Jan 13 05:13 id_ed25519.pub
-rw------- 1 root root 411 Jan 13 05:13 id_ed25519_shared
-rw-r--r-- 1 root root 100 Jan 13 05:13 id_ed25519_shared.pub
-rw------- 1 root root 3.4K Jan 13 05:13 id_rsa
-rw-r--r-- 1 root root 743 Jan 13 05:13 id_rsa.pub
-rw------- 1 root root 5.0K Jan 13 05:13 known_hosts
-rw------- 1 root root 3.2K Jan 13 05:13 known_hosts.old
Starting SSH
The hostfile tells MPI which nodes participate in distributed execution. Use the IPs from the enp1s0f1np1 interface configured in Step 2.
On both Spark A and Spark B, create the hostfile:
cat > ~/openmpi-hostfile <<EOF
192.168.200.12
192.168.200.13
EOF
Run on both Spark A and Spark B to copy the hostfile into each container:
docker cp ~/openmpi-hostfile $TRTLLM_MN_CONTAINER:/etc/openmpi-hostfile
Verify connectivity:
docker exec -it $TRTLLM_MN_CONTAINER bash -c "mpirun -np 2 hostname"
Expected output:
nvidia@spark-afe0:~$ docker exec -it $TRTLLM_MN_CONTAINER bash -c "mpirun -np 2 hostname"
Warning: Permanently added '[192.168.200.13]:2233' (ED25519) to the list of known hosts.
spark-afe0
spark-ae11
nvidia@spark-afe0:~$
Eagle3 speculative decoding accelerates inference by predicting multiple tokens ahead, then validating them in parallel. This can provide significant speedup compared to standard autoregressive generation.
export HF_TOKEN=your_huggingface_token_here
docker exec \
-e HF_TOKEN=$HF_TOKEN \
-it $TRTLLM_MN_CONTAINER bash -c "
mpirun -x HF_TOKEN -np 2 bash -c 'hf download nvidia/Qwen3-235B-A22B-Eagle3 --local-dir /opt/Qwen3-235B-A22B-Eagle3/'
"
This configuration enables Eagle speculative decoding with 3 draft tokens and conservative memory settings.
docker exec -it $TRTLLM_MN_CONTAINER bash -c "cat > /tmp/extra-llm-api-config.yml <<EOF
enable_attention_dp: false
disable_overlap_scheduler: false
enable_autotuner: false
enable_chunked_prefill: false
cuda_graph_config:
max_batch_size: 1
speculative_config:
decoding_type: Eagle
max_draft_len: 3
speculative_model_dir: /opt/Qwen3-235B-A22B-Eagle3/
kv_cache_config:
free_gpu_memory_fraction: 0.9
enable_block_reuse: false
EOF
"
Run on Spark A only. This starts the TensorRT-LLM API server using the FP4 base model with Eagle3 speculative decoding enabled. The mpirun command coordinates execution across both nodes, so it only needs to be launched from Spark A. The maximum token length is set to 1024 (adjust as needed).
docker exec \
-e MODEL="nvidia/Qwen3-235B-A22B-FP4" \
-e HF_TOKEN=$HF_TOKEN \
-it $TRTLLM_MN_CONTAINER bash -c '
mpirun -x CPATH=/usr/local/cuda/include \
-x TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas \
-x HF_TOKEN \
trtllm-llmapi-launch \
trtllm-serve \
$MODEL \
--backend pytorch \
--tp_size 2 \
--max_num_tokens 1024 \
--extra_llm_api_options /tmp/extra-llm-api-config.yml \
--port 8355 --host 0.0.0.0
'
Expected output when the endpoint is ready:
[01/13/2026-06:16:56] [TRT-LLM] [I] get signal from executor worker
INFO: Started server process [2011]
INFO: Waiting for application startup.
INFO: Application startup complete.
Run on Spark A only. The server is listening on Spark A, so test the endpoint from there:
curl -s http://localhost:8355/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/Qwen3-235B-A22B-FP4",
"messages": [{"role": "user", "content": "Paris is great because"}],
"max_tokens": 64
}'
Expected: A JSON response with generated text. This confirms the multi-node TensorRT-LLM server with Eagle3 speculative decoding is working correctly.
Run on both Spark A & B:
docker stop $TRTLLM_MN_CONTAINER
The containers will be automatically removed due to the --rm flag.
If you need to free up disk space:
Run on both Spark A & B:
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3*
This removes the model files (~hundreds of GB). Skip this if you plan to run the setup again.
Now that you have Eagle3 speculative decoding running, consider these optimizations and experiments:
max_draft_len in the configuration (try values between 2-5) to balance speculation speed vs. accuracymax_batch_size in cuda_graph_config for throughput-latency tradeoffs