NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • cuTile Kernels
  • CLI Coding Agent
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞

inference

  • LM Studio on DGX Spark
  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • vLLM for Inference

TRT LLM for Inference

1 HR

Install and use TensorRT-LLM on DGX Spark

DGXSpark
View on GitHub
OverviewOverviewSingle SparkSingle SparkRun on two SparksRun on two SparksOpen WebUI for TensorRT-LLMOpen WebUI for TensorRT-LLMTroubleshootingTroubleshooting

Step 1
Configure network connectivity

Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.

This includes:

  • Physical QSFP cable connection
  • Network interface configuration (automatic or manual IP assignment)
  • Passwordless SSH setup
  • Network connectivity verification

Step 2
Configure Docker permissions

To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

docker ps

If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .

sudo usermod -aG docker $USER
newgrp docker

Repeat this step on both nodes.

Step 3
Create OpenMPI hostfile

Create a hostfile with the IP addresses of both nodes for MPI operations. On each node, get the IP address of your network interface:

ip a show enp1s0f0np0

Or if you're using the second interface:

ip a show enp1s0f1np1

Look for the inet line to find the IP address (e.g., 192.168.1.10/24).

On your primary node, create the hostfile ~/openmpi-hostfile with the collected IPs:

cat > ~/openmpi-hostfile <<EOF
192.168.1.10
192.168.1.11
EOF

Replace the IP addresses with your actual node IPs.

Step 4
Start containers on both nodes

On each node (primary and worker), run the following command to start the TRT-LLM container:

  docker run -d --rm \
  --name trtllm-multinode \
  --gpus '"device=all"' \
  --network host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --device /dev/infiniband:/dev/infiniband \
  -e UCX_NET_DEVICES="enp1s0f0np0,enp1s0f1np1" \
  -e NCCL_SOCKET_IFNAME="enp1s0f0np0,enp1s0f1np1" \
  -e OMPI_MCA_btl_tcp_if_include="enp1s0f0np0,enp1s0f1np1" \
  -e OMPI_MCA_orte_default_hostfile="/etc/openmpi-hostfile" \
  -e OMPI_MCA_rmaps_ppr_n_pernode="1" \
  -e OMPI_ALLOW_RUN_AS_ROOT="1" \
  -e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM="1" \
  -e CPATH=/usr/local/cuda/include \
  -e TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas \
  -v ~/.cache/huggingface/:/root/.cache/huggingface/ \
  -v ~/.ssh:/tmp/.ssh:ro \
  nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5 \
  sh -c "curl https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/trt-llm/assets/trtllm-mn-entrypoint.sh | sh"

NOTE

Make sure to run this command on both the primary and worker nodes.

Step 5
Verify containers are running

On each node, verify the container is running:

docker ps

You should see output similar to:

CONTAINER ID   IMAGE                                                 COMMAND                  CREATED          STATUS          PORTS     NAMES
abc123def456   nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5         "sh -c 'curl https:…"    10 seconds ago   Up 8 seconds              trtllm-multinode

Step 6
Copy hostfile to primary container

On your primary node, copy the OpenMPI hostfile into the container:

docker cp ~/openmpi-hostfile trtllm-multinode:/etc/openmpi-hostfile

Step 7
Save container reference

On your primary node, save the container name in a variable for convenience:

export TRTLLM_MN_CONTAINER=trtllm-multinode

Step 8
Generate configuration file

On your primary node, generate the configuration file inside the container:

docker exec $TRTLLM_MN_CONTAINER bash -c 'cat <<EOF > /tmp/extra-llm-api-config.yml
print_iter_log: false
kv_cache_config:
  dtype: "auto"
  free_gpu_memory_fraction: 0.9
cuda_graph_config:
  enable_padding: true
EOF'

Step 9
Download model

We can download a model using the following command. You can replace nvidia/Qwen3-235B-A22B-FP4 with the model of your choice.

# Need to specify huggingface token for model download.
export HF_TOKEN=<your-huggingface-token>

docker exec \
  -e MODEL="nvidia/Qwen3-235B-A22B-FP4" \
  -e HF_TOKEN=$HF_TOKEN \
  -it $TRTLLM_MN_CONTAINER bash -c 'mpirun -x HF_TOKEN bash -c "hf download $MODEL"'

Step 10
Serve the model

On your primary node, start the TensorRT-LLM server:

docker exec \
  -e MODEL="nvidia/Qwen3-235B-A22B-FP4" \
  -e HF_TOKEN=$HF_TOKEN \
  -it $TRTLLM_MN_CONTAINER bash -c '
    mpirun -x HF_TOKEN trtllm-llmapi-launch trtllm-serve $MODEL \
      --tp_size 2 \
      --backend pytorch \
      --max_num_tokens 32768 \
      --max_batch_size 4 \
      --extra_llm_api_options /tmp/extra-llm-api-config.yml \
      --port 8355'

This will start the TensorRT-LLM server on port 8355. You can then make inference requests to http://localhost:8355 using the OpenAI-compatible API format.

NOTE

You might see a warning such as UCX WARN network device 'enp1s0f0np0' is not available, please use one or more of. You can ignore this warning if your inference is successful, as it's related to only one of your two CX-7 ports being used, and the other being left unused.

Expected output: Server startup logs and ready message.

Step 11
Validate API server

Once the server is running, you can test it with a CURL request. Run this on the primary node:

curl -s http://localhost:8355/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/Qwen3-235B-A22B-FP4",
    "messages": [{"role": "user", "content": "Paris is great because"}],
    "max_tokens": 64
  }'

Expected output: JSON response with generated text completion.

Step 12
Cleanup and rollback

Stop and remove containers on each node. SSH to each node and run:

docker stop trtllm-multinode

WARNING

This removes all inference data and performance reports. Copy any necessary files before cleanup if needed.

Remove downloaded models to free disk space on each node:

rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3*

Step 13
Next steps

You can now deploy other models on your DGX Spark cluster.

Resources

  • TensorRT-LLM Documentation
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation