NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Optimized JAX
  • Text to Knowledge Graph

tools

  • VS Code
  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark

fine tuning

  • Fine-tune with Pytorch
  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Unsloth on DGX Spark

use case

  • Install and Use Isaac Sim and Isaac Lab
  • Live VLM WebUI
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • NCCL for Two Sparks
  • Connect Two Sparks
  • Build a Video Search and Summarization (VSS) Agent

inference

  • Nemotron-3-Nano with llama.cpp
  • Speculative Decoding
  • vLLM for Inference
  • SGLang for Inference
  • TRT LLM for Inference
  • Multi-modal Inference
  • NIM on Spark
  • NVFP4 Quantization
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

TRT LLM for Inference

1 HR

Install and use TensorRT-LLM on DGX Spark

View on GitHub
OverviewOverviewSingle SparkSingle SparkRun on two SparksRun on two SparksOpen WebUI for TensorRT-LLMOpen WebUI for TensorRT-LLMTroubleshootingTroubleshooting

Common issues for running on a single Spark

SymptomCauseFix
Cannot access gated repo for URLCertain HuggingFace models have restricted accessRegenerate your HuggingFace token; and request access to the gated model on your web browser
OOM during weight loading (e.g., Nemotron Super 49B)Parallel weight-loading memory pressureexport TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=1
"CUDA out of memory"GPU VRAM insufficient for modelReduce free_gpu_memory_fraction: 0.9 or batch size or use smaller model
"Model not found" errorHF_TOKEN invalid or model inaccessibleVerify token and model permissions
Container pull timeoutNetwork connectivity issuesRetry pull or use local mirror
Import tensorrt_llm failsContainer runtime issuesRestart Docker daemon and retry

Common Issues for running on two Sparks

SymptomCauseFix
MPI hostname test returns single hostnameNetwork connectivity issuesVerify both nodes are on reachable IP addresses
"Permission denied" on HuggingFace downloadInvalid or missing HF_TOKENSet valid token: export HF_TOKEN=<TOKEN>
Cannot access gated repo for URLCertain HuggingFace models have restricted accessRegenerate your HuggingFace token; and request access to the gated model on your web browser
"CUDA out of memory" errorsInsufficient GPU memoryReduce --max_batch_size or --max_num_tokens
Container exits immediatelyMissing entrypoint scriptEnsure trtllm-mn-entrypoint.sh download succeeded and has executable permissions, also ensure you are not running the container already on your node. If port 2233 is already utilized, the entrypoint script will not start.
Error response from daemon: error while validating Root CA CertificateSystem clock out of sync or expired certificatesUpdate system time to sync with NTP server sudo timedatectl set-ntp true
"invalid mount config for type 'bind'"Missing or non-executable entrypoint scriptRun docker inspect <container_id> to see full error message. Verify trtllm-mn-entrypoint.sh exists on both nodes in your home directory (ls -la $HOME/trtllm-mn-entrypoint.sh) and has executable permissions (chmod +x $HOME/trtllm-mn-entrypoint.sh)
"task: non-zero exit (255)"Container exit with error code 255Check container logs with docker ps -a --filter "name=trtllm-multinode_trtllm" to get container ID, then docker logs <container_id> to see detailed error messages
Docker state stuck in "Pending" with "no suitable node (insufficien...)"Docker daemon not properly configured for GPU accessVerify steps 2-4 were completed successfully and check that /etc/docker/daemon.json contains correct GPU configuration

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Resources

  • TensorRT-LLM Documentation
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide