NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth

inference

  • Run models with llama.cpp on DGX Spark
  • vLLM for Inference
  • Nemotron-3-Nano with llama.cpp
  • Speculative Decoding
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • LM Studio on DGX Spark

TRT LLM for Inference

1 HR

Install and use TensorRT-LLM on DGX Spark

DGXSpark
View on GitHub
OverviewOverviewSingle SparkSingle SparkRun on two SparksRun on two SparksOpen WebUI for TensorRT-LLMOpen WebUI for TensorRT-LLMTroubleshootingTroubleshooting

Common issues for running on a single Spark

SymptomCauseFix
Cannot access gated repo for URLCertain HuggingFace models have restricted accessRegenerate your HuggingFace token; and request access to the gated model on your web browser
OOM during weight loading (e.g., Nemotron Super 49B)Parallel weight-loading memory pressureexport TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=1
"CUDA out of memory"GPU VRAM insufficient for modelReduce free_gpu_memory_fraction: 0.9 or batch size or use smaller model
"Model not found" errorHF_TOKEN invalid or model inaccessibleVerify token and model permissions
Container pull timeoutNetwork connectivity issuesRetry pull or use local mirror
Import tensorrt_llm failsContainer runtime issuesRestart Docker daemon and retry

Common Issues for running on two Sparks

SymptomCauseFix
MPI hostname test returns single hostnameNetwork connectivity issuesVerify both nodes are on reachable IP addresses
"Permission denied" on HuggingFace downloadInvalid or missing HF_TOKENSet valid token: export HF_TOKEN=<TOKEN>
Cannot access gated repo for URLCertain HuggingFace models have restricted accessRegenerate your HuggingFace token; and request access to the gated model on your web browser
"CUDA out of memory" errorsInsufficient GPU memoryReduce --max_batch_size or --max_num_tokens
Container exits immediatelyMissing entrypoint scriptEnsure trtllm-mn-entrypoint.sh download succeeded and has executable permissions, also ensure you are not running the container already on your node. If port 2233 is already utilized, the entrypoint script will not start.
Error response from daemon: error while validating Root CA CertificateSystem clock out of sync or expired certificatesUpdate system time to sync with NTP server sudo timedatectl set-ntp true
"invalid mount config for type 'bind'"Missing or non-executable entrypoint scriptRun docker inspect <container_id> to see full error message. Verify trtllm-mn-entrypoint.sh exists on both nodes in your home directory (ls -la $HOME/trtllm-mn-entrypoint.sh) and has executable permissions (chmod +x $HOME/trtllm-mn-entrypoint.sh)
"task: non-zero exit (255)"Container exit with error code 255Check container logs with docker ps -a --filter "name=trtllm-multinode_trtllm" to get container ID, then docker logs <container_id> to see detailed error messages
Docker state stuck in "Pending" with "no suitable node (insufficien...)"Docker daemon not properly configured for GPU accessVerify steps 2-4 were completed successfully and check that /etc/docker/daemon.json contains correct GPU configuration
Serving model fails ptxas fatal errorsModel needs runtime triton kernel compilationIn Step 10, add -x TRITON_PTXAS_PATH to your mpirun command

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Resources

  • TensorRT-LLM Documentation
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation