NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • VS Code
  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • Spark & Reachy Photo Booth
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent

inference

  • LM Studio on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • Speculative Decoding
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

TRT LLM for Inference

1 HR

Install and use TensorRT-LLM on DGX Spark

View on GitHub
OverviewOverviewSingle SparkSingle SparkRun on two SparksRun on two SparksOpen WebUI for TensorRT-LLMOpen WebUI for TensorRT-LLMTroubleshootingTroubleshooting

Basic idea

NVIDIA TensorRT-LLM (TRT-LLM) is an open-source library for optimizing and accelerating large language model (LLM) inference on NVIDIA GPUs.

It provides highly efficient kernels, memory management, and parallelism strategies—like tensor, pipeline, and sequence parallelism—so developers can serve LLMs with lower latency and higher throughput.

TRT-LLM integrates with frameworks like Hugging Face and PyTorch, making it easier to deploy state-of-the-art models at scale.

What you'll accomplish

You'll set up TensorRT-LLM to optimize and deploy large language models on your DGX Spark, achieving significantly higher throughput and lower latency than standard PyTorch inference through kernel-level optimizations, efficient memory layouts, and advanced quantization.

What to know before starting

  • Python proficiency and experience with PyTorch or similar ML frameworks
  • Command-line comfort for running CLI tools and Docker containers
  • Basic understanding of GPU concepts including VRAM, batching, and quantization (FP16/INT8)
  • Familiarity with NVIDIA software stack (CUDA Toolkit, drivers)
  • Experience with inference servers and containerized environments

Prerequisites

  • DGX Spark device
  • NVIDIA drivers compatible with CUDA 12.x: nvidia-smi
  • Docker installed and GPU support configured: docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 nvidia-smi
  • Hugging Face account with token for model access: echo $HF_TOKEN
  • Sufficient GPU VRAM (40GB+ recommended for 70B models)
  • Internet connectivity for downloading models and container images
  • Network: open TCP ports 8355 (LLM) and 8356 (VLM) on host for OpenAI-compatible serving

Ancillary files

All required assets can be found here on GitHub

  • trtllm-mn-entrypoint.sh — container entrypoint script for multi-node setup

Model Support Matrix

The following models are supported with TensorRT-LLM on Spark. All listed models are available and ready to use:

ModelQuantizationSupport StatusHF Handle
GPT-OSS-20BMXFP4✅openai/gpt-oss-20b
GPT-OSS-120BMXFP4✅openai/gpt-oss-120b
Llama-3.1-8B-InstructFP8✅nvidia/Llama-3.1-8B-Instruct-FP8
Llama-3.1-8B-InstructNVFP4✅nvidia/Llama-3.1-8B-Instruct-FP4
Llama-3.3-70B-InstructNVFP4✅nvidia/Llama-3.3-70B-Instruct-FP4
Qwen3-8BFP8✅nvidia/Qwen3-8B-FP8
Qwen3-8BNVFP4✅nvidia/Qwen3-8B-FP4
Qwen3-14BFP8✅nvidia/Qwen3-14B-FP8
Qwen3-14BNVFP4✅nvidia/Qwen3-14B-FP4
Qwen3-32BNVFP4✅nvidia/Qwen3-32B-FP4
Phi-4-multimodal-instructFP8✅nvidia/Phi-4-multimodal-instruct-FP8
Phi-4-multimodal-instructNVFP4✅nvidia/Phi-4-multimodal-instruct-FP4
Phi-4-reasoning-plusFP8✅nvidia/Phi-4-reasoning-plus-FP8
Phi-4-reasoning-plusNVFP4✅nvidia/Phi-4-reasoning-plus-FP4
Qwen3-30B-A3BNVFP4✅nvidia/Qwen3-30B-A3B-FP4
Llama-4-Scout-17B-16E-InstructNVFP4✅nvidia/Llama-4-Scout-17B-16E-Instruct-FP4
Qwen3-235B-A22B (two Sparks only)NVFP4✅nvidia/Qwen3-235B-A22B-FP4

NOTE

You can use the NVFP4 Quantization documentation to generate your own NVFP4-quantized checkpoints for your favorite models. This enables you to take advantage of the performance and memory benefits of NVFP4 quantization even for models not already published by NVIDIA.

Reminder: not all model architectures are supported for NVFP4 quantization.

Time & risk

  • Duration: 45-60 minutes for setup and API server deployment
  • Risk level: Medium - container pulls and model downloads may fail due to network issues
  • Rollback: Stop inference servers and remove downloaded models to free resources.
  • Last Updated: 01/02/2026
    • Improve TRT-LLM Run on Two Sparks workflow
    • Upgrade to the latest TRT-LLM container v1.2.0rc6

Resources

  • TensorRT-LLM Documentation
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide