Install and use TensorRT-LLM on DGX Spark
NVIDIA TensorRT-LLM (TRT-LLM) is an open-source library for optimizing and accelerating large language model (LLM) inference on NVIDIA GPUs.
It provides highly efficient kernels, memory management, and parallelism strategies—like tensor, pipeline, and sequence parallelism—so developers can serve LLMs with lower latency and higher throughput.
TRT-LLM integrates with frameworks like Hugging Face and PyTorch, making it easier to deploy state-of-the-art models at scale.
You'll set up TensorRT-LLM to optimize and deploy large language models on your DGX Spark, achieving significantly higher throughput and lower latency than standard PyTorch inference through kernel-level optimizations, efficient memory layouts, and advanced quantization.
nvidia-smidocker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 nvidia-smiecho $HF_TOKENAll required assets can be found here on GitHub
The following models are supported with TensorRT-LLM on Spark. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|---|---|---|---|
| GPT-OSS-20B | MXFP4 | ✅ | openai/gpt-oss-20b |
| GPT-OSS-120B | MXFP4 | ✅ | openai/gpt-oss-120b |
| Llama-3.1-8B-Instruct | FP8 | ✅ | nvidia/Llama-3.1-8B-Instruct-FP8 |
| Llama-3.1-8B-Instruct | NVFP4 | ✅ | nvidia/Llama-3.1-8B-Instruct-FP4 |
| Llama-3.3-70B-Instruct | NVFP4 | ✅ | nvidia/Llama-3.3-70B-Instruct-FP4 |
| Qwen3-8B | FP8 | ✅ | nvidia/Qwen3-8B-FP8 |
| Qwen3-8B | NVFP4 | ✅ | nvidia/Qwen3-8B-FP4 |
| Qwen3-14B | FP8 | ✅ | nvidia/Qwen3-14B-FP8 |
| Qwen3-14B | NVFP4 | ✅ | nvidia/Qwen3-14B-FP4 |
| Qwen3-32B | NVFP4 | ✅ | nvidia/Qwen3-32B-FP4 |
| Phi-4-multimodal-instruct | FP8 | ✅ | nvidia/Phi-4-multimodal-instruct-FP8 |
| Phi-4-multimodal-instruct | NVFP4 | ✅ | nvidia/Phi-4-multimodal-instruct-FP4 |
| Phi-4-reasoning-plus | FP8 | ✅ | nvidia/Phi-4-reasoning-plus-FP8 |
| Phi-4-reasoning-plus | NVFP4 | ✅ | nvidia/Phi-4-reasoning-plus-FP4 |
| Qwen3-30B-A3B | NVFP4 | ✅ | nvidia/Qwen3-30B-A3B-FP4 |
| Llama-4-Scout-17B-16E-Instruct | NVFP4 | ✅ | nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 |
| Qwen3-235B-A22B (two Sparks only) | NVFP4 | ✅ | nvidia/Qwen3-235B-A22B-FP4 |
NOTE
You can use the NVFP4 Quantization documentation to generate your own NVFP4-quantized checkpoints for your favorite models. This enables you to take advantage of the performance and memory benefits of NVFP4 quantization even for models not already published by NVIDIA.
Reminder: not all model architectures are supported for NVFP4 quantization.