Basic idea
vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.
- It uses a memory-efficient attention algoritm called PagedAttention to handle long sequences without running out of GPU memory.
- New requests can be added to a batch already in process through continuous batching to keep GPUs fully utilized.
- It has an OpenAI-compatible API so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.
What you'll accomplish
You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture, either using a pre-built Docker container or building from source with custom LLVM/Triton support for ARM64.
What to know before starting
- Experience building and configuring containers with Docker
- Familiarity with CUDA toolkit installation and version management
- Understanding of Python virtual environments and package management
- Knowledge of building software from source using CMake and Ninja
- Experience with Git version control and patch management
Prerequisites
- DGX Spark device with ARM64 processor and Blackwell GPU architecture
- CUDA 13.0 toolkit installed:
nvcc --versionshows CUDA toolkit version. - Docker installed and configured:
docker --versionsucceeds - NVIDIA Container Toolkit installed
- Python 3.12 available:
python3.12 --versionsucceeds - Git installed:
git --versionsucceeds - Network access to download packages and container images
Model Support Matrix
The following models are supported with vLLM on Spark. All listed models are available and ready to use:
NOTE
The Phi-4-multimodal-instruct models require --trust-remote-code when launching vLLM.
NOTE
You can use the NVFP4 Quantization documentation to generate your own NVFP4-quantized checkpoints for your favorite models. This enables you to take advantage of the performance and memory benefits of NVFP4 quantization even for models not already published by NVIDIA.
Reminder: not all model architectures are supported for NVFP4 quantization.
Time & risk
- Duration: 30 minutes for Docker approach
- Risks: Container registry access requires internal credentials
- Rollback: Container approach is non-destructive.
- Last Updated: 03/12/2026
- Added support for Nemotron-3-Super-120B model
- Updated container to Feb 2026 release (26.02-py3)