vLLM for Inference
Install and use vLLM on DGX Spark
Basic idea
vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.
- It uses a memory-efficient attention algoritm called PagedAttention to handle long sequences without running out of GPU memory.
- New requests can be added to a batch already in process through continuous batching to keep GPUs fully utilized.
- It has an OpenAI-compatible API so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.
What you'll accomplish
You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture, either using a pre-built Docker container or building from source with custom LLVM/Triton support for ARM64.
What to know before starting
- Experience building and configuring containers with Docker
- Familiarity with CUDA toolkit installation and version management
- Understanding of Python virtual environments and package management
- Knowledge of building software from source using CMake and Ninja
- Experience with Git version control and patch management
Prerequisites
- DGX Spark device with ARM64 processor and Blackwell GPU architecture
- CUDA 13.0 toolkit installed:
nvcc --versionshows CUDA toolkit version. - Docker installed and configured:
docker --versionsucceeds - NVIDIA Container Toolkit installed
- Python 3.12 available:
python3.12 --versionsucceeds - Git installed:
git --versionsucceeds - Network access to download packages and container images
Model Support Matrix
The following models are supported with vLLM on Spark. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|---|---|---|---|
| GPT-OSS-20B | MXFP4 | ✅ | openai/gpt-oss-20b |
| GPT-OSS-120B | MXFP4 | ✅ | openai/gpt-oss-120b |
| Llama-3.1-8B-Instruct | FP8 | ✅ | nvidia/Llama-3.1-8B-Instruct-FP8 |
| Llama-3.1-8B-Instruct | NVFP4 | ✅ | nvidia/Llama-3.1-8B-Instruct-FP4 |
| Llama-3.3-70B-Instruct | NVFP4 | ✅ | nvidia/Llama-3.3-70B-Instruct-FP4 |
| Qwen3-8B | FP8 | ✅ | nvidia/Qwen3-8B-FP8 |
| Qwen3-8B | NVFP4 | ✅ | nvidia/Qwen3-8B-FP4 |
| Qwen3-14B | FP8 | ✅ | nvidia/Qwen3-14B-FP8 |
| Qwen3-14B | NVFP4 | ✅ | nvidia/Qwen3-14B-FP4 |
| Qwen3-32B | NVFP4 | ✅ | nvidia/Qwen3-32B-FP4 |
| Qwen2.5-VL-7B-Instruct | NVFP4 | ✅ | nvidia/Qwen2.5-VL-7B-Instruct-FP4 |
| Phi-4-multimodal-instruct | FP8 | ✅ | nvidia/Phi-4-multimodal-instruct-FP8 |
| Phi-4-multimodal-instruct | NVFP4 | ✅ | nvidia/Phi-4-multimodal-instruct-FP4 |
| Phi-4-reasoning-plus | FP8 | ✅ | nvidia/Phi-4-reasoning-plus-FP8 |
| Phi-4-reasoning-plus | NVFP4 | ✅ | nvidia/Phi-4-reasoning-plus-FP4 |
NOTE
The Phi-4-multimodal-instruct models require --trust-remote-code when launching vLLM.
NOTE
You can use the NVFP4 Quantization documentation to generate your own NVFP4-quantized checkpoints for your favorite models. This enables you to take advantage of the performance and memory benefits of NVFP4 quantization even for models not already published by NVIDIA.
Reminder: not all model architectures are supported for NVFP4 quantization.
Time & risk
- Duration: 30 minutes for Docker approach
- Risks: Container registry access requires internal credentials
- Rollback: Container approach is non-destructive.
- Last Updated: 01/02/2026
- Add supported Model Matrix (25.11-py3)
- Improve cluster setup instructions