NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • VS Code
  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞
  • Spark & Reachy Photo Booth
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent

inference

  • LM Studio on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • Speculative Decoding
  • SGLang for Inference
  • TRT LLM for Inference
  • vLLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark

vLLM for Inference

30 MIN

Install and use vLLM on DGX Spark

DGXSpark
OverviewOverviewInstructionsInstructionsRun on two SparksRun on two SparksRun on multiple Sparks through a switchRun on multiple Sparks through a switchTroubleshootingTroubleshooting

Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.

  • It uses a memory-efficient attention algoritm called PagedAttention to handle long sequences without running out of GPU memory.
  • New requests can be added to a batch already in process through continuous batching to keep GPUs fully utilized.
  • It has an OpenAI-compatible API so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.

What you'll accomplish

You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture, either using a pre-built Docker container or building from source with custom LLVM/Triton support for ARM64.

What to know before starting

  • Experience building and configuring containers with Docker
  • Familiarity with CUDA toolkit installation and version management
  • Understanding of Python virtual environments and package management
  • Knowledge of building software from source using CMake and Ninja
  • Experience with Git version control and patch management

Prerequisites

  • DGX Spark device with ARM64 processor and Blackwell GPU architecture
  • CUDA 13.0 toolkit installed: nvcc --version shows CUDA toolkit version.
  • Docker installed and configured: docker --version succeeds
  • NVIDIA Container Toolkit installed
  • Python 3.12 available: python3.12 --version succeeds
  • Git installed: git --version succeeds
  • Network access to download packages and container images

Model Support Matrix

The following models are supported with vLLM on Spark. All listed models are available and ready to use:

ModelQuantizationSupport StatusHF Handle
Gemma 4 31B ITBase✅google/gemma-4-31B-it
Gemma 4 31B ITNVFP4✅nvidia/Gemma-4-31B-IT-NVFP4
Gemma 4 26B A4B ITBase✅google/gemma-4-26B-A4B-it
Gemma 4 E4B ITBase✅google/gemma-4-E4B-it
Gemma 4 E2B ITBase✅google/gemma-4-E2B-it
Nemotron-3-Super-120BNVFP4✅nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
GPT-OSS-20BMXFP4✅openai/gpt-oss-20b
GPT-OSS-120BMXFP4✅openai/gpt-oss-120b
Llama-3.1-8B-InstructFP8✅nvidia/Llama-3.1-8B-Instruct-FP8
Llama-3.1-8B-InstructNVFP4✅nvidia/Llama-3.1-8B-Instruct-NVFP4
Llama-3.3-70B-InstructNVFP4✅nvidia/Llama-3.3-70B-Instruct-NVFP4
Qwen3-8BFP8✅nvidia/Qwen3-8B-FP8
Qwen3-8BNVFP4✅nvidia/Qwen3-8B-NVFP4
Qwen3-14BFP8✅nvidia/Qwen3-14B-FP8
Qwen3-14BNVFP4✅nvidia/Qwen3-14B-NVFP4
Qwen3-32BNVFP4✅nvidia/Qwen3-32B-NVFP4
Qwen2.5-VL-7B-InstructNVFP4✅nvidia/Qwen2.5-VL-7B-Instruct-NVFP4
Qwen3-VL-Reranker-2BBase✅Qwen/Qwen3-VL-Reranker-2B
Qwen3-VL-Reranker-8BBase✅Qwen/Qwen3-VL-Reranker-8B
Qwen3-VL-Embedding-2BBase✅Qwen/Qwen3-VL-Embedding-2B
Phi-4-multimodal-instructFP8✅nvidia/Phi-4-multimodal-instruct-FP8
Phi-4-multimodal-instructNVFP4✅nvidia/Phi-4-multimodal-instruct-NVFP4
Phi-4-reasoning-plusFP8✅nvidia/Phi-4-reasoning-plus-FP8
Phi-4-reasoning-plusNVFP4✅nvidia/Phi-4-reasoning-plus-NVFP4
Nemotron3-NanoBF16✅nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Nemotron3-NanoFP8✅nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8

NOTE

The Phi-4-multimodal-instruct models require --trust-remote-code when launching vLLM.

NOTE

You can use the NVFP4 Quantization documentation to generate your own NVFP4-quantized checkpoints for your favorite models. This enables you to take advantage of the performance and memory benefits of NVFP4 quantization even for models not already published by NVIDIA.

Reminder: not all model architectures are supported for NVFP4 quantization.

Time & risk

  • Duration: 30 minutes for Docker approach
  • Risks: Container registry access requires internal credentials
  • Rollback: Container approach is non-destructive.
  • Last Updated: 04/02/2026
    • Add support for Gemma 4 model family

Resources

  • vLLM Documentation
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation