Skip to main content
NVIDIA
Explore
Models
Skills
Blueprints
GPUs
Docs
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • Run NemoClaw with a Local LLM
  • 🦞 Set Up Example NemoClaw Agents 🦞
  • Run Hermes Agent with Local Models
  • cuTile Kernels
  • CLI Coding Agent
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞

inference

  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • LM Studio on DGX Spark
  • vLLM for Inference

vLLM for Inference

30 MIN

Install and use vLLM on DGX Spark

DGXSpark
OverviewOverviewInstructionsInstructionsRun on two SparksRun on two SparksRun on multiple Sparks through a switchRun on multiple Sparks through a switchRun Agent Ready Qwen3.6 35B Model with vLLMRun Agent Ready Qwen3.6 35B Model with vLLMTroubleshootingTroubleshooting

Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.

  • It uses a memory-efficient attention algoritm called PagedAttention to handle long sequences without running out of GPU memory.
  • New requests can be added to a batch already in process through continuous batching to keep GPUs fully utilized.
  • It has an OpenAI-compatible API so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.

What you'll accomplish

You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture, either using a pre-built Docker container or building from source with custom LLVM/Triton support for ARM64.

What to know before starting

  • Experience building and configuring containers with Docker
  • Familiarity with CUDA toolkit installation and version management
  • Understanding of Python virtual environments and package management
  • Knowledge of building software from source using CMake and Ninja
  • Experience with Git version control and patch management

Prerequisites

  • DGX Spark device with ARM64 processor and Blackwell GPU architecture
  • CUDA 13.0 toolkit installed: nvcc --version shows CUDA toolkit version.
  • Docker installed and configured: docker --version succeeds
  • NVIDIA Container Toolkit installed
  • Python 3.12 available: python3.12 --version succeeds
  • Git installed: git --version succeeds
  • Network access to download packages and container images

Model Support Matrix

The following models are supported with vLLM on Spark. All listed models are available and ready to use:

ModelQuantizationSupport StatusHF Handle
DiffusionGemma 26B A4B ITBF16✅google/diffusiongemma-26B-A4B-it
DiffusionGemma 26B A4B ITNVFP4✅nvidia/diffusiongemma-26B-A4B-it-NVFP4
Nemotron-3-Nano-Omni-30B-A3B-ReasoningBF16✅nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
Nemotron-3-Nano-Omni-30B-A3B-ReasoningFP8✅nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8
Nemotron-3-Nano-Omni-30B-A3B-ReasoningNVFP4✅nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4
Gemma 4 31B ITBase✅google/gemma-4-31B-it
Gemma 4 31B ITNVFP4✅nvidia/Gemma-4-31B-IT-NVFP4
Gemma 4 26B A4B ITBase✅google/gemma-4-26B-A4B-it
Gemma 4 E4B ITBase✅google/gemma-4-E4B-it
Gemma 4 E2B ITBase✅google/gemma-4-E2B-it
Nemotron-3-Super-120BNVFP4✅nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
GPT-OSS-20BMXFP4✅openai/gpt-oss-20b
GPT-OSS-120BMXFP4✅openai/gpt-oss-120b
Llama-3.1-8B-InstructFP8✅nvidia/Llama-3.1-8B-Instruct-FP8
Llama-3.1-8B-InstructNVFP4✅nvidia/Llama-3.1-8B-Instruct-NVFP4
Llama-3.3-70B-InstructNVFP4✅nvidia/Llama-3.3-70B-Instruct-NVFP4
Qwen3-8BFP8✅nvidia/Qwen3-8B-FP8
Qwen3-8BNVFP4✅nvidia/Qwen3-8B-NVFP4
Qwen3-14BFP8✅nvidia/Qwen3-14B-FP8
Qwen3-14BNVFP4✅nvidia/Qwen3-14B-NVFP4
Qwen3-32BNVFP4✅nvidia/Qwen3-32B-NVFP4
Qwen2.5-VL-7B-InstructNVFP4✅nvidia/Qwen2.5-VL-7B-Instruct-NVFP4
Qwen3-VL-Reranker-2BBase✅Qwen/Qwen3-VL-Reranker-2B
Qwen3-VL-Reranker-8BBase✅Qwen/Qwen3-VL-Reranker-8B
Qwen3-VL-Embedding-2BBase✅Qwen/Qwen3-VL-Embedding-2B
Phi-4-multimodal-instructFP8✅nvidia/Phi-4-multimodal-instruct-FP8
Phi-4-multimodal-instructNVFP4✅nvidia/Phi-4-multimodal-instruct-NVFP4
Phi-4-reasoning-plusFP8✅nvidia/Phi-4-reasoning-plus-FP8
Phi-4-reasoning-plusNVFP4✅nvidia/Phi-4-reasoning-plus-NVFP4
Nemotron3-NanoBF16✅nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Nemotron3-NanoFP8✅nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8

NOTE

The Phi-4-multimodal-instruct models require --trust-remote-code when launching vLLM.

NOTE

You can use the NVFP4 Quantization documentation to generate your own NVFP4-quantized checkpoints for your favorite models. This enables you to take advantage of the performance and memory benefits of NVFP4 quantization even for models not already published by NVIDIA.

Reminder: not all model architectures are supported for NVFP4 quantization.

Time & risk

  • Duration: 30 minutes for Docker approach
  • Risks: Container registry access requires internal credentials
  • Rollback: Container approach is non-destructive.
  • Last Updated: 06/12/2026
    • Add Agent ready model recipe for Qwen3.6 35B

Resources

  • vLLM Documentation
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide