vLLM for Inference

Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.

PagedAttention handles long sequences without running out of GPU memory.
Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.

What you'll accomplish

Serve a supported model using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.

You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.

What to know before starting

Basic Docker container usage
Familiarity with REST APIs

Prerequisites

NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
Docker installed: docker --version
NVIDIA Container Toolkit configured
HuggingFace account with access token
Network access to NGC and HuggingFace

Model Support Matrix

The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:

Model	Quantization	Support Status	HF Handle
DiffusionGemma 26B A4B IT	BF16	✅	`google/diffusiongemma-26B-A4B-it`
DiffusionGemma 26B A4B IT	NVFP4	✅	`nvidia/diffusiongemma-26B-A4B-it-NVFP4`
Step-3.7-Flash-FP8	FP8	✅	`stepfun-ai/Step-3.7-Flash-FP8`
Step-3.7-Flash-NVFP4	NVFP4	✅	`stepfun-ai/Step-3.7-Flash-NVFP4`
Qwen3-235B-A22B-NVFP4	NVFP4	✅	`nvidia/Qwen3-235B-A22B-NVFP4`
Kimi-K2.5 (1T)	NVFP4	✅	`nvidia/Kimi-K2.5-NVFP4`
DeepSeek-V4-Flash	NVFP4	✅	`deepseek-ai/DeepSeek-V4-Flash`

Time & risk

Duration: 30 minutes (longer on first run due to model download)
Risks: Model download requires HuggingFace authentication
Rollback: Stop and remove the container to restore state
Last Updated: 06/29/2026
- Added Kimi-K2.5 and DeepSeek-V4-Flash to model support matrix
- Added base configuration example, per-setting explanations, and model-specific recipes