vLLM for Inference

Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.

PagedAttention handles long sequences without running out of GPU memory.
Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.

What you'll accomplish

Serve a supported model using vLLM on NVIDIA RTX Pro 6000. See the list of supported models below.

What to know before starting

Basic Docker container usage
Familiarity with REST APIs

Prerequisites

NVIDIA RTX Pro 6000 with Ubuntu 22.04 or 24.04 host
Docker installed: docker --version
NVIDIA Container Toolkit configured
HuggingFace account with access token
Network access to NGC and HuggingFace

Model Support Matrix

The following models are supported with vLLM on RTX Pro 6000. All listed models are available and ready to use:

Model	Quantization	Support Status	HF Handle
DiffusionGemma 26B A4B IT	BF16	✅	`google/diffusiongemma-26B-A4B-it`
DiffusionGemma 26B A4B IT	NVFP4	✅	`nvidia/diffusiongemma-26B-A4B-it-NVFP4`

Time & risk

Duration: 30 minutes (longer on first run due to model download)
Risks: Model download requires HuggingFace authentication
Rollback: Stop and remove the container to restore state
Last Updated: 06/10/2026
- First publication