vLLM for Inference

Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.

PagedAttention handles long sequences without running out of GPU memory.
Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.

What you'll accomplish

Serve a supported model using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.

You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.

What to know before starting

Basic Docker container usage
Familiarity with REST APIs

Prerequisites

NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
Docker installed: docker --version
NVIDIA Container Toolkit configured
HuggingFace account with access token
Network access to NGC and HuggingFace

Model Support Matrix

The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:

Model	Quantization	Support Status	HF Handle
Step-3.7-Flash-FP8	FP8	✅	`stepfun-ai/Step-3.7-Flash-FP8`
Step-3.7-Flash-NVFP4	NVFP4	✅	`stepfun-ai/Step-3.7-Flash-NVFP4`
Qwen3-235B-A22B-NVFP4	NVFP4	✅	`nvidia/Qwen3-235B-A22B-NVFP4`

Time & risk

Duration: 30 minutes (longer on first run due to model download)
Risks: Model download requires HuggingFace authentication
Rollback: Stop and remove the container to restore state
Last Updated: 05/28/2026
- Update models

Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.

PagedAttention handles long sequences without running out of GPU memory.
Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.

What you'll accomplish

Serve a supported model using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.

You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.

What to know before starting

Basic Docker container usage
Familiarity with REST APIs

Prerequisites

NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
Docker installed: docker --version
NVIDIA Container Toolkit configured
HuggingFace account with access token
Network access to NGC and HuggingFace

Model Support Matrix

The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:

Model	Quantization	Support Status	HF Handle
Step-3.7-Flash-FP8	FP8	✅	`stepfun-ai/Step-3.7-Flash-FP8`
Step-3.7-Flash-NVFP4	NVFP4	✅	`stepfun-ai/Step-3.7-Flash-NVFP4`
Qwen3-235B-A22B-NVFP4	NVFP4	✅	`nvidia/Qwen3-235B-A22B-NVFP4`

Time & risk

Duration: 30 minutes (longer on first run due to model download)
Risks: Model download requires HuggingFace authentication
Rollback: Stop and remove the container to restore state
Last Updated: 05/28/2026
- Update models

Basic idea

What you'll accomplish

What to know before starting

Prerequisites

Model Support Matrix

Time & risk

Resources

vLLM for Inference

Basic idea

What you'll accomplish

What to know before starting

Prerequisites

Model Support Matrix

Time & risk

Resources