Basic idea
vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.
- PagedAttention handles long sequences without running out of GPU memory.
- Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
- OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.
What you'll accomplish
Serve a supported model using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
What to know before starting
- Basic Docker container usage
- Familiarity with REST APIs
Prerequisites
- NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
- Docker installed:
docker --version - NVIDIA Container Toolkit configured
- HuggingFace account with access token
- Network access to NGC and HuggingFace
Model Support Matrix
The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|---|---|---|---|
| Step-3.7-Flash-FP8 | FP8 | ✅ | stepfun-ai/Step-3.7-Flash-FP8 |
| Step-3.7-Flash-NVFP4 | NVFP4 | ✅ | stepfun-ai/Step-3.7-Flash-NVFP4 |
| Qwen3-235B-A22B-NVFP4 | NVFP4 | ✅ | nvidia/Qwen3-235B-A22B-NVFP4 |
Time & risk
- Duration: 30 minutes (longer on first run due to model download)
- Risks: Model download requires HuggingFace authentication
- Rollback: Stop and remove the container to restore state
- Last Updated: 05/28/2026
- Update models