vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.
Serve a supported model using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
docker --versionThe following models are supported with vLLM on DGX Station. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|---|---|---|---|
| DiffusionGemma 26B A4B IT | BF16 | ✅ | google/diffusiongemma-26B-A4B-it |
| DiffusionGemma 26B A4B IT | NVFP4 | ✅ | nvidia/diffusiongemma-26B-A4B-it-NVFP4 |
| Step-3.7-Flash-FP8 | FP8 | ✅ | stepfun-ai/Step-3.7-Flash-FP8 |
| Step-3.7-Flash-NVFP4 | NVFP4 | ✅ | stepfun-ai/Step-3.7-Flash-NVFP4 |
| Qwen3-235B-A22B-NVFP4 | NVFP4 | ✅ | nvidia/Qwen3-235B-A22B-NVFP4 |