Basic idea
vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.
- PagedAttention handles long sequences without running out of GPU memory.
- Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
- OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.
What you'll accomplish
Serve the Qwen3-235B-A22B-NVFP4 model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU.
What to know before starting
- Basic Docker container usage
- Familiarity with REST APIs
Prerequisites
- NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
- Docker installed:
docker --version - NVIDIA Container Toolkit configured
- HuggingFace account with access token
- Network access to NGC and HuggingFace
Time & risk
- Duration: 15-20 minutes (longer on first run due to model download)
- Risks: Model download requires HuggingFace authentication
- Rollback: Stop and remove the container to restore state
- Last Updated: 03/02/2026
- First Publication