Basic idea
vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.
- PagedAttention handles long sequences without running out of GPU memory.
- Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
- OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.
What you'll accomplish
Serve a supported model using vLLM on NVIDIA RTX Pro 6000. See the list of supported models below.
What to know before starting
- Basic Docker container usage
- Familiarity with REST APIs
Prerequisites
- NVIDIA RTX Pro 6000 with Ubuntu 22.04 or 24.04 host
- Docker installed:
docker --version
- NVIDIA Container Toolkit configured
- HuggingFace account with access token
- Network access to NGC and HuggingFace
Model Support Matrix
The following models are supported with vLLM on RTX Pro 6000. All listed models are available and ready to use:
Time & risk
- Duration: 30 minutes (longer on first run due to model download)
- Risks: Model download requires HuggingFace authentication
- Rollback: Stop and remove the container to restore state
- Last Updated: 06/10/2026