Serve Qwen3-235B with vLLM

20 MIN

Set up vLLM server with Qwen3-235B on DGX Station

Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.

  • PagedAttention handles long sequences without running out of GPU memory.
  • Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
  • OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.

What you'll accomplish

Serve the Qwen3-235B-A22B-NVFP4 model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU.

What to know before starting

  • Basic Docker container usage
  • Familiarity with REST APIs

Prerequisites

  • NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
  • Docker installed: docker --version
  • NVIDIA Container Toolkit configured
  • HuggingFace account with access token
  • Network access to NGC and HuggingFace

Time & risk

  • Duration: 15-20 minutes (longer on first run due to model download)
  • Risks: Model download requires HuggingFace authentication
  • Rollback: Stop and remove the container to restore state
  • Last Updated: 03/02/2026
    • First Publication