Serve Qwen3-235B with vLLM

20 MIN

Set up vLLM server with Qwen3-235B on DGX Station

Set up Docker permissions

If you haven't already, add your user to the docker group to run Docker without sudo:

sudo usermod -aG docker $USER
newgrp docker

Set up environment variables

Set the following so the vLLM container can download the model and use your chosen context length:

# HuggingFace token (required)
# Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"

# Model to serve
export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4"

# Maximum context length
export MAX_MODEL_LEN=8192

Pull vLLM container image

Pull the vLLM container from NGC. Use the 26.01 image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations.

docker pull nvcr.io/nvidia/vllm:26.01-py3

Start vLLM server

Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, --gpus all uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with --gpus '"device=N"' where N is the GB300 device ID from nvidia-smi.

docker run -d \
  --name vllm-server \
  --gpus all \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "$MODEL_HANDLE" \
    --max-model-len $MAX_MODEL_LEN \
    --gpu-memory-utilization 0.9

Check the server logs for startup progress:

docker logs -f vllm-server

Expected output includes:

  • Model download progress (first run only)
  • Model loading into GPU memory
  • Uvicorn running on http://0.0.0.0:8000

Press Ctrl+C to exit log view once the server is ready.

Test the API

Send a test request to verify the server is working:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    "max_tokens": 256
  }'

The response should contain a choices array with the model's answer.

Cleanup

Stop and remove the container:

docker stop vllm-server
docker rm vllm-server

Optionally, remove the image and cached model:

docker rmi nvcr.io/nvidia/vllm:26.01-py3
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4