Set up Docker permissions
If you haven't already, add your user to the docker group to run Docker without sudo:
sudo usermod -aG docker $USER
newgrp docker
Set up environment variables
Set the following so the vLLM container can download the model and use your chosen context length:
# HuggingFace token (required)
# Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"
# Model to serve
export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4"
# Maximum context length
export MAX_MODEL_LEN=8192
Pull vLLM container image
Pull the vLLM container from NGC. Use the 26.01 image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations.
docker pull nvcr.io/nvidia/vllm:26.01-py3
Start vLLM server
Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, --gpus all uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with --gpus '"device=N"' where N is the GB300 device ID from nvidia-smi.
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
nvcr.io/nvidia/vllm:26.01-py3 \
vllm serve "$MODEL_HANDLE" \
--max-model-len $MAX_MODEL_LEN \
--gpu-memory-utilization 0.9
Check the server logs for startup progress:
docker logs -f vllm-server
Expected output includes:
- Model download progress (first run only)
- Model loading into GPU memory
Uvicorn running on http://0.0.0.0:8000
Press Ctrl+C to exit log view once the server is ready.
Test the API
Send a test request to verify the server is working:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"$MODEL_HANDLE"'",
"messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
"max_tokens": 256
}'
The response should contain a choices array with the model's answer.
Cleanup
Stop and remove the container:
docker stop vllm-server
docker rm vllm-server
Optionally, remove the image and cached model:
docker rmi nvcr.io/nvidia/vllm:26.01-py3
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4