Serve Qwen3-235B with vLLM

20 MIN

Set up vLLM server with Qwen3-235B on DGX Station

Common issues

SymptomCauseFix
"permission denied" when running dockerUser not in docker groupRun sudo usermod -aG docker $USER && newgrp docker
Container fails to start with GPU errorNVIDIA Container Toolkit not configuredRun nvidia-ctk runtime configure --runtime=docker and restart Docker
"Token is required" or 401 errorMissing HuggingFace tokenEnsure HF_TOKEN is exported before running docker command
Model download hangs or failsNetwork or authentication issueCheck internet connection, verify HF_TOKEN is valid
CUDA out of memoryContext length too largeReduce MAX_MODEL_LEN or lower --gpu-memory-utilization
Server not responding on port 8000Port already in useCheck with lsof -i :8000, use -p 8001:8000 for different port
Model runs on wrong GPUDefault GPU selectionUse --gpus '"device=0"' to select specific GPU
NGC authentication failsInvalid or missing credentialsRun docker login nvcr.io with NGC API key
EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v"Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph captureUse the 26.01 container image: nvcr.io/nvidia/vllm:26.01-py3 instead of 25.10.