NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • MIG on DGX Station

data science

  • Topic Modeling
  • Text to Knowledge Graph on DGX Station

tools

  • NVFP4 Quantization

fine tuning

  • Nanochat Training

use case

  • Secure Long Running AI Agents with OpenShell on DGX Station
  • Local Coding Agent

inference

  • Serve Qwen3-235B with vLLM

Serve Qwen3-235B with vLLM

20 MIN

Set up vLLM server with Qwen3-235B on DGX Station

InferencevLLM
OverviewOverviewServe Qwen3-235BServe Qwen3-235BTroubleshootingTroubleshooting

Common issues

SymptomCauseFix
"permission denied" when running dockerUser not in docker groupRun sudo usermod -aG docker $USER && newgrp docker
Container fails to start with GPU errorNVIDIA Container Toolkit not configuredRun nvidia-ctk runtime configure --runtime=docker and restart Docker
"Token is required" or 401 errorMissing HuggingFace tokenEnsure HF_TOKEN is exported before running docker command
Model download hangs or failsNetwork or authentication issueCheck internet connection, verify HF_TOKEN is valid
CUDA out of memoryContext length too largeReduce MAX_MODEL_LEN or lower --gpu-memory-utilization
Server not responding on port 8000Port already in useCheck with lsof -i :8000, use -p 8001:8000 for different port
Model runs on wrong GPUDefault GPU selectionUse --gpus '"device=0"' to select specific GPU
NGC authentication failsInvalid or missing credentialsRun docker login nvcr.io with NGC API key
EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v"Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph captureUse the 26.01 container image: nvcr.io/nvidia/vllm:26.01-py3 instead of 25.10.

Resources

  • vLLM Documentation
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation