vLLM for Inference

Common issues

Symptom	Cause	Fix
"permission denied" when running docker	User not in docker group	Run `sudo usermod -aG docker $USER && newgrp docker`
Container fails to start with GPU error	NVIDIA Container Toolkit not configured	Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker
"Token is required" or 401 error	Missing HuggingFace token	Ensure `HF_TOKEN` is exported before running docker command
Model download hangs or fails	Network or authentication issue	Check internet connection, verify HF_TOKEN is valid
CUDA out of memory	Context length too large	Reduce `MAX_MODEL_LEN` or lower `--gpu-memory-utilization`
Server not responding on port 8000	Port already in use	Check with `lsof -i :8000`, use `-p 8001:8000` for different port
Model runs on wrong GPU	Default GPU selection	Use `--gpus '"device=0"'` to select specific GPU
NGC authentication fails	Invalid or missing credentials	Run `docker login nvcr.io` with NGC API key
EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v"	Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph capture	Use the 26.01 container image: `nvcr.io/nvidia/vllm:26.01-py3` instead of 25.10.