Skip to main content
NVIDIA
Explore
Models
Skills
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • MIG on DGX Station

data science

  • Topic Modeling
  • Text to Knowledge Graph on DGX Station

tools

  • NVFP4 Quantization

fine tuning

  • NVFP4 Pretraining with Megatron Bridge
  • Nanochat Training

use case

  • Run NemoClaw with a Local LLM
  • DGX Station AI Skills for Coding Agents
  • Profiler-Driven Kernel Optimization for Fine-Tuning
  • Local Healthcare Agent on DGX Station
  • Secure Long Running AI Agents with OpenShell on DGX Station
  • Local Coding Agent

inference

  • vLLM for Inference
  • Image & Video Generation with ComfyUI
  • Isaac GR00T N1.6 Fine-Tuning
  • LLM Inference with SGLang

vLLM for Inference

30 MIN

Install and use vLLM on DGX Station

InferencevLLM
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Step 1
Set up Docker permissions

If you haven't already, add your user to the docker group to run Docker without sudo:

sudo usermod -aG docker $USER
newgrp docker

Step 2
Set up environment variables

Set the following so the vLLM container can download the model and use your chosen context length:

# HuggingFace token (required)
# Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"

# Model to serve
export MODEL_HANDLE="<HF_HANDLE>"

# Maximum context length
export MAX_MODEL_LEN=8192

Step 3
Pull vLLM container image

Pull the vLLM container from NGC. Use the 26.01 image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations.

docker pull nvcr.io/nvidia/vllm:26.01-py3

For Step-3.7-Flash models, pull the custom VLLM container

docker pull vllm/vllm-openai:stepfun37

Step 4
Start vLLM server

Start the vLLM server with the model. On a single-GPU DGX Station, --gpus all uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with --gpus '"device=N"' where N is the GB300 device ID from nvidia-smi.

For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.

docker run -d \
  --name vllm-server \
  --gpus all \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "$MODEL_HANDLE" \
    --max-model-len $MAX_MODEL_LEN \
    --gpu-memory-utilization 0.9

For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.

docker run -d \
  --name vllm-server \
  --gpus all \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  vllm/vllm-openai:stepfun37 \
  "$MODEL_HANDLE" \
    --gpu-memory-utilization 0.95 \
    --trust-remote-code \
    --reasoning-parser step3p5 \
    --enable-auto-tool-choice \
    --tool-call-parser step3p5 \
    --kv-cache-dtype fp8

Check the server logs for startup progress:

docker logs -f vllm-server

Expected output includes:

  • Model download progress (first run only)
  • Model loading into GPU memory
  • Application startup complete.

Press Ctrl+C to exit log view once the server is ready.

Step 5
Test the API

Send a test request to verify the server is working:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    "max_tokens": 256
  }'

The response should contain a choices array with the model's answer.

Step 6
Cleanup

Stop and remove the container:

docker stop vllm-server
docker rm vllm-server

Optionally, remove the image and cached model:

Eg.

docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"

Resources

  • vLLM Documentation
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation