vLLM for Inference

30 MIN

Install and use vLLM on DGX Spark

Configure Docker permissions

To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

docker ps

If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .

sudo usermod -aG docker $USER
newgrp docker

Pull vLLM container image

Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm

export LATEST_VLLM_VERSION=<latest_container_version>
# example
# export LATEST_VLLM_VERSION=26.02-py3

export HF_MODEL_HANDLE=<HF_HANDLE>
# example
# export HF_MODEL_HANDLE=openai/gpt-oss-20b

docker pull nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION}

For Gemma 4 model family, use vLLM custom containers:

docker pull vllm/vllm-openai:gemma4-cu130

Test vLLM in container

Launch the container and start vLLM server with a test model to verify basic functionality.

docker run -it --gpus all -p 8000:8000 \
nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} \
vllm serve ${HF_MODEL_HANDLE}

To run models from Gemma 4 model family, (e.g. google/gemma-4-31B-it):

docker run -it --gpus all -p 8000:8000 \
vllm/vllm-openai:gemma4-cu130 ${HF_MODEL_HANDLE}

Expected output should include:

  • Model loading confirmation
  • Server startup on port 8000
  • GPU memory allocation details

In another terminal, test the server:

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "'"${HF_MODEL_HANDLE}"'",
    "messages": [{"role": "user", "content": "12*17"}],
    "max_tokens": 500
}'

Expected response should contain "content": "204" or similar mathematical calculation.

Cleanup and rollback

For container approach (non-destructive):

docker rm $(docker ps -aq --filter ancestor=nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION})
docker rmi nvcr.io/nvidia/vllm

Next steps

  • Production deployment: Configure vLLM with your specific model requirements
  • Performance tuning: Adjust batch sizes and memory settings for your workload
  • Monitoring: Set up logging and metrics collection for production use
  • Model management: Explore additional model formats and quantization options