vLLM for Inference

Step 1
Set up Docker permissions

If you haven't already, add your user to the docker group to run Docker without sudo:

sudo usermod -aG docker $USER
newgrp docker

Step 2
Set up environment variables

Set the following so the vLLM container can download the model and use your chosen context length:

# HuggingFace token (required)
# Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"

# Model to serve
export MODEL_HANDLE="<HF_HANDLE>"

Step 3
Pull vLLM container image

For DiffusionGemma, use the vLLM custom container:

docker pull vllm/vllm-openai:gemma

Step 4
Start vLLM server

Start the vLLM server with the model.

For DiffusionGemma models (e.g. google/diffusiongemma-26B-A4B-it), run with custom VLLM container.

docker run -d \
  --name vllm-server \
  -p 8000:8000 \
  --gpus all \
  --shm-size=16g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_USE_V2_MODEL_RUNNER=1 \
  vllm/vllm-openai:gemma ${MODEL_HANDLE} \
  --gpu-memory-utilization 0.85 \
  --attention-backend TRITON_ATTN \
  --max-num-seqs 8 \
  --diffusion-config '{"canvas_length":256}' \
  --override-generation-config '{"max_new_tokens": null}' \
  --load-format fastsafetensors \
  --enable-prefix-caching \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4

# For BF16 checkpoint add "--moe-backend triton" for better performance

Check the server logs for startup progress:

docker logs -f vllm-server

Expected output includes:

Model download progress (first run only)
Model loading into GPU memory
Application startup complete.

Press Ctrl+C to exit log view once the server is ready.

Step 5
Test the API

Send a test request to verify the server is working:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    "max_tokens": 256
  }'

The response should contain a choices array with the model's answer.

Step 6
Cleanup

Stop and remove the container:

docker stop vllm-server
docker rm vllm-server

Optionally, remove the image and cached model:

Eg.

docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"

Step 1Set up Docker permissions

Step 2Set up environment variables

Step 3Pull vLLM container image

Step 4Start vLLM server

Step 5Test the API

Step 6Cleanup

Resources