If you haven't already, add your user to the docker group to run Docker without sudo:
sudo usermod -aG docker $USER
newgrp docker
Set the following so the vLLM container can download the model and use your chosen context length:
# HuggingFace token (required)
# Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"
# Model to serve
export MODEL_HANDLE="<HF_HANDLE>"
For DiffusionGemma, use the vLLM custom container:
docker pull vllm/vllm-openai:gemma
Start the vLLM server with the model.
For DiffusionGemma models (e.g. google/diffusiongemma-26B-A4B-it), run with custom VLLM container.
docker run -d \
--name vllm-server \
-p 8000:8000 \
--gpus all \
--shm-size=16g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e HF_TOKEN="$HF_TOKEN" \
-e VLLM_USE_V2_MODEL_RUNNER=1 \
vllm/vllm-openai:gemma ${MODEL_HANDLE} \
--gpu-memory-utilization 0.85 \
--attention-backend TRITON_ATTN \
--max-num-seqs 8 \
--diffusion-config '{"canvas_length":256}' \
--override-generation-config '{"max_new_tokens": null}' \
--load-format fastsafetensors \
--enable-prefix-caching \
--reasoning-parser gemma4 \
--default-chat-template-kwargs '{"enable_thinking": true}' \
--enable-auto-tool-choice \
--tool-call-parser gemma4
# For BF16 checkpoint add "--moe-backend triton" for better performance
Check the server logs for startup progress:
docker logs -f vllm-server
Expected output includes:
Application startup complete.Press Ctrl+C to exit log view once the server is ready.
Send a test request to verify the server is working:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"$MODEL_HANDLE"'",
"messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
"max_tokens": 256
}'
The response should contain a choices array with the model's answer.
Stop and remove the container:
docker stop vllm-server
docker rm vllm-server
Optionally, remove the image and cached model:
Eg.
docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"