Certain models require special deployment configurations. Please refer to their respective model cards to run on DGX Spark:
| Model | Quantization | HF Model Card Link |
|---|---|---|
| Nemotron-3-Nano-Omni-30B-A3B-Reasoning | BF16 | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 |
| Nemotron-3-Nano-Omni-30B-A3B-Reasoning | FP8 | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 |
| Nemotron-3-Nano-Omni-30B-A3B-Reasoning | NVFP4 | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 |
To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.
Open a new terminal and test Docker access. In the terminal, run:
docker ps
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
sudo usermod -aG docker $USER
newgrp docker
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm
# HuggingFace token (required)
# Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"
export LATEST_VLLM_VERSION=<latest_container_version>
# example
# export LATEST_VLLM_VERSION=26.05.post1-py3
export HF_MODEL_HANDLE=<HF_HANDLE>
# example
# export HF_MODEL_HANDLE=openai/gpt-oss-20b
docker pull nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION}
For DiffusionGemma models, use vLLM custom container:
docker pull vllm/vllm-openai:gemma
For Gemma 4 model family, use vLLM custom container:
docker pull vllm/vllm-openai:gemma4-cu130
Launch the container and start vLLM server with a test model to verify basic functionality.
docker run -it --gpus all -p 8000:8000 \
nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} \
vllm serve ${HF_MODEL_HANDLE}
To run DiffusionGemma models (e.g. google/diffusiongemma-26B-A4B-it):
docker run -it \
-p 8000:8000 \
--gpus all \
--shm-size=16g \
-e HF_TOKEN="$HF_TOKEN" \
-e VLLM_USE_V2_MODEL_RUNNER=1 \
vllm/vllm-openai:gemma ${HF_MODEL_HANDLE} \
--gpu-memory-utilization 0.8 \
--max-model-len 262144 \
--attention-backend TRITON_ATTN \
--max-num-seqs 10 \
--diffusion-config '{"canvas_length":256}' \
--override-generation-config '{"max_new_tokens": null}' \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--enable-prefix-caching \
--default-chat-template-kwargs '{"enable_thinking": true}' \
--load-format fastsafetensors
# For BF16 checkpoint add "--moe-backend triton" for better performance
To run models from Gemma 4 model family, (e.g. google/gemma-4-31B-it):
docker run -it --gpus all -p 8000:8000 \
vllm/vllm-openai:gemma4-cu130 ${HF_MODEL_HANDLE}
Expected output should include:
In another terminal, test the server:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"${HF_MODEL_HANDLE}"'",
"messages": [{"role": "user", "content": "12*17"}],
"max_tokens": 500
}'
Expected response should contain "content": "204" or similar mathematical calculation.
For container approach (non-destructive):
NGC Container:
docker rm $(docker ps -aq --filter ancestor=nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION})
docker rmi nvcr.io/nvidia/vllm
Upstream Container:
docker stop "<container name>"
docker rm "<container name>"
docker rmi "<container image name>"