NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • VS Code
  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞
  • Spark & Reachy Photo Booth
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent

inference

  • LM Studio on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • Speculative Decoding
  • SGLang for Inference
  • TRT LLM for Inference
  • vLLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark

vLLM for Inference

30 MIN

Install and use vLLM on DGX Spark

DGXSpark
OverviewOverviewInstructionsInstructionsRun on two SparksRun on two SparksRun on multiple Sparks through a switchRun on multiple Sparks through a switchTroubleshootingTroubleshooting

Step 1
Configure network connectivity

Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.

This includes:

  • Physical QSFP cable connection
  • Network interface configuration (automatic or manual IP assignment)
  • Passwordless SSH setup
  • Network connectivity verification

Step 2
Download cluster deployment script

Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference.

# Download on both nodes
wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/online_serving/run_cluster.sh
chmod +x run_cluster.sh

Step 3
Pull the NVIDIA vLLM Image from NGC

First, you will need to configure docker to pull from NGC If this is your first time using docker run:

sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker

After this, you should be able to run docker commands without using sudo.

docker pull nvcr.io/nvidia/vllm:25.11-py3
export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.11-py3

Step 4
Start Ray head node

Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.

# On Node 1, start head node

# Get the IP address of the high-speed interface
# Use the interface that shows "(Up)" from ibdev2netdev (enp1s0f0np0 or enp1s0f1np1)
export MN_IF_NAME=enp1s0f1np1
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')

echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP"

bash run_cluster.sh $VLLM_IMAGE $VLLM_HOST_IP --head ~/.cache/huggingface \
  -e VLLM_HOST_IP=$VLLM_HOST_IP \
  -e UCX_NET_DEVICES=$MN_IF_NAME \
  -e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
  -e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
  -e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
  -e TP_SOCKET_IFNAME=$MN_IF_NAME \
  -e RAY_memory_monitor_refresh_ms=0 \
  -e MASTER_ADDR=$VLLM_HOST_IP

Step 5
Start Ray worker node

Connect Node 2 to the Ray cluster as a worker node. This provides additional GPU resources for tensor parallelism.

# On Node 2, join as worker

# Set the interface name (same as Node 1)
export MN_IF_NAME=enp1s0f1np1

# Get Node 2's own IP address
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')

# IMPORTANT: Set HEAD_NODE_IP to Node 1's IP address
# You must get this value from Node 1 (run: echo $VLLM_HOST_IP on Node 1)
export HEAD_NODE_IP=<NODE_1_IP_ADDRESS>

echo "Worker IP: $VLLM_HOST_IP, connecting to head node at: $HEAD_NODE_IP"

bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \
  -e VLLM_HOST_IP=$VLLM_HOST_IP \
  -e UCX_NET_DEVICES=$MN_IF_NAME \
  -e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
  -e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
  -e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
  -e TP_SOCKET_IFNAME=$MN_IF_NAME \
  -e RAY_memory_monitor_refresh_ms=0 \
  -e MASTER_ADDR=$HEAD_NODE_IP

Note: Replace <NODE_1_IP_ADDRESS> with the actual IP address from Node 1, specifically the QSFP interface nep1s0f1np1 configured in the Connect two Sparks playbook.

Step 6
Verify cluster status

Confirm both nodes are recognized and available in the Ray cluster.

# On Node 1 (head node)
# Find the vLLM container name (it will be node-<random_number>)
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
echo "Found container: $VLLM_CONTAINER"

docker exec $VLLM_CONTAINER ray status

Expected output shows 2 nodes with available GPU resources.

Step 7
Download Llama 3.3 70B model

Authenticate with Hugging Face and download the recommended production-ready model.

# From within the same container where `ray status` ran, run the following
hf auth login
hf download meta-llama/Llama-3.3-70B-Instruct

Step 8
Launch inference server for Llama 3.3 70B

Start the vLLM inference server with tensor parallelism across both nodes.

# On Node 1, enter container and start server
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec -it $VLLM_CONTAINER /bin/bash -c '
  vllm serve meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 2 --max_model_len 2048'

Step 9
Test 70B model inference

Verify the deployment with a sample inference request.

# Test from Node 1 or external client
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "prompt": "Write a haiku about a GPU",
    "max_tokens": 32,
    "temperature": 0.7
  }'

Expected output includes a generated haiku response.

Step 10
(Optional) Deploy Llama 3.1 405B model

WARNING

405B model has insufficient memory headroom for production use.

Download the quantized 405B model for testing purposes only.

# On Node 1, download quantized model
huggingface-cli download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4

Step 11
(Optional) Launch 405B inference server

Start the server with memory-constrained parameters for the large model.

# On Node 1, launch with restricted parameters
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec -it $VLLM_CONTAINER /bin/bash -c '
  vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
    --tensor-parallel-size 2 --max-model-len 64 --gpu-memory-utilization 0.9 \
    --max-num-seqs 1 --max_num_batched_tokens 64'

Step 12
(Optional) Test 405B model inference

Verify the 405B deployment with constrained parameters.

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4",
    "prompt": "Write a haiku about a GPU",
    "max_tokens": 32,
    "temperature": 0.7
  }'

Step 13
Validate deployment

Perform comprehensive validation of the distributed inference system.

# Check Ray cluster health
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec $VLLM_CONTAINER ray status

# Verify server health endpoint
curl http://192.168.100.10:8000/health

# Monitor GPU utilization on both nodes
nvidia-smi
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec $VLLM_CONTAINER nvidia-smi --query-gpu=memory.used,memory.total --format=csv

Step 14
Next steps

Access the Ray dashboard for cluster monitoring and explore additional features:

# Ray dashboard available at:
http://<head-node-ip>:8265

# Consider implementing for production:
# - Health checks and automatic restarts
# - Log rotation for long-running services
# - Persistent model caching across restarts
# - Alternative quantization methods (FP8, INT4)

Resources

  • vLLM Documentation
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation