Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
This includes:
Heads up: the
discover-sparksscript in the linked playbook writes its SSH key to~/.ssh/and fails if the directory does not exist yet. Runmkdir -p ~/.ssh && chmod 700 ~/.sshon both nodes first if you have never used SSH on them.
Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference.
# Download on both nodes — pinned to a known-good commit so upstream changes
# can't silently break this playbook against the 26.05-py3 image.
wget https://raw.githubusercontent.com/vllm-project/vllm/51c1ee9b7c8acbba4899a8ebffd390685d171946/examples/ray_serving/run_cluster.sh
# Patch the script to pip-install ray inside the container before ray starts.
# The 26.05-py3 NGC image ships without ray (upstream made it an optional CUDA dep);
# the install takes ~10s on first container launch.
sed -i 's|^RAY_START_CMD="ray start|RAY_START_CMD="pip install -q --root-user-action=ignore '\''ray[default]>=2.9'\'' \&\& ray start|' run_cluster.sh
chmod +x run_cluster.sh
First, configure docker. If this is your first time using docker, run:
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
After this, you should be able to run docker commands without sudo.
Pull the image on both nodes:
docker pull nvcr.io/nvidia/vllm:26.05-py3
export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3
Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.
# On Node 1, start head node. Run inside tmux/screen so an SSH drop doesn't
# tear down the cluster (run_cluster.sh has an EXIT trap that stops the container).
# Get the IP address of the high-speed interface
# Use the interface that shows "(Up)" from ibdev2netdev (enp1s0f0np0 or enp1s0f1np1)
export MN_IF_NAME=enp1s0f1np1
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3
echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP"
bash run_cluster.sh $VLLM_IMAGE $VLLM_HOST_IP --head ~/.cache/huggingface \
-e VLLM_HOST_IP=$VLLM_HOST_IP \
-e UCX_NET_DEVICES=$MN_IF_NAME \
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
-e TP_SOCKET_IFNAME=$MN_IF_NAME \
-e RAY_memory_monitor_refresh_ms=0 \
-e MASTER_ADDR=$VLLM_HOST_IP
Leave this terminal open — closing it stops the head node and tears down the cluster.
Open a second terminal, SSH to Node 2 (ssh user@<NODE_2_IP>), and join the Ray cluster as a worker. Replace <NODE_1_IP_ADDRESS> below with the QSFP-side IP from Node 1 (run echo $VLLM_HOST_IP on Node 1 to print it). Run inside tmux/screen on Node 2 as well.
# On Node 2, join as worker
# Set the interface name (same as Node 1)
export MN_IF_NAME=enp1s0f1np1
# Get Node 2's own IP address
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
# Set this to Node 1's QSFP IP (see step header)
export HEAD_NODE_IP=<NODE_1_IP_ADDRESS>
# Set the image tag (same as Step 3)
export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3
echo "Worker IP: $VLLM_HOST_IP, connecting to head node at: $HEAD_NODE_IP"
bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \
-e VLLM_HOST_IP=$VLLM_HOST_IP \
-e UCX_NET_DEVICES=$MN_IF_NAME \
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
-e TP_SOCKET_IFNAME=$MN_IF_NAME \
-e RAY_memory_monitor_refresh_ms=0 \
-e MASTER_ADDR=$HEAD_NODE_IP
Confirm both nodes are recognized and available in the Ray cluster.
# On Node 1 (head node)
# Find the vLLM container name (it will be node-<random_number>)
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
echo "Found container: $VLLM_CONTAINER"
docker exec $VLLM_CONTAINER ray status
Expected output shows 2 nodes with available GPU resources.
Llama 3.3 70B is a gated model — first accept its license at https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct and create an HF access token with read permission. Then authenticate inside the container so the cache lands at /root/.cache/huggingface (mounted from ~/.cache/huggingface).
docker exec -it $VLLM_CONTAINER /bin/bash -c '
hf auth login
hf download meta-llama/Llama-3.3-70B-Instruct'
Start the vLLM inference server with tensor parallelism across both nodes.
# On Node 1, enter container and start server
docker exec -it $VLLM_CONTAINER /bin/bash -c '
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 --max-model-len 2048 \
--distributed-executor-backend ray'
Verify the deployment with a sample inference request. Run this on Node 1 itself; from an external client, replace localhost with Node 1's reachable IP.
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"prompt": "Write a haiku about a GPU",
"max_tokens": 32,
"temperature": 0.7
}'
Expected output includes a generated haiku response.
WARNING
405B model has insufficient memory headroom for production use.
Download the quantized 405B model for testing purposes only. Runs inside the head container so the cache lands in the mounted HF directory.
docker exec -it $VLLM_CONTAINER /bin/bash -c '
hf download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4'
Start the server with memory-constrained parameters for the large model.
# On Node 1, launch with restricted parameters
docker exec -it $VLLM_CONTAINER /bin/bash -c '
vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
--tensor-parallel-size 2 --max-model-len 64 --gpu-memory-utilization 0.9 \
--max-num-seqs 1 --max-num-batched-tokens 64 \
--distributed-executor-backend ray'
Startup is slow for 405B — expect several minutes of model-loading logs across both nodes. The server is ready to take traffic once you see:
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Verify the 405B deployment with constrained parameters. As in Step 9, run on Node 1 or replace localhost with Node 1's reachable IP from an external client.
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4",
"prompt": "Write a haiku about a GPU",
"max_tokens": 32,
"temperature": 0.7
}'
Perform comprehensive validation of the distributed inference system.
# Check Ray cluster health
docker exec $VLLM_CONTAINER ray status
# Verify server health endpoint
curl http://localhost:8000/health
# Monitor GPU utilization on both nodes (DGX Spark has unified memory,
# so the --query-gpu memory fields report N/A; use raw nvidia-smi instead).
nvidia-smi
The Ray dashboard runs on port 8265 of the head node. It binds to the container's network (host networking), so it is only directly reachable from Node 1 itself. From an external workstation, tunnel it over SSH:
# From your workstation:
ssh -L 8265:localhost:8265 nvidia@<NODE_1_IP>
# then open http://localhost:8265 in a local browser
Consider for production: