vLLM for Inference

30 MIN

Install and use vLLM on DGX Spark

Configure network connectivity

Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.

This includes:

  • Physical QSFP cable connection
  • Network interface configuration (automatic or manual IP assignment)
  • Passwordless SSH setup
  • Network connectivity verification

Download cluster deployment script

Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference.

# Download on both nodes
wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/online_serving/run_cluster.sh
chmod +x run_cluster.sh

Pull the NVIDIA vLLM Image from NGC

First, you will need to configure docker to pull from NGC If this is your first time using docker run:

sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker

After this, you should be able to run docker commands without using sudo.

docker pull nvcr.io/nvidia/vllm:25.11-py3
export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.11-py3

Start Ray head node

Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.

# On Node 1, start head node

# Get the IP address of the high-speed interface
# Use the interface that shows "(Up)" from ibdev2netdev (enp1s0f0np0 or enp1s0f1np1)
export MN_IF_NAME=enp1s0f1np1
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')

echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP"

bash run_cluster.sh $VLLM_IMAGE $VLLM_HOST_IP --head ~/.cache/huggingface \
  -e VLLM_HOST_IP=$VLLM_HOST_IP \
  -e UCX_NET_DEVICES=$MN_IF_NAME \
  -e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
  -e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
  -e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
  -e TP_SOCKET_IFNAME=$MN_IF_NAME \
  -e RAY_memory_monitor_refresh_ms=0 \
  -e MASTER_ADDR=$VLLM_HOST_IP

Start Ray worker node

Connect Node 2 to the Ray cluster as a worker node. This provides additional GPU resources for tensor parallelism.

# On Node 2, join as worker

# Set the interface name (same as Node 1)
export MN_IF_NAME=enp1s0f1np1

# Get Node 2's own IP address
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')

# IMPORTANT: Set HEAD_NODE_IP to Node 1's IP address
# You must get this value from Node 1 (run: echo $VLLM_HOST_IP on Node 1)
export HEAD_NODE_IP=<NODE_1_IP_ADDRESS>

echo "Worker IP: $VLLM_HOST_IP, connecting to head node at: $HEAD_NODE_IP"

bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \
  -e VLLM_HOST_IP=$VLLM_HOST_IP \
  -e UCX_NET_DEVICES=$MN_IF_NAME \
  -e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
  -e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
  -e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
  -e TP_SOCKET_IFNAME=$MN_IF_NAME \
  -e RAY_memory_monitor_refresh_ms=0 \
  -e MASTER_ADDR=$HEAD_NODE_IP

Note: Replace <NODE_1_IP_ADDRESS> with the actual IP address from Node 1, specifically the QSFP interface nep1s0f1np1 configured in the Connect two Sparks playbook.

Verify cluster status

Confirm both nodes are recognized and available in the Ray cluster.

# On Node 1 (head node)
# Find the vLLM container name (it will be node-<random_number>)
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
echo "Found container: $VLLM_CONTAINER"

docker exec $VLLM_CONTAINER ray status

Expected output shows 2 nodes with available GPU resources.

Download Llama 3.3 70B model

Authenticate with Hugging Face and download the recommended production-ready model.

# On Node 1, authenticate and download
huggingface-cli login
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct

Launch inference server for Llama 3.3 70B

Start the vLLM inference server with tensor parallelism across both nodes.

# On Node 1, enter container and start server
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec -it $VLLM_CONTAINER /bin/bash -c '
  vllm serve meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 2 --max_model_len 2048'

Test 70B model inference

Verify the deployment with a sample inference request.

# Test from Node 1 or external client
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "prompt": "Write a haiku about a GPU",
    "max_tokens": 32,
    "temperature": 0.7
  }'

Expected output includes a generated haiku response.

(Optional) Deploy Llama 3.1 405B model

WARNING

405B model has insufficient memory headroom for production use.

Download the quantized 405B model for testing purposes only.

# On Node 1, download quantized model
huggingface-cli download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4

(Optional) Launch 405B inference server

Start the server with memory-constrained parameters for the large model.

# On Node 1, launch with restricted parameters
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec -it $VLLM_CONTAINER /bin/bash -c '
  vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
    --tensor-parallel-size 2 --max-model-len 256 --gpu-memory-utilization 1.0 \
    --max-num-seqs 1 --max_num_batched_tokens 256'

(Optional) Test 405B model inference

Verify the 405B deployment with constrained parameters.

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4",
    "prompt": "Write a haiku about a GPU",
    "max_tokens": 32,
    "temperature": 0.7
  }'

Validate deployment

Perform comprehensive validation of the distributed inference system.

# Check Ray cluster health
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec $VLLM_CONTAINER ray status

# Verify server health endpoint
curl http://192.168.100.10:8000/health

# Monitor GPU utilization on both nodes
nvidia-smi
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec $VLLM_CONTAINER nvidia-smi --query-gpu=memory.used,memory.total --format=csv

Next steps

Access the Ray dashboard for cluster monitoring and explore additional features:

# Ray dashboard available at:
http://<head-node-ip>:8265

# Consider implementing for production:
# - Health checks and automatic restarts
# - Log rotation for long-running services
# - Persistent model caching across restarts
# - Alternative quantization methods (FP8, INT4)