Install and Use vLLM for Inference
Use a container or build vLLM from source for Spark
Configure network connectivity
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
This includes:
- Physical QSFP cable connection
- Network interface configuration (automatic or manual IP assignment)
- Passwordless SSH setup
- Network connectivity verification
Download cluster deployment script
Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference.
# Download on both nodes
wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/online_serving/run_cluster.sh
chmod +x run_cluster.sh
Pull the NVIDIA vLLM Image from NGC
First, you will need to configure docker to pull from NGC If this is your first time using docker run:
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
After this, you should be able to run docker commands without using sudo.
docker pull nvcr.io/nvidia/vllm:25.09-py3
export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.09-py3
Start Ray head node
Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.
# On Node 1, start head node
export MN_IF_NAME=enP2p1s0f1np1
bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --head ~/.cache/huggingface \
-e VLLM_HOST_IP=192.168.100.10 \
-e UCX_NET_DEVICES=$MN_IF_NAME \
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
-e TP_SOCKET_IFNAME=$MN_IF_NAME \
-e RAY_memory_monitor_refresh_ms=0 \
-e MASTER_ADDR=192.168.100.10
Start Ray worker node
Connect Node 2 to the Ray cluster as a worker node. This provides additional GPU resources for tensor parallelism.
# On Node 2, join as worker
export MN_IF_NAME=enP2p1s0f1np1
bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --worker ~/.cache/huggingface \
-e VLLM_HOST_IP=192.168.100.11 \
-e UCX_NET_DEVICES=$MN_IF_NAME \
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
-e TP_SOCKET_IFNAME=$MN_IF_NAME \
-e RAY_memory_monitor_refresh_ms=0 \
-e MASTER_ADDR=192.168.100.10
Verify cluster status
Confirm both nodes are recognized and available in the Ray cluster.
# On Node 1 (head node)
docker exec node ray status
Expected output shows 2 nodes with available GPU resources.
Download Llama 3.3 70B model
Authenticate with Hugging Face and download the recommended production-ready model.
# On Node 1, authenticate and download
huggingface-cli login
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct
Launch inference server for Llama 3.3 70B
Start the vLLM inference server with tensor parallelism across both nodes.
# On Node 1, enter container and start server
docker exec -it node /bin/bash
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 --max_model_len 2048
Test 70B model inference
Verify the deployment with a sample inference request.
# Test from Node 1 or external client
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"prompt": "Write a haiku about a GPU",
"max_tokens": 32,
"temperature": 0.7
}'
Expected output includes a generated haiku response.
(Optional) Deploy Llama 3.1 405B model
WARNING
405B model has insufficient memory headroom for production use.
Download the quantized 405B model for testing purposes only.
# On Node 1, download quantized model
huggingface-cli download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4
(Optional) Launch 405B inference server
Start the server with memory-constrained parameters for the large model.
# On Node 1, launch with restricted parameters
docker exec -it node /bin/bash
vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
--tensor-parallel-size 2 --max-model-len 256 --gpu-memory-utilization 1.0 \
--max-num-seqs 1 --max_num_batched_tokens 256
(Optional) Test 405B model inference
Verify the 405B deployment with constrained parameters.
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4",
"prompt": "Write a haiku about a GPU",
"max_tokens": 32,
"temperature": 0.7
}'
Validate deployment
Perform comprehensive validation of the distributed inference system.
# Check Ray cluster health
docker exec node ray status
# Verify server health endpoint
curl http://192.168.100.10:8000/health
# Monitor GPU utilization on both nodes
nvidia-smi
docker exec node nvidia-smi --query-gpu=memory.used,memory.total --format=csv
Cleanup and rollback
Remove temporary configurations and containers when testing is complete.
WARNING
This will stop all inference services and remove cluster configuration.
# Stop containers on both nodes
docker stop node
docker rm node
# Remove network configuration on both nodes
sudo ip addr del 192.168.100.10/24 dev enP2p1s0f1np1 # Node 1
sudo ip addr del 192.168.100.11/24 dev enP2p1s0f1np1 # Node 2
sudo ip link set enP2p1s0f1np1 down
Next steps
Access the Ray dashboard for cluster monitoring and explore additional features:
# Ray dashboard available at:
http://192.168.100.10:8265
# Consider implementing for production:
# - Health checks and automatic restarts
# - Log rotation for long-running services
# - Persistent model caching across restarts
# - Alternative quantization methods (FP8, INT4)