On all four systems check and make sure the usernames are the same:
# Check current username
whoami
If usernames don't match, create a new user (e.g., nvidia) on all four systems and login in with the new user:
# Create nvidia user and add to sudo group
sudo useradd -m nvidia
sudo usermod -aG sudo nvidia
# Set password for nvidia user
sudo passwd nvidia
# Switch to nvidia user
su - nvidia
Most QSFP switches offer some form of management interface, either through CLI or UI. Refer to the documentation and connect to the management interface. Make sure that the ports on the switch are enabled. For connecting four sparks, you will need to ensure that the switch is configured to provide 200Gbps connection to each DGX Spark. If not done already, refer to the Overview of this playbook for the prior knowledge and pre-requisites required for this playbook.
Connect the QSFP cables between DGX Spark systems and the switch(QSFP56-DD/QSFP56 ports) using one CX7 port on each Spark system. It is recommended to use the same CX7 port on all Spark systems for easier network configuration and avoiding NCCL test failures. In this playbook the second port (the one further from the ethernet port) is used. This should establish the 200Gbps connection required for high-speed inter-node communication. You will see an output like the one below on all four sparks. In this example the interfaces showing as 'Up' are enp1s0f1np1 and enP2p1s0f1np1 (each physical port has two logical interfaces).
Example output:
# Check QSFP interface availability on all nodes
nvidia@dxg-spark-1:~$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
NOTE
If none of the interfaces are showing as 'Up', please check the QSFP cable connection, reboot the systems and try again. The interfaces showing as 'Up' depend on which port you are using to connect the nodes to the switch. Each physical port has two logical interfaces, for example, Port 1 has two interfaces - enp1s0f1np1 and enP2p1s0f1np1. Please disregard enp1s0f0np0 and enP2p1s0f0np0, and use enp1s0f1np1 and enP2p1s0f1np1 only.
The link speed might not default to 200Gbps with auto-negotiation. To confirm, run the command below on all sparks and check that the speed is shown as 200000Mb/s. If it shows lesser than that value, then the link speed needs to be set to 200Gbps manually in the switch port configuration and auto-negotiation should be disabled. Refer to the switch's manual/documentation to disable auto-negotiation and set the link speed manually to 200Gbps (eg. 200G-baseCR4)
Example output:
nvidia@dxg-spark-1:~$ sudo ethtool enp1s0f1np1 | grep Speed
Speed: 100000Mb/s
nvidia@dxg-spark-1:~$ sudo ethtool enP2p1s0f1np1 | grep Speed
Speed: 100000Mb/s
After setting the correct speed on the switch ports. Verify the link speed on all the DGX Sparks again.
Example output:
nvidia@dxg-spark-1:~$ sudo ethtool enp1s0f1np1 | grep Speed
Speed: 200000Mb/s
nvidia@dxg-spark-1:~$ sudo ethtool enP2p1s0f1np1 | grep Speed
Speed: 200000Mb/s
NOTE
Full bandwidth can be achieved with just one QSFP cable.
For a clustered setup, all DGX sparks:
It is recommended to use the Ethernet/WiFi network for management and internet traffic and keep it separate from the CX7 network to avoid CX7 bandwidth from being used for non-workload traffic.
The supported way to configure a cluster with switch requires configuring a bridge (or using the default bridge) on the switch and adding all the ports of interest (ports connected to DGX sparks) to it through the switch management interface.
Once you are done creating/adding ports to the bridge, you should be ready to configure networking on the DGX Spark side.
We have created a script here on GitHub which automates the following:
NOTE
You can use the script or continue with the manual configurations in the following sections. If you use the script, you can skip the rest of the setup sections in this playbook.
Use the steps below to run the script:
# Clone the repository
git clone https://github.com/NVIDIA/dgx-spark-playbooks
# Enter the script directory
cd dgx-spark-playbooks/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup
# Check the README.md in the script directory for steps to run the script and configure the cluster networking with "--run-setup" argument
In this case, you can choose one of the options to assign the IPs to the CX7 logical interfaces. Options 1, 2 and 3 are mutually exclusive.
# Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
ethernets:
enp1s0f1np1:
dhcp4: true
enP2p1s0f1np1:
dhcp4: true
EOF
# Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
# Apply the configuration
sudo netplan apply
# In this example, we are using interface enp1s0f1np1. Similarly check enP2p1s0f1np1.
nvidia@dgx-spark-1:~$ ip addr show enp1s0f1np1 | grep -w inet
inet 100.100.100.4/24 brd 100.100.100.255 scope global noprefixroute enp1s0f1np1
Configure network interfaces using netplan on all DGX Spark nodes for automatic link-local addressing:
# Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
ethernets:
enp1s0f1np1:
link-local: [ ipv4 ]
enP2p1s0f1np1:
link-local: [ ipv4 ]
EOF
# Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
# Apply the configuration
sudo netplan apply
On node 1:
# Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
ethernets:
enp1s0f1np1:
addresses:
- 192.168.100.10/24
dhcp4: no
enP2p1s0f1np1:
addresses:
- 192.168.100.11/24
dhcp4: no
EOF
# Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
# Apply the configuration
sudo netplan apply
On node 2:
# Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
ethernets:
enp1s0f1np1:
addresses:
- 192.168.100.12/24
dhcp4: no
enP2p1s0f1np1:
addresses:
- 192.168.100.13/24
dhcp4: no
EOF
# Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
# Apply the configuration
sudo netplan apply
On node 3:
# Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
ethernets:
enp1s0f1np1:
addresses:
- 192.168.100.14/24
dhcp4: no
enP2p1s0f1np1:
addresses:
- 192.168.100.15/24
dhcp4: no
EOF
# Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
# Apply the configuration
sudo netplan apply
On node 4:
# Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
ethernets:
enp1s0f1np1:
addresses:
- 192.168.100.16/24
dhcp4: no
enP2p1s0f1np1:
addresses:
- 192.168.100.17/24
dhcp4: no
EOF
# Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
# Apply the configuration
sudo netplan apply
Run the DGX Spark discover-sparks.sh script from one of the nodes to automatically discover and configure SSH:
curl -O https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/connect-two-sparks/assets/discover-sparks
bash ./discover-sparks
Expected output similar to the below, with different IPs and node names. You may see up to two IPs for each node as two interfaces (eg. enp1s0f1np1 and enP2p1s0f1np1) have IP addresses assigned. This is expected and does not cause any issues. The first time you run the script, you'll be prompted for your password for each node.
Found: 169.254.35.62 (dgx-spark-1.local)
Found: 169.254.35.63 (dgx-spark-2.local)
Found: 169.254.35.64 (dgx-spark-3.local)
Found: 169.254.35.65 (dgx-spark-4.local)
Setting up bidirectional SSH access (local <-> remote nodes)...
You may be prompted for your password for each node.
SSH setup complete! All local and remote nodes can now SSH to each other without passwords.
NOTE
If you encounter any errors, please follow Option 2 below to manually configure SSH and debug the issue.
You will need to find the IP addresses for the CX-7 interfaces that are up. On all nodes, run the following command to find the IP addresses and take note of them for the next step.
ip addr show enp1s0f1np1
Example output:
# In this example, we are using interface enp1s0f1np1.
nvidia@dgx-spark-1:~$ ip addr show enp1s0f1np1
4: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 3c:6d:66:cc:b3:b7 brd ff:ff:ff:ff:ff:ff
inet **169.254.35.62**/16 brd 169.254.255.255 scope link noprefixroute enp1s0f1np1
valid_lft forever preferred_lft forever
inet6 fe80::3e6d:66ff:fecc:b3b7/64 scope link
valid_lft forever preferred_lft forever
In this example, the IP address for Node 1 is 169.254.35.62. Repeat the process for other nodes.
On all nodes, run the following commands to enable passwordless SSH:
# Copy your SSH public key to all nodes. Replace the IP addresses with the ones you found in the previous step.
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 1>
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 2>
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 3>
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 4>
Test basic multi-node functionality from the head node:
# Test hostname resolution across nodes
ssh <IP for Node 1> hostname
ssh <IP for Node 2> hostname
ssh <IP for Node 3> hostname
ssh <IP for Node 4> hostname
Now your cluster is set up to run distributed workloads across four nodes. Try running the NCCL playbook.
NOTE
Wherever the playbook asks to run a command on two nodes, just run it on all four nodes. Make sure to adapt the mpirun NCCL command which you run on the head node to accommodate four nodes
Example mpirun command for NCCL:
# Set network interface environment variables (use your Up interface from the previous step)
export UCX_NET_DEVICES=enp1s0f1np1
export NCCL_SOCKET_IFNAME=enp1s0f1np1
export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1
# Run the all_gather performance test across four nodes (replace the IP addresses with the ones you found in the previous step)
mpirun -np 4 -H <IP for Node 1>:1,<IP for Node 2>:1,<IP for Node 3>:1,<IP for Node 4>:1 \
--mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
$HOME/nccl-tests/build/all_gather_perf
WARNING
These steps will reset network configuration.
# Rollback network configuration
sudo rm /etc/netplan/40-cx7.yaml
sudo netplan apply
NOTE
If disconnecting the switch, then make sure to do the following