Basic idea
Configure four DGX Spark systems for high-speed inter-node communication using 200Gbps QSFP connections through a QSFP switch. This setup enables distributed workloads across multiple DGX Spark nodes by establishing network connectivity and configuring SSH authentication.
What you will accomplish
In this playbook, you will physically connect four DGX Spark devices with QSFP cables and a QSFP switch, configure network interfaces for cluster communication, and establish passwordless SSH between nodes to create a functional distributed computing environment. The same set up can be scaled up to more DGX Spark devices connected via the same switch.
What to know before starting
- Basic understanding of distributed computing concepts
- Working with network interface configuration and netplan
- Experience with SSH key management
- Basic understanding and experience in configuring the managed QSFP network switch which you plan to use. Refer to the instruction manuals to:
- Know how to connect to the switch for management of ports and features
- Know how to enable/disable QSFP ports and create a software bridge on the switch
- Know how to configure the link speed manually on the port and disable auto-negotiation if needed
Prerequisites
- Four DGX Spark systems (these instructions will work for any number of DGX Spark devices connected with a switch)
- QSFP switch with at least 4 QSFP56-DD ports (at least 200Gbps each)
- QSFP cables for 200Gbps connection from the switch to the devices. Use recommended cable or similar.
- One cable per spark
- If the switch has 400Gbps ports then you can also use breakout cables to split them into two 200Gbps ports
- SSH access available to all systems
- Root or sudo access on all systems:
sudo whoami - The same username on all systems
- Update all systems to the latest OS and Firmware. Refer to the DGX Spark documentation https://docs.nvidia.com/dgx/dgx-spark/os-and-component-update.html
Ancillary files
All required files for this playbook can be found here on GitHub
- discover-sparks.sh script for automatic node discovery and SSH key distribution
- Cluster setup script for automatic network configuration, validation and running NCCL sanity test
Time & risk
-
Duration: 2 hours including validation
-
Risk level: Medium - involves network reconfiguration
-
Rollback: Network changes can be reversed by removing netplan configs or IP assignments
-
Last Updated: 3/19/2026
- First publication