Connect Multiple DGX Spark through a Switch

1 HR

Set up a cluster of DGX Spark devices that are connected through Switch

SymptomCauseFix
"Network unreachable" errorsNetwork interfaces not configuredVerify netplan config and sudo netplan apply
SSH authentication failuresSSH keys not properly distributedRe-run ./discover-sparks and enter passwords
Nodes not visible in clusterNetwork connectivity issueVerify QSFP cable connection, check IP configuration
"APT update" errors (eg. E: The list of sources could not be read.)APT sources errors, conflicting sources or signing keysCheck APT and Ubuntu documentation to fix the APT sources or keys conflicts
NCCL test failures (eg. libnccl.so.2: cannot open shared object file)NCCL configuration not done on all nodesMake sure to follow the NCCL playbook to configure all nodes before running the NCCL test