NCCL for Two Sparks

30 MIN

Install and test NCCL on two Sparks

Common issues for running on two Spark

IssueCauseSolution
mpirun hangs or times outSSH connectivity issues1. Test basic SSH connectivity: ssh <remote_ip> should work without password prompts
2. Try a simple mpirun test: mpirun -np 2 -H <IP for Node 1>:1,<IP for Node 2>:1 hostname
3. Verify SSH keys are setup correctly for all nodes
Network interface not foundWrong interface name or down statusCheck interface status with ibdev2netdev and verify IP configuration
NCCL build failsMissing dependencies such as OpenMPI or incorrect CUDA versionVerify CUDA installation and required libraries are present