| Symptom | Cause | Fix |
|---|---|---|
| "CUDA out of memory" error | Insufficient GPU memory | Reduce kv_cache_free_gpu_memory_fraction to 0.9 or use a device with more VRAM |
| Container fails to start | Docker GPU support issues | Verify nvidia-docker is installed and --gpus=all flag is supported |
| Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
| Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked |
mpirun fails with SSH connection refused | SSH not configured between containers or nodes | Complete SSH setup from Connect Two Sparks playbook; verify ssh <node_ip> works without password from both nodes |
mpirun hangs or times out connecting to remote node | Hostfile IPs don't match actual node IPs | Verify IPs in /etc/openmpi-hostfile match the IPs assigned to network interfaces with ip addr show |
| NCCL error: "Socket operation on non-socket" | Wrong network interface specified | Check ibdev2netdev output and ensure NCCL_SOCKET_IFNAME and UCX_NET_DEVICES match the active interfaces enp1s0f1np1,enP2p1s0f1np1 |
Permission denied (publickey) during mpirun | SSH keys not exchanged between containers | Re-run SSH setup from Connect Two Sparks playbook or manually verify /root/.ssh/authorized_keys contains public keys from both nodes |
| Model download fails silently in multi-node setup | HF_TOKEN not propagated to mpirun | Add -e HF_TOKEN=$HF_TOKEN to docker exec command and -x HF_TOKEN to mpirun command |
NOTE
DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'