Symptom:
RuntimeError: NCCL error in: ...
Cause: Network connectivity issues between nodes or firewall blocking distributed training communication.
Solution:
ping <WORKER_IP>sudo ufw allow 29500/tcpNCCL_SOCKET_IFNAME matches your network interface (default: enp1s0f0np0)export NCCL_IB_DISABLE=1Symptom:
RuntimeError: CUDA out of memory
Cause: Training or inference is using more memory than available on the GPU.
Solution: Reduce batch size in the training scripts:
# Edit speedrun_spark.sh and change the device_batch_size parameter
--device_batch_size=16 # or 8, 4, 2, 1
You can also reduce the depth parameter:
--depth=18 # instead of 20 for d20 model
Symptom: Container fails to start or immediately exits.
Cause: GPU not available, Docker doesn't have GPU access, or another container is using the GPU.
Solution:
nvidia-smidocker psdocker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smidocker logs nanochatSymptom:
Permission denied (publickey,password)
Cause: SSH key authentication not set up between nodes.
Solution: Set up SSH key authentication:
ssh-keygen -t rsa -b 4096 -N "" -f ~/.ssh/id_rsa
ssh-copy-id $USER@<WORKER_IP>
Test the connection:
ssh $USER@<WORKER_IP> "echo 'Connection successful'"
Symptom: GPU becomes unresponsive, training stops, or system becomes unstable.
Cause: Running out of memory has caused the GPU to crash or become unresponsive.
Solution: Reboot the GPU system:
# Soft reboot
sudo reboot
# Or physically turn the GPU off, wait 30 seconds, then turn it back on
After reboot, reduce memory usage by lowering batch size or depth parameter in speedrun_spark.sh.
Symptom:
OSError: [Errno 28] No space left on device
Cause: Disk space is full and doesn't have any more space to write files or do basic filesystem operations. This may be due to a large dataset download (~24GB for FineWeb).
Solution: Free up filesystem space:
# Check disk usage
df -h
# Clear Docker cache
docker system prune -a
# Remove old training checkpoints if not needed
rm -rf ~/.cache/nanochat/old_runs
# Remove downloaded datasets if redownload is acceptable
rm -rf ~/.cache/nanochat/data
To prevent this issue, ensure you have at least 50GB of free disk space before starting training.
Symptom: Training loss remains constant or increases instead of decreasing.
Cause: Learning rate too high, data not loading properly, or distributed training not coordinating correctly.
Solution:
ls -lh ~/.cache/nanochat/data/nvidia-smiSymptom: Cannot access the web UI at http://localhost:8000 or http://<HOST_IP>:8000.
Cause: Web server not running, port not forwarded, or firewall blocking access.
Solution:
ps aux | grep chat_webnetstat -tuln | grep 8000ssh -L 8000:localhost:8000 username@<HOST_IP>sudo ufw allow 8000/tcpIf you encounter issues not covered here:
docker logs nanochatsudo journalctl -xehtop and nvidia-smi