Nanochat on Dual-Spark
Setup Nanochat on Dual-Spark
Troubleshooting
Common Issues
NCCL timeout or connection errors
Symptom:
RuntimeError: NCCL error in: ...
Cause: Network connectivity issues between nodes or firewall blocking distributed training communication.
Solution:
- Verify network connectivity between nodes:
ping <WORKER_IP> - Check firewall rules allow traffic on port 29500:
sudo ufw allow 29500/tcp - Ensure
NCCL_SOCKET_IFNAMEmatches your network interface (default:enp1s0f0np0) - Try disabling InfiniBand:
export NCCL_IB_DISABLE=1
Out of memory (OOM) errors
Symptom:
RuntimeError: CUDA out of memory
Cause: Training or inference is using more memory than available on the GPU.
Solution: Reduce batch size in the training scripts:
# Edit speedrun_spark.sh and change the device_batch_size parameter
--device_batch_size=16 # or 8, 4, 2, 1
You can also reduce the depth parameter:
--depth=18 # instead of 20 for d20 model
Docker container not starting
Symptom: Container fails to start or immediately exits.
Cause: GPU not available, Docker doesn't have GPU access, or another container is using the GPU.
Solution:
- Check GPU availability:
nvidia-smi - Ensure no other containers using GPUs:
docker ps - Verify Docker has GPU access:
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi - Check Docker logs:
docker logs nanochat
SSH permission denied to worker node
Symptom:
Permission denied (publickey,password)
Cause: SSH key authentication not set up between nodes.
Solution: Set up SSH key authentication:
ssh-keygen -t rsa -b 4096 -N "" -f ~/.ssh/id_rsa
ssh-copy-id $USER@<WORKER_IP>
Test the connection:
ssh $USER@<WORKER_IP> "echo 'Connection successful'"
GPU freezes during training
Symptom: GPU becomes unresponsive, training stops, or system becomes unstable.
Cause: Running out of memory has caused the GPU to crash or become unresponsive.
Solution: Reboot the GPU system:
# Soft reboot
sudo reboot
# Or physically turn the GPU off, wait 30 seconds, then turn it back on
After reboot, reduce memory usage by lowering batch size or depth parameter in speedrun_spark.sh.
Disk space full error
Symptom:
OSError: [Errno 28] No space left on device
Cause: Disk space is full and doesn't have any more space to write files or do basic filesystem operations. This may be due to a large dataset download (~24GB for FineWeb).
Solution: Free up filesystem space:
# Check disk usage
df -h
# Clear Docker cache
docker system prune -a
# Remove old training checkpoints if not needed
rm -rf ~/.cache/nanochat/old_runs
# Remove downloaded datasets if redownload is acceptable
rm -rf ~/.cache/nanochat/data
To prevent this issue, ensure you have at least 50GB of free disk space before starting training.
Training loss not decreasing
Symptom: Training loss remains constant or increases instead of decreasing.
Cause: Learning rate too high, data not loading properly, or distributed training not coordinating correctly.
Solution:
- Check if data is downloading correctly:
ls -lh ~/.cache/nanochat/data/ - Verify both nodes are participating: Check logs on both host and worker containers
- Ensure GPUs are being utilized:
nvidia-smi - Check W&B dashboard (if enabled) for training curves
Web UI not accessible
Symptom: Cannot access the web UI at http://localhost:8000 or http://<HOST_IP>:8000.
Cause: Web server not running, port not forwarded, or firewall blocking access.
Solution:
- Verify the web server is running:
ps aux | grep chat_web - Check if port 8000 is open:
netstat -tuln | grep 8000 - If using SSH tunnel, ensure port forwarding is active:
ssh -L 8000:localhost:8000 username@<HOST_IP> - Check firewall rules:
sudo ufw allow 8000/tcp
Getting Additional Help
If you encounter issues not covered here:
- Check the nanochat repository issues: https://github.com/karpathy/nanochat/issues
- Review Docker container logs:
docker logs nanochat - Check system logs:
sudo journalctl -xe - Verify system resources:
htopandnvidia-smi