Nanochat on Dual-Spark

Troubleshooting

Common Issues

NCCL timeout or connection errors

Symptom:

RuntimeError: NCCL error in: ...

Cause: Network connectivity issues between nodes or firewall blocking distributed training communication.

Solution:

Verify network connectivity between nodes: ping <WORKER_IP>
Check firewall rules allow traffic on port 29500: sudo ufw allow 29500/tcp
Ensure NCCL_SOCKET_IFNAME matches your network interface (default: enp1s0f0np0)
Try disabling InfiniBand: export NCCL_IB_DISABLE=1

Out of memory (OOM) errors

Symptom:

RuntimeError: CUDA out of memory

Cause: Training or inference is using more memory than available on the GPU.

Solution: Reduce batch size in the training scripts:

# Edit speedrun_spark.sh and change the device_batch_size parameter
--device_batch_size=16  # or 8, 4, 2, 1

You can also reduce the depth parameter:

--depth=18  # instead of 20 for d20 model

Docker container not starting

Symptom: Container fails to start or immediately exits.

Cause: GPU not available, Docker doesn't have GPU access, or another container is using the GPU.

Solution:

Check GPU availability: nvidia-smi
Ensure no other containers using GPUs: docker ps
Verify Docker has GPU access: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
Check Docker logs: docker logs nanochat

SSH permission denied to worker node

Symptom:

Permission denied (publickey,password)

Cause: SSH key authentication not set up between nodes.

Solution: Set up SSH key authentication:

ssh-keygen -t rsa -b 4096 -N "" -f ~/.ssh/id_rsa
ssh-copy-id $USER@<WORKER_IP>

Test the connection:

ssh $USER@<WORKER_IP> "echo 'Connection successful'"

GPU freezes during training

Symptom: GPU becomes unresponsive, training stops, or system becomes unstable.

Cause: Running out of memory has caused the GPU to crash or become unresponsive.

Solution: Reboot the GPU system:

# Soft reboot
sudo reboot

# Or physically turn the GPU off, wait 30 seconds, then turn it back on

After reboot, reduce memory usage by lowering batch size or depth parameter in speedrun_spark.sh.

Disk space full error

Symptom:

OSError: [Errno 28] No space left on device

Cause: Disk space is full and doesn't have any more space to write files or do basic filesystem operations. This may be due to a large dataset download (~24GB for FineWeb).

Solution: Free up filesystem space:

# Check disk usage
df -h

# Clear Docker cache
docker system prune -a

# Remove old training checkpoints if not needed
rm -rf ~/.cache/nanochat/old_runs

# Remove downloaded datasets if redownload is acceptable
rm -rf ~/.cache/nanochat/data

To prevent this issue, ensure you have at least 50GB of free disk space before starting training.

Training loss not decreasing

Symptom: Training loss remains constant or increases instead of decreasing.

Cause: Learning rate too high, data not loading properly, or distributed training not coordinating correctly.

Solution:

Check if data is downloading correctly: ls -lh ~/.cache/nanochat/data/
Verify both nodes are participating: Check logs on both host and worker containers
Ensure GPUs are being utilized: nvidia-smi
Check W&B dashboard (if enabled) for training curves

Web UI not accessible

Symptom: Cannot access the web UI at http://localhost:8000 or http://<HOST_IP>:8000.

Cause: Web server not running, port not forwarded, or firewall blocking access.

Solution:

Verify the web server is running: ps aux | grep chat_web
Check if port 8000 is open: netstat -tuln | grep 8000
If using SSH tunnel, ensure port forwarding is active: ssh -L 8000:localhost:8000 username@<HOST_IP>
Check firewall rules: sudo ufw allow 8000/tcp

Getting Additional Help

If you encounter issues not covered here:

Check the nanochat repository issues: https://github.com/karpathy/nanochat/issues
Review Docker container logs: docker logs nanochat
Check system logs: sudo journalctl -xe
Verify system resources: htop and nvidia-smi

Nanochat on Dual-Spark

Troubleshooting

Common Issues

NCCL timeout or connection errors

Out of memory (OOM) errors

Docker container not starting

SSH permission denied to worker node

GPU freezes during training

Disk space full error

Training loss not decreasing

Web UI not accessible

Getting Additional Help

Resources