Nanochat on Dual-Spark

5 days

Setup Nanochat on Dual-Spark

Troubleshooting

Common Issues

NCCL timeout or connection errors

Symptom:

RuntimeError: NCCL error in: ...

Cause: Network connectivity issues between nodes or firewall blocking distributed training communication.

Solution:

  • Verify network connectivity between nodes: ping <WORKER_IP>
  • Check firewall rules allow traffic on port 29500: sudo ufw allow 29500/tcp
  • Ensure NCCL_SOCKET_IFNAME matches your network interface (default: enp1s0f0np0)
  • Try disabling InfiniBand: export NCCL_IB_DISABLE=1

Out of memory (OOM) errors

Symptom:

RuntimeError: CUDA out of memory

Cause: Training or inference is using more memory than available on the GPU.

Solution: Reduce batch size in the training scripts:

# Edit speedrun_spark.sh and change the device_batch_size parameter
--device_batch_size=16  # or 8, 4, 2, 1

You can also reduce the depth parameter:

--depth=18  # instead of 20 for d20 model

Docker container not starting

Symptom: Container fails to start or immediately exits.

Cause: GPU not available, Docker doesn't have GPU access, or another container is using the GPU.

Solution:

  • Check GPU availability: nvidia-smi
  • Ensure no other containers using GPUs: docker ps
  • Verify Docker has GPU access: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
  • Check Docker logs: docker logs nanochat

SSH permission denied to worker node

Symptom:

Permission denied (publickey,password)

Cause: SSH key authentication not set up between nodes.

Solution: Set up SSH key authentication:

ssh-keygen -t rsa -b 4096 -N "" -f ~/.ssh/id_rsa
ssh-copy-id $USER@<WORKER_IP>

Test the connection:

ssh $USER@<WORKER_IP> "echo 'Connection successful'"

GPU freezes during training

Symptom: GPU becomes unresponsive, training stops, or system becomes unstable.

Cause: Running out of memory has caused the GPU to crash or become unresponsive.

Solution: Reboot the GPU system:

# Soft reboot
sudo reboot

# Or physically turn the GPU off, wait 30 seconds, then turn it back on

After reboot, reduce memory usage by lowering batch size or depth parameter in speedrun_spark.sh.

Disk space full error

Symptom:

OSError: [Errno 28] No space left on device

Cause: Disk space is full and doesn't have any more space to write files or do basic filesystem operations. This may be due to a large dataset download (~24GB for FineWeb).

Solution: Free up filesystem space:

# Check disk usage
df -h

# Clear Docker cache
docker system prune -a

# Remove old training checkpoints if not needed
rm -rf ~/.cache/nanochat/old_runs

# Remove downloaded datasets if redownload is acceptable
rm -rf ~/.cache/nanochat/data

To prevent this issue, ensure you have at least 50GB of free disk space before starting training.

Training loss not decreasing

Symptom: Training loss remains constant or increases instead of decreasing.

Cause: Learning rate too high, data not loading properly, or distributed training not coordinating correctly.

Solution:

  • Check if data is downloading correctly: ls -lh ~/.cache/nanochat/data/
  • Verify both nodes are participating: Check logs on both host and worker containers
  • Ensure GPUs are being utilized: nvidia-smi
  • Check W&B dashboard (if enabled) for training curves

Web UI not accessible

Symptom: Cannot access the web UI at http://localhost:8000 or http://<HOST_IP>:8000.

Cause: Web server not running, port not forwarded, or firewall blocking access.

Solution:

  • Verify the web server is running: ps aux | grep chat_web
  • Check if port 8000 is open: netstat -tuln | grep 8000
  • If using SSH tunnel, ensure port forwarding is active: ssh -L 8000:localhost:8000 username@<HOST_IP>
  • Check firewall rules: sudo ufw allow 8000/tcp

Getting Additional Help

If you encounter issues not covered here:

  1. Check the nanochat repository issues: https://github.com/karpathy/nanochat/issues
  2. Review Docker container logs: docker logs nanochat
  3. Check system logs: sudo journalctl -xe
  4. Verify system resources: htop and nvidia-smi