Nanochat Training

Symptom	Cause	Fix
`WANDB_API_KEY is not set` or `HF_TOKEN is not set`	Required env vars not exported before `launch.sh`	`export WANDB_API_KEY=<key>` and `export HF_TOKEN=<token>` in the same shell, then re-run `./launch.sh`
`RuntimeError: CUDA out of memory`	Batch size too large for available VRAM	Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh`
Docker container exits immediately	Missing env vars, bad cache paths, or build failure	Check logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars or paths as needed
`nanochat` image not found	Setup not run or Docker build failed	From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat`
`No such file or directory` for cache paths	Cache directories don't exist	`launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE`
Training hangs at "Waiting for dataset download"	Network issue downloading FineWeb shards	Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh`
W&B shows wrong user / stale login	Cached W&B credentials in container volume	`speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct
Container runs but `launch.sh` says "Training complete!" immediately	Container failed fast and exited before the poll loop detected it	Check `docker ps -a` for the exited container and inspect logs with `docker logs <id>`
GPU not visible inside container	Docker NVIDIA runtime not configured	Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure NVIDIA Container Toolkit

Resources