WANDB_API_KEY is not set or HF_TOKEN is not set | Required env vars not exported before launch.sh | export WANDB_API_KEY=<your_key> and export HF_TOKEN=<your_token> in the same shell, then run ./launch.sh. |
RuntimeError: CUDA out of memory | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. speedrun.sh), reduce --device_batch_size (e.g. 16, 8, 4, 2, or 1). |
| Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run nvidia-smi on your DGX Station. Check no other containers hold GPUs: docker ps. Test GPU in Docker: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi. |
Permission denied or No such file or directory for cache paths in launch.sh | Paths like /home/scratch.lramesh_dpt/... don’t exist on your system | Edit launch.sh: set cache dirs to paths you can create (e.g. $(pwd)/nanochat_cache, $(pwd)/hf_cache). Run mkdir -p <your_cache_dirs> and re-run launch.sh. |
nanochat image not found when running launch.sh | Setup not run or build failed | From nvidia/nanochat/assets, run ./setup.sh and confirm with docker images (look for the nanochat image). |
| Training exits immediately or script doesn’t wait | Container fails early (missing keys, paths, or OOM) | Check container logs: docker ps -a then docker logs <container_id>. Fix env vars, cache paths, or batch size as above. |
| Wrong cache path or "No such file" when launching | launch.sh uses non-existent paths (e.g. /home/scratch...) | On DGX Station, edit launch.sh: replace cache dirs with $(pwd)/nanochat_cache and $(pwd)/hf_cache, then run mkdir -p nanochat_cache hf_cache. |