Nanochat Training

30 MIN

Train a small ChatGPT-style LLM (nanochat) with tokenizer, pretraining, midtraining, and SFT on DGX Station with GB300 Ultra

SymptomCauseFix
WANDB_API_KEY is not set or HF_TOKEN is not setRequired env vars not exported before launch.shexport WANDB_API_KEY=<your_key> and export HF_TOKEN=<your_token> in the same shell, then run ./launch.sh.
RuntimeError: CUDA out of memoryBatch size or model too large for GPUIn the training script in the cloned nanochat repo (e.g. speedrun.sh), reduce --device_batch_size (e.g. 16, 8, 4, 2, or 1).
Docker container not starting or no GPUDocker or NVIDIA runtime misconfiguredRun nvidia-smi on your DGX Station. Check no other containers hold GPUs: docker ps. Test GPU in Docker: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi.
Permission denied or No such file or directory for cache paths in launch.shPaths like /home/scratch.lramesh_dpt/... don’t exist on your systemEdit launch.sh: set cache dirs to paths you can create (e.g. $(pwd)/nanochat_cache, $(pwd)/hf_cache). Run mkdir -p <your_cache_dirs> and re-run launch.sh.
nanochat image not found when running launch.shSetup not run or build failedFrom nvidia/nanochat/assets, run ./setup.sh and confirm with docker images (look for the nanochat image).
Training exits immediately or script doesn’t waitContainer fails early (missing keys, paths, or OOM)Check container logs: docker ps -a then docker logs <container_id>. Fix env vars, cache paths, or batch size as above.
Wrong cache path or "No such file" when launchinglaunch.sh uses non-existent paths (e.g. /home/scratch...)On DGX Station, edit launch.sh: replace cache dirs with $(pwd)/nanochat_cache and $(pwd)/hf_cache, then run mkdir -p nanochat_cache hf_cache.