Nanochat Training
30 MIN
Train a small ChatGPT-style LLM (nanochat) with tokenizer, pretraining, midtraining, and SFT on DGX Station with GB300 Ultra
| Symptom | Cause | Fix |
|---|---|---|
WANDB_API_KEY is not set or HF_TOKEN is not set | Required env vars not exported before launch.sh | export WANDB_API_KEY=<key> and export HF_TOKEN=<token> in the same shell, then re-run ./launch.sh |
RuntimeError: CUDA out of memory | Batch size too large for available VRAM | Edit speedrun_station.sh: reduce --device-batch-size (try 64, 32, 16, 8). Re-run ./setup.sh then ./launch.sh |
| Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: docker ps -a then docker logs <container_id>. Fix env vars or paths as needed |
nanochat image not found | Setup not run or Docker build failed | From the assets/ directory, run ./setup.sh and confirm with docker images | grep nanochat |
No such file or directory for cache paths | Cache directories don't exist | launch.sh creates them automatically under $(pwd)/nanochat_cache and $(pwd)/hf_cache. If using custom paths, create them: mkdir -p $NANOCHAT_CACHE $HF_CACHE |
| Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart ./launch.sh |
| W&B shows wrong user / stale login | Cached W&B credentials in container volume | speedrun_station.sh runs wandb login --relogin with your key automatically. Ensure WANDB_API_KEY is correct |
Container runs but launch.sh says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check docker ps -a for the exited container and inspect logs with docker logs <id> |
| GPU not visible inside container | Docker NVIDIA runtime not configured | Test: docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi. If it fails, install/configure NVIDIA Container Toolkit |