Nanochat Training
Train a small ChatGPT-style LLM (nanochat) with tokenizer, pretraining, midtraining, and SFT on DGX Station with GB300 Ultra
Prerequisites and environment
Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
# Verify GPU and Docker
nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi
Create a W&B account and a Hugging Face token if you don't have them. Export both keys in your shell:
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
Clone and set up
Clone the playbook repository and navigate to the assets directory:
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nanochat/assets
Run the setup script. It clones nanochat, checks out the supported commit, copies the station-adapted speedrun_station.sh, and builds the nanochat Docker image (PyTorch NGC base with dependencies):
./setup.sh
You should see the nanochat image listed if you run docker images. Your directory structure after setup should look like this:
assets/
├── Dockerfile
├── launch.sh
├── setup.sh
├── speedrun_station.sh
└── nanochat/
Launch training
Ensure your API keys are exported, then launch:
./launch.sh
The training runs inside the nanochat container and executes the full pipeline automatically:
- Tokenizer training — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer
- Base model pretraining — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8
- SFT — downloads synthetic identity conversations, fine-tunes for chat
- Report generation — produces
report.mdwith metrics and samples
Training on a single GB300 Ultra takes on the order of 12+ hours for the full d24 run.
Monitor training
W&B dashboard:
Track training at wandb.ai under the nanochat project. The exact link to the wandb run would be provided in the training logs. Key metrics:
- Training loss
- Validation BPB
- Throughput (tokens/sec)
Inference
After training, checkpoints are saved under the nanochat_cache/ directory. Run inference from inside the container or interactively:
Web UI (recommended):
docker run --rm --gpus all --net=host \
-v $(pwd)/nanochat:/workspace/nanochat \
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_web
Open a browser to http://<STATION_IP>:8000 where <STATION_IP> is your DGX Station’s IP address.
CLI:
docker run --rm -it --gpus all \
-v $(pwd)/nanochat:/workspace/nanochat \
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_cli -p "Why is the sky blue?"
Cleanup
To stop training early, interrupt the launch script or stop the container:
WARNING
This stops the training run and any in-progress work in the container.
# If launch.sh is running: press Ctrl+C
# Or stop the container directly
docker stop $(docker ps -q --filter ancestor=nanochat)
To free disk space:
rm -rf ./nanochat_cache ./hf_cache
docker system prune -a
Customization
Smaller/faster run: Edit speedrun_station.sh before running setup to reduce data and model size:
# Fewer data shards (10 instead of default)
python -m nanochat.dataset -n 10 &
# Smaller model (d4 instead of d24), smaller batch size
python -m scripts.base_train --depth=4 --device-batch-size=32
Batch size: The default --device-batch-size=64 is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.
Then re-run ./setup.sh to rebuild with the changes.